# Getting to philosophy 
In the summer of 2008, wikipedia user Mark J discovered a strange phenomenon on wikipedia: if you click on the first link on a wikipedia page, then repeat the process, you usually end up on the page for philosophy. He wrote a [wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) about this, which led to mentions on a [podcast](https://en.wikipedia.org/wiki/Wikipedia:WikipediaWeekly/Episode50) and [documentary](https://www.bbc.co.uk/programmes/b07lk6tj), as well as a [scientific paper](https://www.daniellamprecht.com/wp-content/uploads/2016/08/Evaluating-and-Improving-Navigability-of-Wikipedia-a-Comparative-Study-of-eight-Language-Editions.pdf) published in 2016, which found that 97% of articles led to philosophy.

In this notebook, I'd like to find out if this phenomenon still exists on wikipedia today, in 2021. In order to do this, I will build a web scraper that goes through the process of clicking first links. Then, I will use this web scraper on a bunch (~3000?) of wikipedia pages, and keep a record of whether these pages get to philosophy, as well as all the links the web scraper had to click through to get there. 

I will then analyze the resulting dataset, and try to answer the following questions: 
* What is the percentage of pages that lead to philosophy? 
* What are some other popular pages that many pages eventually lead to? 
* What is the most beautiful way of visualizing this data? (I will use this beautiful [Game of Thrones character network visualization](https://www.linkedin.com/pulse/game-thrones-social-network-analysis-conor-aspell/) as inspiration)
* Which pages are highly connected to other pages, and which are not? (this is a chance to learn about and apply [network theory](https://en.wikipedia.org/wiki/Network_theory), measuring [centrality](https://en.wikipedia.org/wiki/Centrality) of points in a network)  

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import os
import re

## Building the webscraper
learn regex [here](https://regexone.com/lesson/kleene_operators?)

In [32]:
url = 'https://en.wikipedia.org/wiki/Astronomy'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
urls = soup.select('p') # select all a tags that are inside of p tags (css selector)

# combine first 50 paragraphs and convert into string
urlStr = urls[:50] 
urlStr = [str(url) for url in urlStr]
urlStr = ' '.join(urlStr)
urlStr

'<p class="mw-empty-elt">\n</p> <p class="mw-empty-elt">\n</p> <p><b>Astronomy</b> (from <a href="/wiki/Greek_language" title="Greek language">Greek</a>: <span lang="el">ἀστρονομία</span>, literally meaning the science that studies the laws of the stars) is a <a href="/wiki/Natural_science" title="Natural science">natural science</a> that studies <a href="/wiki/Astronomical_object" title="Astronomical object">celestial objects</a> and <a href="/wiki/Celestial_event" title="Celestial event">phenomena</a>. It uses <a href="/wiki/Mathematics" title="Mathematics">mathematics</a>, <a href="/wiki/Physics" title="Physics">physics</a>, and <a href="/wiki/Chemistry" title="Chemistry">chemistry</a> in order to explain their origin and <a class="mw-redirect" href="/wiki/Chronology_of_the_Universe" title="Chronology of the Universe">evolution</a>. Objects of interest include <a class="mw-redirect" href="/wiki/Planets" title="Planets">planets</a>, <a href="/wiki/Natural_satellite" title="Natural sa

In [37]:
def wiki_get_urls(url): # returns list of valid urls   
    # select first three paragraphs in page
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    urls = soup.select('p') # select all a tags that are inside of p tags (css selector)
    
    # combine first 50 paragraphs and convert into string
    urlStr = urls[:50] 
    urlStr = [str(url) for url in urlStr]
    urlStr = ' '.join(urlStr)
    
    # remove text in parentheses or italics from string using regex
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr) # remove text in double parentheses (xx(xx)xx)
    # TODO: write regex for multiple sub-parentheses (..(..)..(..)..(..)..)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]*?[^a-z]+[^\(\)]*?\)", "", urlStr) # remove text in parentheses 
                                                            # (unless it's another wiki page, e.g. /wiki/Justification_(epistemology))
    urlStr = re.sub('<i>.*?<\/i>', '', urlStr) # remove text in italics 
    
    # convert back into soup and find all urls  
    soup = BeautifulSoup(urlStr, 'html.parser')
    urls = soup.find_all('a')[:10] # get first 10 urls 
    urls = [url['href'] for url in urls]
    
    # remove urls that are not other wikipedia pages 
    toRemove = []
    for url in urls: 
        if not url.startswith('/wiki/'): 
            toRemove.append(url)
        elif url.startswith(('/wiki/Help:', '/wiki/File:', '/wiki/Outline_(list)')): 
            toRemove.append(url)
    for i in toRemove: 
        urls.remove(i)
    urls = ['https://en.wikipedia.org' + url for url in urls]
    return urls 

# test the wiki_get_urls function 
urls = wiki_get_urls('https://en.wikipedia.org/wiki/Astronomy') 
urls 
# goal: should work for both epistemology and science: 
# for science, first link should be scientific method 
# for epistomology, first link should be outline of philosophy 

['https://en.wikipedia.org/wiki/Natural_science',
 'https://en.wikipedia.org/wiki/Astronomical_object',
 'https://en.wikipedia.org/wiki/Celestial_event',
 'https://en.wikipedia.org/wiki/Mathematics',
 'https://en.wikipedia.org/wiki/Physics',
 'https://en.wikipedia.org/wiki/Chemistry',
 'https://en.wikipedia.org/wiki/Chronology_of_the_Universe',
 'https://en.wikipedia.org/wiki/Planets',
 'https://en.wikipedia.org/wiki/Natural_satellite',
 'https://en.wikipedia.org/wiki/Star']

In [13]:
# inputs list of valid urls, and returns first link that doesn't loop back on itself 
# (e.g. language --> spoken language --> language)
def next_url(url): 
    test_urls = wiki_get_urls(url) # get list of urls from language 
    for test_url in test_urls: 
        urls_1 = wiki_get_urls(test_url) # find list of valid urls spoken langugage  
        if urls_1[0] == url: # if the first item spoken language is 'language', try the next url in 'language' 
            continue 
        else: # otherwise, pick this url as the next url
            return test_url 
            break 

# test 
url = 'https://en.wikipedia.org/wiki/Science'
next_url(url)

'https://en.wikipedia.org/wiki/Scientific_method'

In [38]:
def get_to_philosophy(url): 
    print('getting to philosophy from: ' + url)
    chain = [url]
    while url != 'https://en.wikipedia.org/wiki/Philosophy': 
        url = next_url(url)
        if url in chain: 
            print('url loop has occured. Repeated url:\n{}'.format(url))
            chain.append(url)
            break
        elif url == 'https://en.wikipedia.org/wiki/Philosophy': 
            chain.append(url)
            print(url)
            print('reached philosophy!')
        elif url not in chain: 
            print(url)
            chain.append(url)
            continue
    return chain

url = 'https://en.wikipedia.org/wiki/Physics'
chain = get_to_philosophy(url)

getting to philosophy from: https://en.wikipedia.org/wiki/Physics
https://en.wikipedia.org/wiki/Natural_science
https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Empirical_evidence
https://en.wikipedia.org/wiki/Information
https://en.wikipedia.org/wiki/Uncertainty
https://en.wikipedia.org/wiki/Epistemology
https://en.wikipedia.org/wiki/Outline_of_philosophy
https://en.wikipedia.org/wiki/Philosophy
reached philosophy!


## Testing the web scraper

## Running the web scraper on a bunch of wikipedia pages

In [39]:
df = pd.read_csv('WikiMainPage.csv')
df = df[['Date', 'TFALink']]
df.rename(columns = {'TFALink': 'link', 'Date': 'date'}, inplace=True)
df.link = 'https://en.wikipedia.org' + df.link
df = df.head(3)
df

Unnamed: 0,date,link
0,2010-12-16,https://en.wikipedia.org/wiki/Dwarf_planet
1,2011-01-01,https://en.wikipedia.org/wiki/History_of_the_A...
2,2011-01-02,https://en.wikipedia.org/wiki/Bob_Marshall_(wi...


In [40]:
%%time
df['chain'] = df.link.map(lambda x: get_to_philosophy(x))
def philosophy(x): 
    if x[-1] == 'https://en.wikipedia.org/wiki/Philosophy': 
        return True
    else: 
        return False
df['philosophy'] = df.chain.map(lambda x: philosophy(x))
df

getting to philosophy from: https://en.wikipedia.org/wiki/Dwarf_planet
https://en.wikipedia.org/wiki/Planetary-mass_object
https://en.wikipedia.org/wiki/Astronomical_body
https://en.wikipedia.org/wiki/Astronomy
https://en.wikipedia.org/wiki/Natural_science
https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Empirical_evidence
https://en.wikipedia.org/wiki/Information
https://en.wikipedia.org/wiki/Uncertainty
https://en.wikipedia.org/wiki/Epistemology
https://en.wikipedia.org/wiki/Outline_of_philosophy
https://en.wikipedia.org/wiki/Philosophy
reached philosophy!
getting to philosophy from: https://en.wikipedia.org/wiki/History_of_the_Australian_Capital_Territory
https://en.wikipedia.org/wiki/Australian_Capital_Territory
https://en.wikipedia.org/wiki/Federal_territory
https://en.wikipedia.org/wiki/Federation
https://en.wikipedia.org/wiki/Political_entity
https://en.wikipedia.

Unnamed: 0,date,link,chain,philosophy
0,2010-12-16,https://en.wikipedia.org/wiki/Dwarf_planet,"[https://en.wikipedia.org/wiki/Dwarf_planet, h...",True
1,2011-01-01,https://en.wikipedia.org/wiki/History_of_the_A...,[https://en.wikipedia.org/wiki/History_of_the_...,True
2,2011-01-02,https://en.wikipedia.org/wiki/Bob_Marshall_(wi...,[https://en.wikipedia.org/wiki/Bob_Marshall_(w...,True


In [43]:
r = 500 
t = 54.1/3
print('hours required to compute {} rows: {}'.format(r, round(t*r/60/60, 2)))

hours required to compute 500 rows: 2.5
