# Getting to philosophy 
In the summer of 2008, wikipedia user Mark J discovered a strange phenomenon on wikipedia: if you click on the first link on a wikipedia page that isn't in brackets or italicized, then repeat the process, you usually end up on the page for philosophy. He wrote a [wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) about this, which led to mentions on a [podcast](https://en.wikipedia.org/wiki/Wikipedia:WikipediaWeekly/Episode50) and [documentary](https://www.bbc.co.uk/programmes/b07lk6tj) (see snippet [here](https://www.youtube.com/watch?v=Q2DdmEBXTpo&t=90s&ab_channel=Wingspan)), as well as a [scientific paper](https://www.daniellamprecht.com/wp-content/uploads/2016/08/Evaluating-and-Improving-Navigability-of-Wikipedia-a-Comparative-Study-of-eight-Language-Editions.pdf) published in 2016, which found that 97% of articles led to philosophy.

In this notebook, I'd like to find out if this phenomenon still exists on wikipedia today, in 2021. In order to do this, I will build a web scraper that goes through the process of clicking first links. Then, I will use this web scraper on a bunch (~3000?) of wikipedia pages, and keep a record of whether these pages get to philosophy, as well as all the links the web scraper had to click through to get there. 

I will then analyze the resulting dataset, and try to answer the following questions: 
* What is the percentage of pages that lead to philosophy? 
* What are some other popular pages that many pages eventually lead to? 
* What is the most beautiful way of visualizing this data? (I will use this beautiful [Game of Thrones character network visualization](https://www.linkedin.com/pulse/game-thrones-social-network-analysis-conor-aspell/) as inspiration)
* Which pages are highly connected to other pages, and which are not? (this is a chance to learn about and apply [network theory](https://en.wikipedia.org/wiki/Network_theory), measuring [centrality](https://en.wikipedia.org/wiki/Centrality) of points in a network)  

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import os
import re

## Building the webscraper
learn regex [here](https://regexone.com/lesson/kleene_operators?)

In [2]:
def wiki_get_urls(url): # returns list of valid urls   
    # select all paragraphs in wiki page using its url 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    urls = soup.select('p') # select all a tags that are inside of p tags (css selector)
    
    # combine first 50 paragraphs and convert into string
    urlStr = urls[:50] 
    urlStr = [str(url) for url in urlStr]
    urlStr = ' '.join(urlStr)
    
    # remove text in parentheses or italics from string using regex
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr) # remove text in double parentheses (xx(xx)xx)
    # TODO: write one regex for parenthesis with multiple sub-parentheses (..(..)..(..)..(..)..)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\([^\(\)]+?\)[^\(\)]+?\)", "", urlStr)
    urlStr = re.sub(r"\([^\(\)]*?[^a-z]+[^\(\)]*?\)", "", urlStr) # remove text in parentheses 
                                                            # (unless it's another wiki page, e.g. /wiki/Justification_(epistemology))
    urlStr = re.sub('<i>.*?<\/i>', '', urlStr) # remove text in italics 
    
    # convert back into soup and find all urls  
    soup = BeautifulSoup(urlStr, 'html.parser')
    urls = soup.find_all('a')[:10] # get first 10 urls 
    urls = [url['href'] for url in urls]
    
    # remove urls that are not other wikipedia pages 
    toRemove = []
    for url in urls: 
        if not url.startswith('/wiki/'): 
            toRemove.append(url)
        elif url.startswith(('/wiki/Help:', '/wiki/File:', '/wiki/Outline_(list)')): 
            toRemove.append(url)
    for i in toRemove: 
        urls.remove(i)
    urls = ['https://en.wikipedia.org' + url for url in urls]
    return urls 

# test the wiki_get_urls function 
urls = wiki_get_urls('https://en.wikipedia.org/wiki/Paraphyly') 
urls 
# goal: should work for both epistemology and science: 
# for science, first link should be scientific method 
# for epistomology, first link should be outline of philosophy 

['https://en.wikipedia.org/wiki/Taxonomy_(general)',
 'https://en.wikipedia.org/wiki/Most_recent_common_ancestor',
 'https://en.wikipedia.org/wiki/Monophyly',
 'https://en.wikipedia.org/wiki/Clade',
 'https://en.wikipedia.org/wiki/Monophyletic',
 'https://en.wikipedia.org/wiki/Phylogenetics',
 'https://en.wikipedia.org/wiki/Linguistics',
 'https://en.wikipedia.org/wiki/Synapomorphy_and_apomorphy',
 'https://en.wikipedia.org/wiki/Symplesiomorphy',
 'https://en.wikipedia.org/wiki/Willi_Hennig']

In [3]:
# inputs list of valid urls, and returns first link that doesn't loop back on itself 
# (e.g. language --> spoken language --> language)
def next_url(url): 
    test_urls = wiki_get_urls(url) # get list of urls from language 
    for test_url in test_urls: 
        urls_1 = wiki_get_urls(test_url) # find list of valid urls spoken langugage  
        if urls_1[0] == url: # if the first item spoken language is 'language', try the next url in 'language' 
            continue 
        else: # otherwise, pick this url as the next url
            return test_url 
            break 

# test 
url = 'https://en.wikipedia.org/wiki/Holes_(novel)'
next_url(url)

'https://en.wikipedia.org/wiki/Young_adult_fiction'

In [4]:
def get_to_philosophy(url): 
    try: 
        print('getting to philosophy from: ' + url)
        chain = [url]
        while url != 'https://en.wikipedia.org/wiki/Philosophy': 
            url = next_url(url)
            if url in chain: 
                print('url loop has occured. Repeated url:\n{}'.format(url))
                chain.append(url)
                break
            elif url == 'https://en.wikipedia.org/wiki/Philosophy': 
                chain.append(url)
                print(url)
                print('reached philosophy!')
            elif url not in chain: 
                print(url)
                chain.append(url)
                continue
        return chain
    except: 
        print('get_to_philosophy failed')
        return ['get_to_philosophy', 'failed']

url = 'https://en.wikipedia.org/wiki/Paraphyly'
chain = get_to_philosophy(url)

getting to philosophy from: https://en.wikipedia.org/wiki/Paraphyly
get_to_philosophy failed


## Running the web scraper on a bunch of wikipedia pages

In [5]:
# open list of 1000 randomly generated wikipedia articles 
import json
with open("randomWikiLinks.txt", "r") as fp:
    wikis = json.load(fp)
df = pd.DataFrame(wikis)
df.rename(columns={0: 'rLinks'}, inplace=True) # rLinks = random links
df

Unnamed: 0,rLinks
0,"https://en.wikipedia.org/wiki/Mansurlu,_Feke"
1,https://en.wikipedia.org/wiki/Net_neutrality_i...
2,https://en.wikipedia.org/wiki/Richard_Sissons
3,https://en.wikipedia.org/wiki/Meriwether_Natio...
4,https://en.wikipedia.org/wiki/Netechma_bicerit...
...,...
995,https://en.wikipedia.org/wiki/Josef_Effertz
996,https://en.wikipedia.org/wiki/Trapezoid_body
997,https://en.wikipedia.org/wiki/Gilberto_Peña
998,https://en.wikipedia.org/wiki/Stanley_Tomshinsky


In [6]:
%%time
df['chain'] = df.rLinks.map(lambda x: get_to_philosophy(x))
def philosophy(x): 
    if x[-1] == 'https://en.wikipedia.org/wiki/Philosophy': 
        return True
    else: 
        return False
df['philosophy'] = df.chain.map(lambda x: philosophy(x))
df

getting to philosophy from: https://en.wikipedia.org/wiki/Mansurlu,_Feke
https://en.wikipedia.org/wiki/Feke
https://en.wikipedia.org/wiki/Adana_Province
https://en.wikipedia.org/wiki/Provinces_of_Turkey
https://en.wikipedia.org/wiki/Turkey
https://en.wikipedia.org/wiki/Southeastern_Europe
https://en.wikipedia.org/wiki/Geography
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Empirical_evidence
https://en.wikipedia.org/wiki/Information
https://en.wikipedia.org/wiki/Uncertainty
https://en.wikipedia.org/wiki/Epistemology
https://en.wikipedia.org/wiki/Outline_of_philosophy
https://en.wikipedia.org/wiki/Philosophy
reached philosophy!
getting to philosophy from: https://en.wikipedia.org/wiki/Net_neutrality_in_the_Netherlands
https://en.wikipedia.org/wiki/Netherlands
https://en.wikipedia.org/wiki/Europe
https://en.wikipedia.org/wiki/Continent
https://en.wikipedia.org/wiki/Landmass
https://en.wikipedia.org/wiki/Region
https://

Unnamed: 0,rLinks,chain,philosophy
0,"https://en.wikipedia.org/wiki/Mansurlu,_Feke","[https://en.wikipedia.org/wiki/Mansurlu,_Feke,...",True
1,https://en.wikipedia.org/wiki/Net_neutrality_i...,[https://en.wikipedia.org/wiki/Net_neutrality_...,True
2,https://en.wikipedia.org/wiki/Richard_Sissons,[https://en.wikipedia.org/wiki/Richard_Sissons...,True
3,https://en.wikipedia.org/wiki/Meriwether_Natio...,[https://en.wikipedia.org/wiki/Meriwether_Nati...,True
4,https://en.wikipedia.org/wiki/Netechma_bicerit...,"[get_to_philosophy, failed]",False
...,...,...,...
995,https://en.wikipedia.org/wiki/Josef_Effertz,"[https://en.wikipedia.org/wiki/Josef_Effertz, ...",True
996,https://en.wikipedia.org/wiki/Trapezoid_body,"[https://en.wikipedia.org/wiki/Trapezoid_body,...",True
997,https://en.wikipedia.org/wiki/Gilberto_Peña,"[https://en.wikipedia.org/wiki/Gilberto_Peña, ...",True
998,https://en.wikipedia.org/wiki/Stanley_Tomshinsky,[https://en.wikipedia.org/wiki/Stanley_Tomshin...,False


In [7]:
r = 1000 # number of rows to compute
t = 78/5 # seconds required to compute one row 
print('time required to compute {} pages: {} hours'.format(r, round(t*r/60/60, 2)))

time required to compute 1000 pages: 4.33 hours


In [8]:
df

Unnamed: 0,rLinks,chain,philosophy
0,"https://en.wikipedia.org/wiki/Mansurlu,_Feke","[https://en.wikipedia.org/wiki/Mansurlu,_Feke,...",True
1,https://en.wikipedia.org/wiki/Net_neutrality_i...,[https://en.wikipedia.org/wiki/Net_neutrality_...,True
2,https://en.wikipedia.org/wiki/Richard_Sissons,[https://en.wikipedia.org/wiki/Richard_Sissons...,True
3,https://en.wikipedia.org/wiki/Meriwether_Natio...,[https://en.wikipedia.org/wiki/Meriwether_Nati...,True
4,https://en.wikipedia.org/wiki/Netechma_bicerit...,"[get_to_philosophy, failed]",False
...,...,...,...
995,https://en.wikipedia.org/wiki/Josef_Effertz,"[https://en.wikipedia.org/wiki/Josef_Effertz, ...",True
996,https://en.wikipedia.org/wiki/Trapezoid_body,"[https://en.wikipedia.org/wiki/Trapezoid_body,...",True
997,https://en.wikipedia.org/wiki/Gilberto_Peña,"[https://en.wikipedia.org/wiki/Gilberto_Peña, ...",True
998,https://en.wikipedia.org/wiki/Stanley_Tomshinsky,[https://en.wikipedia.org/wiki/Stanley_Tomshin...,False


In [10]:
def list_to_string(x): 
    ','.join(x)

df.chain = df.chain.map(lambda x: ','.join(x))

Unnamed: 0,rLinks,chain,philosophy
0,"https://en.wikipedia.org/wiki/Mansurlu,_Feke","https://en.wikipedia.org/wiki/Mansurlu,_Feke,h...",True
1,https://en.wikipedia.org/wiki/Net_neutrality_i...,https://en.wikipedia.org/wiki/Net_neutrality_i...,True
2,https://en.wikipedia.org/wiki/Richard_Sissons,"https://en.wikipedia.org/wiki/Richard_Sissons,...",True
3,https://en.wikipedia.org/wiki/Meriwether_Natio...,https://en.wikipedia.org/wiki/Meriwether_Natio...,True
4,https://en.wikipedia.org/wiki/Netechma_bicerit...,"get_to_philosophy,failed",False
...,...,...,...
995,https://en.wikipedia.org/wiki/Josef_Effertz,"https://en.wikipedia.org/wiki/Josef_Effertz,ht...",True
996,https://en.wikipedia.org/wiki/Trapezoid_body,"https://en.wikipedia.org/wiki/Trapezoid_body,h...",True
997,https://en.wikipedia.org/wiki/Gilberto_Peña,"https://en.wikipedia.org/wiki/Gilberto_Peña,ht...",True
998,https://en.wikipedia.org/wiki/Stanley_Tomshinsky,https://en.wikipedia.org/wiki/Stanley_Tomshins...,False


In [11]:
df.to_feather('links_to_philosophy.feather')