# Getting to philosophy 
In the summer of 2008, wikipedia user Mark J discovered a strange phenomenon on wikipedia: if you click on the first link on a wikipedia page, then repeat the process, you usually end up on the page for philosophy. He wrote a [wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy) about this, which led to mentions on a [podcast](https://en.wikipedia.org/wiki/Wikipedia:WikipediaWeekly/Episode50) and [documentary](https://www.bbc.co.uk/programmes/b07lk6tj), as well as a [scientific paper](https://www.daniellamprecht.com/wp-content/uploads/2016/08/Evaluating-and-Improving-Navigability-of-Wikipedia-a-Comparative-Study-of-eight-Language-Editions.pdf) published in 2016, which found that 97% of articles led to philosophy.

In this notebook, I'd like to find out if this phenomenon still exists on wikipedia today, in 2021. In order to do this, I will build a web scraper that goes through the process of clicking first links. Then, I will use this web scraper on a bunch (~3000?) of wikipedia pages, and keep a record of whether these pages get to philosophy, as well as all the links the web scraper had to click through to get there. 

I will then analyze the resulting dataset, and try to answer the following questions: 
* What is the percentage of pages that lead to philosophy? 
* What are some other popular pages that many pages eventually lead to? 
* What is the most beautiful way of visualizing this data? (I will use this beautiful [Game of Thrones character network visualization](https://www.linkedin.com/pulse/game-thrones-social-network-analysis-conor-aspell/) as inspiration)
* Which pages are highly connected to other pages, and which are not? (this is a chance to learn about and apply [network theory](https://en.wikipedia.org/wiki/Network_theory), measuring [centrality](https://en.wikipedia.org/wiki/Centrality) of points in a network)  

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import os

## Building the webscraper

In [106]:
def wiki_first_url(url): 
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # print(soup.prettify()) # prints html code in a pretty way - good practice to examine the code first before scraping it 
    links = soup.select('p a[href]') # select all a tags that are inside of p tags (css selector)
    for link in links: 
        if link['href'].startswith((r'/wiki/Help:', r'/wiki/File:', '/wiki/Wikipedia:Media_help')):
            continue
        elif not link['href'].startswith(r'/wiki'): 
            continue
        elif link['href'].endswith('.ogg'): 
            continue 
        else:
            url = 'https://en.wikipedia.org' + link['href']
            break
    print(url)
    return url

In [113]:
def get_to_philosophy(url): 
    print('testing for url: ' + url)
    links = []
    while url != 'https://en.wikipedia.org/wiki/Philosophy': 
        url = wiki_first_url(url)
        if url in links:
            print('link loop has occured')
            break
        elif url == 'https://en.wikipedia.org/wiki/Philosophy': 
            print('reached philosophy!')
        links.append(url)
    return links 

## Testing the web scraper

In [127]:
url = 'https://en.wikipedia.org/wiki/Soju'
links = get_to_philosophy(url)

testing for url: https://en.wikipedia.org/wiki/Soju
https://en.wikipedia.org/wiki/Alcoholic_beverage
https://en.wikipedia.org/wiki/Drink
https://en.wikipedia.org/wiki/Liquid
https://en.wikipedia.org/wiki/Compressibility
https://en.wikipedia.org/wiki/Thermodynamics
https://en.wikipedia.org/wiki/Physics
https://en.wikipedia.org/wiki/Ancient_Greek_language
https://en.wikipedia.org/wiki/Greek_language
https://en.wikipedia.org/wiki/Ancient_Greek
https://en.wikipedia.org/wiki/Greek_language
link loop has occured


## Running the web scraper on a bunch of wikipedia pages

In [130]:
df = pd.read_csv('WikiMainPage.csv')
df = df[['Date', 'TFALink']]
df.rename(columns = {'TFALink': 'link', 'Date': 'date'}, inplace=True)
df.link = 'https://en.wikipedia.org' + df.link
df = df.head(3)
df

Unnamed: 0,date,link
0,2010-12-16,https://en.wikipedia.org/wiki/Dwarf_planet
1,2011-01-01,https://en.wikipedia.org/wiki/History_of_the_A...
2,2011-01-02,https://en.wikipedia.org/wiki/Bob_Marshall_(wi...


In [131]:
%%time
df['chain'] = df.link.map(lambda x: get_to_philosophy(x))
def philosophy(x): 
    if x[-1] == 'https://en.wikipedia.org/wiki/Philosophy': 
        return True
    else: 
        return False
df['philosophy'] = df.chain.map(lambda x: philosophy(x))
df

testing for url: https://en.wikipedia.org/wiki/Dwarf_planet
https://en.wikipedia.org/wiki/Planetary-mass_object
https://en.wikipedia.org/wiki/Sun
https://en.wikipedia.org/wiki/Star
https://en.wikipedia.org/wiki/Astronomical_object
https://en.wikipedia.org/wiki/Astronomy
https://en.wikipedia.org/wiki/Greek_language
https://en.wikipedia.org/wiki/Ancient_Greek
https://en.wikipedia.org/wiki/Greek_language
link loop has occured
testing for url: https://en.wikipedia.org/wiki/History_of_the_Australian_Capital_Territory
https://en.wikipedia.org/wiki/Australian_Capital_Territory
https://en.wikipedia.org/wiki/Federal_territory
https://en.wikipedia.org/wiki/Federation
https://en.wikipedia.org/wiki/Political_entity
https://en.wikipedia.org/wiki/Collective_identity
https://en.wikipedia.org/wiki/Belongingness
https://en.wikipedia.org/wiki/Emotional
https://en.wikipedia.org/wiki/Biological
https://en.wikipedia.org/wiki/Natural_science
https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikip

Unnamed: 0,date,link,chain,philosophy
0,2010-12-16,https://en.wikipedia.org/wiki/Dwarf_planet,[https://en.wikipedia.org/wiki/Planetary-mass_...,False
1,2011-01-01,https://en.wikipedia.org/wiki/History_of_the_A...,[https://en.wikipedia.org/wiki/Australian_Capi...,False
2,2011-01-02,https://en.wikipedia.org/wiki/Bob_Marshall_(wi...,"[https://en.wikipedia.org/wiki/Forester, https...",False


In [136]:
10 * 500 / 60 / 60

1.3888888888888888