### Project: Getting to Philosophy


"Getting to Philosophy" is defined on wikipedia as:
"Clicking on the first link in the main text of an English Wikipedia article, and then repeating the process for subsequent articles, usually leads to the Philosophy article. In February 2016, this was true for 97% of all articles in Wikipedia, an increase from 94.52% in 2011. The remaining articles lead to an article without any outgoing wikilinks, to pages that do not exist, or get stuck in loops."

https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

The program should receive a Wikipedia link as an input, go to another normal link and repeat this process until either Philosophy page is reached, or we are in an article without any outgoing Wikilinks, or stuck in a loop.
This process is repeated to create the whole network of a sample size of your choice

A "normal link" is a link from the main page article, not in a box, is blue (red is for non-existing articles), not in parentheses, not italic and not a footnote. You don't have to check style tables or other fancy things, it is enough that the script works with the current Wikipedia style (for example you can use 'class' attribute in Wikipedia tags). For easy validation, please print all visited links to the standard output.

you can use a 0.5 second timeout between queries to avoid heavy load on Wikipedia (sleep function from time module).

You can use https://en.wikipedia.org/wiki/Special:Random to check this hypothesis at home.


In [343]:
from bs4 import BeautifulSoup
import urllib
import time
import sys
import requests
import re
import csv
import numpy as np

## Getting to Philosophy will stop when : 
1. reaches the Philosophy article
2. reaches 100 article without reaching the Philosophy article 
3. get into an article with no links 
4. stuck in a loop (like mathematics article)

In [344]:
#timeout
time_out=1
#size of the network
N=1000

In [345]:
# start
start_url = "https://en.wikipedia.org/wiki/Special:Random"
#start_url = "https://en.wikipedia.org/wiki/ Manga_artist"   
# target
target_url = "https://en.wikipedia.org/wiki/Philosophy"
# store the visited article 
visited_urls = [start_url]

In [346]:
#Remove text within NESTED parenthese
def remove_text_inside_brackets(text, brackets="()"):
    count = [0] * (len(brackets) // 2) # count open/close brackets
    saved_chars = []
    for character in text:
        for i, b in enumerate(brackets):
            if character == b: # found bracket
                kind, is_close = divmod(i, 2)
                count[kind] += (-1)**is_close # `+1`: open, `-1`: close
                if count[kind] < 0: # unbalanced bracket
                    count[kind] = 0  # keep it
                else:  # found bracket to remove
                    break
        else: # character is not a [balanced] bracket
            if not any(count): # outside brackets
                saved_chars.append(character)
    return ''.join(saved_chars)

In [347]:
def find_first_link(url):
    response = requests.get(url)
    html = response.text
    # Remove span, sup, small,tables, italicized text and parenthesized text
    html = remove_text_inside_brackets(html)
    html = re.sub(r'<span.*?</span>', "", html)
    html = re.sub(r'<sup.*?</sup>', "", html)
    html = re.sub(r'<small.*?</small>', "", html)
    html = re.sub(r'<table.*?</table>', "", html)
    html = re.sub(r'<i>.*?</i>', "", html)
    #html = re.sub(r' \(.*?\)', "", html) 
    soup = BeautifulSoup(html, "html.parser")

    # This div stars with the body of the article
    content_div = soup.find(id="mw-content-text").find(class_="mw-parser-output")

    # delete red links
    for s in content_div.find_all(class_="new"):
        s.replace_with("")

    # if the link contains no links it remains None
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        #find only the direct children
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    # Build a full url 
    first_link = urllib.parse.urljoin(
        'https://en.wikipedia.org/', article_link)

    return first_link

In [348]:
def continue_scraping(scraping_history, target_url, max_steps=100):
    # When reaches to philosphy
    if scraping_history[-1] == target_url:
        print("https://en.wikipedia.org/wiki/Philosophy")
        return False
    # max iterations 
    elif len(scraping_history) > max_steps:
        print("Maximum (100) searches reached, interrupted.")
        return False
    elif scraping_history[-1] in scraping_history[:-1]:
        print("We are in a Loop , interrupted.")
        return False
    else:
        return True

In [349]:
def generate_path():
    #generate path
    visited_urls=[start_url]
    while continue_scraping(visited_urls, target_url):
        #print first link
        print(visited_urls[-1])
        first_link = find_first_link(visited_urls[-1])
        # when arrive at an article with no links
        if not first_link:
            print("Arrived at an article with no links, search aborted.")
            break
        visited_urls.append(first_link)
        if time_out==1:
            time.sleep(0.4)  # Slow things down so as to not overload Wikipedia's servers
        path=visited_urls

    for i in range (0, len(path)):
        path[i]=path[i].replace("https://en.wikipedia.org/wiki/","",1)   #remove link to create name of node
        path[i]=path[i].split("#",1)
        #path[i]=path[i].replace("_"," ")   #replace underscore by space
    return path

In [350]:
network=[]
for n in range(0,N):
    print ("Path #",n+1)
    path=generate_path()
    network.append(path)

Path # 1
https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Natural_gas_field
https://en.wikipedia.org/wiki/Hydrocarbon
https://en.wikipedia.org/wiki/Organic_chemistry
https://en.wikipedia.org/wiki/Chemistry
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Empirical_evidence
https://en.wikipedia.org/wiki/Proposition
https://en.wikipedia.org/wiki/Logic
https://en.wikipedia.org/wiki/Truth
https://en.wikipedia.org/wiki/Fact
https://en.wikipedia.org/wiki/Experience
https://en.wikipedia.org/wiki/Conscious
https://en.wikipedia.org/wiki/Sentience
https://en.wikipedia.org/wiki/Emotion
https://en.wikipedia.org/wiki/Mental_state
https://en.wikipedia.org/wiki/Mind
https://en.wikipedia.org/wiki/Thought
https://en.wikipedia.org/wiki/Idea
https://en.wikipedia.org/wiki/Philosophy
Path # 2
https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Unincorporated_community
https://en.wikipedia.org

UnboundLocalError: local variable 'path' referenced before assignment

In [351]:
network

[[['Special:Random'],
  ['Natural_gas_field'],
  ['Hydrocarbon'],
  ['Organic_chemistry'],
  ['Chemistry'],
  ['Science'],
  ['Scientific_method'],
  ['Empirical_evidence'],
  ['Proposition'],
  ['Logic'],
  ['Truth'],
  ['Fact'],
  ['Experience'],
  ['Conscious'],
  ['Sentience'],
  ['Emotion'],
  ['Mental_state'],
  ['Mind'],
  ['Thought'],
  ['Idea'],
  ['Philosophy']],
 [['Special:Random'],
  ['Unincorporated_community'],
  ['Municipal_corporation'],
  ['Local_government'],
  ['Public_administration'],
  ['Public_policy'],
  ['Government'],
  ['State_']],
 [['Special:Random'], ['Reservoir'], ['Lake'], ['Depression_']],
 [['Special:Random'],
  ['List_of_Professorships_at_the_University_of_Cambridge'],
  ['Professor'],
  ['Academy'],
  ['Secondary_education'],
  ['International_Standard_Classification_of_Education'],
  ['Education'],
  ['Learning'],
  ['Understanding'],
  ['Psychological'],
  ['Scientific'],
  ['Scientific_method'],
  ['Empirical_evidence'],
  ['Proposition'],
  ['Lo

In [352]:
# Exporting the generated network using the savetxt from the numpy module
data = np.asarray(network)
np.savetxt("data_raw.csv", 
           data,
           delimiter =", ", 
           fmt ='% s')

  data = np.asarray(network)
