# Data Collection

This notebook outlines our data collection strategy which consists of the following steps:

1. Finding the relevant Wikipedia pages for each discipline through [PetScan](https://petscan.wmflabs.org/).
2. Scraping each page to parse out hyperlinks to other Wikipedia pages and the text.
3. Creating a smaller and manageble subgraph from the Network.

In [9]:
#Imports
import requests
import json
from tqdm.notebook import tqdm
from dataclasses import dataclass
from typing import List
import numpy as np
import pandas as pd
import random 
import requests
from bs4 import BeautifulSoup

## Finding relevant articles

To collect the relevant wikipedia pages for our project we specify the dataclass `WikiPage`. This is based on the use of the open-source software [PetScan](https://petscan.wmflabs.org/) that based on a list of wikipedia-categories yields the corresponding page-names. We furthermore specify the depth of our PetScan-query, which is a measure of how deep we want our categories to be. As the list of pages grows exponentially we limit the levels of depth we set the parameter to 0, 1 and 2. The reason for not choosing one specific depth is that the group and sub-group structure of the disciplines differs which means that we get a widely different amount of pages. 

In [10]:
@dataclass(frozen=False)
class WikiPage:
    """
    Data obj that stores an article and 
    its relevant attributes
    """
    title:str
    parent:str
    depth:int
    text:str = np.nan
    edges:List = np.nan
        

def collect_pages(parents:list,
                  depth:int=0)->List[WikiPage]:
    
    """
    Finds relevant articles from petscan based on some initial query.
    See https://petscan.wmflabs.org/ for api reference.
    """
    
    pages = list()
    errors = 0
    #setup API call
    base_url = 'https://petscan.wmflabs.org/?ns%5B0%5D=1&'
    params = {'project':'wikipedia',
              'language':'en',
              'format':'json',
              'interface_language':'en',
              'depth':str(depth),
              'doit':''}
    
    #Loop over parents and get corresponding page names
    for cat in parents:
        params['categories'] = cat
        resp = requests.get(url=base_url, params=params).json()
        try: 
            for page in resp['*'][0]['a']['*']:

                #Append nodes
                pages.append(WikiPage(title=page['title'],
                                      parent=cat,
                                      depth=depth))
                
        except KeyError:
            errors+=1
    
    print(f'Petscan failed to retrieve {errors} pages in depth {depth}...')
            
    return pages

Bellow we call the function `collect_pages` and create a page list for depth 0, 1 and 2 and display the resulting counts. As can be seen Anthropology is a clear outlier because of a different group structure on wikipedia.

In [11]:
#Define initial query groups
query = ['political_science', 'economics', 
          'sociology', 'anthropology', 
          'psychology']

depths = [0,1,2]
pages = []
for d in tqdm(depths):
    pages += collect_pages(query, d)
#Show marginal distribution    
pd.DataFrame(pages).groupby('parent').count()['title']

  0%|          | 0/3 [00:00<?, ?it/s]

Petscan failed to retrieve 0 pages in depth 0...
Petscan failed to retrieve 0 pages in depth 1...
Petscan failed to retrieve 0 pages in depth 2...


parent
anthropology         17621
economics             6023
political_science     7011
psychology            8785
sociology             5895
Name: title, dtype: int64

In [13]:
#Display some random articles
random.sample(pages, 10)

[WikiPage(title='Adolescent_crystallization', parent='psychology', depth=2, text=nan, edges=nan),
 WikiPage(title="Maimonides'_rule", parent='psychology', depth=2, text=nan, edges=nan),
 WikiPage(title='Affine_pricing', parent='economics', depth=1, text=nan, edges=nan),
 WikiPage(title='Orgasmic_platform', parent='sociology', depth=2, text=nan, edges=nan),
 WikiPage(title='Box_office_futures', parent='economics', depth=2, text=nan, edges=nan),
 WikiPage(title='Laurent_Naud', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='Psychiatric_casualty', parent='psychology', depth=2, text=nan, edges=nan),
 WikiPage(title='Western_painting', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='Hunnic_language', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='The_Culture_of_Connectivity', parent='political_science', depth=2, text=nan, edges=nan)]

## Collect page text and edges

In the function `collect_attributes` we use `BeautifulSoup` to scrape the html content from the wikipedia pages we've found. The key html node is the `div` with attributes `{'id':'mw-content-text'}` from which we can parse out all paragraphs and hyperlinks, disregarding section headings, tables and other irrelevant content and page attributes.

In [None]:
def collect_attributes(articles:list[WikiArticle])->list[WikiArticle]:
    """
    Parses the wikipedia article text and urls pointing to another wiki page.
    """

    base_url = 'https://en.wikipedia.org/wiki/'
    for page in tqdm(pages):
        resp = requests.get(base_url+page.name)
        soup = BeautifulSoup(resp.content, 'html.parser')
        content = soup.find('div', {'id':'mw-content-text'})
        text = ''
        for paragraph in content.find_all('p'):
            text += ' ' + paragraph.text
        node.text = text
        node.edges = [ref.text for ref in content.find_all('a', href=True) 
                                             if 'wiki' in ref.get('href')]
    return pages

collect_attributes(pages)

## Subsetting a smaler network

Because of the large size of the network, we deem it necessary to create a smaller subgraph that is more manageble.  

## Saving the data

Following the collection of pages we gather them in a dataframe and edgelist for future use. To reduce the size of edgelist we alreay now remove edges that points to pages we have not collected. This means that we only keep edges that link to other pages in one of the five categories.

In [None]:
# Based on our nodes we can now create and save our df for future use
def create_df(nodes = nodes):
    return pd.DataFrame({"name": [node.name for node in nodes],
                         "parent": [node.parent for node in nodes],
                         "depth": [node.depth for node in nodes],
                         "edges": [node.edges for node in nodes],
                         "text": [node.text for node in nodes],
                         "categories": [node.categories for node in nodes]})

df = create_df()
df.to_pickle("df.obj")

In [None]:
# Based on our nodes we can now create a edgelist and save our ot for future use
def create_edgelist(nodes = nodes):
    nodelist = [node.name for node in nodes]
    edgelist = [[(nodes[i].name, edge) for edge in nodes[i].edges if edge in nodelist]
                for i in tqdm(range(len(nodes)))]
    return list(chain.from_iterable(edgelist))

edgelist = create_edgelist()
with open('edgelist.obj', 'wb') as f:
    pickle.dump(edgelist, f)