# Data Collection

This notebook outlines our data collection strategy which consists of the following steps:

1. Finding the relevant Wikipedia pages for each discipline through [PetScan](https://petscan.wmflabs.org/).
2. Scraping each page to parse out hyperlinks to other Wikipedia pages and the text.
3. Creating a smaller and manageble subgraph from the Network.

In [1]:
#Imports
import requests
import networkx as nx
import json
from tqdm.notebook import tqdm
from dataclasses import dataclass
from typing import List
import numpy as np
import pandas as pd
import random 
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import re
from itertools import chain
import pickle

In [2]:
import littleballoffur

## Finding relevant articles

To collect the relevant wikipedia pages for our project we specify the dataclass `WikiPage`. This is based on the use of the open-source software [PetScan](https://petscan.wmflabs.org/) that based on a list of wikipedia-categories yields the corresponding page-names. We furthermore specify the depth of our PetScan-query, which is a measure of how deep we want our categories to be. As the list of pages grows exponentially we limit the levels of depth we set the parameter to 0, 1 and 2. The reason for not choosing one specific depth is that the group and sub-group structure of the disciplines differs which means that we get a widely different amount of pages. 

In [2]:
@dataclass(frozen=False)
class WikiPage:
    """
    Data obj that stores an article and 
    its relevant attributes
    """
    title:str
    parent:str
    depth:int
    text:str = np.nan
    edges:List = np.nan
        

def collect_pages(parents:list,
                  depth:int=0)->List[WikiPage]:
    
    """
    Finds relevant articles from petscan based on some initial query.
    See https://petscan.wmflabs.org/ for api reference.
    """
    
    pages = list()
    errors = 0
    #setup API call
    base_url = 'https://petscan.wmflabs.org/?ns%5B0%5D=1&'
    params = {'project':'wikipedia',
              'language':'en',
              'format':'json',
              'interface_language':'en',
              'depth':str(depth),
              'doit':''}
    
    #Loop over parents and get corresponding page names
    for cat in parents:
        params['categories'] = cat
        resp = requests.get(url=base_url, params=params).json()
        try: 
            for page in resp['*'][0]['a']['*']:

                #Append nodes
                pages.append(WikiPage(title=page['title'],
                                      parent=cat,
                                      depth=depth))
                
        except KeyError:
            errors+=1
    
    print(f'Petscan failed to retrieve {errors} pages in depth {depth}...')
            
    return pages

Bellow we call the function `collect_pages` and create a page list for depth 0, 1 and 2 and display the resulting counts. As can be seen Anthropology is a clear outlier because of a different group structure on wikipedia.

In [3]:
#Define initial query groups
query = ['political_science', 'economics', 
          'sociology', 'anthropology', 
          'psychology']

depths = [0,1,2]
pages = []
for d in tqdm(depths):
    pages += collect_pages(query, d)
#Show marginal distribution    
pd.DataFrame(pages).groupby('parent').count()['title']

  0%|          | 0/3 [00:00<?, ?it/s]

Petscan failed to retrieve 0 pages in depth 0...
Petscan failed to retrieve 0 pages in depth 1...
Petscan failed to retrieve 0 pages in depth 2...


parent
anthropology         17621
economics             6023
political_science     7011
psychology            8757
sociology             5895
Name: title, dtype: int64

In [4]:
#Display some random articles
random.sample(pages, 10)

[WikiPage(title='Talking_shit', parent='sociology', depth=2, text=nan, edges=nan),
 WikiPage(title='Fan_effect', parent='psychology', depth=2, text=nan, edges=nan),
 WikiPage(title='Ernst-Ludwig_von_Thadden', parent='economics', depth=2, text=nan, edges=nan),
 WikiPage(title='Hassan_Kettani', parent='political_science', depth=2, text=nan, edges=nan),
 WikiPage(title='Ethel_Cutler_Freeman', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='Theory_of_generations', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='National_Pet_Month', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='Janine_Krieber', parent='political_science', depth=2, text=nan, edges=nan),
 WikiPage(title='Peter_Lewis_Paul', parent='anthropology', depth=2, text=nan, edges=nan),
 WikiPage(title='Toxic_masculinity', parent='sociology', depth=2, text=nan, edges=nan)]

## Collect page text and edges

In the function `collect_attributes` we use `BeautifulSoup` to scrape the html content from the wikipedia pages we've found. The key html node is the `div` with attributes `{'id':'mw-content-text'}` from which we can parse out all paragraphs and hyperlinks, disregarding section headings, tables and other irrelevant content and page attributes.

In [17]:
def collect_attributes(articles:list[WikiPage])->list[WikiPage]:
    """
    Parses the wikipedia article text and urls pointing to another wiki page.
    """
    base_url = 'https://en.wikipedia.org/wiki/'
    error_log = dict()
    for page in tqdm(pages):
        try:
            try: 
                resp = requests.get(base_url+page.title, timeout=10)
            except requests.exceptions.Timeout as e: 
                error_log[page.title] = e
            
            soup = BeautifulSoup(resp.content, 'html.parser')
            content = soup.find('div', {'id':'mw-content-text'})
            text = ''
            for paragraph in content.find_all('p'):
                text += ' ' + paragraph.text
            page.text = text
            page.edges = [ref.text for ref in content.find_all('a', href=True) 
                                               if 'wiki' in ref.get('href')]
        except Exception as e:
            #Log potential errors in collection
            error_log[page.title] = str(e)
            
    return pages, error_log

pages, error_log = collect_attributes(pages)

  0%|          | 0/45307 [00:00<?, ?it/s]

In [18]:
pages_df = pd.DataFrame(pages)

In [26]:
print(f'Amount of pages that failed to be collected: {len(error_log.keys())}')

Amount of pages that failed to be collected: 135


## Subsetting a smaler network

Because of the large size of the network, we deem it necessary to create a smaller subgraph that is more manageble.  

In [704]:
pages_df = pd.read_pickle("full_data.pickle")

In [705]:
def remove_anthro(df):
    df = df.loc[~((df["depth"] == 2) & (df["parent"] == "anthropology"))]
    return df

In [706]:
pages_df.loc[~((pages_df["depth"] == 2) & (pages_df["parent"] == "anthropology"))].groupby("parent").count()

Unnamed: 0_level_0,title,depth,text,edges
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anthropology,2161,2161,2161,2161
economics,6023,6023,6023,6023
political_science,7011,7011,7011,7011
psychology,8757,8757,8757,8757
sociology,5895,5895,5895,5895


In [697]:
pages_df.groupby("parent").count()

Unnamed: 0_level_0,title,depth,text,edges
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anthropology,17621,17621,17621,17621
economics,6023,6023,6023,6023
political_science,7011,7011,7011,7011
psychology,8757,8757,8757,8757
sociology,5895,5895,5895,5895


In [707]:
df = remove_anthro(pages_df)

In [708]:
df.groupby("parent").count()

Unnamed: 0_level_0,title,depth,text,edges
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anthropology,2161,2161,2161,2161
economics,6023,6023,6023,6023
political_science,7011,7011,7011,7011
psychology,8757,8757,8757,8757
sociology,5895,5895,5895,5895


In [709]:
def remove_duplicates(df):
    nodes_to_remove = [node for node in tqdm(set(df[df.duplicated("title")]["title"])) if
                      len(set(df[df["title"] == node]["parent"])) > 1]
    
    df = df[~df['title'].isin(nodes_to_remove)]
    df = df.drop_duplicates(subset="title", keep="first")
    return df

In [710]:
df = remove_duplicates(df)

100%|██████████████████████████████████████| 5714/5714 [00:23<00:00, 241.87it/s]


In [711]:
df.groupby("parent").count()

Unnamed: 0_level_0,title,depth,text,edges
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anthropology,1401,1401,1401,1401
economics,4636,4636,4636,4636
political_science,5217,5217,5217,5217
psychology,6715,6715,6715,6715
sociology,3518,3518,3518,3518


In [712]:
def uniform_page_and_edge_names(df):
    df['title'] = df['title'].str.lower()
    df['edges'] = df['edges'].apply(lambda x: [re.sub(' ', '_', l.strip().lower()) for l in x])
    return df

In [713]:
df = uniform_page_and_edge_names(df)

In [714]:
def remove_edges_not_in_nodelist(df):
    tqdm.pandas()
    nodes = df["title"].tolist()
    df['edges'] = df['edges'].progress_apply(lambda x: [e for e in x if e in nodes])
    return df

In [715]:
df = remove_edges_not_in_nodelist(df)

100%|█████████████████████████████████████| 21487/21487 [04:10<00:00, 85.72it/s]


In [716]:
df.groupby("parent").count()

Unnamed: 0_level_0,title,depth,text,edges
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
anthropology,1401,1401,1401,1401
economics,4636,4636,4636,4636
political_science,5217,5217,5217,5217
psychology,6715,6715,6715,6715
sociology,3518,3518,3518,3518


In [589]:
#def get_edgelist(df):
#    edgelist = df.apply(lambda x: [(x.title, edge) for edge in x.edges], axis = 1)
#    edgelist = chain.from_iterable(edgelist)
#    return list(edgelist)

In [590]:
#def remove_self_loops(df, edgelist):
#    edgelist = [e for e in edgelist if e[0] != e[1]]
    #list_of_nodes_to_keep = [e[0] for e in edgelist] + [e[1] for e in edgelist]
    #list_of_nodes_to_keep = list(set(list_of_nodes_to_keep))
    #df = df[df['title'].isin(list_of_nodes_to_keep)]
#    return edgelist

In [659]:
df = remove_duplicates(pages_df)
df = remove_anthro(df)
df = uniform_page_and_edge_names(df)
df = remove_edges_not_in_nodelist(df)
#edgelist = get_edgelist(df)
#df = remove_self_loops(df, edgelist)
#df = extract_connected_nodes(df, edgelist)
#df = remove_edges_not_in_nodelist(df)

100%|██████████████████████████████████████| 8004/8004 [00:43<00:00, 183.70it/s]
100%|████████████████████████████████████| 15226/15226 [00:57<00:00, 263.62it/s]


In [717]:
df = df.reset_index()

In [726]:
node_attr = df[["title", "parent", "depth"]].to_dict("index")
index_dict = {i:k for k, i in enumerate(df['title'])}

edge_list = []
for node, edges in zip(df['title'].tolist(), df['edges'].tolist()):
    for edge in edges:
        edge_list.append((index_dict[node], index_dict[edge]))
edge_list = [e for e in edge_list if e[0] != e[1]]

G = nx.Graph()
G.add_nodes_from(list(index_dict.values()))
nx.set_node_attributes(G, node_attr)
G.add_edges_from(edge_list)
gcc = max(nx.connected_components(G), key=len)
G = G.subgraph(gcc)

G = nx.relabel.convert_node_labels_to_integers(G)
from littleballoffur import MetropolisHastingsRandomWalkSampler

number_of_nodes = int(0.25 * G.number_of_nodes())
sampler = MetropolisHastingsRandomWalkSampler(number_of_nodes = number_of_nodes)
new_graph = sampler.sample(G)

In [728]:
parent=nx.get_node_attributes(new_graph,'parent')

In [729]:
Counter(parent.values())

Counter({'political_science': 1092,
         'psychology': 1587,
         'economics': 985,
         'anthropology': 338,
         'sociology': 808})

In [739]:
nodes_to_keep = [list(index_dict.keys())[i] for i in list(new_graph.nodes())]

In [748]:
final_df = df[df['title'].isin(nodes_to_keep)].reset_index()[["title", "parent", "depth", "text", "edges"]]

In [752]:
final_df = remove_edges_not_in_nodelist(final_df)

100%|██████████████████████████████████████| 4810/4810 [00:06<00:00, 717.81it/s]


In [794]:
node_attr = final_df[["title", "parent", "depth"]].set_index("title").to_dict()

In [819]:
edge_list = []
for node, edges in zip(final_df['title'].tolist(), final_df['edges'].tolist()):
    for edge in edges:
        edge_list.append((node, edge))

G = nx.DiGraph()
G.add_edges_from(edge_list)
gcc = max(nx.weakly_connected_components(G), key=len)
G = G.subgraph(gcc)

In [820]:
final_df["gcc"] = final_df["title"].apply(lambda x: 1 if x in G.nodes() else 0)

In [823]:
G = nx.DiGraph()
G.add_edges_from(edge_list)
node_attr = final_df[["title", "parent", "depth", "gcc"]].set_index("title").to_dict("index")
nx.set_node_attributes(G, node_attr)

In [829]:
pickle.dump(G, open('Final_graph.pickle', 'wb'))

In [830]:
final_df.to_pickle("Final_df.pickle")

## Saving the data

Following the collection of pages we gather them in a dataframe and edgelist for future use. To reduce the size of edgelist we alreay now remove edges that points to pages we have not collected. This means that we only keep edges that link to other pages in one of the five categories.