# Data Description

For our project, we collected wikipedia articles corresponding to five major social science diciplines: Economics, Political Science, Anthropology, Sociology and Psychology. To do this we used the tool [petscan](https://petscan.wmflabs.org/) which enabled us to find all the sub-categories/articles for each discipline. Petscan can be access both through a web-interface or programmatically through for example Python. 

![petscan](petscan.gif "PetScan Demonstration")

The table bellow shows the five first observations of our data-set, which includes the following variables:

* `name`: The name of the Wikipedia article.
* `parent`: The discipline to which the article belongs.
* `edges`: Contains all links to another Wikipedia page.
* `text`: The raw text of the article.
* `cleaned_text`: Punctuation removed, lower-cased.
* `lemmatized`: The cleaned text in lemmatized form, stop words removed.
* `gcc`: Dummy for if the article is part of the giant component in the network.

The data can be downloaded from the following [link](https://drive.google.com/file/d/1U0Q8eMvp50crf383ykE1anB5J4oq_mrc/view?usp=sharing).

In [4]:
import pandas as pd
import numpy as np
from ast import literal_eval
from collections import defaultdict


df = pd.read_pickle('/home/matiasp/University/m2/socialgraphs2021-Matias/project/df_preprocessed.obj')
df = df.drop(columns=['categories', 'depth', 'tokens', 'Unnamed: 0']).dropna()
df['edges'] = df['edges'].apply(lambda x: literal_eval(x))
df.set_index('name').head()
#df.to_html('temp.html')

FileNotFoundError: [Errno 2] No such file or directory: '/home/matiasp/University/m2/socialgraphs2021-Matias/project/df_preprocessed.obj'

In the table bellow we display summary statistics including the average number of articles for each discipline, number of edges and word count. As can be seen the distribution is rather skewed, with Antropology for example having more than double the amount of articles compared to Political Science. 

In [62]:
#Create descriptives table
tab = defaultdict(list)
for discipline in df['parent'].unique():
    avg_edges = []
    avg_pagelen = []
    for row in df.loc[df['parent']==discipline].iterrows():
        avg_edges.append(len(row[1]['edges']))
        avg_pagelen.append(len(row[1]['tokens']))
    
    tab['Discipline'].append(discipline)
    tab['Number of articles'].append(df.loc[df['parent']==discipline].shape[0])
    tab['Avg. edges'].append(np.mean(avg_edges))
    tab['Avg. word count'].append(np.mean(avg_pagelen))
tab = pd.DataFrame(tab)
tab.set_index('Discipline').round(2)

Unnamed: 0_level_0,Number of articles,Avg. edges,Avg. word count
Discipline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
political_science,418,218.65,12748.76
economics,675,82.67,4399.29
sociology,684,137.08,9135.81
anthropology,1198,193.84,10567.26
psychology,955,127.9,10447.76
