# Data Description

For our project, we collected wikipedia articles corresponding to five major social science diciplines: Economics, Political Science, Anthropology, Sociology and Psychology. To understand how we did this one need to know the structure of how information is stored into [categories](https://en.wikipedia.org/wiki/Wikipedia:Contents/Categories) on Wikipedia. Each category blablabla...

In [41]:
# Create illustration of wikipedia structure
import pygraphviz as pgv

G=pgv.AGraph(directed=True)

G.add_node("ROOT", label="Category: Social sciences", fontsize=20)
G.add_node("ROOT_i", label="Depth 0", shape = "plaintext", fontsize=20)

disciplines = ["Anthropology",
               "Economics",
               "Sociology",
               "Political Science",
               "Psychology"]

for i,k in enumerate(disciplines):
    G.add_node("Child_%i" % i, label=f"Subcategory: {k}")
    G.add_edge("ROOT", "Child_%i" % i)
    G.add_node("Grandchild_%i" % i, label = f"List of {k} sub-subcategories")
    G.add_edge("Child_%i" % i, "Grandchild_%i" % i)
    G.add_node("Greatgrandchild_%i" % i, label = f"... n list of {k} sub-subcategories")
    G.add_edge("Grandchild_%i" % i, "Greatgrandchild_%i" % i)

G.add_node("Child_%ix" % i, label="Depth 1", shape = "plaintext", fontsize=20)
G.add_node("Grandchild_%ix" % i, label="Depth 2", shape = "plaintext", fontsize=20)
G.add_node("Greatgrandchild_%ix" % i, label="Depth n", shape = "plaintext", fontsize=20)

G.add_edge("ROOT_i", "Child_%ix" % i)
G.add_edge("Child_%ix" % i, "Grandchild_%ix" % i)
G.add_edge("Grandchild_%ix" % i, "Greatgrandchild_%ix" % i)

G.layout(prog='dot')
G.draw('wikipedia_struture.png')

![](wikipedia_struture.png)

To do this we used the tool [PetScan](https://petscan.wmflabs.org/) which enabled us to find all the sub-categories or pages for each discipline depending on the depth of query. Petscan can then be accessed programmatically through Python and provide us with a relevant list of pages to get. 

![](petscan.gif)

The table bellow shows the five first observations of our data-set, which includes the following variables:

* `name`: The name of the Wikipedia article.
* `parent`: The discipline to which the article belongs.
* `edges`: Contains all links to another Wikipedia page.
* `text`: The raw text of the article.
* `cleaned_text`: Punctuation removed, lower-cased.
* `lemmatized`: The cleaned text in lemmatized form, stop words removed.
* `gcc`: Dummy for if the article is part of the giant component in the network.

The data can be downloaded from the following [link](https://drive.google.com/file/d/1U0Q8eMvp50crf383ykE1anB5J4oq_mrc/view?usp=sharing).

In [66]:
import pandas as pd
import numpy as np
from ast import literal_eval
from collections import defaultdict


df = pd.read_pickle('/home/matiasp/University/m2/socialgraphs2021-Matias/project/df_preprocessed.obj')
df = df.drop(columns=['categories', 'depth', 'tokens', 'Unnamed: 0']).dropna()
df['edges'] = df['edges'].apply(lambda x: literal_eval(x))
df.set_index('name').head()

Unnamed: 0_level_0,parent,edges,text,cleaned_text,lemmatized,gcc
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anarchism,political_science,"[-ism, 1872 Hague Congress, 1919 United States...",Anarchism is a political philosophy and moveme...,anarchism is a political philosophy and moveme...,anarchism political philosophy movement scepti...,1
Census,political_science,"[2011 Canadian census, Ab urbe condita (book),...",A census is the procedure of systematically ca...,a census is the procedure of systematically ca...,a census procedure systematically calculating ...,1
Comparative_law,political_science,"[Act of Congress, Act of Parliament, Adjudicat...",Comparative law is the study of differences an...,comparative law is the study of differences an...,comparative law study difference similarity la...,1
Code_of_Hammurabi,political_science,"['Ain Ghazal Statues, A. Leo Oppenheim, A Medi...",The Code of Hammurabi is a Babylonian legal te...,the code of hammurabi is a babylonian legal te...,the code hammurabi babylonian legal text compo...,1
Egalitarianism,political_science,"[Abdullah Öcalan, Abraham Lincoln, Adam Smith,...","Egalitarianism (from French égal 'equal'), or...",egalitarianism from french égal equal or equa...,egalitarianism french égal equal equalitariani...,1


In the table bellow we display summary statistics including the average number of articles for each discipline, number of edges and word count. As can be seen the distribution is rather skewed, with Antropology for example having more than double the amount of articles compared to Political Science. 

In [62]:
#Create descriptives table
tab = defaultdict(list)
for discipline in df['parent'].unique():
    avg_edges = []
    avg_pagelen = []
    for row in df.loc[df['parent']==discipline].iterrows():
        avg_edges.append(len(row[1]['edges']))
        avg_pagelen.append(len(row[1]['tokens']))
    
    tab['Discipline'].append(discipline)
    tab['Number of articles'].append(df.loc[df['parent']==discipline].shape[0])
    tab['Avg. edges'].append(np.mean(avg_edges))
    tab['Avg. word count'].append(np.mean(avg_pagelen))
tab = pd.DataFrame(tab)
tab.set_index('Discipline').round(2)

Unnamed: 0_level_0,Number of articles,Avg. edges,Avg. word count
Discipline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
political_science,418,218.65,12748.76
economics,675,82.67,4399.29
sociology,684,137.08,9135.81
anthropology,1198,193.84,10567.26
psychology,955,127.9,10447.76
