## Import Packages and Data

We first import the relevant packages as described in the README.

In [1]:
import requests
import json
import pandas as pd
import os
import csv
import numpy as np

# The following code is motivated from the Stack Overflow post by user emunsing:
#        https://stackoverflow.com/a/29665452/3905509
from IPython.display import display, HTML

os.chdir('C:/Users/willf/Documents/Data 512/HW/A2')

Next, we import the 2 datasets as pandas dataframes:
* one representing population estimates from many nations (*data_population*)
* one representing political Wikipeia articles from many nations (*data_wiki*)

Downstream in this notebook, we will use the ORES API from Wikipedia. To prevent access issues (and more generally to be polite), we partition *data_wiki* into chunks of 100 articles in the list of dataframes *data_wiki_partition*.

In [2]:
# Import WPDS_2018_data.csv
# The following dataset comes from https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0
data_population = pd.read_csv('WPDS_2018_data.csv', thousands = ',')

# Import page_data.csv
# The following dataset is produced by the R project found here: https://figshare.com/articles/Untitled_Item/5513449
data_wiki = pd.read_csv('page_data.csv')

# Partition data_wiki into small enough chunks to not get blocked by ORES
# The following creates a list of data_wiki partitions. Each list element has 100 records.
data_wiki_partition = np.array_split(data_wiki, data_wiki.shape[0] // 100)

# Scoring Wikipedia Articles with ORES

In order to estimate the quality of Wikipedia articles, we use the ORES API. In the following code block, relevant headers and an api call function, *get_ores_data*, are defined. The scored ORES data is stored in the list of json objects called *data_wiki_ores_json*.

To make *data_wiki_ores_json* more usable, we restate it as a dataframe in the object *data_wiki_ores_df*. Lastly, the predicted article qualities are appended to the original *data_wiki* object into the dataframe *data_wiki_ores*.

In [3]:
# The code in this block is reused & adapted from Os Keyes:
#        https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

headers = {'User-Agent' : 'https://github.com/OO00OO00', 'From' : 'frierw@uw.edu'}

# Score Wiki articles with ORES
def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

# Score the chunks of 100 articles in ORES. Return as a list of json output for each partition.
data_wiki_ores_json = [get_ores_data(i['rev_id'].tolist(), headers) for i in data_wiki_partition]

# For each batch and each revision id contained therein (i.e., the keys of batch['enwiki']['scores']):
#     1) Restate the wp10 dictionary value as a pandas dataframe
#     2) Add a column reflecting the revision id
#     3) Extract the relevant prediction field, i.e., the 1st row of the resulting dataframe
#     4) Lastly, concatenate the list of dataframes as a new dataframe and transpose the result
data_wiki_ores_df = pd.concat([pd.DataFrame.from_dict(batch['enwiki']['scores'][revID]['wp10']).assign(rev_id = int(revID)).iloc[0,:] 
                               for batch in data_wiki_ores_json 
                               for revID in batch['enwiki']['scores'].keys()
                              ], axis = 1).transpose()

# Append ORES prediction to data_wiki
data_wiki_ores = pd.merge(data_wiki, data_wiki_ores_df, left_on = 'rev_id', right_on = 'rev_id', how = 'left')

# Combine Data

Next, we combine the ORES-scored article dataframe with the population dataframe into a new dataframe called *data*. Some minor changes are performed on it to clarify its operations and analysis downstream in this notebook. Lastly, its results are exported to CSV in the file *wikipedia_political_article_bias_2018.csv*

In [4]:
# Merge dataframes
data = pd.merge(data_population, data_wiki_ores, left_on = 'Geography', right_on = 'country', how = 'inner')

# Rename columns:
data.rename(columns = {'page': 'article_name'}, inplace = True)
data.rename(columns = {'rev_id': 'revision_id'}, inplace = True)
data.rename(columns = {'score': 'article_quality'}, inplace = True)
data.rename(columns = {'Population mid-2018 (millions)': 'population'}, inplace = True)

# Remove duplicate column
data.drop(columns = 'Geography')

# Reorder columns:
data = data.reindex(columns = ['country', 'article_name', 'revision_id', 'article_quality', 'population'])

# Export data to CSV
data.to_csv('wikipedia_political_article_bias_2018.csv', index = False)

Since we want to calculate the proportion of high quality political articles via the FA and GA ORES values, the following code reshapes the *data* dataframe to expand these levels as new columns and stores the result in the dataframe *data_pivot*. Next, *data_pivot* is aggregated at the country level and additional summary data is calculated. The result is called *data_aggregated*. Lastly, *data_aggregated* is modified to more easily create derived fields downstream in this notebook.

In [5]:
# Expand levels of article_quality column into new columns themselves
#    Motivated by the Stack Overflow response from user DYZ:
#        https://stackoverflow.com/a/42708606
# 1) Use pivot_table on data to dcast levels of article_quality as new columns
# 2) Restate pivot to numpy record array
# 3) Restate numpy record array back to a pandas dataframe
data_pivot = pd.DataFrame(data.pivot_table(index = ['country', 'article_name', 'revision_id', 'population'], columns = 'article_quality', aggfunc=len).to_records())

# Aggregate data to country-level and include relevant summary data
data_aggregated = data_pivot.groupby('country').agg({
    'population': {'max'},
    'revision_id': {'count'},
    'B': {'count'},
    'C': {'count'},
    'FA': {'count'},
    'GA': {'count'},
    'Start': {'count'},
    'Stub': {'count'}
})

# Flatten the multi-index dataframe for ease in deriving new columns
# The following code is motivated by the Stack Overflow response from user Andy Hayden:
#        https://stackoverflow.com/a/14508355
data_aggregated.columns = data_aggregated.columns.get_level_values(0)

# For clarity, renaming revision_id to article_count
data_aggregated.rename(columns = {'revision_id' : 'article_count'}, inplace = True)

Two fields are derived:
* articles_per_population: the proportion of articles-per-population for each nation
* pct_high_quality_articles: the proportion of high quality articles for each nation, where "high quality" means proportion of articles that have the ORES prediction quality of *FA* or *GA*.

In [6]:
# Create derived fields
data_aggregated['articles_per_population'] = data_aggregated['article_count'] / (data_aggregated['population'] * 1e6)
data_aggregated['pct_high_quality_articles'] = (data_aggregated['FA'] + data_aggregated['GA']) / data_aggregated['article_count']

# Analysis

Lastly, for this notebook, we create 4 tables:
* 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
* 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
* 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
* 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

Note: The wording for the tables is taken from here: https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018)/Assignments#A2:_Bias_in_data

In [7]:
print('10 highest-ranked countries in terms of number of politician articles as a proportion of country population:')
display(data_aggregated.sort_values(by = ['articles_per_population'], ascending = False).iloc[0:10,[0, 1, -2]])

print('10 lowest-ranked countries in terms of number of politician articles as a proportion of country population:')
display(data_aggregated.sort_values(by = ['articles_per_population'], ascending = True).iloc[0:10,[0, 1, -2]])

print('10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:')
display(data_aggregated.sort_values(by = ['pct_high_quality_articles'], ascending = False).iloc[0:10,[1, 4, 5, -1]])

print('10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:')
display(data_aggregated.sort_values(by = ['pct_high_quality_articles', 'article_count', 'population'], ascending = True).iloc[0:10,[0, 1, 4, 5, -1]])

10 highest-ranked countries in terms of number of politician articles as a proportion of country population:


Unnamed: 0_level_0,population,article_count,articles_per_population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tuvalu,0.01,42,0.0042
San Marino,0.03,72,0.0024
Nauru,0.01,11,0.0011
Iceland,0.4,200,0.0005
Marshall Islands,0.06,26,0.000433
Monaco,0.04,16,0.0004
Luxembourg,0.6,176,0.000293
Fiji,0.9,195,0.000217
Seychelles,0.1,17,0.00017
Tonga,0.1,17,0.00017


10 lowest-ranked countries in terms of number of politician articles as a proportion of country population:


Unnamed: 0_level_0,population,article_count,articles_per_population
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Uzbekistan,32.9,12,3.647416e-07
Ethiopia,107.5,62,5.767442e-07
Mozambique,30.5,18,5.901639e-07
China,1393.8,901,6.464342e-07
Saudi Arabia,33.4,26,7.784431e-07
"Korea, North",25.6,20,7.8125e-07
Indonesia,265.2,224,8.446456e-07
India,1371.3,1238,9.02793e-07
Zambia,17.7,16,9.039548e-07
Thailand,66.2,81,1.223565e-06


10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:


Unnamed: 0_level_0,article_count,FA,GA,pct_high_quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Korea, North",20,0,9,0.45
Central African Republic,19,1,2,0.157895
Trinidad and Tobago,31,0,4,0.129032
Tuvalu,42,0,5,0.119048
Saudi Arabia,26,1,2,0.115385
Romania,282,19,9,0.099291
United States,920,24,67,0.098913
Singapore,92,0,9,0.097826
Kosovo,31,0,3,0.096774
Eritrea,11,0,1,0.090909


10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:


Unnamed: 0_level_0,population,article_count,FA,GA,pct_high_quality_articles
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Liechtenstein,0.04,5,0,0,0.0
Kiribati,0.1,5,0,0,0.0
Sao Tome and Principe,0.2,6,0,0,0.0
Lesotho,2.3,8,0,0,0.0
Dominica,0.07,10,0,0,0.0
Turkmenistan,5.9,10,0,0,0.0
Nauru,0.01,11,0,0,0.0
Grenada,0.1,11,0,0,0.0
Barbados,0.3,12,0,0,0.0
Belize,0.4,12,0,0,0.0


# Discussion

I expected smaller but notorious countries to have higher proportions of high quality articles on Wikipedia. North Korea, e.g., has been notable for its aggressive statements and acts, despite being a relatively smaller country by population. This is reflected in the tables above with North Korea being in the 10 lowest-ranked countries for articles per capita, but highest-ranked country for high quality articles.

I also expected some smaller nations to have disproportionately more political articles because these nations have some unique political property that manifests itself more easily *because* the nations were smaller. Iceland is a good example here, since the nation has less than 0.5M citizens but recently had its prime minister resign following the Panama Papers scandal. Because Iceland was smaller by population (and in size), its citizens could more directly interact with their government to elicit change.

The table results also suggest that the metric of political articles per population may not be very meaningful in the context of bias for very large countries. For example, China and India each have more than a billion people, yet both countries are in the 10 lowest-ranked nations for political articles per population. Given that each nation had on the order of 1000 political articles, however, that suggests there's quite a bit more content described about these nations compared to the other nations in the same table but with smaller populations.