# A2: Bias in data

* Author: Haowen Ni
* Date: 11/02/2017
  
In this assignment, the author:  
1. Collects politicans articles by coountry from English Wikipedia dataset 
2. Collects population by country dataset
3. Use a machine learning API called ORES to get the article quality data of articles on political figures from English Wikipedia
4. Combines both article quality data and population data together into a dataframe 
5. Performs an analysis of the differences on the coverage of politicians on Wikipedia and the article quality about politicians between various countries 


## Aritcle and population data loading and preprocessing

In [1]:
# Import Python packages that we will use in this script
import csv
import json
import pandas
import pickle
import requests


We first import the English Wikipedia politicians articles dataset and population dataset into this notebook working environment. Both datasets are CSV files. The Wikipedia articles dataset contains the information about the sanitised country name, the unsanitised page title, and the edit ID of the last edit to the page (also called revision id.) The population dataset contains the information about the country name and corresponding population. 

In [2]:
# Import wikipedia dataset

wiki_data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        wiki_data.append(row)
wiki_data = wiki_data[1:] # Skip the header


Notice that the population data are in string format, so we need to convert them to integers before doing calculations.

In [3]:
# Import population data

population_list = []
with open('Population Mid-2015.csv') as csvfile:
    reader = csv.reader(csvfile)
    next(reader) # Skip table title
    next(reader) # Skip emtpy line
    next(reader) # Skip table headers
    for row in reader:
        if len(row) < 5: # Skip empty rows
            continue
        row[4] = int(row[4].replace(',', ''))
        population_list.append(row)


Based on the Wikipedia dataset we just imported, a dictionary is constructed in which the keys are revision ids and the values are tuples of country names and article names.

In [4]:
# Based on wikipedia dataset, construct a dictionary as following:
# Key: revision id
# Value: (country name, article name)

country_article = {}
for line in wiki_data:
    country_article[line[2]] = (line[1], line[0])


Based on the population dataset we just imported, a dictionary is constructed in which the keys are country names and the values are population.

In [5]:
# Based on population data, construct a dictionary as following:
# Key: country name
# Value: population

population_data = {}
for line in population_list:
    population_data[line[0]] = line[4]


## Using ORES to get article quality predictions

In order to get the article quality predictions, we will use ORES, a machine learning API provided by Wikipedia. We write a function that accepts an revision id and API user header as parameters and returns the article quality prediction. Since multiple revision ids can be passed into ORES API for each call, we split the Wikipedia article set into batches of 50. Notice that not every revision id from Wikipedia datasets that get passed into the API has a article quality prediction. For those articles that cannot be predicted, we just raise a flag and ignore them in the final prediction dataset. The result of this machine learning model is in JSON format.

In [6]:
headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response


In [7]:
# Gather prediction for each article
# It's recommended to batch 50 revisions in each request 

rev_id_list = [line[2] for line in wiki_data[1:]]

prediction_data = {}

fifty_ids = []
for counter, rev_id in enumerate(rev_id_list):  
    if counter % 50 == 0 and counter != 0:
        fifty_predictions = get_ores_data(fifty_ids, headers)
        fifty_predictions = fifty_predictions['enwiki']['scores'].items()
        for key, value in fifty_predictions:            
            try: 
                prediction_data[key] = value['wp10']['score']['prediction']
            except KeyError:
                print('Could not find revision ' + str(key))
        fifty_ids = []
    fifty_ids.append(rev_id)

# Gather predictions for last 47 revisions 
fifty_predictions = get_ores_data(fifty_ids, headers)
fifty_predictions = fifty_predictions['enwiki']['scores'].items()
for key, value in fifty_predictions:            
    try: 
        prediction_data[key] = value['wp10']['score']['prediction']
    except KeyError:
        print('Could not find revision ' + str(key))


Could not find revision 806811023
Could not find revision 807367030
Could not find revision 807367166
Could not find revision 807484325


We could save the prediction data in local disk by using pickle so that we could reuse the article quality prediction data when we restart the instance.

In [8]:
# Save the prediction data for later use
pickle.dump(prediction_data, open('prediction_data.p', 'wb'))


In [9]:
# Load the saved prediction data
prediction_data = pickle.load(open('prediction_data.p', 'rb'))


## Combining the datasets

Now we merge both datasets we just collected based on country names into a dataframe. Notice that country names from Wikipedia articles dataset may not match those from population dataset. In those cases, we just ignore those data with country names that do not match. We eventually save the table into a csv file. 

In [10]:
# Combine the population data and article prediction data

combined_data = []
for key, value in country_article.items():
    country = value[0]
    article_name = value[1]
    revision_id = key
    if revision_id in prediction_data:
        article_quality = prediction_data[revision_id]
        if country in population_data:
            population = population_data[country]
            line = [country, article_name, revision_id, article_quality, population]
            combined_data.append(line)

df = pandas.DataFrame(combined_data, columns=['country', 'article_name', 'revision_id', 'article_quality', 'population'])

# Write combined data to CSV
df.to_csv('en_wikipedia_article_quality.csv', sep=',', index=False)

# Take a look at the combined dataset
df

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Chad,Bir I of Kanem,355319463,Stub,13707000
1,Zimbabwe,Template:Zimbabwe-politician-stub,391862046,Stub,17354000
2,Uganda,Template:Uganda-politician-stub,391862070,Stub,40141000
3,Namibia,Template:Namibia-politician-stub,391862409,Stub,2482100
4,Nigeria,Template:Nigeria-politician-stub,391862819,Stub,181839400
5,Colombia,Template:Colombia-politician-stub,391863340,Stub,48218000
6,Chile,Template:Chile-politician-stub,391863361,Stub,18025000
7,Fiji,Template:Fiji-politician-stub,391863617,Stub,867000
8,Solomon Islands,Template:Solomons-politician-stub,391863809,Stub,641900
9,Palestinian Territory,Information Minister of the Palestinian Nation...,393276188,Stub,4481195


## Analyze the bias in data

In this section, we will analyze the potential bias in the data we just collected through some quantitative analysis. Basically, we will calculate and consider two ratios:
1. Number of politician articles to country population
2. Number of 'high qualities' ariticles (articles with predictions in either 'FA' or 'GA') to total number of articles


## Calculate the ratio of number of politician articles to country population

In [11]:
articles_to_population = pandas.DataFrame(
    df.groupby('country').size() / df.groupby('country')['population'].max() * 100
    , columns=['Number of articles to population (in percentage)'])
articles_to_population = articles_to_population.sort_values(by='Number of articles to population (in percentage)'
                                                                   , ascending=False)


### Top 10 countries in terms of number of politician articles as a proportion of country population

In [12]:
articles_to_population.head(10)

Unnamed: 0_level_0,Number of articles to population (in percentage)
country,Unnamed: 1_level_1
Nauru,0.488029
Tuvalu,0.466102
San Marino,0.248485
Monaco,0.10502
Liechtenstein,0.077189
Marshall Islands,0.067273
Iceland,0.062268
Tonga,0.060987
Andorra,0.04359
Federated States of Micronesia,0.036893


### Bottom 10 countries in terms of number of politician articles as a proportion of country population

In [13]:
articles_to_population.tail(10)

Unnamed: 0_level_0,Number of articles to population (in percentage)
country,Unnamed: 1_level_1
Bangladesh,0.000202
"Congo, Dem. Rep. of",0.000194
Thailand,0.000172
Zambia,0.000162
"Korea, North",0.000156
Ethiopia,0.000107
Uzbekistan,9.3e-05
Indonesia,8.4e-05
China,8.3e-05
India,7.5e-05


## Calculate the ratio of number of 'high qualities' ariticles to total number of articles

In [14]:
gafa_to_articles = pandas.DataFrame(
    df[(df.article_quality == 'FA') | (df.article_quality == 'GA')].groupby('country').size() / df.groupby('country').size() * 100
    , columns=['Number of \'high qualities\' articles to number of articles (in percentage)'])
gafa_to_articles = gafa_to_articles.fillna(0).sort_values(
    by='Number of \'high qualities\' articles to number of articles (in percentage)'
                    , ascending=False)


### Top 10 countries in terms of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [15]:
gafa_to_articles.head(10)

Unnamed: 0_level_0,Number of 'high qualities' articles to number of articles (in percentage)
country,Unnamed: 1_level_1
"Korea, North",23.076923
Saudi Arabia,11.764706
Uzbekistan,10.344828
Central African Republic,10.294118
Romania,9.770115
Guinea-Bissau,9.52381
Bhutan,9.090909
Vietnam,8.376963
Dominica,8.333333
Mauritania,7.692308


### Bottom 10 countries in terms of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [16]:
gafa_to_articles.tail(10)

Unnamed: 0_level_0,Number of 'high qualities' articles to number of articles (in percentage)
country,Unnamed: 1_level_1
Bahrain,0.0
Seychelles,0.0
Malta,0.0
Bahamas,0.0
Cape Verde,0.0
Solomon Islands,0.0
Guadeloupe,0.0
French Guiana,0.0
Finland,0.0
Moldova,0.0


## Analysis and Write-Up


### What I have learned in this project?
  
This project provides me an opportunity to practice basic skills about data preprocessing in Python such as loading csv files, constructing dictionaries and splitting dataset into batches, and using pandas dataframe to combine datasets and proceeding some advanced aggregations and calculations. Also, I became more familiar with using WikiMedia API in Python. I learned that to improve the efficiency of using API, I could pass in multiple revision id for each API call. It was a really interesting experience to discover the fact that the API may not return the desired result and raise an error sometimes, I spent a large amount of time on debugging this issue. Actually, several days ago I found there are only two revision ids that do not have any article quality prediction, but today I found there are two more revision ids that do not have predictions. 

### What I found from the calculations?  
  
For the first ratio we calculated, the ratio of number of politician articles to country, I realized that the top-ranked countries are countries that are relatively small regarding both territory and population size, but there are still many people talking about the politicians of those countries becaues small countries tend to have more political and domestic events and changes. 

In contrast, the bottom-ranked countries have much less percentages of converage of politician articles on English Wikipedia compared to those top-ranked countries (about 100000 times difference). There are two cases: for countries like India and China, even though the numbers of politician articles in English Wikipedia are not small, the final coverage percentage will still be very very small because the population of those countries (the denominators in the caculations) are extremely large. On the other hand, North Korea, even though the country population is not large relatively, the number of politician articles on English Wikipedia is relatively small due to the fact that there is an Internet blocking between North Korea's Internet and Worldwide Internet, so there are less people who are able to publish or even know about North Korea's politics.

For the second ratio we calculated, the ratio of umber of 'high qualities' ariticles to total number of articles, we see North Korea gets the first rank. I think there is some bias existed in this result because articles talking about North Korea's politicians may be not be actually written with very good quality, but because of the fact that the articles about North Korea are too rare, the algorithm somehow considers those articles contain extremely precious value and marks them as feature or good articles.  