# Bias in Wikipedia
## Francisco Javier Salido Magos
Human Centered Data Science, DATA 512, Fall 2018.
M.S. Data Science,
University of Washington.

Homework A2, version 1.0

This code is made available for re-use under a MIT license. (https://opensource.org/licenses/MIT)

In [16]:
import requests
import json
import pandas as pd
import numpy as np

# Step 1: Data Acquisition
## Function that prepares the environment to extract per-article quality data using the REST API
In this function we first declare the endpoint for the API. These is the actual "location" in Wikimedia to which the extraction requests will be sent, via API calls.

Next we set the parameters for data extraction through API calls.

In [17]:
def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks. 
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

# Headers with personal data that will be relayed with the API calls, for reference purposes.
headers = {
    'User-Agent': 'https://github.com/PrivacyEngineer',
    'From': 'javiers@uw.edu'
}

## Function that issues calls to download batches of data
The purpose of this function is to extract the quality rating for each article by making a call to the API. Per the API's instructions, we should not issue a request for information on more than 50 articles at a time. This function *does not* deal with breaking requests down into appropriately sized batches, that is done before calling this function.

First, the function takes the list of article ids that have been provided, issues the request and stores the resulting JSON-formatted data into ores_raw.

In order to obtain the specific data point we are interested in, the rating for each article, we use a for loop to extract article-specific data for each id from the JSON data. The try/except instructions are there to deal with the errors that will result from article ids that cannot be found in the JSON data (rating of zero is assigned to these cases). Note that for the first iteration, flag = 0 which means we have to create the dataframe for the results, and that dataframe will contain column names as well as data. The dataframe with results is named ores_frame.

In the final step we drop the probabilities the API has assigned to each article, keeping only the rating, and append a column that contains the article ids to ores_frame before returning it to the caller.

In [18]:
def extract_quality(list_ids):
    # Extracting probabilities from JSON file
    ores_raw = get_ores_data(list_ids, headers) # Calls API and gets JSON data
    #ores_frame = pd.DataFrame.from_dict(ores_raw['enwiki']['scores'])

    # A value flag = 0 marks the first iteration through the for loop, when ores_frame has to be
    # created. Once flag = 1, we know the remaining iterations are just appends to the original ores_frame.
    flag = 0
    
    # for loop below extracts probabilities for each ID and stores them into the dataframe "ores_frame"
    for j in list_ids:
        if flag == 0:
            flag = 1
            try:
                ores_frame = pd.DataFrame(list(ores_raw['enwiki']['scores'][str(j)]['wp10']['score'].items())).T
            except:
                ores_frame = pd.DataFrame([['prediction','probability'],['0','0']])
        else:
            try:
                temp = pd.DataFrame(list(ores_raw['enwiki']['scores'][str(j)]['wp10']['score'].items())).T
            except:
                temp = pd.DataFrame(['0','0'])
            ores_frame = ores_frame.append(temp.loc[1,:].T)
    
    # Separates column names from data in the dataframe
    column_names = ores_frame.loc[0,:]
    ores_frame = ores_frame.loc[1:ores_frame.shape[0],:]
    ores_frame.columns = [column_names]
    
    # Add a column with the IDs and reset the dataframe index
    ores_frame.drop('probability',axis=1,inplace=True)
    ores_frame.reset_index(drop = True, inplace = True)
    return(ores_frame)

## Data extraction step

The program reads the table that contains information on the 47k+ articles that we are interested in and extracts the relevant list of article ids.

Next is a for loop that will break the list of article ids into batches of size 50 and issue calls to extract_quality() in order to get the individual rating for each article. The rating for each article is stored in the dataframe "articles" in a column named "prediction", and the column names in "article" are renamed and reorganized inside the dataframe so that they fit the format specified by Homework A2.

In [19]:
# Read csv file that contains the list of articles and related information.
articles = pd.read_csv('c:/Users/Castor18/OneDrive/MSDS_UW/Data512_Aut18/Session 4/Hmwk_A2_Salido/data-512-a2/raw_data/page_data.csv')

# Extracts the list of article ids we are interested in.
list_ids = list(articles.rev_id)
x = 0
y = 50

# This for loop breaks the list of ids down into chunks of 50 and issues one API call for each chunk. 
# Results are stored in the quality_list dataframe.
for k in range(1,int(len(list_ids)/50)+2):
    list_short = list_ids[x:y]
    if k == 1:
        quality_list = extract_quality(list_short)
    else:
        quality_list = quality_list.append(extract_quality(list_short))
    x = y
    y += 50
    if y > len(list_ids):
        y = len(list_ids)
        
# quality_list is re-indexed and appended to "articles"
quality_list.reset_index(drop = True, inplace = True)
articles['prediction'] = quality_list

# Contents of the "articles" dataframe are reorg'd and columns are renamed.
articles = articles[['country','page','rev_id','prediction']]
articles.columns = ['country','article_name','revision_id','article_quality']

## Extracting rest of data we need and cleaning data. First attempt at merging article and population data
This section of code reads a second csv file that contains population figures for each country, and merges that with the "articles" dataframe.

Here we note a large number of countries that show populations of zero, including some relatively large countries that have been around for over a hundred years each. This resulted in the inspection of the data in the "articles" dataframe, which led to the discovery of mistakes in country names ("Hondura" instead of "Honduras"), use of names that do not match the list of countries we have for populations ("Czech Republic" instead of "Czechia"), typos (extra comas or blank spaces in names), use of demonyms instead of country names ("Salvadorean" instead of "El Salvador"), and references to pre-hispanic demonyms that do not align with current politican boundaries ("incan"). We also found references to federated states instead of country names (Daguestan, which is a federated state of Russia, or Greenland which is part of the Kingdom of Denmark).  

In [20]:
# Read the country/population data
demographic = pd.read_csv('c:/Users/Castor18/OneDrive/MSDS_UW/Data512_Aut18/Session 4/Hmwk_A2_Salido/data-512-a2/raw_data/WPDS_2018_data.csv',thousands = ',')
demographic.columns = ['country','population']

# Merge population data with the articles dataset
articles2 = pd.merge(demographic, articles, how='right', on='country')
left_out_countries = articles2[articles2.population.isnull()]['country']

# Discovered a number of mistakes with country names in the articles dataset
left_out_countries = pd.DataFrame(left_out_countries.unique())
left_out_countries

Unnamed: 0,0
0,Palestinian Territory
1,Hondura
2,Czech Republic
3,Salvadoran
4,Saint Kitts and Nevis
5,Palauan
6,French Guiana
7,Ivorian
8,Saint Vincent and the Grenadines
9,Rhodesian


## Addressing inconsistencies and incorrect names

To address the above problems we created the translation table shown below. 
In addition to correcting spelling mistakes and variations and changing demonyms to country names, we had to make some decisions: for federated states we used the name of the federation if the state is physically close to the rest of the federation (Chechnia and Daguestan as part of the Russian Federation), but kept the name of the federated state if they are far appart (Guadeloupe and France, or Greenland and Denmark). We did not wade into territorial disputes, and left states not recognized by U.N.O. as independent states (South Ossetia), and merged Incan with Peru. Note also that we did not change Rodhesia, a colonial-era name, into a current country name because Rodhesia was broken down into Zambia and Zimbabwe, so changing names we would run the risk of choosing the wrong country.  

In [21]:
correct_map = [
['Hondura','Honduras'],
['Czech Republic','Czechia'],
['Salvadoran','El Salvador'],
['Saint Kitts and Nevis','St. Kitts-Nevis'],
['Palauan','Palau'],
['Ivorian','Cote d\'Ivoire'],
['Saint Vincent and the Grenadines','St. Vincent and the Grenadines'],
['Rhodesian','Rhodesia'],
['Omani','Oman'],
['Congo Dem. Rep. of','Congo Dem. Rep.'],
['Congo  Dem. Rep. of','Congo Dem. Rep.'],
['Niuean','Niuean'],
['East Timorese','East Timor'],
['Korea  South','Korea South'],
['Faroese','Denmark'],
['Cape Colony','South Africa'],
['South Korean','Korea South'],
['Samoan','Samoa'],
['Montserratian','Montserrat'],
['Saint Lucian','Saint Lucia'],
['South African Republic','South Africa'],
['Incan','Peru'],
['Chechen','Russia'],
['Korea  North','Korea North'],
['South Ossetian','South Ossetia'],
['Cook Island','Cook Islands'],
['Tokelauan','Tokelau'],
['Dagestani','Russia'],
['Greenlandic','Greenland']]
correct_map = pd.DataFrame(correct_map)
correct_map.columns = ['original','countries']

# Step 2: Data Processing
## Second attempt at merging country populations into the articles dataset
Here we read a set of corrections for country names and call the resulting data frame correct_map. We then replace country names in the articles dataset with the correct ones and then merge article with population data.

Columns in the resulting dataframe "articles2" are reorganized to match the format requested in the Homework A2 statement (columns are organized in the following order: country, article_name, revision_id, article_quality, population), and the results are stored in ../results/results.csv 

Note that Python issues a warning when running this section of code. We reviewed the output and it seemed OK.

In [22]:
articles2 = articles

# This for loop assigns the correct country name to each row in the dataframe articles2
for j in range(0,correct_map.shape[0]):
    articles2['country'][articles2['country'] == correct_map.loc[j][0]] = correct_map.loc[j][1]
    
# Merging article and population data into a single dataframe
articles2 = pd.merge(demographic, articles2, how='right', on='country')

# Reorganize articles2 dataframe columns and store the result in a csv file
articles2 = articles2[['country','article_name','revision_id','article_quality','population']]
articles2.to_csv('c:/Users/Castor18/OneDrive/MSDS_UW/Data512_Aut18/Session 4/Hmwk_A2_Salido/data-512-a2/results/results.csv')
articles2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Algeria,Template:Algeria-politician-stub,544347736,Stub,42.7
1,Algeria,Template:Algeria-diplomat-stub,567620838,Stub,42.7
2,Algeria,Template:AlgerianPres,665948270,Stub,42.7
3,Algeria,Ali Fawzi Rebaine,686269631,Stub,42.7
4,Algeria,Ahmed Attaf,705910185,Stub,42.7


# Step 3: Analysis
## Computing proportions: Articles per million people and % of FA and GA articles per country
This section parses a list of the unique country names found in the dataframe that contains all article data, articles2. Note that this list, stored in a dataframe called freq_articles is different from that in correct_map. We then use a for loop to compute the number of articles that are associated with each country and store that information into a numpy vector called num_articles which is later appended to freq_articles and merged with demographic so that it will also contain population per country. The frequency of articles per country is computed by dividing num_articles by population for each country. Note that this is the frequency of *ARTICLES PER MILLION PEOPLE* as a computation per person results in numbers that are too small. 

We then create a separate dataframe called articles_with_FAGA that contains only those articles that were rated either FA (Featured Article) or GA (Good Article) and compute the number of articles of these ratings per country. Results are stored also in freq_articles. 

In [23]:
# Build list of unique country names in articles2 dataframe
unique_country = articles2.country.unique()
freq_articles = pd.DataFrame([unique_country]).T
freq_articles.columns = ['country']

# Join list of unique country names with demographic dataframe to get the population of each country.
freq_articles = pd.merge(freq_articles, demographic, how='right', on='country')

# Build numpy matrix to store article counts for each country.
num_articles = np.zeros(len(unique_country))

# This for loop will compute the article count for each country and store each total in num_articles
for j in range(0,freq_articles.shape[0]):
    temp = articles2['country'][articles2['country'] == freq_articles.loc[j][0]]
    num_articles[j] = len(temp)
    
# We append the column with the per-country article counts to the dataframe with country names and populations
freq_articles = freq_articles.join(pd.DataFrame(num_articles))
freq_articles.columns = ['country','population','num_articles']

# Compute proportion of articles per million people for each country and include in freq_articles dataframe
freq_articles['proportion_per_million'] = freq_articles.num_articles/freq_articles.population

# Create new dataframe that contains information on articles rated FA and GA only.
articles_with_FAGA = articles2[articles2.article_quality.isin(['FA','GA'])]
num_articles = np.zeros(len(unique_country))

# For loop computes FA and GA article count for each country
for j in range(0,freq_articles.shape[0]):
    temp = articles_with_FAGA['country'][articles_with_FAGA['country'] == freq_articles.loc[j][0]]
    num_articles[j] = len(temp)
    
# We append the column with the per-country FA and GA article counts to the dataframe with country names and populations
freq_articles = freq_articles.join(pd.DataFrame(num_articles))
freq_articles.columns = ['country','population','num_articles','proportion_per_million','num_high_quality_articles']

# Compute FA and GA articles as a proportion of total articles per country.
freq_articles['prop_high_quality'] = freq_articles.num_high_quality_articles/freq_articles.num_articles

## Building tables
We now wish to use the contents of the dataframe freq_articles from the previous step to build the four tables required by Homework A2:

- 10 highest-ranked countries in terms of politician articles as proportion of country population.
- 10 lowest-ranked countries in terms of politician articles as proportion of country population.
- 10 highest-ranked countries in terms of number of GA and FA-quality articles as proportion of all articles about politicians from that country.
- 10 lowest-ranked countries in terms of number of GA and FA-quality articles as proportion of all articles about politicians from that country.

To do that we first remove from the list all countries for which either we don't have population numbers or for which we have no articles in the data base, the resulting dataframe is called ranked_countries. 

Next we sort ranked_countries and take the top ten and bottom ten countries as measured by proportion of articles per country population.

A second sorting of ranked_countries, this time by proportion of high quality articles allows us to find the top and bottom ten from this point of view. Note that the bottom ten countries all have zero FA and GA articles. So, we remove all countries with zero num_high_quality_articles, 26 countries total, and look for the bottom 10 that have a non-zero rate of FA and GA articles against total.

In [24]:
# Create dataframe of ranked countries. That is, those for which we have population data and a non-zero number of articles.l
ranked_countries = freq_articles[freq_articles.population > 0]
ranked_countries = ranked_countries[ranked_countries.num_articles > 0]

# Drop a column we don't need.
ranked_countries = ranked_countries.drop('num_high_quality_articles',1)

# Sort ranked countries so that we can pick top and bottom ten by proportion of articles per million people.
HR_prop_per_million = ranked_countries.sort_values('proportion_per_million')
BR_prop_per_million = HR_prop_per_million.head(10)
BR_prop_per_million = BR_prop_per_million.drop(['num_articles','prop_high_quality'],1)
BR_prop_per_million.reset_index(drop = True, inplace = True)
HR_prop_per_million = HR_prop_per_million.tail(10).sort_values('proportion_per_million',ascending = False)
HR_prop_per_million = HR_prop_per_million.drop(['num_articles','prop_high_quality'],1)
HR_prop_per_million.reset_index(drop = True, inplace = True)

# Sort ranked countries so that we can pick top and bottom ten by proportion of FA and GA articles per total articles.
HR_prop_high_quality = ranked_countries.sort_values('prop_high_quality')
BR_prop_high_quality = HR_prop_high_quality.head(10)
BR_prop_high_quality = BR_prop_high_quality.drop(['population','proportion_per_million'],1)
BR_prop_high_quality.reset_index(drop = True, inplace = True)
HR_prop_high_quality = HR_prop_high_quality.tail(10).sort_values('prop_high_quality',ascending = False)
HR_prop_high_quality = HR_prop_high_quality.drop(['population','proportion_per_million'],1)
HR_prop_high_quality.reset_index(drop = True, inplace = True)

# Remove all ranked countries that have zero FA and GA articles (list is shown), and recompute bottom ten FA and GA per total.
BR_prop_high_quality2 = ranked_countries.sort_values('prop_high_quality')
BR_prop_high_quality0 = BR_prop_high_quality2[BR_prop_high_quality2.prop_high_quality == 0]
BR_prop_high_quality0 = BR_prop_high_quality0.drop(['population','proportion_per_million'],1)
BR_prop_high_quality0.reset_index(drop = True, inplace = True)
BR_prop_high_quality2 = BR_prop_high_quality2[BR_prop_high_quality2.prop_high_quality > 0]
BR_prop_high_quality2 = BR_prop_high_quality2.sort_values('prop_high_quality').head(10)
BR_prop_high_quality2 = BR_prop_high_quality2.drop(['population','proportion_per_million'],1)
BR_prop_high_quality2.reset_index(drop = True, inplace = True)

#### 10 Highest-ranked countries by politician articles as proportion of population:

In [25]:
HR_prop_per_million

Unnamed: 0,country,population,proportion_per_million
0,Tuvalu,0.01,5500.0
1,Nauru,0.01,5300.0
2,San Marino,0.03,2733.333333
3,Palau,0.02,1150.0
4,Monaco,0.04,1000.0
5,Liechtenstein,0.04,725.0
6,St. Kitts-Nevis,0.05,640.0
7,Tonga,0.1,630.0
8,Marshall Islands,0.06,616.666667
9,Iceland,0.4,515.0


#### 10 Lowest-ranked countries by politician articles as proportion of population

In [26]:
BR_prop_per_million

Unnamed: 0,country,population,proportion_per_million
0,India,1371.3,0.721943
1,Indonesia,265.2,0.810709
2,China,1393.8,0.816473
3,Uzbekistan,32.9,0.881459
4,Ethiopia,107.5,0.976744
5,Zambia,17.7,1.468927
6,Thailand,66.2,1.691843
7,Bangladesh,166.4,1.947115
8,Mozambique,30.5,1.967213
9,Vietnam,94.7,2.016895


#### 10 Highest-ranked countries by GA and FA quality articles as proportion of total articles

In [27]:
HR_prop_high_quality

Unnamed: 0,country,num_articles,prop_high_quality
0,Saudi Arabia,119.0,0.134454
1,Central African Republic,68.0,0.117647
2,Romania,348.0,0.114943
3,Mauritania,52.0,0.096154
4,Tuvalu,55.0,0.090909
5,Bhutan,33.0,0.090909
6,Dominica,12.0,0.083333
7,United States,1098.0,0.074681
8,Benin,94.0,0.074468
9,Vietnam,191.0,0.068063


#### There actually are 25 countries where the proportion of GA and FA is zero:

In [28]:
BR_prop_high_quality0

Unnamed: 0,country,num_articles,prop_high_quality
0,Moldova,426.0,0.0
1,Finland,572.0,0.0
2,Mozambique,60.0,0.0
3,Seychelles,22.0,0.0
4,Macedonia,65.0,0.0
5,Uganda,188.0,0.0
6,Zambia,26.0,0.0
7,Angola,110.0,0.0
8,Sao Tome and Principe,22.0,0.0
9,Andorra,34.0,0.0


#### If we exclude those and instead focus on those for which the proportion of GA and FA is greater than zero:

In [29]:
BR_prop_high_quality2

Unnamed: 0,country,num_articles,prop_high_quality
0,Tanzania,408.0,0.002451
1,Peru,361.0,0.00277
2,Czechia,254.0,0.003937
3,Lithuania,248.0,0.004032
4,Nigeria,684.0,0.004386
5,Morocco,208.0,0.004808
6,Fiji,199.0,0.005025
7,Bolivia,187.0,0.005348
8,Brazil,556.0,0.005396
9,Luxembourg,180.0,0.005556
