# A2 - Bias

By: Benjamin Brodeur Mathieu  
Date: 10/05/2019

## Overview

The goal of this assignment is to reflect on sources of bias by analyzing coverage and relative article quality by country and geographical regions of politicians articles taken from the English Wikipedia.

## Step 1: Data acquisition

The data for this analysis comes from:

1. [The Wikipedia politicians by country dataset](https://figshare.com/articles/Untitled_Item/5513449)
2. [Population resource bureau, mid-2018 population by country](https://www.prb.org/international/indicator/population/table/)
   
and is located in the `raw_data` folder. See the repository's README.md file for additional details.

## Step 2: Cleaning the data

First we will import a few libraries needed for our analysis.

The `pandas` library will be used for loading and manipulating the data.
> `pandas` uses the `numpy` library behind the scenes to handle multidimensional arrays efficiently. We will import this library as well to help with specific manipulations later on.

In [316]:
import pandas as pd
import numpy as np

We load the data csv files from the `raw_data` folder and output the first few rows of each to make sure they were loaded correctly.

In [317]:
politicians_by_country = pd.read_csv('../raw_data/page_data.csv')
politicians_by_country.head(2)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463


In [318]:
population_by_geography = pd.read_csv('../raw_data/WPDS_2018_data.csv', thousands=',')
population_by_geography.head(2)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7


To simplify the use of the `population_by_geography` table we will rename its columns `geo` and `pop`.

In [319]:
population_by_geography.columns = ['geo', 'pop']
population_by_geography.head(2)

Unnamed: 0,geo,pop
0,AFRICA,1284.0
1,Algeria,42.7


We can see that some rows of the `politicians_by_country` dataframe's `page` column contains the "Template:" prefix. These pages are not Wikipedia articles and will be removed below.

In [320]:
# ~ is used as the standard ! (negation operator)
template_prefix_filter = ~politicians_by_country.page.str.startswith('Template:')
politicians_by_country = politicians_by_country[template_prefix_filter]
politicians_by_country.head(3)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005


The `population_by_geography` contains some cumulative regional (i.e. AFRICA, OCEANIA) population counts. Regions are ALL CAPS values in the `geo` column. These rows won't match with the country field of our `politicians_by_country` table, so we will remove them to form the `population_by_country` table and keep the other rows.

In [321]:
# Only regions are in ALL CAPS
region_filter = population_by_geography.geo.str.isupper()
population_by_country = population_by_geography[~region_filter]
population_by_country.columns = ['country', 'pop']
population_by_country.head(3)

Unnamed: 0,country,pop
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5


## Step 3: Getting article quality predictions

We will be gathering quality predictions data from the [ORES](https://www.mediawiki.org/wiki/ORES) (Objective Revision Evaluation Servie) machine learning system.

The code in the cell below was provided as sample code to use with the ores package.

In [322]:
from ores import api

# We provide this useragent string (second arg below) to help the ORES team track requests
ores_session = api.Session("https://ores.wikimedia.org", "Class project: bebrodeu@uw.edu")

# Fetch the article quality using the rev_id values
results = ores_session.score("enwiki", ["articlequality"], politicians_by_country.rev_id.values)

For each article in the result we obtain the prediction and place them in an array. If the prediction was not available we instead use a `no_prediction_token` as value.

In [323]:
article_quality_col = []
no_prediction_token = 'NOT_FOUND'

for score in results:
    found_prediction = False
    
    # Is a prediction in the score object ?
    if 'articlequality' in score:
        if 'score' in score['articlequality']:
            if 'prediction' in score['articlequality']['score']:
                article_quality_col.append(score['articlequality']['score']['prediction'])
                found_prediction = True
    
    # No predictions were found
    if not found_prediction:
        article_quality_col.append(no_prediction_token)

# Output the first five values to validate
article_quality_col[0:5]

['Stub', 'Stub', 'Stub', 'Stub', 'Stub']

We add the newly extracted article_quality column to the politicians_by_country dataframe.

In [324]:
politicians_by_country['article_quality'] = article_quality_col
politicians_by_country.head(2)

Unnamed: 0,page,country,rev_id,article_quality
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub


We save the articles whose ratings weren't found in a file named `ores_not_found.csv` in the artifacts folder. We will used them later in the analysis phase.
For now, we remove these values from our `politicians_by_country` table.

In [325]:
not_found_articles_filter = (politicians_by_country['article_quality'] == 'NOT_FOUND')
not_found_articles = politicians_by_country.loc[not_found_articles_filter]

# We do not need to include the article_quality column as it was not available
not_found_articles.drop(columns=['article_quality'])

not_found_articles.to_csv('../artifacts/ores_not_found.csv', index=None, header=True)

In [326]:
# Politicians by country now only has rated articles
politicians_by_country = politicians_by_country[~not_found_articles_filter]

## Step 4: Combining datasets

Now that our article data in the `politicians_by_country` table has the quality rating for each article, we will merge it with our population_by_country into one table. We also rename our columns for readability going forward.

In [276]:
# pandas' merge is the equivalent of the sql join statement
# the how parameter indicates the type of merge
# outer indicates a "full outer join"
articles_and_population = pd.merge(politicians_by_country, population_by_country, on='country', how='outer')
articles_and_population.columns = ['article_name', 'country', 'revision_id', 'article_quality', 'population']
articles_and_population.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,15.4
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15.4
2,Salmama II of Kanem,Chad,565745353.0,Stub,15.4
3,Kuri I of Kanem,Chad,565745365.0,Stub,15.4
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15.4


Some rows will not have had a match with the other table.

* We want to keep a record of rows for which there was not `pop` value (NaN in the table) which indicates no match from the population_by_country table.


* We also want to keep rows for which the other fields (such as rev_id) are missing (NaN) which indicates no match from the politicians_by_country table.

In [277]:
no_population_match_rows = articles_and_population[articles_and_population['population'].isnull()]
no_revision_id_match_rows = articles_and_population[articles_and_population['revision_id'].isnull()]
no_match_df = no_population_match_rows.append(no_revision_id_match_rows)

We will now create a file with the complete and incomplete rows.

In [278]:
articles_and_population = articles_and_population.drop(no_match_df.index)
no_match_df.to_csv('../clean_data/wp_wpds_countries_no_match.csv', index=None, header=True)
articles_and_population.to_csv('../clean_data/wp_wpds_politicians_by_country.csv', index=None, header=True)

## Step 5: Analysis

We start by loading the cleaned data.

In [279]:
# We use the Thousands=',' token to specify that the population column has thousands delimted by commas
articles_and_population = pd.read_csv('../clean_data/wp_wpds_politicians_by_country.csv', thousands=',')
articles_and_population.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,15.4
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15.4
2,Salmama II of Kanem,Chad,565745353.0,Stub,15.4
3,Kuri I of Kanem,Chad,565745365.0,Stub,15.4
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15.4


Our analysis will focus on:

| Area | Description |
|---|---|
| Coverage | The number of politician articles as a proportion of the country's population |
| Relative article quality |The proportion of the number of "FA" (featured article) or "GA" (good article) over the number of articles |

We are interested in getting those metrics by regions and countries. We will use our original data source to associate a country with its region and add this to our dataset.

In [280]:
# Drop the population from our original dataset
geography = population_by_geography.drop(columns=['pop'])

# iterate over indexes in geography and create dictionary of countries (key) to their region (value).
# The original dataset has region in ALL_CAPS first followed by all countries in that region.
country_to_region_lookup = {}
region = ''
for i in geography.index:
    country_or_region = geography.loc[i, 'geo']
    # Is the 'geo' field of this row a region?
    if country_or_region.isupper():
        # Assign region for all countries until the next region
        region = country_or_region
    else:
        # Assign current region to country
        country_to_region_lookup[country_or_region] = region

# iterate over the articles dataset using the lookup to assign a region 
# to each row based on the value of the country field
regions = []
for i in articles_and_population.index:
    country = articles_and_population.loc[i, 'country']
    regions.append(country_to_region_lookup[country])

# Assign region column
articles_and_population['region'] = regions

# Display as validation
articles_and_population.head(3)

Unnamed: 0,article_name,country,revision_id,article_quality,population,region
0,Bir I of Kanem,Chad,355319463.0,Stub,15.4,AFRICA
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15.4,AFRICA
2,Salmama II of Kanem,Chad,565745353.0,Stub,15.4,AFRICA


## Coverage calculation

### By country

Our analysis will first focus on 'coverage' which we will calculate in terms of number of politician articles as a proportion of the country's population.

First we create a table of the number of the article_count and population by country.

In [281]:
# np.mean gives the mean for each group, np.size gives us the row_count (in this case the article count)
coverage_by_country = articles_and_population.groupby('country').agg({'population': np.mean, 'article_name': np.size})
coverage_by_country.columns = ['population', 'article_count']
coverage_by_country.head(2)

Unnamed: 0_level_0,population,article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,36.5,320
Albania,2.9,457


We calculate coverage in its own row and sort the table to obtain the top and bottom 10 countries for coverage.

In [282]:
# Reminder we mulitply by 1e6 as the population is in millions
coverage_by_country['coverage'] = (coverage_by_country.article_count/(coverage_by_country.population*1e6))

In [283]:
# Sort by coverage percentage descending and take 10
top_10_by_country = coverage_by_country.sort_values(by=['coverage'], ascending=False).head(10)

# Sort by coverage percentage ascending and take 10
bottom_10_by_country = coverage_by_country.sort_values(by=['coverage']).head(10)

### By region

We'd like to do a similar excersise to see what the coverage by geographical region will be.

In [288]:
# Group data by region counting the number of articles
articles_by_region = articles_and_population.groupby('region').agg({'article_name': np.size})

# Rename columns for article_count
articles_by_region.columns = ['article_count']

# Get population by region from the orginal table (population_by_geography)
coverage_by_region = pd.merge(articles_by_region, population_by_geography, left_on='region', right_on='geo', how='inner')

# Rename the 'pop' column
coverage_by_region = coverage_by_region.rename(columns={"pop": "population"})

# Calculate coverage (population is in millions)
coverage_by_region['coverage'] = (coverage_by_region['article_count']/(coverage_by_region.population*1e6))

# Output sorted by coverage percentage descending
coverage_by_region = coverage_by_region.sort_values(by=['coverage'], ascending=False)

# Output friendly names
coverage_by_region = coverage_by_region.rename(columns={'geo': 'region'})
coverage_by_region = coverage_by_region[['region', 'population', 'article_count', 'coverage']]

## Coverage tables discussion

In [300]:
# Display logic inspired by: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side
from IPython.display import display_html

top_10_by_country_styler = top_10_by_country.style.set_table_attributes("style='display:inline'").set_caption('Top 10').format({'coverage' : '{:.3%}'})
bottom_10_by_country_styler = bottom_10_by_country.style.set_table_attributes("style='display:inline;margin-left:40px'").set_caption('Bottom 10').format({'coverage' : '{:.5%}'})
region_styler = coverage_by_region.style.set_table_attributes("style='display:block'").set_caption('Regions').format({'coverage' : '{:.5%}'})

display_html(top_10_by_country_styler._repr_html_()+bottom_10_by_country_styler._repr_html_()+region_styler._repr_html_(), raw=True)

Unnamed: 0_level_0,population,article_count,coverage
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tuvalu,0.01,54,0.540%
Nauru,0.01,52,0.520%
San Marino,0.03,81,0.270%
Monaco,0.04,40,0.100%
Liechtenstein,0.04,28,0.070%
Tonga,0.1,63,0.063%
Marshall Islands,0.06,37,0.062%
Iceland,0.4,201,0.050%
Andorra,0.08,34,0.042%
Grenada,0.1,36,0.036%

Unnamed: 0_level_0,population,article_count,coverage
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,1371.3,980,0.00007%
Indonesia,265.2,210,0.00008%
China,1393.8,1130,0.00008%
Uzbekistan,32.9,28,0.00009%
Ethiopia,107.5,101,0.00009%
"Korea, North",25.6,36,0.00014%
Zambia,17.7,25,0.00014%
Thailand,66.2,112,0.00017%
Mozambique,30.5,58,0.00019%
Bangladesh,166.4,319,0.00019%

Unnamed: 0,region,population,article_count,coverage
5,OCEANIA,41,3128,0.00763%
2,EUROPE,746,15864,0.00213%
3,LATIN AMERICA AND THE CARIBBEAN,649,5169,0.00080%
0,AFRICA,1284,6851,0.00053%
4,NORTHERN AMERICA,365,1921,0.00053%
1,ASIA,4536,11531,0.00025%


> Note population is in millions

#### Observations

We notice that the countries with the top 10 coverage all have fairly small populations. This is expected as it is difficult to obtain high coverage in densely populated countries. This is reflected in the bottom 10 table which all have populations over 30 million.

Both country table's countries official languages are not english. This is interesting given that articles where fetch from the English version of wikipedia. A hypothesis which could have been formulated before looking at the data is: Countries whose official language include english would have good coverage.

Coverage is calculated by counting the number of articles about politicians over a countrie's population. This does not take into account the historical context of the countries nor their political systems. Some countries may have much richer history records, political systems that involve more people etc.

In the region table we can see some of the observations above come into play:

- The population count seems to vaguely dictate the overall order
- Northern america has a small number of articles for its population, but may also have the shortest reported historical period.
- Many other factors such as the distribution of wikipedia's english countries could explain some of the discrepancies between regions.


## Relative quality

Our analysis will now focus on 'relative quality' which we will calculate as a proportion of the number of articles with a rating of "FA" or "GA" over the total number of articles.

### By country

In [313]:
# Create custom aggregator to count the number of "FA" and "GA" articles
def count_quality_articles(series):
    great_articles_count = 0
    for val in series:
        if val == 'FA' or val == 'GA':
            great_articles_count = great_articles_count + 1
    return great_articles_count

# Group data by country
relative_quality_by_country = articles_and_population.groupby('country').agg({'article_name': np.size, 'article_quality': count_quality_articles})

# Rename columns for article_count
relative_quality_by_country.columns = ['article_count', 'quality_article_count']

# Calculate relative_quality
relative_quality_by_country['relative_quality'] = (relative_quality_by_country['quality_article_count']/relative_quality_by_country['article_count'])

# Grab top 10
top_10_relative_quality_by_country = relative_quality_by_country.sort_values(by=['relative_quality'], ascending=False).head(10)

# Grab bottom 10
bottom_10_relative_quality_by_country = relative_quality_by_country.sort_values(by=['relative_quality']).head(10)

### By region

In [314]:
# Group data by region
relative_quality_by_region = articles_and_population.groupby('region').agg({'article_name': np.size, 'article_quality': count_quality_articles})

# Rename columns for article_count
relative_quality_by_region.columns = ['article_count', 'quality_article_count']

# Calculate relative_quality
relative_quality_by_region['relative_quality'] = (relative_quality_by_region['quality_article_count']/relative_quality_by_region['article_count'])

# Output by relative_quality descending
relative_quality_by_region = relative_quality_by_region.sort_values(by=['relative_quality'], ascending=False)

## Relative quality tables

In [315]:
# Display logic inspired by: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side
top_10_relative_quality_by_country_styler = top_10_relative_quality_by_country.style.set_table_attributes("style='display:inline'").set_caption('Top 10').format({'relative_quality' : '{:.3%}'})
bottom_10_relative_quality_by_country_styler = bottom_10_relative_quality_by_country.style.set_table_attributes("style='display:inline;margin-left:40px'").set_caption('Bottom 10').format({'relative_quality' : '{:.5%}'})
region_styler = relative_quality_by_region.style.set_table_attributes("style='display:block'").set_caption('Regions').format({'relative_quality' : '{:.5%}'})

display_html(top_10_relative_quality_by_country_styler._repr_html_()+bottom_10_relative_quality_by_country_styler._repr_html_()+region_styler._repr_html_(), raw=True)

Unnamed: 0_level_0,article_count,quality_article_count,relative_quality
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",36,7,19.444%
Saudi Arabia,118,15,12.712%
Mauritania,48,6,12.500%
Central African Republic,66,8,12.121%
Romania,343,39,11.370%
Tuvalu,54,5,9.259%
Bhutan,33,3,9.091%
Dominica,12,1,8.333%
Syria,128,10,7.812%
Benin,91,7,7.692%

Unnamed: 0_level_0,article_count,quality_article_count,relative_quality
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Slovakia,116,0,0.00000%
Namibia,162,0,0.00000%
Cape Verde,37,0,0.00000%
Mozambique,58,0,0.00000%
Costa Rica,147,0,0.00000%
Monaco,40,0,0.00000%
Djibouti,37,0,0.00000%
Moldova,423,0,0.00000%
Uganda,185,0,0.00000%
Eritrea,16,0,0.00000%

Unnamed: 0_level_0,article_count,quality_article_count,relative_quality
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NORTHERN AMERICA,1921,99,5.15357%
ASIA,11531,310,2.68841%
OCEANIA,3128,66,2.10997%
EUROPE,15864,322,2.02975%
AFRICA,6851,125,1.82455%
LATIN AMERICA AND THE CARIBBEAN,5169,69,1.33488%


#### Observations

Looking at the dataset more closely, we can see that more than 10 countries have no articles about a politician which obtained a quality rating of "FA" or "GA".

Having very few articles makes it easy to increase the relative_quality rating for a given country.

Many of the countries having poor relative quality ratings also have a few number of articles.

When looking at the region table, we see that the Northern America has the highest relative quality rating.
This may be due to having a large number of english native speakers.


# Reflection

**1. What biases did you expect to find in the data (before you started working with it), and why?**

Before starting the work, I thought article quality would reflect writting quality, but also content quality. As a result, I expected countries that live under political regimes prone to censorship to have worse article quality and a limited quantity of articles. I also expected english speaking countries to have better article quality by a significant margin due to having a larger number of editor's whose native language is english. I intuitively thought that, at least for countries whose official language list include english, population and coverage would be fairly proportional.


**2. What (potential) sources of bias did you discover in the course of your data processing and analysis?**

The evaluation for article quality doesn't really evaluate what the documentation calls 'tone':

> The metrics of evaluation for article quality on wikipedia is derived by:  
<br/>
>_"The articlequality model bases its predictions on structural characteristics of the article. E.g. How many sections are there? Is there an infobox? How many references? And do the references use a {{cite}} template? The articlequality model doesn't evaluate the quality of the writing or whether or not there's a tone problem (e.g. a point of view being pushed). However, many of the structural characteristics of articles seem to correlate strongly with good writing and tone, so the models work very well in practice."_ -Ores documentation

We also have very few information regarding how the model concretely does this evaluation. The code is at least made available for further exploration.

The number of politicians in a country is not proportional to a country's population. Due to this, article coverage seems like a metric with very little value/explanability. Furthermore, countries have varied political systems whom might involve a greater or smaller number of people. The total number of articles written for a country also depends largely on its historical records and history. In example, a country whose libraries where destroyed during wars might not have many records of early politics. Similarly, countries that were founded in the last century will not have a comparable amount of articles to a country whose rich history ranges over multiple centuries.

The dataset does not include any information about the editors of the articles. Having no information about the editors makes it so that we cannot make inference about the intent and or the validity of the articles. It would have been interesting to try to use data about the editors to account for potential bias. (Age group, Gender editing pages of same/different gender, countries editing pages from other countries etc.)


**4. What might your results suggest about the internet and global society in general?**

It is very tempting to draw intuitive (even prejudicial) conclusions from a dataset before taking a look at the data and its source. Sources of bias in anything human centered are multiple and seem to be difficult to account for. The Internet is an inherently biased source of data (notably, because access to the Internet is required to be part of the conversation).

Given these observations, it is interesting to think that there seem to remain an inherent (naive) trust in the democratic process of sharing opinions and information online. The current generation is already feeling the repercussions of exercising trust in largely unmonitored information sources.