# Assignment 2: Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will perform an analysis on how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. 

The analysis will consist of a series of tables that show:
- the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
- the countries with the highest and lowest proportion of high quality articles about politicians.
- a ranking of geographic regions by articles-per-person and proportion of high quality articles.

### Import Libraries

In [1]:
import os
import requests
from urllib.parse import urlencode

import pandas as pd
import numpy as np

### Step 1: Getting the Article and Population data

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. 
- download the zipped folder manually from the link https://figshare.com/articles/dataset/Untitled_Item/5513449 
- unzip it in our data/source folder

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [4]:
df_cd = pd.read_csv('../data/source/page_data.csv')
df_wpd = pd.read_csv('../data/source/WPDS_2020_data.csv')

In [5]:
df_cd.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [6]:
df_wpd.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### Step 2: Data Cleaning

Both page_data.csv and WPDS_2020_data.csv contain some rows that need to be filtered out:

page_data.csv:
- the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in the analysis

WPDS_2020_data.csv:
- contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA)


In [7]:
df_cd = df_cd[df_cd["page"].str.contains("Template:")==False]

df_wpd_country = df_wpd[df_wpd['Name'].str.isupper() == False] # Country-level 
df_wpd_region = df_wpd[df_wpd['Name'].str.isupper()] # Cumulative region level

In [8]:
df_cd.head(5)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [9]:
df_wpd_country.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


In [10]:
df_wpd_region.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000


### Step 3: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures. These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. 

In order to get article predictions for each article in the Wikipedia dataset, we will first need to read page_data.csv and then read through the dataset line by line, using the value of the rev_id column to make an API query.


In [11]:
headers = {
    'User-Agent': 'https://github.com/KrishaMehta98',
    'From': 'kkm98@gmail.com'
}

In [12]:
def api_call(rev_ids):
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    parameters ={'project' : 'enwiki',
                 'model'   : 'articlequality',
                 'revids'   : '|'.join(str(x) for x in rev_ids)
                }
    call = requests.get(endpoint.format(**parameters), headers = headers)
    response = call.json()
    return response

In [13]:
rev_ids = df_cd['rev_id'].tolist()

In [14]:
revids = []
scores = []
missing_revids = []

for i in range(0, len(rev_ids), 50):
    
    if i+50>len(rev_ids):
        ids = rev_ids[i:len(rev_ids)]
    else:
        ids = rev_ids[i:i+50]
    
    response = api_call(ids)
    
    for revid in ids:
        if response['enwiki']['scores'][str(revid)]['articlequality'].get('score') is not None:
            revids.append(revid)
            scores.append(response['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction'])
        else:
            missing_revids.append(revid)

In [15]:
revid_score = pd.DataFrame([revids, scores]).T
revid_score.columns = ['rev_id', 'prediction']
revid_score.revision_id = revid_score.rev_id.astype(int)
revid_score.to_csv('../data/final/wikipedia-politician-article-quality.csv', index=False)


revid_missing = pd.DataFrame([missing_revids]).T
revid_missing.columns = ['revid']
revid_missing.to_csv('../data/final/revids_missing.csv', index=False)

  revid_score.revision_id = revid_score.rev_id.astype(int)


### Step 4: Combining the data sets

Some processing of the data needs to be done. In particular, we'll have to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, we'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

Removing any rows that do not have matching data, and output them to a CSV file called: wp_wpds_countries-no_match.csv

Consolidate the remaining data into a single CSV file called: wp_wpds_politicians_by_country.csv


In [16]:
wpd_cd = df_cd.merge(df_wpd, how='outer', left_on='country', right_on='Name')

In [17]:
wpd_cd.head(5)

Unnamed: 0,page,country,rev_id,FIPS,Name,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463.0,TD,Chad,Country,2019.0,16.877,16877000.0
1,Abdullah II of Kanem,Chad,498683267.0,TD,Chad,Country,2019.0,16.877,16877000.0
2,Salmama II of Kanem,Chad,565745353.0,TD,Chad,Country,2019.0,16.877,16877000.0
3,Kuri I of Kanem,Chad,565745365.0,TD,Chad,Country,2019.0,16.877,16877000.0
4,Mohammed I of Kanem,Chad,565745375.0,TD,Chad,Country,2019.0,16.877,16877000.0


In [18]:
missing_wpd_cd = wpd_cd.loc[(wpd_cd['country'].isnull() | wpd_cd['Name'].isnull())]
missing_wpd_cd.to_csv('../data/final/wp_wpds_countries-no_match.csv', index = False)

In [19]:
wp_wpds  = df_cd.merge(df_wpd, how='inner', left_on='country', right_on='Name')
wp_wpds = wp_wpds.merge(revid_score, on='rev_id')
wp_wpds = wp_wpds.drop(columns=['FIPS', 'Name', 'Type', 'TimeFrame', 'Data (M)'])
wp_wpds = wp_wpds.rename(index=str, columns={'page': 'article_name', 'rev_id': 'revision_id', 'Population':'population', 'prediction':'article_quality_est'})
wp_wpds.to_csv('../data/final/wp_wpds_politicians_by_country.csv', index = False)

In [20]:
wp_wpds.head(5)

Unnamed: 0,article_name,country,revision_id,population,article_quality_est
0,Bir I of Kanem,Chad,355319463,16877000,Stub
1,Abdullah II of Kanem,Chad,498683267,16877000,Stub
2,Salmama II of Kanem,Chad,565745353,16877000,Stub
3,Kuri I of Kanem,Chad,565745365,16877000,Stub
4,Mohammed I of Kanem,Chad,565745375,16877000,Stub


### Step 5: Analysis

The analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [21]:
wp_wpds_per_country = pd.pivot_table(wp_wpds, fill_value=0,columns=['article_quality_est'],aggfunc={'article_quality_est': len,},index=['country'])
wp_wpds_per_country.columns = wp_wpds_per_country.columns.droplevel() 
wp_wpds_per_country = wp_wpds_per_country.reset_index()
wp_wpds_per_country.columns.name = None
wp_wpds_per_country = wp_wpds_per_country.merge(wp_wpds.groupby(['country'])['population'].mean(), left_on='country', right_index=True)

In [22]:
wp_wpds_per_country['total_articles'] = wp_wpds_per_country['FA'] + wp_wpds_per_country['GA'] + wp_wpds_per_country['B'] + wp_wpds_per_country['C'] + wp_wpds_per_country['Stub'] + wp_wpds_per_country['Start']
wp_wpds_per_country['total_high_quality_articles'] = wp_wpds_per_country['FA'] + wp_wpds_per_country['GA']
wp_wpds_per_country['articles_per_population_percent'] = (wp_wpds_per_country['total_articles'] / wp_wpds_per_country['population']) * 100
wp_wpds_per_country['high_quality_articles_percent'] = (wp_wpds_per_country['total_high_quality_articles'] / wp_wpds_per_country['total_articles']) * 100


In [23]:
wp_wpds_per_country.head(5)

Unnamed: 0,country,B,C,FA,GA,Start,Stub,population,total_articles,total_high_quality_articles,articles_per_population_percent,high_quality_articles_percent
0,Afghanistan,8,46,1,12,99,153,38928000,319,13,0.000819,4.075235
1,Albania,3,59,0,3,147,244,2838000,456,3,0.016068,0.657895
2,Algeria,3,10,0,2,44,57,44357000,116,2,0.000262,1.724138
3,Andorra,0,2,0,0,8,24,82000,34,0,0.041463,0.0
4,Angola,2,6,0,0,23,75,32522000,106,0,0.000326,0.0


In [25]:
region = "NORTHERN AFRICA"

regions = ['WORLD', 'AFRICA', 'NORTHERN AFRICA']
for i in range(3, len(df_wpd)):
    if df_wpd.iloc[i]['Type'] == 'Sub-Region':
        region = df_wpd.iloc[i]['Name']
    regions.append(region)

df_wpd['Region'] = regions

In [26]:
wp_wpds_per_region = pd.merge(left=wp_wpds,right=df_wpd,left_on='country',right_on='Name',how='left')
wp_wpds_per_region = pd.merge(left=wp_wpds_per_region,right=df_wpd_region,left_on='Region',right_on='Name',how='left')
wp_wpds_per_region = wp_wpds_per_region[['Region', 'country', 'article_name', 'revision_id', 'article_quality_est', 'Population_y', 'population']]
wp_wpds_per_region.rename(columns={'Region': 'region', 'Population_y': 'region_population','population': 'country_population'}, inplace=True)
wp_wpds_per_region.dropna(subset=['region_population'], inplace=True)

In [27]:
wp_wpds_per_region.head(5)

Unnamed: 0,region,country,article_name,revision_id,article_quality_est,region_population,country_population
0,MIDDLE AFRICA,Chad,Bir I of Kanem,355319463,Stub,179757000.0,16877000
1,MIDDLE AFRICA,Chad,Abdullah II of Kanem,498683267,Stub,179757000.0,16877000
2,MIDDLE AFRICA,Chad,Salmama II of Kanem,565745353,Stub,179757000.0,16877000
3,MIDDLE AFRICA,Chad,Kuri I of Kanem,565745365,Stub,179757000.0,16877000
4,MIDDLE AFRICA,Chad,Mohammed I of Kanem,565745375,Stub,179757000.0,16877000


In [28]:
wp_wpds_per_region_pivot = pd.pivot_table(wp_wpds_per_region,fill_value=0,columns=['article_quality_est'],aggfunc={'article_quality_est': len, },index=['region'])
wp_wpds_per_region_pivot.columns = wp_wpds_per_region_pivot.columns.droplevel() #clean up multilevel index
wp_wpds_per_region_pivot = wp_wpds_per_region_pivot.reset_index()
wp_wpds_per_region_pivot.columns.name = None

In [29]:
wp_wpds_per_region_pivot = pd.merge(left=wp_wpds_per_region_pivot,right=wp_wpds_per_region.groupby(['region'])['region_population'].mean(), left_on='region', right_index=True)

In [30]:
wp_wpds_per_region_pivot['total_articles'] = wp_wpds_per_region_pivot['FA'] + wp_wpds_per_region_pivot['GA'] + wp_wpds_per_region_pivot['B'] + wp_wpds_per_region_pivot['C'] + wp_wpds_per_region_pivot['Stub'] + wp_wpds_per_region_pivot['Start']
wp_wpds_per_region_pivot['total_high_quality_articles'] = wp_wpds_per_region_pivot['FA'] + wp_wpds_per_region_pivot['GA']
wp_wpds_per_region_pivot['articles_per_population_percent'] = (wp_wpds_per_region_pivot['total_articles'] / wp_wpds_per_region_pivot['region_population']) * 100
wp_wpds_per_region_pivot['high_quality_articles_percent'] = (wp_wpds_per_region_pivot['total_high_quality_articles'] / wp_wpds_per_region_pivot['total_articles']) * 100

In [31]:
wp_wpds_per_region_pivot.head(5)

Unnamed: 0,region,B,C,FA,GA,Start,Stub,region_population,total_articles,total_high_quality_articles,articles_per_population_percent,high_quality_articles_percent
0,CARIBBEAN,6,103,2,11,241,332,43233000.0,695,13,0.001608,1.870504
1,CENTRAL AMERICA,8,96,7,16,266,1150,178611000.0,1543,23,0.000864,1.490603
2,CENTRAL ASIA,5,33,1,6,75,125,74961000.0,245,7,0.000327,2.857143
3,EAST ASIA,104,422,17,59,789,1082,1641063000.0,2473,76,0.000151,3.07319
4,EASTERN AFRICA,28,264,7,28,658,1517,444970000.0,2502,35,0.000562,1.398881


### Step 6: Results

#### 1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [32]:
wp_wpds_per_country.sort_values(by=['articles_per_population_percent'], ascending=False).head(10)[['country', 'articles_per_population_percent']]

Unnamed: 0,country,articles_per_population_percent
169,Tuvalu,0.54
117,Nauru,0.472727
138,San Marino,0.238235
110,Monaco,0.105263
95,Liechtenstein,0.071795
104,Marshall Islands,0.064912
164,Tonga,0.063636
70,Iceland,0.05462
3,Andorra,0.041463
52,Federated States of Micronesia,0.033962


#### 2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


In [33]:
wp_wpds_per_country.sort_values(by=['articles_per_population_percent'], ascending=True).head(10)[['country', 'articles_per_population_percent']]

Unnamed: 0,country,articles_per_population_percent
71,India,6.9e-05
72,Indonesia,7.7e-05
34,China,8.1e-05
176,Uzbekistan,8.2e-05
51,Ethiopia,8.8e-05
181,Zambia,0.000136
84,"Korea, North",0.00014
162,Thailand,0.000168
114,Mozambique,0.000186
13,Bangladesh,0.000187


#### 3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [34]:
wp_wpds_per_country.sort_values(by=['high_quality_articles_percent'], ascending=False).head(10)[['country', 'high_quality_articles_percent']]

Unnamed: 0,country,high_quality_articles_percent
84,"Korea, North",22.222222
140,Saudi Arabia,12.820513
135,Romania,12.244898
31,Central African Republic,12.121212
176,Uzbekistan,10.714286
106,Mauritania,10.416667
64,Guatemala,8.433735
44,Dominica,8.333333
158,Syria,7.8125
18,Benin,7.692308


#### 4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [35]:
wp_wpds_per_country.sort_values(by=['high_quality_articles_percent'], ascending=True).head(10)[['country', 'high_quality_articles_percent']]

Unnamed: 0,country,high_quality_articles_percent
148,Solomon Islands,0.0
164,Tonga,0.0
117,Nauru,0.0
116,Namibia,0.0
43,Djibouti,0.0
114,Mozambique,0.0
110,Monaco,0.0
49,Eritrea,0.0
50,Estonia,0.0
109,Moldova,0.0


#### 5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician a

In [36]:
wp_wpds_per_region_pivot.sort_values(by=['articles_per_population_percent'], ascending=False)[['region', 'articles_per_population_percent']]


Unnamed: 0,region,articles_per_population_percent
9,OCEANIA,0.007244
14,SOUTHERN EUROPE,0.002421
17,WESTERN EUROPE,0.002333
0,CARIBBEAN,0.001608
5,EASTERN EUROPE,0.001279
13,SOUTHERN AFRICA,0.000936
16,WESTERN ASIA,0.000912
1,CENTRAL AMERICA,0.000864
10,SOUTH AMERICA,0.000706
4,EASTERN AFRICA,0.000562


#### 6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [37]:
wp_wpds_per_region_pivot.sort_values(by=['high_quality_articles_percent'], ascending=False)[['region', 'articles_per_population_percent']]

Unnamed: 0,region,articles_per_population_percent
8,NORTHERN AMERICA,0.000516
12,SOUTHEAST ASIA,0.000305
16,WESTERN ASIA,0.000912
5,EASTERN EUROPE,0.001279
3,EAST ASIA,0.000151
2,CENTRAL ASIA,0.000327
6,MIDDLE AFRICA,0.00037
7,NORTHERN AFRICA,0.000368
9,OCEANIA,0.007244
14,SOUTHERN EUROPE,0.002421


### Reflection

Before performing the analysis, I thought that the following factors might have the most effect on the percentage of articles:
- If english is a common language in the country, then there may be higher percentage of articles from that country
- If the country is developed, that is, it has more resources like Internet and also a majority of literate population, then again, there may be a higher percentage of articles

But, from the analysis, we see that the article per population is highest is countries with less population. I was really surprised by the fact that large countries like China and India were in the bottom 10 for highest article per population percentage. I believe that is because the politician count does not increase drastically with a high population, so we see a higher article per population in countries with lower population.

For the top 10 high quality articles percentage, we see countries like North Korea, Saudi Arabia and Syria, who in general have a lot of attention of the whole world and not only from the people of their own countries (which is the case in some other countries).

I believe that major bias was introduced when we only use the English wikipedia articles to do our analysis, as there would be many other countries who would rather use their own language. Ideally, articles in other languages on politicians should also be taken into consideration.