# A2 - Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.

### Step 1: Getting the Article and Population Data

The first step is getting the data. The Wikipedia [politicians by country dataset](https://figshare.com/articles/Untitled_Item/5513449) can be found on Figshare. Here it is called page_data.csv.
The population data is available is called WPDS_2020_data.csv. This dataset is drawn from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

Our analysis will also use score estimates generated from ORES. You must `pip install ores` prior to running this notebook, or follow the [installation isntructions](https://github.com/wikimedia/ores).

In [1]:
import pandas as pd
import numpy as np
from ores import api
from tqdm import tqdm

In [2]:
page_data_path = '../data/page_data.csv'
WPDS_path = '../data/WPDS_2020_data.csv'

### Step 2: Cleaning the Data

Both page_df and WPDS_df contain some rows we will need to filter out or ignore. We will clean the datasets here.

In [88]:
# Filter out any rows that begin with 'Template:'. These are not Wikipedia articles.
page_df = pd.read_csv(page_data_path)
page_df = page_df.loc[~page_df['page'].str.contains('Template:')]
page_df

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


Here we add a column to the WPDS_df for 'Region' so we have a way to associate each country with its region. Then we separate the regions and countries into separate dfs.

In [91]:
WPDS_df = pd.read_csv(WPDS_path)
WPDS_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.850,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


In [92]:
# Adding the sub-region and region_population to WPDS_df.
region = ('NORTHERN AFRICA', 244344000)

regions = [('WORLD', 7772850000) , ('AFRICA', 1337918000), ('NORTHERN AFRICA', 244344000)]
for i in range(3, len(WPDS_df)):
    if WPDS_df.iloc[i]['Type'] == 'Sub-Region':
        region = (WPDS_df.iloc[i]['Name'], WPDS_df.iloc[i]['Population'])
    regions.append(region)
    
regions_tuples_df = pd.DataFrame(regions, columns=['Region', 'region_population'])
    
WPDS_df = pd.concat([WPDS_df, regions_tuples_df], axis=1)

WPDS_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region,region_population
0,WORLD,WORLD,World,2019,7772.850,7772850000,WORLD,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA,244344000
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA,244344000
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA,244344000
...,...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA,43155000
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA,43155000
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA,43155000
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA,43155000


In [93]:
# Separate all UPPERCASE entries from lowercase ones. UPPERCASE names are regions and lowercase are countries.
regions_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == True]
countries_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == False]

In [94]:
regions_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region,region_population
0,WORLD,WORLD,World,2019,7772.85,7772850000,WORLD,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000,WESTERN AFRICA,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000,EASTERN AFRICA,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000,MIDDLE AFRICA,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000,SOUTHERN AFRICA,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000,NORTHERN AMERICA,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000,LATIN AMERICA AND THE CARIBBEAN,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000,CENTRAL AMERICA,178611000


In [95]:
countries_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region,region_population
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA,244344000
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA,244344000
5,LY,Libya,Country,2019,6.891,6891000,NORTHERN AFRICA,244344000
6,MA,Morocco,Country,2019,35.952,35952000,NORTHERN AFRICA,244344000
7,SD,Sudan,Country,2019,43.849,43849000,NORTHERN AFRICA,244344000
...,...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA,43155000
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA,43155000
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA,43155000
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA,43155000


### Step 3: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the [Wikipedia content assessment procedures](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment).These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. ORES will assign one of these 6 categories to any rev_id we send it.

To get the score estimates, we will build a list of all the rev_ids in page_df and feed them one at a time to ORES. Each query will return a generator object which we will collect in a list called 'results'.

In [8]:
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 Class project <swheele@uw.edu>")

revids = list(page_df.rev_id)
results = []

for revid in revids:
    results.append(ores_session.score("enwiki", ["articlequality"], [revid]))

In [9]:
# Empty list to population with (rev_id, score) tuples.
scores = []

Here is an example of one of the generator objects that is stored in 'results'. **We will only be concerned with the 'prediction' field.**

In [11]:
for score in results[0]:
    print(score)

{'articlequality': {'score': {'prediction': 'Stub', 'probability': {'B': 0.005643168767502225, 'C': 0.005641424870624224, 'FA': 0.0010757577110297029, 'GA': 0.001543343686495854, 'Start': 0.010537503531047517, 'Stub': 0.9755588014333005}}}}


In [12]:
for i in tqdm(range(len(results))):
    for score in results[i]:
        if 'error' in list(score['articlequality'].keys()):
            scores.append((revids[i], np.nan))
        else:
            scores.append((revids[i], score['articlequality']['score']['prediction']))

100%|███████████████████████████████████| 46701/46701 [1:11:41<00:00, 10.86it/s]


In [13]:
# Convert scores which is a list of tuples to a dataframe
scores_df = pd.DataFrame(scores, columns=['rev_id', 'score'])
scores_df

Unnamed: 0,rev_id,score
0,393276188,Stub
1,393822005,Stub
2,395521877,Stub
3,395526568,Stub
4,401577829,Stub
...,...,...
46695,807482007,GA
46696,807483006,C
46697,807483153,GA
46698,807483270,C


In [96]:
# scores_df.to_csv('../data/scores_df.csv')
scores_df = pd.read_csv('../data/scores_df.csv')

Here we merge the scores_df with the page_df on rev_id. We use a left merge to retain all the rows of page_df. Then we will separate all the articles that ORES was unable to determine a score (score='NaN') from the articles with valid scores.

In [97]:
 page_df = page_df.merge(scores_df, how='left', left_on='rev_id', right_on='rev_id')

In [98]:
page_df

Unnamed: 0.1,page,country,rev_id,Unnamed: 0,score
0,Bir I of Kanem,Chad,355319463,,
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,0.0,Stub
2,Yos Por,Cambodia,393822005,1.0,Stub
3,Julius Gregr,Czech Republic,395521877,2.0,Stub
4,Edvard Gregr,Czech Republic,395526568,3.0,Stub
...,...,...,...,...,...
46696,Yahya Jammeh,Gambia,807482007,46695.0,GA
46697,Lucius Fairchild,United States,807483006,46696.0,C
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,46697.0,GA
46699,Francis Fessenden,United States,807483270,46698.0,C


In [99]:
nan_scores_df = page_df[page_df['score'].isna()]
articles_df = page_df[~page_df['score'].isna()]

### Step 4: Combining the Datasets

We need to merge the Wikipedia data and population data together. Both have fields containing country names which we will use for the merge. After merging the data, we will find that some entries could not be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

We will use an outer merge to retain all rows from both dataframes. Then we will remove any rows that are missing article, country, or score. We will save them to a CSV file called: `wp_wpds_countries-no_match.csv`

The remaining data will be consolidated into a single CSV file called: `wp_wpds_politicians_by_country.csv`.

The schema for that file looks like this:

| Column              |
|---------------------|
| country             |
| article_name        |
| revision_id         |
| article_quality_est |
| population          |

In [103]:
merged_df = articles_df.merge(countries_df, how='outer', left_on='country', right_on='Name')

merged_df

Unnamed: 0.1,page,country,rev_id,Unnamed: 0,score,FIPS,Name,Type,TimeFrame,Data (M),Population,Region,region_population
0,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188.0,0.0,Stub,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA,280927000.0
1,Finance Minister of the Palestinian National A...,Palestinian Territory,596181202.0,37.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA,280927000.0
2,Planning Minister of the Palestinian National ...,Palestinian Territory,633612729.0,70.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA,280927000.0
3,Hossam Arafat (politician),Palestinian Territory,680933208.0,285.0,Stub,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA,280927000.0
4,Tawfik Tirawi,Palestinian Territory,701106976.0,614.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA,280927000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
46446,,,,,,PF,French Polynesia,Country,2019.0,0.280,280000.0,OCEANIA,43155000.0
46447,,,,,,GU,Guam,Country,2019.0,0.175,175000.0,OCEANIA,43155000.0
46448,,,,,,NC,New Caledonia,Country,2019.0,0.295,295000.0,OCEANIA,43155000.0
46449,,,,,,PW,Palau,Country,2019.0,0.018,18000.0,OCEANIA,43155000.0


In [104]:
no_matches_df = merged_df.loc[(merged_df.score.isna()) | (merged_df.Name.isna()) | (merged_df.page.isna())]
merged_df = merged_df.drop(index=no_matches_df.index)
merged_df = merged_df.drop(columns=['Name', 'Type', 'FIPS', 'TimeFrame', 'Data (M)', 'Unnamed: 0']).rename(columns={'page': 'article_name', 'rev_id': 'revision_id', 'score': 'article_quality_est', 'Population': 'population', 'Region': 'region'})

merged_df

Unnamed: 0,article_name,country,revision_id,article_quality_est,population,region,region_population
0,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188.0,Stub,5008000.0,WESTERN ASIA,280927000.0
1,Finance Minister of the Palestinian National A...,Palestinian Territory,596181202.0,Start,5008000.0,WESTERN ASIA,280927000.0
2,Planning Minister of the Palestinian National ...,Palestinian Territory,633612729.0,Start,5008000.0,WESTERN ASIA,280927000.0
3,Hossam Arafat (politician),Palestinian Territory,680933208.0,Stub,5008000.0,WESTERN ASIA,280927000.0
4,Tawfik Tirawi,Palestinian Territory,701106976.0,Start,5008000.0,WESTERN ASIA,280927000.0
...,...,...,...,...,...,...,...
46413,Rita Sinon,Seychelles,800323154.0,Stub,98000.0,EASTERN AFRICA,444970000.0
46414,Sylvette Frichot,Seychelles,800323798.0,Stub,98000.0,EASTERN AFRICA,444970000.0
46415,May De Silva,Seychelles,800969960.0,Start,98000.0,EASTERN AFRICA,444970000.0
46416,Vincent Meriton,Seychelles,802051093.0,Stub,98000.0,EASTERN AFRICA,444970000.0


In [105]:
no_matches_df.to_csv('../data/wp_wpds_countries-no_match.csv')
merged_df.to_csv('../data/wp_wpds_politicians_by_country.csv')

### Step 5: Analysis

The analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

**Examples:**
- If a country has a population of 10,000 people, and you found 10 FA or GA class articles about politicians from that country, then the percentage of articles-per-population would be 0.1%.
- If a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.


In [106]:
groupby_country_df = merged_df.groupby(['country', 'article_quality_est']).agg({'revision_id': 'count', 'population': 'first'})

groupby_country_df

Unnamed: 0_level_0,Unnamed: 1_level_0,revision_id,population
country,article_quality_est,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,B,8,38928000.0
Afghanistan,C,46,38928000.0
Afghanistan,FA,1,38928000.0
Afghanistan,GA,12,38928000.0
Afghanistan,Start,99,38928000.0
...,...,...,...
Zimbabwe,C,17,14863000.0
Zimbabwe,FA,1,14863000.0
Zimbabwe,GA,1,14863000.0
Zimbabwe,Start,48,14863000.0


We have 183 countries represented in our final dataset that have at least one Wikipedia article with an estimated score.

In [107]:
len(merged_df.country.unique())

183

In [108]:
data_by_country = []
countries = merged_df.country.unique()

for country in countries:

    if (groupby_country_df.index.isin([(country, 'FA')]).any()) | (groupby_country_df.index.isin([(country, 'GA')]).any()):
        articles_sum = groupby_country_df.loc[(country, ['FA', 'GA']), :].revision_id.sum()
    else:
        articles_sum = 0

    articles_per_pop = ( articles_sum/groupby_country_df.loc[(country, slice(None)), :].population[0] ) * 100
    high_quality = ( articles_sum/groupby_country_df.loc[(country, slice(None)), :].revision_id.sum() ) * 100
    
    data_by_country.append([country, articles_per_pop, high_quality])

results_by_country_df = pd.DataFrame(data_by_country, columns=['country', 'articles_per_pop', 'high_quality'])
results_by_country_df

Unnamed: 0,country,articles_per_pop,high_quality
0,Palestinian Territory,0.000140,3.910615
1,Cambodia,0.000026,1.877934
2,Canada,0.000063,2.860548
3,Egypt,0.000010,4.273504
4,Pakistan,0.000007,1.472031
...,...,...,...
178,Barbados,0.000000,0.000000
179,Belize,0.000000,0.000000
180,Djibouti,0.000000,0.000000
181,Zambia,0.000000,0.000000


In [109]:
results_by_country_df.nlargest(10, 'articles_per_pop', keep='all')

Unnamed: 0,country,articles_per_pop,high_quality
99,Tuvalu,0.04,7.407407
175,Dominica,0.001389,8.333333
121,Vanuatu,0.000935,5.172414
68,Iceland,0.000543,0.995025
33,Ireland,0.0005,6.702413
123,Montenegro,0.000322,2.777778
138,Martinique,0.000281,2.941176
124,Bhutan,0.000274,6.060606
58,New Zealand,0.000261,1.660281
47,Romania,0.000218,12.244898


In [110]:
results_by_country_df.nlargest(10, 'high_quality', keep='all')

Unnamed: 0,country,articles_per_pop,high_quality
165,"Korea, North",3.1e-05,22.222222
172,Saudi Arabia,4.3e-05,12.820513
47,Romania,0.000218,12.244898
163,Central African Republic,0.000166,12.121212
151,Uzbekistan,9e-06,10.714286
139,Mauritania,0.000108,10.416667
156,Guatemala,3.9e-05,8.433735
175,Dominica,0.001389,8.333333
49,Syria,5.2e-05,7.8125
46,Benin,5.7e-05,7.692308


In [116]:
groupby_region_df = merged_df.groupby(['region', 'article_quality_est']).agg({'revision_id': 'count', 'region_population': 'first'})

groupby_region_df

Unnamed: 0_level_0,Unnamed: 1_level_0,revision_id,region_population
region,article_quality_est,Unnamed: 2_level_1,Unnamed: 3_level_1
CARIBBEAN,B,6,43233000.0
CARIBBEAN,C,103,43233000.0
CARIBBEAN,FA,2,43233000.0
CARIBBEAN,GA,11,43233000.0
CARIBBEAN,Start,241,43233000.0
...,...,...,...
WESTERN EUROPE,C,394,195479000.0
WESTERN EUROPE,FA,12,195479000.0
WESTERN EUROPE,GA,44,195479000.0
WESTERN EUROPE,Start,1281,195479000.0


We have 19 unique regions.

In [112]:
len(merged_df.region.unique())

19

In [113]:
data_by_region = []
regions = merged_df.region.unique()

for region in regions:

    if (groupby_region_df.index.isin([(region, 'FA')]).any()) | (groupby_region_df.index.isin([(region, 'GA')]).any()):
        articles_sum = groupby_region_df.loc[(region, ['FA', 'GA']), :].revision_id.sum()
    else:
        articles_sum = 0

    articles_per_pop = ( articles_sum/groupby_region_df.loc[(region, slice(None)), :].region_population[0] ) * 100
    high_quality = ( articles_sum/groupby_region_df.loc[(region, slice(None)), :].revision_id.sum() ) * 100
    
    data_by_region.append([region, articles_per_pop, high_quality])

results_by_region_df = pd.DataFrame(data_by_region, columns=['region', 'articles_per_pop', 'high_quality'])
results_by_region_df

Unnamed: 0,region,articles_per_pop,high_quality
0,WESTERN ASIA,3.2e-05,3.472493
1,SOUTHEAST ASIA,1.1e-05,3.613861
2,NORTHERN AMERICA,2.8e-05,5.470805
3,NORTHERN AFRICA,8e-06,2.113459
4,SOUTH ASIA,4e-06,1.626202
5,MIDDLE AFRICA,9e-06,2.409639
6,WESTERN EUROPE,2.9e-05,1.22807
7,EASTERN AFRICA,8e-06,1.398881
8,CENTRAL AMERICA,1.3e-05,1.490603
9,EASTERN EUROPE,4e-05,3.161844
