# A2 - Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.

### Step 1: Getting the Article and Population Data

The first step is getting the data. The Wikipedia [politicians by country dataset](https://figshare.com/articles/Untitled_Item/5513449) can be found on Figshare. Here it is called page_data.csv.
The population data is available is called WPDS_2020_data.csv. This dataset is drawn from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.

Our analysis will also use score estimates generated from ORES. You must `pip install ores` prior to running this notebook, or follow the [installation isntructions](https://github.com/wikimedia/ores).

In [1]:
import pandas as pd
import numpy as np
from ores import api
from tqdm import tqdm

In [2]:
page_data_path = '../data/page_data.csv'
WPDS_path = '../data/WPDS_2020_data.csv'

page_df = pd.read_csv(page_data_path)
WPDS_df = pd.read_csv(WPDS_path)

### Step 2: Cleaning the Data

Both page_df and WPDS_df contain some rows we will need to filter out or ignore. We will clean the datasets here.

In [3]:
# Filter out any rows that begin with 'Template:'. These are not Wikipedia articles.
page_df = page_df.loc[~page_df['page'].str.contains('Template:')]

Here we add a column to the WPDS_df for 'Region' so we have a way to associate each country with its region. Then we separate the regions and countries into separate dfs.

In [4]:
# Adding the sub-region to WPDS_df.
region = 'NORTHERN AFRICA'

regions = ['WORLD', 'AFRICA', 'NORTHERN AFRICA']
for i in range(3, len(WPDS_df)):
    if WPDS_df.iloc[i]['Type'] == 'Sub-Region':
        region = WPDS_df.iloc[i]['Name']
    regions.append(region)
    
WPDS_df['Region'] = regions

# Separate all UPPERCASE entries from lowercase ones. UPPERCASE names are regions and lowercase are countries.
regions_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == True]
countries_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == False]

In [5]:
regions_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.85,7772850000,WORLD
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000,WESTERN AFRICA
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000,EASTERN AFRICA
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000,MIDDLE AFRICA
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000,SOUTHERN AFRICA
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000,NORTHERN AMERICA
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000,LATIN AMERICA AND THE CARIBBEAN
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000,CENTRAL AMERICA


In [6]:
countries_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
5,LY,Libya,Country,2019,6.891,6891000,NORTHERN AFRICA
6,MA,Morocco,Country,2019,35.952,35952000,NORTHERN AFRICA
7,SD,Sudan,Country,2019,43.849,43849000,NORTHERN AFRICA
...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA


In [None]:
page_df

### Step 3: Getting Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the [Wikipedia content assessment procedures](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment).These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. ORES will assign one of these 6 categories to any rev_id we send it.

To get the score estimates, we will build a list of all the rev_ids in page_df and feed them one at a time to ORES. Each query will return a generator object which we will collect in a list called 'results'.

In [8]:
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 Class project <swheele@uw.edu>")

revids = list(page_df.rev_id)
results = []

for revid in revids:
    results.append(ores_session.score("enwiki", ["articlequality"], [revid]))

In [9]:
# Empty list to population with (rev_id, score) tuples.
scores = []

Here is an example of one of the generator objects that is stored in 'results'. **We will only be concerned with the 'prediction' field.**

In [11]:
for score in results[0]:
    print(score)

{'articlequality': {'score': {'prediction': 'Stub', 'probability': {'B': 0.005643168767502225, 'C': 0.005641424870624224, 'FA': 0.0010757577110297029, 'GA': 0.001543343686495854, 'Start': 0.010537503531047517, 'Stub': 0.9755588014333005}}}}


In [12]:
for i in tqdm(range(len(results))):
    for score in results[i]:
        if 'error' in list(score['articlequality'].keys()):
            scores.append((revids[i], np.nan))
        else:
            scores.append((revids[i], score['articlequality']['score']['prediction']))

100%|███████████████████████████████████| 46701/46701 [1:11:41<00:00, 10.86it/s]


In [13]:
# Convert scores which is a list of tuples to a dataframe
scores_df = pd.DataFrame(scores, columns=['rev_id', 'score'])
scores_df

Unnamed: 0,rev_id,score
0,393276188,Stub
1,393822005,Stub
2,395521877,Stub
3,395526568,Stub
4,401577829,Stub
...,...,...
46695,807482007,GA
46696,807483006,C
46697,807483153,GA
46698,807483270,C


Here we merge the scores_df with the page_df on rev_id. We use a left merge to retain all the rows of page_df. Then we will separate all the articles that ORES was unable to determine a score (score='NaN') from the articles with valid scores.

In [15]:
 page_df = page_df.merge(scores_df, how='left', left_on='rev_id', right_on='rev_id')

In [16]:
page_df

Unnamed: 0,page,country,rev_id,score
0,Bir I of Kanem,Chad,355319463,
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...
46696,Yahya Jammeh,Gambia,807482007,GA
46697,Lucius Fairchild,United States,807483006,C
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA
46699,Francis Fessenden,United States,807483270,C


In [19]:
nan_scores_df = page_df[page_df['score'].isna()]
articles_df = page_df[~page_df['score'].isna()]

### Step 4: Combining the Datasets

We need to merge the Wikipedia data and population data together. Both have fields containing country names which we will use for the merge. After merging the data, we will find that some entries could not be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

We will use an outer merge to retain all rows from both dataframes. Then we will remove any rows that are missing article, country, or score. We will save them to a CSV file called: `wp_wpds_countries-no_match.csv`

The remaining data will be consolidated into a single CSV file called: `wp_wpds_politicians_by_country.csv`.

The schema for that file looks like this:

| Column              |
|---------------------|
| country             |
| article_name        |
| revision_id         |
| article_quality_est |
| population          |

In [24]:
merged_df = articles_df.merge(countries_df, how='outer', left_on='country', right_on='Name')

In [25]:
merged_df

Unnamed: 0,page,country,rev_id,score,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188.0,Stub,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA
1,Finance Minister of the Palestinian National A...,Palestinian Territory,596181202.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA
2,Planning Minister of the Palestinian National ...,Palestinian Territory,633612729.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA
3,Hossam Arafat (politician),Palestinian Territory,680933208.0,Stub,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA
4,Tawfik Tirawi,Palestinian Territory,701106976.0,Start,PS,Palestinian Territory,Country,2019.0,5.008,5008000.0,WESTERN ASIA
...,...,...,...,...,...,...,...,...,...,...,...
46446,,,,,PF,French Polynesia,Country,2019.0,0.280,280000.0,OCEANIA
46447,,,,,GU,Guam,Country,2019.0,0.175,175000.0,OCEANIA
46448,,,,,NC,New Caledonia,Country,2019.0,0.295,295000.0,OCEANIA
46449,,,,,PW,Palau,Country,2019.0,0.018,18000.0,OCEANIA


In [None]:
no_matches_df = merged_df.loc[(merged_df.score.isna()) | (merged_df.Name.isna()) | (merged_df.article.isna())]
merged_df = merged_df.drop(index=no_matches_df.index)
merged_df = merged_df.drop(columns=['Name', 'Type', 'FIPS', 'TimeFrame', 'Data (M)']).rename(columns={'page': 'article_name', 'rev_id': 'revision_id', 'score': 'article_quality_est', 'Population': 'population'})

merged_df

In [None]:
no_matches_df.to_csv('../data/wp_wpds_countries-no_match.csv')

#### Step 5: Analysis

In [None]:
groupby_df = merged_df.groupby(['country', 'article_quality_est']).agg({'revision_id': 'count', 'population': 'first'})

groupby_df

In [None]:
groupby_df.loc[('United Kingdom', slice(None)), :].revision_id.sum()

In [None]:
print('FA' not in groupby_df.loc[('Cambodia', slice(None)), :].index)

In [None]:
merged_df.country.unique()

In [None]:
data = []
for country in merged_df.country.unique():

    if (groupby_df.index.isin([(country, 'FA')]).any()) | (groupby_df.index.isin([(country, 'GA')]).any()):
        articles_sum = groupby_df.loc[(country, ['FA', 'GA']), :].revision_id.sum()
    else:
        articles_sum = 0

    articles_per_pop = ( articles_sum/groupby_df.loc[(country, slice(None)), :].population[0] ) * 100
    high_quality_percent = ( articles_sum/groupby_df.loc[(country, slice(None)), :].revision_id.sum() ) * 100
    
    data.append([country, articles_per_pop, high_quality_percent])

results_df = pd.DataFrame(data, columns=['country', 'articles_per_pop', 'high_quality_percent'])
results_df

In [None]:
results_df.nlargest(10, 'articles_per_pop', keep='all')

In [None]:
results_df.loc[results_df.high_quality_percent == 9.523809523809524]

In [None]:
groupby_df

In [None]:
WPDS_full_df = pd.read_csv(WPDS_path)

In [None]:
WPDS_regions

In [None]:
WPDS_regions_df.shape