# Bias in Data - Quality of Wikipedia Political Pages

This notebook explores the concept of bias through data on English Wikipedia articles - specifically, articles on political figures from a variety of countries. Using the ORES machine learning API, we will rate each article based on its quality and examine which countries/regions possess a larger percentages of high-quality articles on political figures.

In [469]:
import pandas as pd
import requests
import json

In [470]:
# Load page data from the 'data' folder

page_data = pd.read_csv('data/page_data.csv')

# Remove pages whose names begin with the string 'Template:', these are not wikipedia articles and 
# will not be included in the analysis

page_data.drop(page_data[page_data['page'].str.startswith('Template:')].index, axis=0, inplace=True)

In [471]:
# Load population data from 'data' folder

population_data = pd.read_csv('data/WPDS_2020_data - WPDS_2020_data.csv')

# Separate into regional data and data by country, retain both dataframes for analysis

regional_pop_data = population_data[population_data['Name'].str.isupper()]
population_data.drop(regional_pop_data.index, axis=0, inplace=True)

## Rating Estimation with ORES API

Now that the datasets have been loaded and cleaned, we will use the ORES API (documentation: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) to predict an article's rating out of six classes:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

With FA being the best rating and the others following in descending order.

In [472]:
# Set endpoint specifying english wikipedia as our context, article quality as our model
# of choice, and a variable list of revision ids

endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={revids}'

In [473]:
# Set header with personal github account, email

headers = {
    'User-Agent': 'https://github.com/TrevorNims',
    'From': 'nimstre@uw.edu'
}

In [474]:
# Define API call to communicate with ORES scoring interface

def api_call(endpoint, parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    return response

In [475]:
# Create dictionary to hold revision id coupled with their respective
# estimated scores. Create list to hold revision ids that cannot be scored.

score_dict = {}
unscorables = []

# Iterate over all revision ids in 'page_data' in batches of 50 - note that
# larger batches may cause a slowdown with the API calls.
for i in range(0, page_data.shape[0], 50):
    # stop when we've reached the final revision id
    end_idx = min(i+50, page_data.shape[0])
    rev_ids = page_data['rev_id'].iloc[i:end_idx]
    # concatenate revision ids as specified in the API documentation
    revid_params = {'revids' : '|'.join(str(x) for x in rev_ids)}
    data = api_call(endpoint, revid_params)
    # for each revision id, save estimated score if it exists, otherwise save 
    # revision id in 'unscorables'
    for score in data['enwiki']['scores']:
        try:
            score_dict[score] = data['enwiki']['scores'][score]['articlequality']['score']['prediction']
        except KeyError as K:
            unscorables.append(score)

In [476]:
# create dataframe of revision ids and their respective estimated scores

score_estimates = pd.DataFrame(score_dict.items(), columns=['rev_id', 'article_quality_est.'])

In [477]:
# save 'unscorables' as a .csv file in the data folder

pd.Series(unscorables).to_csv('data/unscorables.csv')

In [478]:
# Retype rev_id as an int for comparsion with 'page_data' dataframe

score_estimates['rev_id'] = score_estimates['rev_id'].astype(int)

In [479]:
# merge tables on rev_id, creating a single dataframe with page information and 
# predicted score

page_data_with_scores = pd.merge(page_data, score_estimates, on='rev_id')

In [480]:
# Inspect 'page_data_woith scores'

page_data_with_scores

Unnamed: 0,page,country,rev_id,article_quality_est.
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...
46420,Hal Bidlack,United States,807481636,C
46421,Yahya Jammeh,Gambia,807482007,GA
46422,Lucius Fairchild,United States,807483006,C
46423,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA


## Analysis

Now that the pages have had their ranking estimated, we move on the production and measurement of
two different metrics:

1. Number of total Articles per population by country
2. High Quality Articles as a proportion of total articles by country

In this analysis, we will define a High Quality Article as one that has recieved either a 'FA' or a 'GA' rating from the ORES API. 

In [481]:
# rename column to match 'page_data_with_scores' format and facilitate table merging

population_data.rename({'Name' : 'country'}, axis=1, inplace=True)

In [482]:
# merge 'population_data' with 'page_data_with_scores', drop unneeded columns and 
# rename some columns to make them more ergonomic

pd_merged = pd.merge(page_data_with_scores, population_data, how='outer', on='country')
pd_merged.drop(columns=['FIPS', 'Type', 'TimeFrame', 'Data (M)'], axis=1, inplace=True)
pd_merged.rename({'page' : 'article_name', 'rev_id' : 'revision_id'}, axis=1, inplace=True)

In [483]:
# Identify and save rows in 'pd_merged' that contain a value that is NaN
# (meaning either the country does not have any scored articles in our dataset,
# or that population data for the country is not available in 'population_data')

no_match = pd_merged[pd_merged.isna().any(axis=1)]
no_match.to_csv('data/wp_wpds_countries-no_match.csv')

# Remove rows with NaN values, save remaining data to csv file in 'data' folder

pd_merged.drop(no_match.index, inplace=True)
pd_merged.to_csv('data/wp_wpds_politicians_by_country.csv')

In [484]:
# Obtain the total number of articles for each country

articles_by_country = pd_merged[['country', 'revision_id']].groupby(['country']).count()

In [485]:
# Obtain the number of High Quality articles for each country

quality_articles_by_country = pd_merged.loc[pd_merged['article_quality_est.'].isin(['GA', 'FA'])]\
                                [['country', 'revision_id']].groupby(['country']).count()

In [486]:
# Calculate the percentage of high quality articles per country

percentage_high_quality_articles = quality_articles_by_country/articles_by_country*100
percentage_high_quality_articles.rename({'revision_id' : 'percentage'}, axis=1, inplace=True)

In [487]:
# Calculate the percentage of articles-per-population

population_by_country = pd_merged.groupby('country').mean()['Population'].to_frame()
population_by_country.rename({'Population' : 'percentage'}, axis=1, inplace=True)
articles_by_country.rename({'revision_id' : 'percentage'}, axis=1, inplace=True)
percentage_articles_per_population = articles_by_country/population_by_country*100

### Analysis 1

Below we can see countries with the highest percentages of articles per population. These countries tend to have lower populations.

In [488]:
percentage_articles_per_population.sort_values('percentage', ascending=False).head(10)

Unnamed: 0_level_0,percentage
country,Unnamed: 1_level_1
Tuvalu,0.54
Nauru,0.472727
San Marino,0.238235
Monaco,0.105263
Liechtenstein,0.071795
Marshall Islands,0.064912
Tonga,0.063636
Iceland,0.05462
Andorra,0.041463
Federated States of Micronesia,0.033962


### Analysis 2
Below we can see countries with the lowest percentages of articles per population. These countries tend to have higher populations.

In [489]:
percentage_articles_per_population.sort_values('percentage').head(10)

Unnamed: 0_level_0,percentage
country,Unnamed: 1_level_1
India,6.9e-05
Indonesia,7.7e-05
China,8.1e-05
Uzbekistan,8.2e-05
Ethiopia,8.8e-05
Zambia,0.000136
"Korea, North",0.00014
Thailand,0.000168
Mozambique,0.000186
Bangladesh,0.000187


### Analysis 3
Below we can see countries with the highest percentages of High Quality Articles as a proportion of the counrty's total article count.

In [490]:
percentage_high_quality_articles.sort_values('percentage', ascending=False).head(10)

Unnamed: 0_level_0,percentage
country,Unnamed: 1_level_1
"Korea, North",22.222222
Saudi Arabia,12.820513
Romania,12.244898
Central African Republic,12.121212
Uzbekistan,10.714286
Mauritania,10.416667
Guatemala,8.433735
Dominica,8.333333
Syria,7.8125
Benin,7.692308


### Analysis 4
Below we can see countries with the lowest percentages of High Quality Articles as a proportion of the counrty's total article count.

In [508]:
percentage_high_quality_articles.sort_values('percentage').head(10)

Unnamed: 0_level_0,percentage
country,Unnamed: 1_level_1
Belgium,0.192678
Tanzania,0.247525
Switzerland,0.248756
Nepal,0.280899
Peru,0.285714
Nigeria,0.295858
Portugal,0.314465
Colombia,0.350877
Lithuania,0.409836
Morocco,0.485437


## Country Mapping to Sub-Regions

Now, we will analyze the same metrics by sub-region, however we first need to map each country to its respective sub-region(s).

In [495]:
# Map countries to sub-regions by examining their respective indices from the original DataFrame 
# 'population_data' 

# construct list of sub-region indices
regional_idx_list = regional_pop_data.index.tolist()
regional_list_idx = 0
country_to_region_dict = {}

while regional_list_idx+1 < len(regional_idx_list):
    for p_idx in population_data.index:
        # If the country's index is within the range of two sub-region indices, pick the lower index as the
        # sub-region
        if p_idx in range(regional_idx_list[regional_list_idx], regional_idx_list[regional_list_idx+1]):
            country_to_region_dict[population_data.loc[p_idx]['country']] = \
            regional_pop_data['Name'].loc[regional_idx_list[regional_list_idx]]
    # Update sub-region after iterating through all countries
    regional_list_idx += 1

# Original while loop misses final sub-region as it only examines index ranges between
# sub-regions, final sub-region needs to be added manually
for p_idx in population_data.index:
        if p_idx > regional_idx_list[regional_list_idx]:
            country_to_region_dict[population_data.loc[p_idx]['country']] = \
            regional_pop_data['Name'].loc[regional_idx_list[regional_list_idx]]

In [496]:
# Construct DataFrame of each country with their associated sub-region

country_to_region = pd.DataFrame(country_to_region_dict.items(), columns=['country', 'Sub-Region'])

# Construct DataFrame for each "special" sub-region, these sub-regions consist of a collection
# of other sub-regions
africa_subset = country_to_region[country_to_region['Sub-Region'].str.contains('AFRICA')]
asia_subset = country_to_region[country_to_region['Sub-Region'].str.contains('ASIA')]
europe_subset = country_to_region[country_to_region['Sub-Region'].str.contains('EUROPE')]
latin_subset = country_to_region[country_to_region['Sub-Region'].isin(\
                                 ['CENTRAL AMERICA', 'CARIBBEAN', 'SOUTH AMERICA'])]

In [497]:
# Construct DataFrames of total article counts by sub-region/"special" sub-regions

articles_by_region = pd.merge(pd_merged, country_to_region, how='left', on='country')\
                    [['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
articles_africa = pd.merge(pd_merged, africa_subset, how='left', on='country').dropna()\
                    [['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
articles_africa = pd.DataFrame({'AFRICA' : articles_africa.sum()}).transpose()


articles_asia = pd.merge(pd_merged, asia_subset, how='left', on='country').dropna()\
                    [['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
articles_asia = pd.DataFrame({'ASIA' : articles_asia.sum()}).transpose()


articles_europe = pd.merge(pd_merged, europe_subset, how='left', on='country').dropna()\
                    [['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
articles_europe = pd.DataFrame({'EUROPE' : articles_europe.sum()}).transpose()


articles_latin = pd.merge(pd_merged, latin_subset, how='left', on='country').dropna()\
                    [['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
articles_latin = pd.DataFrame({'LATIN AMERICA AND THE CARIBBEAN' : articles_latin.sum()}).transpose()

# construct list of DataFrames for iteration/merging purposes

region_article_list = [articles_by_region, articles_africa, articles_asia, articles_europe,
                      articles_latin]

In [498]:
# Construct DataFrames of quality article counts by sub-region/"special" sub-regions

quality_articles_by_region = pd.merge(pd_merged, country_to_region, how='left', on='country')\
                        .loc[pd.merge(pd_merged, country_to_region, how='left', on='country')['article_quality_est.']
                         .isin(['GA', 'FA'])][['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()

quality_articles_africa = pd.merge(pd_merged, africa_subset, how='left', on='country')\
                        .loc[pd.merge(pd_merged, africa_subset, how='left', on='country')['article_quality_est.']
                         .isin(['GA', 'FA'])][['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
quality_articles_africa = pd.DataFrame({'AFRICA' : quality_articles_africa.sum()}).transpose()

quality_articles_asia = pd.merge(pd_merged, asia_subset, how='left', on='country')\
                        .loc[pd.merge(pd_merged, asia_subset, how='left', on='country')['article_quality_est.']
                         .isin(['GA', 'FA'])][['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
quality_articles_asia = pd.DataFrame({'ASIA' : quality_articles_asia.sum()}).transpose()

quality_articles_europe = pd.merge(pd_merged, europe_subset, how='left', on='country')\
                        .loc[pd.merge(pd_merged, europe_subset, how='left', on='country')['article_quality_est.']
                         .isin(['GA', 'FA'])][['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
quality_articles_europe = pd.DataFrame({'EUROPE' : quality_articles_europe.sum()}).transpose()

quality_articles_latin = pd.merge(pd_merged, latin_subset, how='left', on='country')\
                        .loc[pd.merge(pd_merged, latin_subset, how='left', on='country')['article_quality_est.']
                         .isin(['GA', 'FA'])][['Sub-Region', 'revision_id']].groupby(['Sub-Region']).count()
quality_articles_latin = pd.DataFrame({'LATIN AMERICA AND THE CARIBBEAN' : quality_articles_latin.sum()}).transpose()

# construct list of DataFrames for iteration/merging purposes

region_quality_article_list = [quality_articles_by_region, quality_articles_africa, quality_articles_asia,
                               quality_articles_europe, quality_articles_latin]

In [499]:
# Construct DataFrames of population totals by sub-region/"special" sub-regions

population_by_region = pd.merge(pd_merged, country_to_region, how='left', on='country')\
.groupby('Sub-Region').mean()['Population'].to_frame()
population_by_region
population_africa = pd.merge(pd_merged, africa_subset, how='left', on='country').dropna()\
                    .groupby(['Sub-Region']).mean()['Population'].to_frame()
population_africa = pd.DataFrame({'AFRICA' : population_africa.sum()}).transpose()


population_asia = pd.merge(pd_merged, asia_subset, how='left', on='country').dropna()\
                    .groupby(['Sub-Region']).mean()['Population'].to_frame()
population_asia = pd.DataFrame({'ASIA' : population_asia.sum()}).transpose()


population_europe = pd.merge(pd_merged, europe_subset, how='left', on='country').dropna()\
                    .groupby(['Sub-Region']).mean()['Population'].to_frame()
population_europe = pd.DataFrame({'EUROPE' : population_europe.sum()}).transpose()


population_latin = pd.merge(pd_merged, latin_subset, how='left', on='country').dropna()\
                    .groupby(['Sub-Region']).mean()['Population'].to_frame()
population_latin = pd.DataFrame({'LATIN AMERICA AND THE CARIBBEAN' : population_latin.sum()}).transpose()

# construct list of DataFrames for iteration/merging purposes

region_population_list = [population_by_region, population_africa, population_asia, population_europe,
                      population_latin]

In [500]:
# Iterate through each of the corresponding DataFrames in all three lists, caluculating
# metrics upon each iteration

regional_percentage_quality_articles = []
regional_percentage_articles_per_population = []

for article_count, quality_article_count, pop in zip(region_article_list, region_quality_article_list,
                                                     region_population_list):
    regional_percentage_quality_articles.append(quality_article_count/article_count*100)
    regional_percentage_quality_articles[-1].rename({'revision_id' : 'percentage'}, axis=1, inplace=True)
    pop.rename({'Population' : 'percentage'}, axis=1, inplace=True)
    article_count.rename({'revision_id' : 'percentage'}, axis=1, inplace=True)
    regional_percentage_articles_per_population.append(article_count/pop*100)

In [501]:
# Merge DataFrames for each metric into a single DataFrame for display

regional_percentage_articles_per_population_merged = pd.concat(regional_percentage_articles_per_population)
regional_percentage_quality_articles_merged = pd.concat(regional_percentage_quality_articles)

### Analysis 5

Below we can see the percentages of articles per population by sub-region, sorted in descending order.

In [503]:
regional_percentage_articles_per_population_merged.sort_values('percentage', ascending=False)

Unnamed: 0,percentage
OCEANIA,0.021351
NORTHERN EUROPE,0.019509
SOUTHERN EUROPE,0.013522
EUROPE,0.011363
WESTERN EUROPE,0.010907
WESTERN ASIA,0.010678
CARIBBEAN,0.010158
EASTERN EUROPE,0.007434
EASTERN AFRICA,0.007189
SOUTH AMERICA,0.005073


### Analysis 6

Below we can see all sub-regions' percentage of High Quality Articles as a proportion of the sub-region's total article count, sorted in descending order.

In [504]:
regional_percentage_quality_articles_merged.sort_values('percentage', ascending=False)

Unnamed: 0,percentage
NORTHERN AMERICA,5.470805
SOUTHEAST ASIA,3.613861
WESTERN ASIA,3.472493
EASTERN EUROPE,3.161844
EAST ASIA,3.07319
CENTRAL ASIA,2.857143
NORTHERN EUROPE,2.710603
ASIA,2.708494
MIDDLE AFRICA,2.406015
EUROPE,2.220108


## Final Remarks

Despite the limited scope, this study has brought to hand some interesting information on estimated article quality for political figures on English Wikipedia. When examining the percentage of High Quality Articles as a proportion of the country's total article count, it's interesting to see that the country with the highest percentage is North Korea. Intuitively, one would think that an english-speaking country would be the first entry (however, an English speaking country is not even present in the top ten percentages!). This could possibly point to the idea that there are a very limited number of articles about political figures for these countries, and that the ones that do exist are well-done. 

That being said, when examining the table illustrating sub-regions' percentage of High Quality Articles as a proportion of the sub-region's total article count, it is plain to see that a sub-region where English is the predominant language (i.e. Northern America) has the larget percentage by far. 

A final interesting takeaway was that Western Europe had the lowest percentage of High Quality Articles as a proportion of the sub-region's total article count of any sub-region, which goes against the intuition that a sub-region geographically and politically similar to that of many english-speaking countries would have higher-quality articles.

#### Question Responses

1. What biases did you expect to find in the data (before you started working with it), and why?
    - Initially, I expected to find bias towards english-speaking countries posessing Higher Quality Articles. While this was somewhat true in the general case, it was not the case at the country-level. 


2. What (potential) sources of bias did you discover in the course of your data processing and analysis?
    - While somewhat expected, I found that there were a number of smaller countries with absolutely no High Quality Articles written about their political figures. This lack of information illustrates the bias in the english Wikipedia's creator-base, as content production for political figures from smaller countries may be less desirable/more difficult than those from larger countries.


3. What might your results suggest about (English) Wikipedia as a data source?
    - In terms of just this analysis, it's interesting to see just how low the percentage of High Quality Articles as a proportion of the country's/sub-region's total article count really is. To put it plainly, it's curious to think that only ~5% of the articles I've read about politicians in North America were "High Quality". With all this said, I think some additional domain knowledge on what differentiates each article ranking would be nice to better understand these seemingly "low" percentages.