# Bias in Data

## Purpose

This project explores the concept of *bias* by examinging how the number and quality of Wikipedia articles about political figures vary among countries.

Several specific questions are addressed:
- Which countries have the greatest and the least coverage of politicians on Wikipedia compared to their population?
- which countries have the highest and lowest proportion of high quality articles about politicians?
- Which regions have the most articles about politicians, relative to their populations?
- Which regions have teh highest proprtion of high-quality articles about politicians?

Article quality is estimated using a machine learning service called ORES. NOTES ABOUT PACKAGES AND THINGS THAT ARE USED TO CONDUCT ANALYSIS.

## Data Ingestion and Cleaning

### Data Sources
The data used in this analysis is drawn from two sources:
- The Wikipedia politicians by country dataset, found on Figshare: https://figshare.com/articles/Untitled_Item/5513449
- A subset of the world population datasheet published by the Population Reference Bureau

### Data Cleaning

The Wikipedia *Politicians by Country* dataset contains some pages which are not Wikipedia articles. These pages are filtered out before we conduct our analysis by removing all page names that begin with the string "Template:".

The Population Reference Bureau *World Population Datasheet* contains some rows relating to regional population counts. These are filtered out prior to country-level analyses performed below, but utilized in the final two tables in the Analysis section and in the Reflection section to address coverage and quality by region.

In [155]:
# import needed packages
import pandas as pd

# read the csv files in to Pandas data frames
politicos_full = pd.read_csv("page_data.csv")
pops_regions = pd.read_csv("WPDS_2018_data.csv")
# check that the imports have worked correctly
#print(politicos_full.head())

# remove the no-Wikipedia articles by filtering the politicos data frame to remove instances of the string "Template:"
politicos = politicos_full[~politicos_full.page.str.contains("Template:")]
# check that the filtering step has worked correctly
#print(politicos.head())

# remove the regions from the population data frame by removing rows where the geography col is all caps
# first we make a deep copy of the dataframe because we want a dataframe free of regions, but we also want the region data
pops_countries = pops_regions.copy(deep=True)
# drop regions from the new countries dataframe for the upcoming analysis
pops_countries.drop(pops_countries[pops_countries['Geography'].str.isupper()].index, inplace = True)
# drop countries from the regions dataframe so the two will be completely distinct
#pops_regions = pops_regions[pops_regions['Geography'].str.isupper()]

# check that both dataframes are correct
#print(pops_regions.head())
#print(pops_countries.head())

### Quality Predictions

In the following code we use the ORES API to get json files which contain predictions about the quality of individual articles.

ORES documentation: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

There are six total quality categories. The first two categories (FA and GA) are considered high quality.

FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

The first fuction in the following code get_ores_data is taken from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb and modified only so that it returns the result (rather than simply printing it). 

In [156]:
# import needed packages
import requests
import json

# this block of code is taken from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb
# it is modified only so that get_ores_data returns the result response
headers = {'User-Agent' : 'https://github.com/chisquareatops', 'From' : 'hertman@uw.edu'}

def get_ores_data(revision_ids, headers):    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    #print(json.dumps(response, indent=4, sort_keys=True
    return response

In [157]:
# we need to extract the overall prediction from the above function, which also returns sores for all page types

# make a list of the ids
revids = list(politicos['rev_id'])

# loop through the list of ids in chunks of 100
def get_pred(df, block_size): 
    start = 0
    end = block_size
    output_final = list()
    while start < len(revids):
        revids_temp = revids[start:end]
        output_temp = get_ores_data(revids_temp, headers)
        for key, item in output_temp['enwiki']['scores'].items():
            dict_temp = dict()
            dict_temp['rev_id'] = key
            if 'error' in item['wp10']:
                dict_temp['prediction'] = 'no score'
            else:
                dict_temp['prediction'] = item['wp10']['score']['prediction']
            output_final.append(dict_temp)
        start += 100
        end += 100
    scores = pd.DataFrame(output_final)
    return scores

In [158]:
# call the above functions to get the predictions for our data frame; divide the articles into blocks of 100
politicos_preds = get_pred(politicos, 100)

# check that the above step worked correctly
#print(politicos_preds.head())

# save the articles with no score to a csv and then remove them from the data frame
pred[pred.prediction == 'no score'][['rev_id']].to_csv('wp_wpds_articles-no_score.csv')
politicos_preds = politicos_preds[~politicos_preds.prediction.str.contains("no score")]

### Merge and Output Data

In the following code we merge our data so that the predictions we are interested in are associated with the individual articles in our data set. We then export a csv of this combined data.

In [159]:
# make copies just in case before merging
politicos_final = politicos.copy(deep=True)
politicos_preds_final = politicos_preds.copy(deep=True)
pops_countries_final = pops_countries.copy(deep=True)

# merge the politcal article data and the quality predictions on the rev_id/revision_id cols
politicos_preds_final = politicos_preds_final.astype({'rev_id': 'int64'})
combined_final = politicos_final.merge(politicos_preds_final, how='right', left_on='rev_id', right_on='rev_id')
# merge the new data frame with the population data on the country/Geography cols
combined_final = combined_final.merge(pops_countries_final, how='right', left_on='country', right_on='Geography')

# check that the above step worked
#print(combined_final.head())

# rename the cols to comply with assignment
combined_final.rename(columns={'page':'article_name','Population mid-2018 (millions)':'population','rev_id':'revision_id','prediction':'article_quality'}, inplace=True)

# save the rows that have no match on the country field to a csv, then drop from the final data frame
combined_final[combined_final.Geography.isnull()].to_csv('wp_wpds_countries-no_match.csv')
combined_final.dropna(inplace=True)

# remove Geography col to comply with assigment (now that rows with no country match are gone)
combined_final.drop('Geography', axis=1)

# check that the above step worked
print(combined_final.head())

           article_name country  revision_id article_quality Geography  \
0        Bir I of Kanem    Chad  355319463.0            Stub      Chad   
1  Abdullah II of Kanem    Chad  498683267.0            Stub      Chad   
2   Salmama II of Kanem    Chad  565745353.0            Stub      Chad   
3       Kuri I of Kanem    Chad  565745365.0            Stub      Chad   
4   Mohammed I of Kanem    Chad  565745375.0            Stub      Chad   

  population  
0       15.4  
1       15.4  
2       15.4  
3       15.4  
4       15.4  


In [161]:
# change some data types so the following analysis will work
combined_final['population'] = combined_final['population'].str.replace(',', '')
combined_final = combined_final.astype({'population':'float'})

# Analysis

In this section we create the following six individual tables:

- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [162]:
# select for high quality articles by keeping only the FA and GA designations in the article_quality field
combined_final_2 = combined_final.copy(deep=True)
hq_articles = combined_final_2.loc[combined_final_2['article_quality'].isin(['FA','GA'])]
# count total number of high quality articles in each country using group by 
hq_articles_country = hq_articles.groupby('country').count()['article_name']

# make this result into a dataframe with appropriate cols so we can bring back population data and report the proportion
hq_articles_country_df = hq_articles_country.to_frame()
hq_articles_country_df['country'] = hq_articles_country_df.index
hq_articles_country_df.reset_index(drop=True, inplace=True)
hq_articles_country_df = hq_articles_country_df.merge(pops_countries_final, how='inner', left_on='country', right_on='Geography')

# find the actual proprtion: divide number of high quality articles by total population
hq_articles_country_df = hq_articles_country_df.astype({'article_name': 'float'})
hq_articles_country_df['Population mid-2018 (millions)'] = hq_articles_country_df['Population mid-2018 (millions)'].str.replace(',', '')
hq_articles_country_df = hq_articles_country_df.astype({'Population mid-2018 (millions)': 'float'})
hq_articles_country_df['article_proportion'] = hq_articles_country_df['article_name'] / (hq_articles_country_df['Population mid-2018 (millions)'] * 1000000)

#### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [163]:
# sort by proportion and display a table of the top 10
articles_over_pop = hq_articles_country_df[['country','article_proportion']]
articles_over_pop = articles_over_pop.sort_values('article_proportion', ascending=False)
print(articles_over_pop.head(10))

         country  article_proportion
130       Tuvalu            0.000500
33      Dominica            0.000014
46       Grenada            0.000010
137      Vanuatu            0.000010
52       Iceland            0.000005
57       Ireland            0.000004
13        Bhutan            0.000004
79      Maldives            0.000003
90   New Zealand            0.000002
58        Israel            0.000002


#### Bottom 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [164]:
#Bottom 10 countries by coverage
articles_over_pop = articles_over_pop.sort_values('article_proportion', ascending=True)
print(articles_over_pop.head(10))

        country  article_proportion
93      Nigeria        1.020929e-08
53        India        1.239700e-08
125    Tanzania        1.692047e-08
9    Bangladesh        1.802885e-08
38     Ethiopia        1.860465e-08
27     Colombia        2.008032e-08
17       Brazil        2.865330e-08
26        China        2.941599e-08
99         Peru        3.105590e-08
88        Nepal        3.367003e-08


In [165]:
#Top 10 countries by relative quality

# group the same way we did in previous steps, but this time using all articles
all_articles = combined_final.loc[combined_final['article_quality'].isin(['FA','GA','B','C','Start','Stub'])]
# count total number of high quality articles in each country using group by 
all_articles_country = all_articles.groupby('country').count()['article_name']

# make a dataframe with this total number of articles per country so it can be merged with dataframe from prev step
all_articles_country_df = all_articles_country.to_frame()
all_articles_country_df['country'] = all_articles_country_df.index
all_articles_country_df.reset_index(drop=True, inplace=True)

all_articles_country_df = all_articles_country_df.astype({'article_name': 'float'})
all_articles_country_df.rename(columns = {'article_name':'total_articles'}, inplace = True) 
all_articles_country_df = all_articles_country_df.merge(hq_articles_country_df, how='right', left_on='country', right_on='country')

#### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [166]:
# find the proprtion: divide number of high quality articles by total articles
all_articles_country_df['quality_to_total'] = all_articles_country_df['article_name'] / all_articles_country_df['total_articles']
hqarticles_over_total = all_articles_country_df[['country','quality_to_total']]
hqarticles_over_total = hqarticles_over_total.sort_values('quality_to_total', ascending=False)
print(hqarticles_over_total.head(10))

                      country  quality_to_total
64               Korea, North          0.194444
107              Saudi Arabia          0.127119
81                 Mauritania          0.125000
23   Central African Republic          0.121212
104                   Romania          0.113703
130                    Tuvalu          0.092593
13                     Bhutan          0.090909
33                   Dominica          0.083333
122                     Syria          0.078125
12                      Benin          0.076923


#### Bottom 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [167]:
hqarticles_over_total = hqarticles_over_total.sort_values('quality_to_total', ascending=True)
print(hqarticles_over_total.head(10))

         country  quality_to_total
11       Belgium          0.001923
125     Tanzania          0.002469
121  Switzerland          0.002488
88         Nepal          0.002801
99          Peru          0.002857
93       Nigeria          0.002954
27      Colombia          0.003509
74     Lithuania          0.004098
39          Fiji          0.005076
7     Azerbaijan          0.005587


In [170]:
# regions by coverage
# The only data source we have that connects countries to regions is the original WPDS_2018_data.csv data (now pops_regions)
# countries in this data belong to the region that precedes them in the file, so we need to loop through it.

# create an empty dict to hold country/region pairs as we find them
region_dict = {}

# loop through the original data we preserved (as a list) to identify countries vs. regions, then store pairs
for value in pops_regions['Geography'].tolist():
    # if the current row is a region, make it the current region (the first row is a region)
    if value.isupper():
        region = value
    # if the current row is a country, add a new country/region pair to the dict
    else:
        region_dict.update({value:region})

# use a lambda to make a new col in the most recent dataframe and use the dict to insert a region value
all_articles_country_df['region'] = all_articles_country_df['country'].apply(lambda x: region_dict[x])

# test that the above step worked correctly
#print(all_articles_country_df.head())

#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [171]:
# add up the total number of articles in each region using group by 
all_articles_region = all_articles_country_df.groupby('region').sum()['total_articles']

# turn that result back into a data frame
all_articles_region_df = all_articles_region.to_frame()
all_articles_country_df.reset_index(drop=True, inplace=True)

# add up the total number of articles in each region using group by 
pop_region = all_articles_country_df.groupby('region').sum()['Population mid-2018 (millions)']

# turn that result back into a data frame
pop_region_df = pop_region.to_frame()
pop_region_df.reset_index(inplace=True)

#all_articles_region_df = all_articles_region_df.sort_values('article_proportion', ascending=False)

all_articles_over_pop = all_articles_region_df.merge(pop_region_df, how='right', left_on='region', right_on='region')
all_articles_over_pop['total_articles_over_pop'] = all_articles_over_pop['total_articles']/all_articles_over_pop['Population mid-2018 (millions)']
all_articles_over_pop = all_articles_over_pop[['region', 'total_articles_over_pop']]
all_articles_over_pop = all_articles_over_pop.sort_values('total_articles_over_pop', ascending=False)

print(all_articles_over_pop)

                            region  total_articles_over_pop
5                          OCEANIA                72.668561
2                           EUROPE                19.907834
3  LATIN AMERICA AND THE CARIBBEAN                 7.932139
0                           AFRICA                 5.867868
4                 NORTHERN AMERICA                 5.260131
1                             ASIA                 2.544333


#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [172]:
# add up the total number of articles in each region using group by 
all_articles_region = all_articles_country_df.groupby('region').sum()['total_articles']

# turn that result back into a data frame
all_articles_region_df = all_articles_region.to_frame()
all_articles_country_df.reset_index(drop=True, inplace=True)

# add up the total number of high quality articles in each region using group by 
hq_region = all_articles_country_df.groupby('region').sum()['article_name']

# turn that result back into a data frame
hq_region_df = hq_region.to_frame()
hq_region_df.reset_index(inplace=True)

hq_over_all_articles_df = all_articles_region_df.merge(hq_region_df, how='right', left_on='region', right_on='region')

hq_over_all_articles_df['hq_over_all_articles'] = hq_over_all_articles_df['article_name']/hq_over_all_articles_df['total_articles']
hq_over_all_articles_df = hq_over_all_articles_df[['region', 'hq_over_all_articles']]
hq_over_all_articles_df = hq_over_all_articles_df.sort_values('hq_over_all_articles', ascending=False)

print(hq_over_all_articles_df)

                            region  hq_over_all_articles
4                 NORTHERN AMERICA              0.051536
1                             ASIA              0.027143
5                          OCEANIA              0.023462
2                           EUROPE              0.022587
0                           AFRICA              0.021324
3  LATIN AMERICA AND THE CARIBBEAN              0.014002


# Reflection

Potentially the most significant source of potential bias is this analysis is the ORES scores themselves and the way in which they are generated. Unfortunately I can't speak to this as I don't know how the detailed algorithms behind this scores. I can only state that I cannot be sure whether these scores thoroughly account for possible cultural differences when evaluating the 'quality' of an article.

Another potential problem with the data is different political structures in different countries. Governments can vary widely in size and different branches of government vary wiely in how much power they weild, how long they are in office, and other important measures both within and between countries. Therefore, it is possible that one country may have proportionaly more political offices which warrant (and require) lengthy or high quality explanation than another country.

The tables generated here suggest possible hypothesis about internet access: it's possible that countries where a higher percentage of the population has internet access will tend to have more total articles and proportionally more high-quality articles. This would have to be tested by brining in additional data. It would also be interesting to combine the above data set with national GDP data to investigate any possible correlations with wealth at the national or regional level.