# Hasnah Said<br>A2: Bias in Data<br>October 12, 2021

### Step 1: Get the Article and Population Data

The datasets, politician by country and world population, used in this assignment are obtained from Figshare and the Population Reference Bureau (PBR).

In [1]:
import pandas as pd
import numpy as np
import requests
import json
from collections import defaultdict

In [2]:
# Load the data 
page_data = pd.read_csv('raw_data/page_data.csv')
wpds_data = pd.read_csv('raw_data/WPDS_2020_data.csv')

In [3]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
wpds_data.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### Step 2: Cleaning the Data

The following will be done to prepare the datasets for the analysis:
* **page_data.csv:** Remove rows that contain 'Template:' in the page's name.
* **WPDS_2020_data.csv:** Remove rows that provide cumulative regional population counts, rather than country-level counts. Theses rows are distinguished by having ALL CAPS
* Retain regions mapping for the analysis section

In [5]:
clean_page_data = page_data[~page_data.page.str.contains("Template:")]
clean_wpds_country_data = wpds_data[~wpds_data.Name.str.isupper()]
clean_wpds_region_data = wpds_data[wpds_data.Name.str.isupper()]

In [6]:
# create a df with country-region mapping 
country_region_dic = {}
sub_region = ""
for i, row in wpds_data.iterrows():
    if row['Type'] == 'Sub-Region':
        sub_region = row['Name']
    elif row['Type'] == 'Country':
        country_region_dic[row['Name']] = sub_region

country_region_df = pd.DataFrame(country_region_dic.items(), columns=['country', 'sub_region'])


In [7]:
# export the clean data 
clean_page_data.to_csv('clean_data/clean_page_data.csv')
clean_wpds_country_data.to_csv('clean_data/clean_wpds_country_data.csv')
clean_wpds_region_data.to_csv('clean_data/clean_wpds_region_data.csv')
country_region_df.to_csv('clean_data/country_region_mapping.csv')

### Step 3: Getting Article Quality Predictions

In this step, I will use ORES to get the predicted quality score for each article in the Wikipedia dataset using their RESTAPI. ORES supports querying for up to 50 revisions per request with 4 parallell requests. 

(Sources: https://www.mediawiki.org/wiki/ORES)

The article quality estimates are, from best to worst:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

I made a batch API call then stored the prediction with the revision ID in a python dictionary. I then converted the scores dictionary to a pandas framework and merged it with the page_data by adding a column for the predictions.

In [8]:
# Example of an ORES API call for batch revisions:
# http://ores.wmflabs.org/v3/scores/enwiki/?models=draftquality|wp10&revids=34854345|485104318
ores_get_scores_url = "https://ores.wikimedia.org/v3/scores/enwiki/?models=articlequality&revids={}"

In [9]:
# Get all the revision IDs from clean_data
all_rev_ids = clean_page_data['rev_id'].tolist()

In [10]:
# Break all_rev_ids into chunks of 50 to make the API calls and get article predictions
n = 50
rev_ids_chunks = [all_rev_ids[i:i + n] for i in range(0, len(all_rev_ids), n)]

In [11]:
# Make the API call and get predictions and store them in a dictionary
all_scores = {}
for chunk in rev_ids_chunks:
    revids = "|".join([str(element) for element in chunk])
    endpoint = ores_get_scores_url.format(revids)
 
    req = (requests.get(endpoint)).json()
    scores = req['enwiki']['scores']
    for s in scores: 
        try:
            all_scores[s] = scores[s]['articlequality']['score']['prediction']
        except:
            all_scores[s] = 'error'


In [12]:
# Write out all_scores dictionary so that I don't have to make another api call
with open('clean_data/all_scores.txt', 'w') as outfile:
    json.dump(all_scores, outfile)

In [13]:
rev_pred = pd.DataFrame(all_scores.items(), columns=['rev_id', 'article_quality_est'])
rev_pred.rev_id = rev_pred.rev_id.astype(int)
clean_page_data_with_preds = pd.merge(clean_page_data, rev_pred, on='rev_id')

In [14]:
clean_page_data_with_preds.to_csv('clean_data/clean_page_data_with_preds.csv')
clean_page_data_with_preds.head()

Unnamed: 0,page,country,rev_id,article_quality_est
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


### Step 4: Combining the Datasets               

In this step, page_data and population_data are combined into one dataframe based on the company. After merging the data, there will be entries that can't be merged and they will be removed and stored. The data then will be be exportd  to two CSV files: one with rows that had no matches and the other one is the final combined data

In [15]:
final_combined_data = pd.merge(clean_page_data_with_preds, clean_wpds_country_data, left_on='country', right_on='Name', how='outer')

In [16]:
final_combined_data

Unnamed: 0,page,country,rev_id,article_quality_est,FIPS,Name,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
3,Kuri I of Kanem,Chad,565745365.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
4,Mohammed I of Kanem,Chad,565745375.0,Stub,TD,Chad,Country,2019.0,16.877,16877000.0
...,...,...,...,...,...,...,...,...,...,...
46723,,,,,PF,French Polynesia,Country,2019.0,0.280,280000.0
46724,,,,,GU,Guam,Country,2019.0,0.175,175000.0
46725,,,,,NC,New Caledonia,Country,2019.0,0.295,295000.0
46726,,,,,PW,Palau,Country,2019.0,0.018,18000.0


In [17]:
# Create a dataframe with no null rows
wp_wpds_politicians_by_country = final_combined_data.dropna(how='any',axis=0)
# Create a dataframe with all the null rows
wp_wpds_countries_no_match = final_combined_data[final_combined_data.isnull().any(axis=1)]

In [18]:
# Check if numbers match up
(len(final_combined_data)) == (len(wp_wpds_politicians_by_country) + len(wp_wpds_countries_no_match))

True

In [19]:
# Drop extra columns, rename, and reorder the rest in the final dataframe
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country.drop(columns=['FIPS', 'Type', 'Name', 'TimeFrame', 'Data (M)'])
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country.rename(columns={'page': 'article_name', 'rev_id': 'revision_id', 'Population':'population'})
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[["country", "article_name", "revision_id", "article_quality_est", "population"]]

In [20]:
wp_wpds_politicians_by_country

Unnamed: 0,country,article_name,revision_id,article_quality_est,population
0,Chad,Bir I of Kanem,355319463.0,Stub,16877000.0
1,Chad,Abdullah II of Kanem,498683267.0,Stub,16877000.0
2,Chad,Salmama II of Kanem,565745353.0,Stub,16877000.0
3,Chad,Kuri I of Kanem,565745365.0,Stub,16877000.0
4,Chad,Mohammed I of Kanem,565745375.0,Stub,16877000.0
...,...,...,...,...,...
46690,Seychelles,Rita Sinon,800323154.0,Stub,98000.0
46691,Seychelles,Sylvette Frichot,800323798.0,Stub,98000.0
46692,Seychelles,May De Silva,800969960.0,Start,98000.0
46693,Seychelles,Vincent Meriton,802051093.0,Stub,98000.0


In [21]:
# Export final dataframes to CSV files
wp_wpds_politicians_by_country.to_csv('clean_data/wp_wpds_politicians_by_country.csv')
wp_wpds_countries_no_match.to_csv('clean_data/wp_wpds_countries-no_match.csv')

### Step 5: Analysis

Calculate the proportion (percentage) of articles per population and high-quality articles for each country and each geographic region.

* percentage of article coverage per country
* percentage of high-quality articles per country

**The steps I followed to get proportions for articles per population proportion:**
1. create a dataframe with article count column
2. merge the dataframe on country to get population 
3. divide count column by population to get coverage percentage (articles_per_population)

**The steps I followed to get proportions for high-quality proportions:** 
1. I selected articles with rating GA or FA 
2. I grouped the rows by country and added up the high quality counts

In [22]:
# country_article_dic = defaultdict(int)
# country_article_rating = defaultdict(int)
# country_population_dic = {}

# for index, row in wp_wpds_politicians_by_country.iterrows():
#     country = row.values[0]
#     rating = row.values[3]
#     population = row.values[4]
#     country_article_dic[country] += 1
#     country_population_dic[country] = population
    
#     if rating == 'GA' or rating == 'FA':
#         country_article_rating[country] += 1

In [23]:
df = wp_wpds_politicians_by_country
# count the number of high quality rows
high_quality = df[(df['article_quality_est']=='GA')|(df['article_quality_est']=='FA' )]
high_quality = high_quality.groupby(['country']).size().reset_index(name='high_quality_count')

In [24]:
high_quality.sort_values('high_quality_count')

Unnamed: 0,country,high_quality_count
124,Switzerland,1
34,Dominican Republic,1
128,Tanzania,1
37,Equatorial Guinea,1
39,Fiji,1
...,...,...
119,Spain,38
26,China,40
106,Romania,42
137,United Kingdom,56


In [25]:
# count the articles for each country
article_counts = wp_wpds_politicians_by_country.country.value_counts().reset_index().rename(columns={'index':'country', 'country':'article_count'})
# merge article counts with population and drop duplicates
article_count_population = pd.merge(article_counts, wp_wpds_politicians_by_country, on='country').drop_duplicates(subset=['country'])
# drop extra columns
article_count_population_clean = article_count_population.drop(columns=['article_name', 'revision_id', 'article_quality_est']).reset_index()

In [26]:
# create coverage column that has the articles_per_population percentage (count/population)
article_count_population_clean['coverage'] = (article_count_population_clean['article_count']/article_count_population_clean['population']) * 100

In [27]:
# merge article count and high quality count dataframes
final = pd.merge(article_count_population_clean, high_quality, on='country', how='left')
final

Unnamed: 0,index,country,article_count,population,coverage,high_quality_count
0,0,France,1681,6.494000e+07,0.002589,26.0
1,1681,Australia,1561,2.575400e+07,0.006061,38.0
2,3242,China,1133,1.402385e+09,0.000081,40.0
3,4375,United States,1092,3.298780e+08,0.000331,80.0
4,5467,Mexico,1077,1.277920e+08,0.000843,10.0
...,...,...,...,...,...,...
177,44602,Guinea-Bissau,20,1.927000e+06,0.001038,1.0
178,44622,Belize,16,4.190000e+05,0.003819,
179,44638,Eritrea,16,3.546000e+06,0.000451,
180,44654,Barbados,14,2.870000e+05,0.004878,


In [28]:
# calculate high quality article proportion
final['high_quality_count'] = final['high_quality_count'].fillna(0)
final['high_quality_proportion'] = (final['high_quality_count']/final['article_count']) * 100

In [29]:
# add subregion column
final = pd.merge(final, country_region_df, on='country', how='left')

In [30]:
final.sort_values('high_quality_proportion')

Unnamed: 0,index,country,article_count,population,coverage,high_quality_count,high_quality_proportion,sub_region
111,41123,Solomon Islands,97,715000.0,0.013566,0.0,0.000000,OCEANIA
168,44388,Liechtenstein,28,39000.0,0.071795,0.0,0.000000,WESTERN EUROPE
166,44331,Lesotho,29,2142000.0,0.001354,0.0,0.000000,SOUTHERN AFRICA
142,43387,Comoros,51,870000.0,0.005862,0.0,0.000000,EASTERN AFRICA
103,40312,Angola,106,32522000.0,0.000326,0.0,0.000000,MIDDLE AFRICA
...,...,...,...,...,...,...,...,...
169,44416,Uzbekistan,28,34174000.0,0.000082,3.0,10.714286,CENTRAL ASIA
130,42669,Central African Republic,66,4830000.0,0.001366,8.0,12.121212,MIDDLE AFRICA
43,28527,Romania,343,19241000.0,0.001783,42.0,12.244898,EASTERN EUROPE
94,39298,Saudi Arabia,118,35041000.0,0.000337,15.0,12.711864,WESTERN ASIA


In [31]:
# drop unnecessary columns for step 6 analysis
final_data = final.drop(columns=['index', 'population'])

In [32]:
final_data.head()

Unnamed: 0,country,article_count,coverage,high_quality_count,high_quality_proportion,sub_region
0,France,1681,0.002589,26.0,1.546698,WESTERN EUROPE
1,Australia,1561,0.006061,38.0,2.434337,OCEANIA
2,China,1133,8.1e-05,40.0,3.53045,EAST ASIA
3,United States,1092,0.000331,80.0,7.326007,NORTHERN AMERICA
4,Mexico,1077,0.000843,10.0,0.928505,CENTRAL AMERICA


### Step 6: Results

**1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population**

In [33]:
final_data.sort_values('coverage', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,article_count,coverage,high_quality_count,high_quality_proportion,sub_region
0,Tuvalu,54,0.54,4.0,7.407407,OCEANIA
1,Nauru,52,0.472727,0.0,0.0,OCEANIA
2,San Marino,81,0.238235,0.0,0.0,SOUTHERN EUROPE
3,Monaco,40,0.105263,0.0,0.0,WESTERN EUROPE
4,Liechtenstein,28,0.071795,0.0,0.0,WESTERN EUROPE
5,Marshall Islands,37,0.064912,0.0,0.0,OCEANIA
6,Tonga,63,0.063636,0.0,0.0,OCEANIA
7,Iceland,202,0.054891,2.0,0.990099,Channel Islands
8,Andorra,34,0.041463,0.0,0.0,SOUTHERN EUROPE
9,Federated States of Micronesia,36,0.033962,0.0,0.0,OCEANIA


**2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population**

In [34]:
final_data.sort_values('coverage', ascending=True).reset_index(drop=True).head(10)

Unnamed: 0,country,article_count,coverage,high_quality_count,high_quality_proportion,sub_region
0,India,985,7e-05,13.0,1.319797,SOUTH ASIA
1,Indonesia,211,7.8e-05,9.0,4.265403,SOUTHEAST ASIA
2,China,1133,8.1e-05,40.0,3.53045,EAST ASIA
3,Uzbekistan,28,8.2e-05,3.0,10.714286,CENTRAL ASIA
4,Ethiopia,101,8.8e-05,2.0,1.980198,EASTERN AFRICA
5,Zambia,25,0.000136,0.0,0.0,EASTERN AFRICA
6,"Korea, North",36,0.00014,8.0,22.222222,EAST ASIA
7,Thailand,112,0.000168,3.0,2.678571,SOUTHEAST ASIA
8,Mozambique,58,0.000186,0.0,0.0,EASTERN AFRICA
9,Bangladesh,321,0.000189,3.0,0.934579,SOUTH ASIA


**3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality**

In [35]:
final_data.sort_values('high_quality_proportion', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,country,article_count,coverage,high_quality_count,high_quality_proportion,sub_region
0,"Korea, North",36,0.00014,8.0,22.222222,EAST ASIA
1,Saudi Arabia,118,0.000337,15.0,12.711864,WESTERN ASIA
2,Romania,343,0.001783,42.0,12.244898,EASTERN EUROPE
3,Central African Republic,66,0.001366,8.0,12.121212,MIDDLE AFRICA
4,Uzbekistan,28,8.2e-05,3.0,10.714286,CENTRAL ASIA
5,Mauritania,48,0.001032,5.0,10.416667,WESTERN AFRICA
6,Guatemala,83,0.000459,7.0,8.433735,CENTRAL AMERICA
7,Dominica,12,0.016667,1.0,8.333333,CARIBBEAN
8,Syria,129,0.000665,10.0,7.751938,WESTERN ASIA
9,Benin,91,0.000745,7.0,7.692308,WESTERN AFRICA


**4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality**

In [36]:
final_data.sort_values('high_quality_proportion', ascending=True).reset_index(drop=True).head(10)

Unnamed: 0,country,article_count,coverage,high_quality_count,high_quality_proportion,sub_region
0,Solomon Islands,97,0.013566,0.0,0.0,OCEANIA
1,Liechtenstein,28,0.071795,0.0,0.0,WESTERN EUROPE
2,Lesotho,29,0.001354,0.0,0.0,SOUTHERN AFRICA
3,Comoros,51,0.005862,0.0,0.0,EASTERN AFRICA
4,Angola,106,0.000326,0.0,0.0,MIDDLE AFRICA
5,Moldova,424,0.011994,0.0,0.0,EASTERN EUROPE
6,Guadeloupe,49,0.013067,0.0,0.0,CARIBBEAN
7,San Marino,81,0.238235,0.0,0.0,SOUTHERN EUROPE
8,Turkmenistan,32,0.000531,0.0,0.0,CENTRAL ASIA
9,French Guiana,27,0.009184,0.0,0.0,SOUTH AMERICA


**5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population**

To answer this question, I'll add a column for each region then merge it with clean_wpds_region_data to get population

In [37]:
sub_region_article = final.groupby(['sub_region'])['article_count'].sum()
sub_region_article = pd.merge(sub_region_article, clean_wpds_region_data, left_on='sub_region', right_on='Name')
sub_region_article = sub_region_article.drop(columns=['FIPS', 'Type', 'TimeFrame', 'Data (M)'])

In [38]:
sub_region_article['region_coverage'] = (sub_region_article['article_count']/sub_region_article['Population']) * 100

In [39]:
sub_region_article = sub_region_article.rename(columns={'Name': 'sub_region', 'Population':'population'})
sub_region_article = sub_region_article[['sub_region', 'population', 'article_count', 'region_coverage']]

In [40]:
sub_region_article.sort_values('region_coverage', ascending=False)

Unnamed: 0,sub_region,population,article_count,region_coverage
9,OCEANIA,43155000,3132,0.007258
14,SOUTHERN EUROPE,153251000,3729,0.002433
17,WESTERN EUROPE,195479000,4577,0.002341
0,CARIBBEAN,43233000,697,0.001612
5,EASTERN EUROPE,291902000,3771,0.001292
16,WESTERN ASIA,280927000,2580,0.000918
1,CENTRAL AMERICA,178611000,1545,0.000865
10,SOUTH AMERICA,429191000,3042,0.000709
13,SOUTHERN AFRICA,67732000,473,0.000698
4,EASTERN AFRICA,444970000,2509,0.000564


**6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality**

In [41]:
high_quality_count = final.groupby(['sub_region'])['high_quality_count'].sum()
sub_region_article = pd.merge(high_quality_count, sub_region_article, on='sub_region')
sub_region_article['high_quality_proportion'] = (sub_region_article['high_quality_count']/sub_region_article['article_count']) * 100

In [42]:
sub_region_article.sort_values('high_quality_proportion', ascending=False)

Unnamed: 0,sub_region,high_quality_count,population,article_count,region_coverage,high_quality_proportion
8,NORTHERN AMERICA,104.0,368193000,1940,0.000527,5.360825
12,SOUTHEAST ASIA,73.0,661845000,2034,0.000307,3.588987
16,WESTERN ASIA,89.0,280927000,2580,0.000918,3.449612
5,EASTERN EUROPE,118.0,291902000,3771,0.001292,3.129143
3,EAST ASIA,76.0,1641063000,2477,0.000151,3.068228
2,CENTRAL ASIA,7.0,74961000,247,0.00033,2.834008
6,MIDDLE AFRICA,16.0,179757000,669,0.000372,2.391629
7,NORTHERN AFRICA,19.0,244344000,902,0.000369,2.10643
9,OCEANIA,63.0,43155000,3132,0.007258,2.011494
14,SOUTHERN EUROPE,74.0,153251000,3729,0.002433,1.984446


## Writeup: Reflections and Implications


Since we are analyzing data from English Wikipedia, I expected that the proportion of high-quality articles to be higher in English-speaking country as well as the number of articles. After going through this assignment and looking at the results, I found that countries with highest coverage and highest proportion of high-quality articles were countries that English was not their primary language. After starting to work on this assignment and examining the raw dataset, I expected the countries that have articles and a small population to have higher articles coverage which was confirmed in the result section of this analysis. 

As for the data analysis part of this assignment, I really enjoyed it and found it interesting since it gave me a chance to review many aspects of data preparations that I learned in previous classes. Data cleaning and data transformation are very important steps in any data analysis project, and this assignment was a great practice in data cleaning, retrieving data from APIs, merging datasources and performing calculations to get proportions. 


**1. What biases did you expect to find in the data (before you started working with it), and why?<br>**
The bias I expected to find in the data was a higher number of articles written by English-speaking countries, like the USA and UK, that have a high-quality rating. I had this expectation because the dataset used in this assignment was from the English Wikipedia. On the contrary, I found that English-speaking countries were not among the top 10 countries by relative quality and were also not in the list of top 10 countries by coverage.

<br>

**2. What (potential) sources of bias did you discover in the course of your data processing and analysis?<br>**
A potential source of bias was the language of the data source; we only used English Wikipedia and didn't take into consideration languages of all the countries in the dataset.
<br>

**3. How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?<br>**
A way researcher can correct the limitations is by only including countries where English is the primary language or supplementing this analysis with wikipedia articles in all the languages of the countries in the dataset
<br>

