### Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.



### STEP 1: Data Acquisition
The first step is getting the data, which lives in several different places.

The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called page_data.csv.

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [52]:
#import necessary libraries
import pandas as pd
import numpy as np
import json
import requests
import math
import warnings
warnings.filterwarnings('ignore')

### Read politicians article data

 The Wikipedia politicians by country dataset is downloaded from Figshare. We download and unzip it to extract the data file, which is called page_data.csv.

In [2]:
articles = pd.read_csv('page_data.csv')
articles.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### Read population data 

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [58]:
wps_data = pd.read_csv("WPDS_2020_data.csv")
wps_data.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### STEP 2: Cleaning the Data
Both page_data.csv and WPDS_2020_data.csv contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. 

In the case of articles data, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in the analysis.

In [4]:
print("article data set size before dropping articles that start with template:", articles.shape)
articles = articles[~articles['page'].str.startswith('Template:')]
print("article data set size after dropping articles that start with template:", articles.shape)

articles.head()

article data set size before dropping articles that start with template: (47197, 3)
article data set size after dropping articles that start with template: (46701, 3)


Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


Similarly, population data set contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.


In [46]:
print("population data set size before dropping geography field name with ALL CAPS", population.shape)
population = wps_data[~wps_data['Name'].str.isupper()]
print("population data set size after dropping geography field name with ALL CAPS", population.shape)

population.loc[population['Type'] != 'Country',]

population data set size before dropping geography field name with ALL CAPS (210, 6)
population data set size after dropping geography field name with ALL CAPS (210, 6)


Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
168,Channel Islands,Channel Islands,Sub-Region,2019,0.172,172000


### STEP 3 -  Getting Article quality predictions.

Using the wikimedia API edpoints that connects to a machine learning algorithm called ORES, we obtain predictions for each of the articles listted in the articles data. 

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python, and then read through the dataset line by line, using the value of the rev_id column to make an API query.

The ORES API expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "articlequality".

In [118]:
# update correspondingly for reproducing
headers = {
    'User-Agent': 'https://github.com/poornima-muthukumar',
    'From': 'muthupoo@uw.edu'
}

In [7]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'

### Example API call for single Revision ID

In [8]:
parameters = {'project' : 'enwiki',
              'model'   : 'articlequality',
              'revid'   : '521986779'
              }    
call = requests.get(endpoint.format(**parameters), headers=headers)
response = call.json()

print("JSON Dump", json.dumps(response, indent=4))
print("Prediction", response['enwiki']['scores']['521986779']['articlequality']['score']['prediction'])

JSON Dump {
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.8.2"
            }
        },
        "scores": {
            "521986779": {
                "articlequality": {
                    "score": {
                        "prediction": "Stub",
                        "probability": {
                            "B": 0.009908958962632771,
                            "C": 0.009436325360461916,
                            "FA": 0.0018057235542233237,
                            "GA": 0.0026249279565224385,
                            "Start": 0.04946106501006811,
                            "Stub": 0.9267629991560914
                        }
                    }
                }
            }
        }
    }
}
Prediction Stub


In [9]:
def api_call(rev_id):
    parameters = {'project' : 'enwiki',
                      'model'   : 'articlequality',
                      'revid'   : rev_id
                      }  
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

We read the articles csv downloaded earlier row by and row and for each revision id we query the ORES API and parse the json response to retrieve the quality of the article.

We write the result of revision id to quality mapping in a csv file, first by converting it to pandas data frame.

This step takes a while to run as it queries the API for roughly ~47K ids. We enclose the code in a try catch block to catch API calls for which we do not get a response. We also write the revids for which the quality is missing into a csv file called missing_prediction_revids. 

This individual API call for each revision id takes a really long time to run. Instead we can use the bulk API call that returns result for 100 revision id at once. 

In [10]:
# revids = []
# predictions = []
# missing_prediction_revids = []

# for row in articles.iterrows():
#     try:
#         rev_id = row[1]['rev_id']
#         response = api_call(rev_id)
#         revids.append(rev_id)
#         predictions.append(response['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction'])
#     except:
#         missing_prediction_revids.append(row[1]['rev_id'])

# print("Writing to file revid to quality mapping")
# revid_quality = pd.dataFrame([revids, predictions]).T
# revid_quality.columns = ['revid', 'quality']
# revid_quality.to_csv('wikipedia-politician-article-quality.csv', index=False)


# print("Writing to file rev_ids missing prediction")
# missing_prediction_revids.to_csv('missing_prediction_revids.csv', index=False)

In [11]:
rev_ids = articles['rev_id'].tolist()

In [12]:
def bulk_api_call(rev_ids):
    
    bulk_endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    parameters = {'project' : 'enwiki',
                  'model'   : 'articlequality',
                  'revids'  : '|'.join(str(x) for x in rev_ids)
                  }  
    
    call = requests.get(bulk_endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [13]:
start = 0
end = 50

revids = []
predictions = []
missing_prediction_revids = []

#for each rev id retrieve predicition by sending 50 revids at once using bulk api call. 
for t in range(math.ceil(len(rev_ids)/50)):
    ids = rev_ids[start:end]
    response = bulk_api_call(ids)
    for revid in ids:
        if not response['enwiki']['scores'][str(revid)]['articlequality'].get('score') is None:
            revids.append(revid)
            
            predictions.append(response['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction'])
        else:
            missing_prediction_revids.append(revid)
            
    start+=50
    end+=50 
    

In [14]:
print("Writing to file revid to quality mapping")
revid_quality = pd.DataFrame([revids, predictions]).T
revid_quality.columns = ['revision_id', 'article_quality_est']
revid_quality.revision_id = revid_quality.revision_id.astype(int)
revid_quality.to_csv('wikipedia-politician-article-quality.csv', index=False)


print("Writing to file revid missing prediction")
revid_missing_prediction = pd.DataFrame([missing_prediction_revids]).T
revid_missing_prediction.columns = ['revid']
revid_missing_prediction.to_csv('missing_prediction_revids.csv', index=False)

Writing to file revid to quality mapping
Writing to file revid missing prediction


### STEP 4 - Combining the data sets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

Please remove any rows that do not have matching data, and output them to a CSV file called:
wp_wpds_countries-no_match.csv

Consolidate the remaining data into a single CSV file called:
wp_wpds_politicians_by_country.csv


In [15]:
# join two tables
articles_population = articles.merge(population, how='outer', left_on='country', right_on='Name')

missing_population_or_article = articles_population.loc[(articles_population['country'].isnull() | 
                                        articles_population['Name'].isnull())]

#save missing articles to csv
missing_population_or_article.to_csv('wp_wpds_countries-no_match.csv', index = False)

In [16]:
final_merged  = articles.merge(population, how='inner', left_on='country', right_on='Name')

# remove columns not required
del final_merged['FIPS']
del final_merged['Name']
del final_merged['Type']
del final_merged['TimeFrame']
del final_merged['Population']

final_merged = final_merged.rename(index=str, columns={"page": "article_name", "rev_id": "revision_id", "Data (M)":"population"})

final_merged_quality = final_merged.merge(revid_quality, on='revision_id')

#save merged to csv
final_merged_quality.to_csv('wp_wpds_politicians_by_country.csv', index = False)

In [17]:
revid_quality

Unnamed: 0,revision_id,article_quality_est
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub
...,...,...
46420,807481636,C
46421,807482007,GA
46422,807483006,C
46423,807483153,GA


###  STEP 6 Results

Your results from this analysis will be published in the form of data tables. You are being asked to produce six total tables, that show:

### 1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [18]:
#compute proportion
country_ranking_by_article = pd.DataFrame(np.column_stack([np.sort(
                                                          final_merged_quality['country'].unique()),
                                                          final_merged_quality['country'].value_counts()/(final_merged_quality.groupby('country')['population'].mean()*10000.00)]),
                                                          columns=['country','politiican article as a proportion of country population'])







In [19]:
#sort proportion and display
country_ranking_by_article.sort_values('politiican article as a proportion of country population',ascending=False)[:10]

Unnamed: 0,country,politiican article as a proportion of country population
169,Tuvalu,0.54
117,Nauru,0.472727
138,San Marino,0.238235
110,Monaco,0.105263
95,Liechtenstein,0.071795
104,Marshall Islands,0.064912
164,Tonga,0.063636
70,Iceland,0.05462
3,Andorra,0.041463
52,Federated States of Micronesia,0.033962


### 2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [20]:
country_ranking_by_article.sort_values('politiican article as a proportion of country population')[:10]

Unnamed: 0,country,politiican article as a proportion of country population
71,India,6.9e-05
72,Indonesia,7.7e-05
34,China,8.1e-05
176,Uzbekistan,8.2e-05
51,Ethiopia,8.8e-05
181,Zambia,0.000136
84,"Korea, North",0.00014
162,Thailand,0.000168
114,Mozambique,0.000186
13,Bangladesh,0.000187


### 3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality


In [43]:
total_article_count = final_merged_quality.groupby('country').size().reset_index(name='total_article_count')

#filter by good ratings
good_rating = ['GA','FA']
good_article_count = final_merged_quality[final_merged_quality['article_quality_est'].isin(good)].groupby('country').size().reset_index(name='good_article_count')
merged_article = total_article_count.merge(good_article_count, on='country',how='left')
merged_article.fillna(0, inplace=True)
merged_article ['proportion of politician articles that are of GA and FA-quality'] = (merged_article ['good_article_count'] * 100)/merged_article ['total_article_count']

In [44]:
merged_article.sort_values('proportion of politician articles that are of GA and FA-quality',ascending=False)[:10]

Unnamed: 0,country,total_article_count,good_article_count,proportion of politician articles that are of GA and FA-quality
84,"Korea, North",36,8.0,22.222222
140,Saudi Arabia,117,15.0,12.820513
135,Romania,343,42.0,12.244898
31,Central African Republic,66,8.0,12.121212
176,Uzbekistan,28,3.0,10.714286
106,Mauritania,48,5.0,10.416667
64,Guatemala,83,7.0,8.433735
44,Dominica,12,1.0,8.333333
158,Syria,128,10.0,7.8125
18,Benin,91,7.0,7.692308


### 4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality


In [42]:
merged_article.sort_values('proportion of politician articles that are of GA and FA-quality')[:10]

Unnamed: 0,country,total_article_count,good_article_count,proportion of politician articles that are of GA and FA-quality
148,Solomon Islands,97,0.0,0.0
164,Tonga,63,0.0,0.0
117,Nauru,52,0.0,0.0
116,Namibia,162,0.0,0.0
43,Djibouti,37,0.0,0.0
114,Mozambique,58,0.0,0.0
110,Monaco,40,0.0,0.0
49,Eritrea,16,0.0,0.0
50,Estonia,148,0.0,0.0
109,Moldova,421,0.0,0.0


### 5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population


In [59]:
#filter sub regions
region = wps_data[wps_data['Name'].isin(['AFRICA','LATIN AMERICA AND THE CARIBBEAN','ASIA','EUROPE','OCEANIA'])]
region

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
109,ASIA,ASIA,Sub-Region,2019,4625.927,4625927000
166,EUROPE,EUROPE,Sub-Region,2019,746.622,746622000
216,OCEANIA,OCEANIA,Sub-Region,2019,43.155,43155000


In [60]:
# update mapping to add regional column
wps_data['Region'] = np.nan
wps_data['Region'][:67]='AFRICA'
wps_data['Region'][67:109]='LATIN AMERICA AND THE CARIBBEAN'
wps_data['Region'][109:166]='ASIA'
wps_data['Region'][166:216]='EUROPE'
wps_data['Region'][216:]='OCEANIA'

In [69]:
final_merged_quality

Unnamed: 0,article_name,country,revision_id,population,article_quality_est
0,Bir I of Kanem,Chad,355319463,16.877,Stub
1,Abdullah II of Kanem,Chad,498683267,16.877,Stub
2,Salmama II of Kanem,Chad,565745353,16.877,Stub
3,Kuri I of Kanem,Chad,565745365,16.877,Stub
4,Mohammed I of Kanem,Chad,565745375,16.877,Stub
...,...,...,...,...,...
44563,Rita Sinon,Seychelles,800323154,0.098,Stub
44564,Sylvette Frichot,Seychelles,800323798,0.098,Stub
44565,May De Silva,Seychelles,800969960,0.098,Start
44566,Vincent Meriton,Seychelles,802051093,0.098,Stub


In [63]:
regional_combined = pd.merge(left = wps_data[['Name','Region']], right=final_merged_quality, how='right', left_on='Name', right_on='country')
regional_combined

Unnamed: 0,Name,Region,article_name,country,revision_id,population,article_quality_est
0,Chad,AFRICA,Bir I of Kanem,Chad,355319463,16.877,Stub
1,Chad,AFRICA,Abdullah II of Kanem,Chad,498683267,16.877,Stub
2,Chad,AFRICA,Salmama II of Kanem,Chad,565745353,16.877,Stub
3,Chad,AFRICA,Kuri I of Kanem,Chad,565745365,16.877,Stub
4,Chad,AFRICA,Mohammed I of Kanem,Chad,565745375,16.877,Stub
...,...,...,...,...,...,...,...
44563,Seychelles,AFRICA,Rita Sinon,Seychelles,800323154,0.098,Stub
44564,Seychelles,AFRICA,Sylvette Frichot,Seychelles,800323798,0.098,Stub
44565,Seychelles,AFRICA,May De Silva,Seychelles,800969960,0.098,Start
44566,Seychelles,AFRICA,Vincent Meriton,Seychelles,802051093,0.098,Stub


In [103]:
#compute regional article count and merge with population of regions data
regional_article_count = regional_combined.groupby('Region').size().reset_index(name='regional_article_count')
regional_article_count_population = regional_article_count.merge(region[['Name', 'Population']], left_on='Region', right_on='Name', how='inner')
regional_article_count_population

Unnamed: 0,Region,regional_article_count,Name,Population
0,AFRICA,8740,AFRICA,1337918000
1,ASIA,11667,ASIA,4625927000
2,EUROPE,15765,EUROPE,746622000
3,LATIN AMERICA AND THE CARIBBEAN,5270,LATIN AMERICA AND THE CARIBBEAN,651036000
4,OCEANIA,3126,OCEANIA,43155000


In [105]:
# compute regional proportional
regional_ranking_by_article = pd.DataFrame(np.column_stack([np.sort(
                                                          regional_article_count_population['Region'].unique()),
                                                          regional_article_count_population['regional_article_count'],
                                                            regional_article_count_population['Population'],
                                                          regional_article_count_population['regional_article_count']/regional_article_count_population['Population']]),
                                                          columns=['Region','Article Count', 'Population','Proportion'])







In [106]:
regional_ranking_by_article.sort_values('Proportion')[:10]

Unnamed: 0,Region,Article Count,Population,Proportion
1,ASIA,11667,4625927000,3e-06
0,AFRICA,8740,1337918000,7e-06
3,LATIN AMERICA AND THE CARIBBEAN,5270,651036000,8e-06
2,EUROPE,15765,746622000,2.1e-05
4,OCEANIA,3126,43155000,7.2e-05


### 6 Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [116]:
total_article_count = regional_combined.groupby('Region').size().reset_index(name='total_article_count')
good_rating = ['GA','FA']
good_article_count = regional_combined[regional_combined['article_quality_est'].isin(good)].groupby('Region').size().reset_index(name='good_article_count')
merged_article = total_article_count.merge(good_article_count, on='Region',how='left')
merged_article.fillna(0, inplace=True)

merged_article ['proportion of politician articles that are of GA and FA-quality'] = (merged_article ['good_article_count'])/merged_article ['total_article_count']

In [117]:
merged_article.sort_values('proportion of politician articles that are of GA and FA-quality ',ascending=False)[:10]

Unnamed: 0,Region,total_article_count,good_article_count,proportion of politician articles that are of GA and FA-quality
1,ASIA,11667,316,0.027085
0,AFRICA,8740,223,0.025515
2,EUROPE,15765,350,0.022201
4,OCEANIA,3126,63,0.020154
3,LATIN AMERICA AND THE CARIBBEAN,5270,76,0.014421


### Reflection

Looking the ranking of countries it is not surprising to see that countries which are smaller in size with small population are ranked higher. Tuvalu with a population of only 10000 has 54 articles. It is surprising to see such a smaller country with so many articles. 

Similarly countries lower in the rank are the countries with largest population like China and India. The number of politicians in a country is not directly proportional to the population of the country thereby resulting in lower ratio when compared to smaller nations. 

Also it is possbile that in countries like China and India, the people prefer writing articles in their local languages thereby resulting in lower Engligh wikipedia articles. 

The fact that North Korea stood out as the country with the highest good quality articles could be because these articles are infact written by americans given the increased interest among americans with North Korea.
