### Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.



### STEP 1: Data Acquisition
The first step is getting the data, which lives in several different places.

The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called page_data.csv.

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [12]:
import pandas as pd
import numpy as np
import json
import requests
import math

### Read politicians article data

 The Wikipedia politicians by country dataset is downloaded from Figshare. We download and unzip it to extract the data file, which is called page_data.csv.

In [2]:
articles = pd.read_csv('page_data.csv')
articles.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### Read population data 

The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [3]:
population = pd.read_csv("WPDS_2020_data.csv")
population.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### STEP 2: Cleaning the Data
Both page_data.csv and WPDS_2020_data.csv contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. 

In the case of articles data, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in the analysis.

In [4]:
print("article data set size before dropping articles that start with template:", articles.shape)
articles = articles[~articles['page'].str.startswith('Template:')]
print("article data set size after dropping articles that start with template:", articles.shape)

articles.head()

article data set size before dropping articles that start with template: (47197, 3)
article data set size after dropping articles that start with template: (46701, 3)


Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


Similarly, population data set contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in page_data.csv, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.


In [5]:
print("population data set size before dropping geography field name with ALL CAPS", population.shape)
population = population[~population['Name'].str.isupper()]
print("population data set size after dropping geography field name with ALL CAPS", population.shape)


population data set size before dropping geography field name with ALL CAPS (234, 6)
population data set size after dropping geography field name with ALL CAPS (210, 6)


### STEP 3 -  Getting Article quality predictions.

Using the wikimedia API edpoints that connects to a machine learning algorithm called ORES, we obtain predictions for each of the articles listted in the articles data. 

In order to get article predictions for each article in the Wikipedia dataset, you will first need to read page_data.csv into Python, and then read through the dataset line by line, using the value of the rev_id column to make an API query.

The ORES API expects a revision ID, which is the third column in the Wikipedia dataset, and a model, which is "articlequality".

In [6]:
headers = {
    'User-Agent': 'https://github.com/poornima-muthukumar',
    'From': 'muthupoo@uw.edu'
}

In [7]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'

### Example API call for single Revision ID

In [8]:
parameters = {'project' : 'enwiki',
              'model'   : 'articlequality',
              'revid'   : '521986779'
              }    
call = requests.get(endpoint.format(**parameters), headers=headers)
response = call.json()

print("JSON Dump", json.dumps(response, indent=4))
print("Prediction", response['enwiki']['scores']['521986779']['articlequality']['score']['prediction'])

JSON Dump {
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.8.2"
            }
        },
        "scores": {
            "521986779": {
                "articlequality": {
                    "score": {
                        "prediction": "Stub",
                        "probability": {
                            "B": 0.009908958962632771,
                            "C": 0.009436325360461916,
                            "FA": 0.0018057235542233237,
                            "GA": 0.0026249279565224385,
                            "Start": 0.04946106501006811,
                            "Stub": 0.9267629991560914
                        }
                    }
                }
            }
        }
    }
}
Prediction Stub


In [9]:
def api_call(rev_id):
    parameters = {'project' : 'enwiki',
                      'model'   : 'articlequality',
                      'revid'   : rev_id
                      }  
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

We read the articles csv downloaded earlier row by and row and for each revision id we query the ORES API and parse the json response to retrieve the quality of the article.

We write the result of revision id to quality mapping in a csv file, first by converting it to pandas data frame.

This step takes a while to run as it queries the API for roughly ~47K ids. We enclose the code in a try catch block to catch API calls for which we do not get a response. We also write the revids for which the quality is missing into a csv file called missing_prediction_revids. 

This individual API call for each revision id takes a really long time to run. Instead we can use the bulk API call that returns result for 100 revision id at once. 

In [None]:
%%script false 

revids = []
predictions = []
missing_prediction_revids = []

for row in articles.iterrows():
    try:
        rev_id = row[1]['rev_id']
        response = api_call(rev_id)
        revids.append(rev_id)
        predictions.append(response['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction'])
    except:
        missing_prediction_revids.append(row[1]['rev_id'])
        print("Exception while calling ORES API for revision id", row[1]['rev_id'])


print("Writing to file revid to quality mapping")
revid_quality = pd.dataFrame([revids, predictions]).T
revid_quality.columns = ['revid', 'quality']
revid_quality.to_csv('wikipedia-politician-article-quality.csv', index=False)


print("Writing to file rev_ids missing prediction")
missing_prediction_revids.to_csv('missing_prediction_revids.csv', index=False)

Exception while calling ORES API for revision id 516633096
Exception while calling ORES API for revision id 550682925
Exception while calling ORES API for revision id 627547024
Exception while calling ORES API for revision id 636911471
Exception while calling ORES API for revision id 669987106
Exception while calling ORES API for revision id 671484594
Exception while calling ORES API for revision id 680981536
Exception while calling ORES API for revision id 684023803
Exception while calling ORES API for revision id 684023859
Exception while calling ORES API for revision id 696608092
Exception while calling ORES API for revision id 698572327
Exception while calling ORES API for revision id 699260156
Exception while calling ORES API for revision id 703773782
Exception while calling ORES API for revision id 706204833
Exception while calling ORES API for revision id 706810694
Exception while calling ORES API for revision id 708482569
Exception while calling ORES API for revision id 7088130

Exception while calling ORES API for revision id 777163201
Exception while calling ORES API for revision id 778388481
Exception while calling ORES API for revision id 778618827
Exception while calling ORES API for revision id 779101752
Exception while calling ORES API for revision id 779135011
Exception while calling ORES API for revision id 779409209
Exception while calling ORES API for revision id 779899001
Exception while calling ORES API for revision id 779954797
Exception while calling ORES API for revision id 779957437
Exception while calling ORES API for revision id 780854503
Exception while calling ORES API for revision id 781072934
Exception while calling ORES API for revision id 781427254


In [9]:
rev_ids = articles['rev_id'].tolist()

In [10]:
def bulk_api_call(rev_ids):
    
    bulk_endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    parameters = {'project' : 'enwiki',
                  'model'   : 'articlequality',
                  'revids'  : '|'.join(str(x) for x in rev_ids)
                  }  
    
    call = requests.get(bulk_endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [23]:
start = 0
end = 50

revids = []
predictions = []
missing_prediction_revids = []

for t in range(math.ceil(len(rev_ids)/50)):
    ids = rev_ids[start:end]
    response = bulk_api_call(ids)
    for revid in ids:
        if not response['enwiki']['scores'][str(revid)]['articlequality'].get('score') is None:
            revids.append(revid)
            
            predictions.append(response['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction'])
        else:
            missing_prediction_revids.append(revid)
            print("Missing Quality for Revision Id", revid)
            
    start+=50
    end+=50 
    

Missing Quality for Revision Id 516633096
Missing Quality for Revision Id 550682925
Missing Quality for Revision Id 627547024
Missing Quality for Revision Id 636911471
Missing Quality for Revision Id 669987106
Missing Quality for Revision Id 671484594
Missing Quality for Revision Id 680981536
Missing Quality for Revision Id 684023803
Missing Quality for Revision Id 684023859
Missing Quality for Revision Id 696608092
Missing Quality for Revision Id 698572327
Missing Quality for Revision Id 699260156
Missing Quality for Revision Id 703773782
Missing Quality for Revision Id 706204833
Missing Quality for Revision Id 706810694
Missing Quality for Revision Id 708482569
Missing Quality for Revision Id 708813010
Missing Quality for Revision Id 709508670
Missing Quality for Revision Id 710135228
Missing Quality for Revision Id 710311600
Missing Quality for Revision Id 710715953
Missing Quality for Revision Id 711224007
Missing Quality for Revision Id 711288191
Missing Quality for Revision Id 71

Missing Quality for Revision Id 797311794
Missing Quality for Revision Id 797902544
Missing Quality for Revision Id 798073295
Missing Quality for Revision Id 798615865
Missing Quality for Revision Id 798841317
Missing Quality for Revision Id 798865216
Missing Quality for Revision Id 798891126
Missing Quality for Revision Id 798945915
Missing Quality for Revision Id 798952867
Missing Quality for Revision Id 799036953
Missing Quality for Revision Id 799263010
Missing Quality for Revision Id 799300265
Missing Quality for Revision Id 799540668
Missing Quality for Revision Id 799880073
Missing Quality for Revision Id 799933274
Missing Quality for Revision Id 799981363
Missing Quality for Revision Id 800072571
Missing Quality for Revision Id 800144885
Missing Quality for Revision Id 800233652
Missing Quality for Revision Id 800299837
Missing Quality for Revision Id 800471126
Missing Quality for Revision Id 800574730
Missing Quality for Revision Id 800707821
Missing Quality for Revision Id 80

AttributeError: module 'pandas' has no attribute 'dataFrame'

In [29]:
print("Writing to file revid to quality mapping")
revid_quality = pd.DataFrame([revids, predictions]).T
revid_quality.columns = ['revision_id', 'article_quality_est']
revid_quality.to_csv('wikipedia-politician-article-quality.csv', index=False)


print("Writing to file revid missing prediction")
revid_missing_prediction = pd.DataFrame([missing_prediction_revids]).T
revid_missing_prediction.columns = ['revid']
revid_missing_prediction.to_csv('missing_prediction_revids.csv', index=False)

Writing to file revid to quality mapping
Writing to file revid missing prediction


### STEP 4 - Combining the data sets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

Please remove any rows that do not have matching data, and output them to a CSV file called:
wp_wpds_countries-no_match.csv

Consolidate the remaining data into a single CSV file called:
wp_wpds_politicians_by_country.csv


In [30]:
articles_population = articles.merge(population, how='outer', left_on='country', right_on='Name')

missing_population_or_article = articles_population.loc[(articles_population['country'].isnull() | 
                                        articles_population['Name'].isnull())]

missing_population_or_article.to_csv('wp_wpds_countries-no_match.csv', index = False)

In [31]:
final_merged  = articles.merge(population, how='inner', left_on='country', right_on='Name')

del final_merged['FIPS']
del final_merged['Name']
del final_merged['Type']
del final_merged['TimeFrame']
del final_merged['Population']

final_merged = final_merged.rename(index=str, columns={"page": "article_name", "rev_id": "revision_id", "Data (M)":"population"})

final_merged_quality = final_merged.merge(revid_quality, on='revision_id')
final_merged_quality.to_csv('wp_wpds_politicians_by_country.csv', index = False)

###  STEP 5 Analysis