# Assignment 2: Bias in data
### Data 512
### Saturday, October 5
### Tara Wilson

-----------------

## Setup

This Jupyter Notebook is created using [Python version 3.7](https://www.python.org/downloads/release/python-370/).

First, I will import the necessary libraries to run the code. The following libraries are used:

[pandas](https://pandas.pydata.org/)    
[json](https://docs.python.org/3/library/json.html)  
[requests](https://realpython.com/python-requests/)  
[logging](https://docs.python.org/3/library/logging.html)  
[numpy](https://numpy.org/)  

In [1]:
import pandas as pd
import json
import requests
import logging
import numpy as np

I will set up a logging file to keep track of messages later in the file. I set the logging `level` to `INFO` since we want to track information and not error messages or other types of alerts. I set `format` to `%(message)s` so the message is printed out simply as a string. I set the `filename` to `bias_in_data_error_log.log` so the log will be saved with this title in the current working directory. Finally, I set the `filemode` to `w` for write permissions so that the file is re-written each time and not added on to since I do not need to keep historical logs.

In [2]:
logging.basicConfig(level=logging.INFO,
                    format='%(message)s',
                    filename='bias_in_data_error_log.log',
                    filemode='w')

---------------------

## Data cleaning

I will then read in the 2 source files from the `source_data` folder into pandas DataFrames using the `read_csv` function:
* page_data.csv
* WPDS_2018_data.csv

In [3]:
page_data = pd.read_csv("source_data/page_data.csv")
population_data = pd.read_csv("source_data/WPDS_2018_data.csv")

I will then take a look to ensure the data files are read in properly.  

In [4]:
print("Page dataframe row count: ", page_data.shape[0])

Page dataframe row count:  47197


I will also preview the dataset to ensure the columns are as we expect: 
* page (article names)
* country
* rev_id (revision ID for the article)

In [5]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [6]:
print("Population dataframe row count: ", population_data.shape[0])

Population dataframe row count:  207


I will also preview the population dataset to ensure the columns are as follows: 
* Geogrpahy
* Population mid-2018 (millions)

In [7]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


Rows with page names that begin with the string "Template" need to be filtered out of `page_data` as these are not Wikipedia articles and we do not want to include them in the anlysis. To do so we will use Python's `~` operator described in detail [here](https://stackoverflow.com/questions/8305199/the-tilde-operator-in-python).

In [8]:
page_data = page_data[~page_data["page"].str.startswith("Template")]

DESCRIBE THIS!!!!!!!!!!!1 

In [9]:
region = ""
regions = []
population_for_regions = pd.DataFrame()
region_pop = []
region_name= []

for index, row in population_data.iterrows():
    if row["Geography"].isupper():
        region = row["Geography"]
        region_pop.append(row["Population mid-2018 (millions)"])
        region_name.append(region)
    regions.append(region)

population_data["regions"] = regions

population_for_regions["region"] = region_name

population_for_regions["population"] = region_pop

-----------------

## Data gathering

In this next step I will gather the quality data for each of the Wikipedia articles included in `page_data`. This data is sourced from the Objective Revision Evaluation Service [(ORES)](https://www.mediawiki.org/wiki/ORES) which provides a predicted label to represent the quality of the article. The available labels are:

1. FA (Featured article)
2. GA (Good article)
3. B (B-class article)
4. C (C-class article)
5. Start (Start-class article)
6. Stub (Stub-class article)

where 1 is the highest quality and 6 is the lowest quality classification.

I was unable to `pip install ores` so I am using the ORES API service to get the article quality predictions. The following code writes a function to and makes requests to the REST API endpoint: `https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}`. The API requires the following parameters:

| parameter | value |
| ---------|:-----:|
|*project*|`enwiki`|
|*model*|`wp10`|
|*revids*|List of revision IDs, from `page_data`|


The resulting API response will look something like:
```json
{'articlequality': 
    {'score': 
        {'prediction': 'B', 
        'probability': 
            {'GA': 0.005565225912988614, 
             'Stub': 0.285072978841463, 
             'C': 0.1237249061020009, 
             'B': 0.2910788689339172, 
             'Start': 0.2859984921969326, 
             'FA': 0.008559528012697881
            }
        }
    }
}
```

The following function is adapted from [this example](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb). 

In [10]:
def get_ores_data(revision_ids):
    """
    This function makes a request to the ORES API Endpoint. 
    Inputs:
        - revision_ids: A list of revision IDs for Wikipedia articles
    Outputs:
        - A JSON object with the predicted quality for all revision IDs
    """
    headers = {'User-Agent' : 'https://github.com/TaraWilson17', 'From' : 'wwtara@uw.edu'}
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [11]:
revision_id = []
article_quality = []

page_data["rev_id"] = page_data["rev_id"].astype(np.int64)

for i in range(0, page_data.shape[0], 50):
    ores_responses = get_ores_data(np.array(page_data["rev_id"].iloc[i:i + 50,]))
    for article in ores_responses["enwiki"]["scores"]:
        try:
            article_quality.append(ores_responses["enwiki"]["scores"][article]["wp10"]["score"]["prediction"])
        except:
            logging.info("Unable to get a ORES response for revision id: %s", article)
        else:
            revision_id.append(article)

ConnectionError: HTTPSConnectionPool(host='ores.wikimedia.org', port=443): Max retries exceeded with url: /v3/scores/enwiki/?models=wp10&revids=781951239%7C781968923%7C781980382%7C781983879%7C781985076%7C781985555%7C781985625%7C781985704%7C781986675%7C781988824%7C781989047%7C781990193%7C781990452%7C781990525%7C781990644%7C781990773%7C781996299%7C781997714%7C781997919%7C782005013%7C782005419%7C782005867%7C782006071%7C782014775%7C782018907%7C782021839%7C782030181%7C782031885%7C782032727%7C782035312%7C782038393%7C782044725%7C782045760%7C782047073%7C782048547%7C782058170%7C782058966%7C782061268%7C782062759%7C782063768%7C782073792%7C782074765%7C782075499%7C782082362%7C782087278%7C782089225%7C782099774%7C782100788%7C782102962%7C782103998 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000002B695C20E48>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [12]:
article_data = pd.DataFrame()
article_data["revision_id"] = revision_id
article_data["article_quality"] = article_quality

In [13]:
article_data["revision_id"] = article_data["revision_id"].astype(str).astype(int)
all_article_data = pd.merge(article_data, page_data, left_on="revision_id", right_on="rev_id")
all_article_data = all_article_data.drop(columns=["rev_id"])

In [14]:
all_data = pd.merge(all_article_data, population_data, left_on="country", right_on="Geography")
all_data = all_data.drop(columns=["Geography"])
all_data = all_data.rename(columns={"Population mid-2018 (millions)": "population"})
all_data.head()

Unnamed: 0,revision_id,article_quality,page,country,population,regions
0,355319463,Stub,Bir I of Kanem,Chad,15.4,AFRICA
1,498683267,Stub,Abdullah II of Kanem,Chad,15.4,AFRICA
2,565745353,Stub,Salmama II of Kanem,Chad,15.4,AFRICA
3,565745365,Stub,Kuri I of Kanem,Chad,15.4,AFRICA
4,565745375,Stub,Mohammed I of Kanem,Chad,15.4,AFRICA


In [15]:
all_data.to_csv("wp_wpds_politicians_by_country.csv", sep=",", columns=["country", "article_name", "revision_id", "article_quality", "population"])

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


-----------------------

## Data analysis

The next step is to complete some analysis on the resulting dataset. This will allow us to derive some metrics to generate some summary statistics for different countries.

??

In [19]:
article_stats = pd.DataFrame()
country_list = []
counts = []
populations= []
high_quality_counts = []

countries = all_data["country"].unique()
for country in countries:
    country_list.append(country)
    articles_from_country = all_data[all_data["country"] == country]
    counts.append(len(articles_from_country))
    count = 0
    for index, row in articles_from_country.iterrows():
        # "FA" and "GA" are considered 'high quality' predictions
        if row["article_quality"] == "FA" or row["article_quality"] == "GA":
            count += 1
    high_quality_counts.append(count)
    populations.append(row["population"])
    
    
article_stats["country"] = country_list
article_stats["num_articles"] = counts
article_stats["population"] = populations
article_stats["num_high_quality_articles"] = high_quality_counts
article_stats.head()

Unnamed: 0,country,num_articles,population,num_high_quality_articles
0,Chad,43,15.4,0
1,Cambodia,142,16.0,0
2,Canada,314,37.2,0
3,Egypt,74,97.0,0
4,Pakistan,160,200.6,0


In [20]:
region_stats = pd.DataFrame()
region_list = []
num_articles = []
high_quality_counts = []

regions = all_data["regions"].unique()    
for region in regions:
    region_list.append(region)
    articles_from_region = all_data[all_data["regions"] == region]
    num_articles.append(len(articles_from_region))
    count = 0
    for index, row in articles_from_region.iterrows():
        # "FA" and "GA" are considered 'high quality' predictions
        if row["article_quality"] == "FA" or row["article_quality"] == "GA":
            count += 1
    high_quality_counts.append(count)
    
region_stats["region"] = region_list
region_stats["num_articles"] = num_articles
region_stats["num_high_quality_articles"] = high_quality_counts
 
region_stats.head()    

Unnamed: 0,region,num_articles,num_high_quality_articles
0,AFRICA,2712,12
1,ASIA,3736,18
2,NORTHERN AMERICA,609,2
3,EUROPE,6361,27
4,LATIN AMERICA AND THE CARIBBEAN,3144,9


In [21]:
region_stats = pd.merge(region_stats, population_for_regions, left_on="region", right_on="region")

In [22]:
# populations in millions
article_stats["population"] = article_stats["population"].str.replace(",","")
article_stats["population"] = article_stats["population"].astype(float) * 1000000

region_stats["population"] = region_stats["population"].str.replace(",","")
region_stats["population"] = region_stats["population"].astype(float) * 1000000

In [23]:
article_stats["articles_per_population"] = article_stats["num_articles"] / article_stats["population"]
article_stats["quality_articles_per_articles"] = article_stats["num_high_quality_articles"] / article_stats["num_articles"]

region_stats["articles_per_population"] = region_stats["num_articles"] / region_stats["population"]
region_stats["quality_articles_per_articles"] = region_stats["num_high_quality_articles"] / region_stats["num_articles"]

Unnamed: 0,region,num_articles,num_high_quality_articles,population,articles_per_population,quality_articles_per_articles
0,AFRICA,2712,12,1284000000.0,2.11215e-06,0.004425
1,ASIA,3736,18,4536000000.0,8.236332e-07,0.004818
2,NORTHERN AMERICA,609,2,365000000.0,1.668493e-06,0.003284
3,EUROPE,6361,27,746000000.0,8.52681e-06,0.004245
4,LATIN AMERICA AND THE CARIBBEAN,3144,9,649000000.0,4.844376e-06,0.002863


--------------------

## Result tables

### 1. Top 10 countries by coverage

These are the 10 countries with the highest proportion of politician articles on Wikideia per the population.

In [35]:
top_10_by_coverage = article_stats.nlargest(10, "articles_per_population")
top_10_by_coverage[["country", "articles_per_population"]]

Unnamed: 0,country,articles_per_population
98,Tuvalu,0.0025
149,Nauru,0.0022
39,San Marino,0.0015
63,Monaco,0.000575
97,Liechtenstein,0.0004
86,Tonga,0.00032
66,Iceland,0.000275
104,Marshall Islands,0.000267
166,Andorra,0.000237
171,Federated States of Micronesia,0.0002


### 2. Bottom 10 countries by coverage

These are the 10 countries with the least amount of politician articles on Wikideia per the population.

In [34]:
bottom_10_by_coverage = article_stats.nsmallest(10, "articles_per_population")
bottom_10_by_coverage[["country", "articles_per_population"]]

Unnamed: 0,country,articles_per_population
6,India,2.027273e-07
150,Uzbekistan,2.12766e-07
106,Ethiopia,2.697674e-07
112,Sudan,2.877698e-07
163,"Korea, North",3.125e-07
58,Indonesia,3.242836e-07
20,China,3.515569e-07
125,Mozambique,3.606557e-07
126,Thailand,3.927492e-07
73,Turkey,4.551046e-07


### 3. Top 10 countries by relative quality

These are the 10 countries with the highest proportion of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [33]:
top_10_by_quality = article_stats.nlargest(10, "quality_articles_per_articles")
top_10_by_quality[["country", "quality_articles_per_articles"]]

Unnamed: 0,country,quality_articles_per_articles
163,"Korea, North",0.125
138,Mauritania,0.095238
18,Togo,0.088235
169,Saudi Arabia,0.074074
155,Guatemala,0.051282
134,Suriname,0.045455
147,Papua New Guinea,0.043478
98,Tuvalu,0.04
167,Laos,0.04
144,Gambia,0.038462


### 4. Bottom 10 countries by relative quality

These are the 10 countries with the least number of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [32]:
bottom_10_by_quality = article_stats.nsmallest(10, "quality_articles_per_articles")
bottom_10_by_quality[["country", "quality_articles_per_articles"]]

Unnamed: 0,country,quality_articles_per_articles
0,Chad,0.0
1,Cambodia,0.0
2,Canada,0.0
3,Egypt,0.0
4,Pakistan,0.0
6,India,0.0
9,Malawi,0.0
10,Nicaragua,0.0
11,Hungary,0.0
13,Iran,0.0


### 5. Geographic regions by coverage (by politician articles from countries in each region as a proportion of total regional population)

The geographic regions in order of most politician articles per population to least.

In [31]:
regions_by_coverage = region_stats.sort_values("articles_per_population", ascending=False)
regions_by_coverage[["region", "articles_per_population"]]

Unnamed: 0,region,articles_per_population
5,OCEANIA,4.507317e-05
3,EUROPE,8.52681e-06
4,LATIN AMERICA AND THE CARIBBEAN,4.844376e-06
0,AFRICA,2.11215e-06
2,NORTHERN AMERICA,1.668493e-06
1,ASIA,8.236332e-07


### 6. Geographic regions by coverage (by relative proportion of politician articles from countries in each region that are of GA and FA-quality)

The geographic regions in order of highest proportion of high quality articles to total articles to least.

In [30]:
regions_by_quality_coverage = region_stats.sort_values("quality_articles_per_articles", ascending=False)
regions_by_quality_coverage[["region", "quality_articles_per_articles"]]

Unnamed: 0,region,quality_articles_per_articles
1,ASIA,0.004818
0,AFRICA,0.004425
5,OCEANIA,0.004329
3,EUROPE,0.004245
2,NORTHERN AMERICA,0.003284
4,LATIN AMERICA AND THE CARIBBEAN,0.002863


----------------

## Reflections and implications

Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

In addition to any reflections you want to share about the process of the assignment, please respond (briefly) to at least three of the questions below:

What biases did you expect to find in the data (before you started working with it), and why?  
What (potential) sources of bias did you discover in the course of your data processing and analysis?  
What might your results suggest about (English) Wikipedia as a data source?  
What might your results suggest about the internet and global society in general?  
Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?  
Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?  
How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?
This section doesn't need to be particularly long or thorough, but we'll expect you to write at least a couple paragraphs.

In [None]:
* bias heavily influenced by county size!! -europe probbaly high but malta low??