# Assignment 2: Bias in data
### Data 512
### Saturday, October 5
### Tara Wilson

-----------------

## Setup

This Jupyter Notebook is created using [Python version 3.7](https://www.python.org/downloads/release/python-370/).

First, I will import the necessary libraries to run the code. The following libraries are used:

[pandas](https://pandas.pydata.org/)    
[json](https://docs.python.org/3/library/json.html)  
[requests](https://realpython.com/python-requests/)  
[logging](https://docs.python.org/3/library/logging.html)  
[numpy](https://numpy.org/)  

In [75]:
import pandas as pd
import json
import requests
import logging
import numpy as np

I will set up a logging file to keep track of messages later in the file. I set the logging `level` to `INFO` since we want to track information and not error messages or other types of alerts. I set `format` to `%(message)s` so the message is printed out simply as a string. I set the `filename` to `bias_in_data_error_log.log` so the log will be saved with this title in the current working directory. Finally, I set the `filemode` to `w` for write permissions so that the file is re-written each time and not added on to since I do not need to keep historical logs.

In [76]:
logging.basicConfig(level=logging.INFO,
                    format='%(message)s',
                    filename='bias_in_data_error_log.log',
                    filemode='w')

---------------------

## Data cleaning

I will then read in the 2 source files from the `source_data` folder into pandas DataFrames using the `read_csv` function:
* page_data.csv
* WPDS_2018_data.csv

In [77]:
page_data = pd.read_csv("source_data/page_data.csv")
population_data = pd.read_csv("source_data/WPDS_2018_data.csv")

I will then take a look to ensure the data files are read in properly.  

In [78]:
print("Page dataframe row count: ", page_data.shape[0])

Page dataframe row count:  47197


I will also preview the dataset to ensure the columns are as we expect: 
* page (article names)
* country
* rev_id (revision ID for the article)

In [79]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [80]:
print("Population dataframe row count: ", population_data.shape[0])

Population dataframe row count:  207


I will also preview the population dataset to ensure the columns are as follows: 
* Geogrpahy
* Population mid-2018 (millions)

In [81]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


Rows with page names that begin with the string "Template" need to be filtered out of `page_data` as these are not Wikipedia articles and we do not want to include them in the anlysis. To do so we will use Python's `~` operator described in detail [here](https://stackoverflow.com/questions/8305199/the-tilde-operator-in-python).

In [82]:
page_data = page_data[~page_data["page"].str.startswith("Template")]

DESCRIBE THIS!!!!!!!!!!!1 

In [87]:
region = ""
regions = []
population_for_regions = pd.DataFrame()
region_pop = []
region_name= []

for index, row in population_data.iterrows():
    if row["Geography"].isupper():
        region = row["Geography"]
        region_pop.append(row["Population mid-2018 (millions)"])
        region_name.append(region)
    regions.append(region)

population_data["regions"] = regions

population_for_regions["region"] = region_name
population_for_regions["population"] = region_pop

Unnamed: 0,region,population
0,AFRICA,1284
1,NORTHERN AMERICA,365
2,LATIN AMERICA AND THE CARIBBEAN,649
3,ASIA,4536
4,EUROPE,746
5,OCEANIA,41


-----------------

## Data gathering

In this next step I will gather the quality data for each of the Wikipedia articles included in `page_data`. This data is sourced from the Objective Revision Evaluation Service [(ORES)](https://www.mediawiki.org/wiki/ORES) which provides a predicted label to represent the quality of the article. The available labels are:

1. FA (Featured article)
2. GA (Good article)
3. B (B-class article)
4. C (C-class article)
5. Start (Start-class article)
6. Stub (Stub-class article)

where 1 is the highest quality and 6 is the lowest quality classification.

I was unable to `pip install ores` so I am using the ORES API service to get the article quality predictions. The following code writes a function to and makes requests to the REST API endpoint: `https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}`. The API requires the following parameters:

| parameter | value |
| ---------|:-----:|
|*project*|`enwiki`|
|*model*|`wp10`|
|*revids*|List of revision IDs, from `page_data`|


The resulting API response will look something like:
```json
{'articlequality': 
    {'score': 
        {'prediction': 'B', 
        'probability': 
            {'GA': 0.005565225912988614, 
             'Stub': 0.285072978841463, 
             'C': 0.1237249061020009, 
             'B': 0.2910788689339172, 
             'Start': 0.2859984921969326, 
             'FA': 0.008559528012697881
            }
        }
    }
}
```

The following function is adapted from [this example](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb). 

In [7]:
def get_ores_data(revision_ids):
    """
    This function makes a request to the ORES API Endpoint. 
    Inputs:
        - revision_ids: A list of revision IDs for Wikipedia articles
    Outputs:
        - A JSON object with the predicted quality for all revision IDs
    """
    headers = {'User-Agent' : 'https://github.com/TaraWilson17', 'From' : 'wwtara@uw.edu'}
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [9]:
revision_id = []
article_quality = []

page_data["rev_id"] = page_data["rev_id"].astype(np.int64)

for i in range(0, page_data.shape[0], 50):
    ores_responses = get_ores_data(np.array(page_data["rev_id"].iloc[i:i + 50,]))
    for article in ores_responses["enwiki"]["scores"]:
        try:
            article_quality.append(ores_responses["enwiki"]["scores"][article]["wp10"]["score"]["prediction"])
        except:
            logging.info("Unable to get a ORES response for revision id: %s", article)
        else:
            revision_id.append(article)

In [19]:
article_data = pd.DataFrame()
article_data["revision_id"] = revision_id
article_data["article_quality"] = article_quality

Unnamed: 0,revision_id,article_quality
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [20]:
article_data["revision_id"] = article_data["revision_id"].astype(str).astype(int)
all_article_data = pd.merge(article_data, page_data, left_on="revision_id", right_on="rev_id")
all_article_data = all_article_data.drop(columns=["rev_id"])

Unnamed: 0,revision_id,article_quality,page,country
0,355319463,Stub,Bir I of Kanem,Chad
1,393276188,Stub,Information Minister of the Palestinian Nation...,Palestinian Territory
2,393822005,Stub,Yos Por,Cambodia
3,395521877,Stub,Julius Gregr,Czech Republic
4,395526568,Stub,Edvard Gregr,Czech Republic


In [21]:
all_data = pd.merge(all_article_data, population_data, left_on="country", right_on="Geography")
all_data = all_data.drop(columns=["Geography"])
all_data = all_data.rename(columns={"Population mid-2018 (millions)": "population"})
all_data.head()

Unnamed: 0,revision_id,article_quality,page,country,population
0,355319463,Stub,Bir I of Kanem,Chad,15.4
1,498683267,Stub,Abdullah II of Kanem,Chad,15.4
2,565745353,Stub,Salmama II of Kanem,Chad,15.4
3,565745365,Stub,Kuri I of Kanem,Chad,15.4
4,565745375,Stub,Mohammed I of Kanem,Chad,15.4


In [22]:
all_data.to_csv("wp_wpds_politicians_by_country.csv", sep=",", columns=["country", "article_name", "revision_id", "article_quality", "population"])

-----------------------

## Data analysis

The next step is to complete some analysis on the resulting dataset. This will allow us to derive some metrics to generate some summary statistics for different countries.

??

In [27]:
article_stats = pd.DataFrame()
country_list = []
counts = []
populations= []
high_quality_counts = []

countries = all_data["country"].unique()
for country in countries:
    country_list.append(country)
    articles_from_country = all_data[all_data["country"] == country]
    counts.append(len(articles_from_country))
    count = 0
    for index, row in articles_from_country.iterrows():
        # "FA" and "GA" are considered 'high quality' predictions
        if row["article_quality"] == "FA" or row["article_quality"] == "GA":
            count += 1
    high_quality_counts.append(count)
    populations.append(row["population"])
    
article_stats["country"] = country_list
article_stats["num_articles"] = counts
article_stats["population"] = populations
article_stats["num_high_quality_articles"] = high_quality_counts
article_stats.head()

Unnamed: 0,country,num_articles,population,num_high_quality_articles
0,Chad,97,15.4,2
1,Cambodia,213,16.0,4
2,Canada,843,37.2,22
3,Egypt,235,97.0,9
4,Pakistan,1023,200.6,19


In [28]:
# populations in millions
article_stats["population"] = article_stats["population"].str.replace(",","")
article_stats["population"] = article_stats["population"].astype(float) * 1000000
article_stats.head()

Unnamed: 0,country,num_articles,population,num_high_quality_articles
0,Chad,97,15400000.0,2
1,Cambodia,213,16000000.0,4
2,Canada,843,37200000.0,22
3,Egypt,235,97000000.0,9
4,Pakistan,1023,200600000.0,19


In [49]:
article_stats["articles_per_population"] = article_stats["num_articles"] / article_stats["population"]
article_stats["quality_articles_per_articles"] = article_stats["num_high_quality_articles"] / article_stats["num_articles"]
article_stats.head()

Unnamed: 0,country,num_articles,population,num_high_quality_articles,articles_per_population,quality_articles_per_population,quality_articles_per_articles
0,Chad,97,15400000.0,2,6e-06,1.298701e-07,0.020619
1,Cambodia,213,16000000.0,4,1.3e-05,2.5e-07,0.018779
2,Canada,843,37200000.0,22,2.3e-05,5.913978e-07,0.026097
3,Egypt,235,97000000.0,9,2e-06,9.278351e-08,0.038298
4,Pakistan,1023,200600000.0,19,5e-06,9.471585e-08,0.018573


--------------------

## Result tables

### 1. Top 10 countries by coverage

These are the 10 countries with the highest proportion of politician articles on Wikideia per the population.

In [89]:
top_10_by_coverage = article_stats.nlargest(10, "articles_per_population")
top_10_by_coverage[["country", "num_articles", "population", "articles_per_population"]]

Unnamed: 0,country,num_articles,population,articles_per_population
98,Tuvalu,54,10000.0,0.0054
149,Nauru,52,10000.0,0.0052
39,San Marino,81,30000.0,0.0027
63,Monaco,40,40000.0,0.001
97,Liechtenstein,28,40000.0,0.0007
86,Tonga,63,100000.0,0.00063
104,Marshall Islands,37,60000.0,0.000617
66,Iceland,201,400000.0,0.000503
166,Andorra,34,80000.0,0.000425
77,Grenada,36,100000.0,0.00036


### 2. Bottom 10 countries by coverage

These are the 10 countries with the least amount of politician articles on Wikideia per the population.

In [90]:
bottom_10_by_coverage = article_stats.nsmallest(10, "articles_per_population")
bottom_10_by_coverage[["country", "num_articles", "population", "articles_per_population"]]

Unnamed: 0,country,num_articles,population,articles_per_population
6,India,980,1371300000.0,7.146503e-07
58,Indonesia,210,265200000.0,7.918552e-07
20,China,1130,1393800000.0,8.107332e-07
150,Uzbekistan,28,32900000.0,8.510638e-07
106,Ethiopia,101,107500000.0,9.395349e-07
163,"Korea, North",36,25600000.0,1.40625e-06
178,Zambia,25,17700000.0,1.412429e-06
126,Thailand,112,66200000.0,1.691843e-06
125,Mozambique,58,30500000.0,1.901639e-06
115,Bangladesh,319,166400000.0,1.917067e-06


### 3. Top 10 countries by relative quality

These are the 10 countries with the highest proportion of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [91]:
top_10_by_quality = article_stats.nlargest(10, "quality_articles_per_articles")
top_10_by_quality[["country", "num_high_quality_articles", "num_articles", "quality_articles_per_articles"]]

Unnamed: 0,country,num_high_quality_articles,num_articles,quality_articles_per_articles
163,"Korea, North",7,36,0.194444
169,Saudi Arabia,15,118,0.127119
138,Mauritania,6,48,0.125
161,Central African Republic,8,66,0.121212
45,Romania,39,343,0.113703
98,Tuvalu,5,54,0.092593
124,Bhutan,3,33,0.090909
172,Dominica,1,12,0.083333
47,Syria,10,128,0.078125
44,Benin,7,91,0.076923


### 4. Bottom 10 countries by relative quality

These are the 10 countries with the least number of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [92]:
bottom_10_by_quality = article_stats.nsmallest(10, "quality_articles_per_articles")
bottom_10_by_quality[["country", "num_high_quality_articles", "num_articles", "quality_articles_per_articles"]]

Unnamed: 0,country,num_high_quality_articles,num_articles,quality_articles_per_articles
14,Malta,0,103,0.0
22,Angola,0,106,0.0
28,Finland,0,569,0.0
32,Tunisia,0,138,0.0
39,San Marino,0,81,0.0
50,Uganda,0,185,0.0
52,Moldova,0,423,0.0
63,Monaco,0,40,0.0
76,Turkmenistan,0,32,0.0
80,Slovakia,0,116,0.0


### 5. Geographic regions by coverage (by politician articles from countries in each region as a proportion of total regional population)

The geographic regions in order of most politician articles per population to least.

### 6. Geographic regions by coverage (by relative proportion of politician articles from countries in each region that are of GA and FA-quality)

The geographic regions in order of highest proportion of high quality articles to total articles to least.

----------------

## Reflections and implications

Write a few paragraphs, either in the README or at the end of the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.

In addition to any reflections you want to share about the process of the assignment, please respond (briefly) to at least three of the questions below:

What biases did you expect to find in the data (before you started working with it), and why?  
What (potential) sources of bias did you discover in the course of your data processing and analysis?  
What might your results suggest about (English) Wikipedia as a data source?  
What might your results suggest about the internet and global society in general?  
Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?  
Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?  
How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?
This section doesn't need to be particularly long or thorough, but we'll expect you to write at least a couple paragraphs.

In [None]:
* bias heavily influenced by county size!! -europe probbaly high but malta low??