# Assignment 2: Bias in data
### Data 512
### Saturday, October 5
### Tara Wilson

-----------------

## Setup

This Jupyter Notebook is created using [Python version 3.7](https://www.python.org/downloads/release/python-370/).

First, I will import the necessary libraries to run the code. The following libraries are used:

[pandas](https://pandas.pydata.org/)    
[json](https://docs.python.org/3/library/json.html)  
[requests](https://realpython.com/python-requests/)  
[logging](https://docs.python.org/3/library/logging.html)  
[numpy](https://numpy.org/)  

In [1]:
import pandas as pd
import json
import requests
import logging
import numpy as np

I will set up a logging file to keep track of messages later in the file. I set the logging `level` to `INFO` since we want to track information and not error messages or other types of alerts. I set `format` to `%(message)s` so the message is printed out simply as a string. I set the `filename` to `bias_in_data_error_log.log` so the log will be saved with this title in the current working directory. Finally, I set the `filemode` to `w` for write permissions so that the file is re-written each time and not added on to since I do not need to keep historical logs.

In [2]:
logging.basicConfig(level=logging.INFO,
                    format='%(message)s',
                    filename='bias_in_data_error_log.log',
                    filemode='w')

---------------------

## Data cleaning

I will then read in the 2 source files from the `source_data` folder into pandas DataFrames using the `read_csv` function:
* page_data.csv
* WPDS_2018_data.csv

In [3]:
page_data = pd.read_csv("source_data/page_data.csv")
population_data = pd.read_csv("source_data/WPDS_2018_data.csv")

I will then take a look to ensure the data files are read in properly.  

In [4]:
print("Page dataframe row count: ", page_data.shape[0])

Page dataframe row count:  47197


I will also preview the dataset to ensure the columns are as we expect: 
* page (article names)
* country
* rev_id (revision ID for the article)

In [5]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [6]:
print("Population dataframe row count: ", population_data.shape[0])

Population dataframe row count:  207


I will also preview the population dataset to ensure the columns are as follows: 
* Geogrpahy
* Population mid-2018 (millions)

In [7]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


Rows with article page names that begin with the string "Template" need to be filtered out of `page_data` as these are not Wikipedia articles and we do not want to include them in the anlysis. To do so we will use Python's `~` operator described in detail [here](https://stackoverflow.com/questions/8305199/the-tilde-operator-in-python).

In [8]:
page_data = page_data[~page_data["page"].str.startswith("Template")]

I will next generate a `population_for_regions` DataFrame to track region level data for use later in the analysis. I loop through the population data grabbing all of the `Geography` column values that represent regions, denoted by all capital letters. I then save the name of the region and corresponding population to be used later. I also add the corresponding region to the `population_data` DataFrame for use in the analysis portion later.

In [9]:
population_for_regions = pd.DataFrame()

# starting value for region, allows scope for variable outside of if statement
region = ""

regions = []
region_pop = []
region_name= []

for index, row in population_data.iterrows():
    # all upercase words are the region names
    if row["Geography"].isupper():
        region = row["Geography"]
        region_pop.append(row["Population mid-2018 (millions)"])
        region_name.append(region)
    regions.append(region)

# append columns to existing DataFrames
population_data["regions"] = regions
population_for_regions["region"] = region_name
population_for_regions["population"] = region_pop

We can preview the results of this loop:

In [10]:
population_for_regions.head()

Unnamed: 0,region,population
0,AFRICA,1284
1,NORTHERN AMERICA,365
2,LATIN AMERICA AND THE CARIBBEAN,649
3,ASIA,4536
4,EUROPE,746


The flat file data sources are now all properly read-in and prepared for the next step: gathering the article quality scores.

-----------------

## Data gathering

In this next step I will gather the quality data for each of the Wikipedia articles included in `page_data`. This data is sourced from the Objective Revision Evaluation Service [(ORES)](https://www.mediawiki.org/wiki/ORES) which provides a predicted label to represent the quality of the article. The available labels are:

1. FA (Featured article)
2. GA (Good article)
3. B (B-class article)
4. C (C-class article)
5. Start (Start-class article)
6. Stub (Stub-class article)

where 1 is the highest quality and 6 is the lowest quality classification.

I was unable to `pip install ores` so I am using the ORES API service to get the article quality predictions. The following code writes a function to and makes requests to the [REST API endpoint](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model): `https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}`. The API requires the following parameters:

| parameter | value |
| ---------|:-----:|
|*project*|`enwiki`|
|*model*|`wp10`|
|*revids*|List of revision IDs from `page_data`|


The resulting API response will look something like:
```json
{'articlequality': 
    {'score': 
        {'prediction': B, 
         'probability': 
            {'GA': 0.005565225912988614, 
             'Stub': 0.285072978841463, 
             'C': 0.1237249061020009, 
             'B': 0.2910788689339172, 
             'Start': 0.2859984921969326, 
             'FA': 0.008559528012697881
            }
        }
    }
}
```

The following function is adapted from [this example](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb). 

In [11]:
def get_ores_data(revision_ids):
    """
    This function makes a request to the ORES API Endpoint. 
    Inputs:
        - revision_ids: A list of revision IDs for Wikipedia articles
    Outputs:
        - A JSON object with the predicted quality for all revision IDs
    """
    headers = {'User-Agent' : 'https://github.com/TaraWilson17', 'From' : 'wwtara@uw.edu'}
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

To avoid overloading the API, I batched the requests in groups of 50 as recommended by the documentation. For each batch, I grab the array of JSON responses. I then try to parse the prediction from the response. If this is successful, I add the article and the corresponding prediction to the lists. If this is unsuccessful, the revision ID is logged to the file created at the top of this notebook.

In [12]:
revision_id = []
article_quality = []

# converts `rev_id` column to integer data type
page_data["rev_id"] = page_data["rev_id"].astype(np.int64)

# loops through all data, 50 rows at a time
for i in range(0, page_data.shape[0], 50):
    # calls my get_ores_data function above with list of rev id's passed in
    ores_responses = get_ores_data(np.array(page_data["rev_id"].iloc[i:i + 50,]))
    for article in ores_responses["enwiki"]["scores"]:
        try:
            # saves prediction if it exists
            article_quality.append(ores_responses["enwiki"]["scores"][article]["wp10"]["score"]["prediction"])
        except:
            logging.info("Unable to get a ORES response for revision id: %s", article)
        else:
            revision_id.append(article)

I will create a dataframe, `article_data`, that stores the results from the ORES API calls. It has a column for the revision IDs and another for the article quality prediction from ORES. 

In [13]:
article_data = pd.DataFrame()
article_data["revision_id"] = revision_id
article_data["article_quality"] = article_quality

I will then convert the `revision_id` column into an integer datatype and then merge the `article_data` DataFrame with the `page_data` from above with the article names, corresponding revision ID and the country the article references. We can join the columns by the revision ID respective columns.

In [14]:
# converts revision_ids to be of integer type
article_data["revision_id"] = article_data["revision_id"].astype(str).astype(int)

# renames columns to proper names and to match between datasets
page_data = page_data.rename(columns={"rev_id": "revision_id"})
page_data = page_data.rename(columns={"page": "article_name"})

all_article_data = pd.merge(article_data, page_data, on="revision_id")

I will then merge the above dataset with the `population_data` from above to bring in the population data for all of the countries. I will join on `country` from the article dataset above with the renamed `Geography` column in the population source data. Pandas `merge` function takes an argument,`indicator=True` to allow rows to include documentation on the source of the data on an outer join where some sources may have been null. Data without matches, denoted by markers other than `both` in the merge indicator column, are saved to a seperate DataFrame named `no_match_data`. Data with the merge indicator set to `both` will be saved to the `all_data` DataFrame. I will then preview this resulting dataset:

In [15]:
# renames columns to correct names and to match between datasets
population_data = population_data.rename(columns={"Geography": "country"})
population_data = population_data.rename(columns={"Population mid-2018 (millions)": "population"})

# merges datasets on 'country' column, outer join
all_data = pd.merge(all_article_data, population_data, on="country", how="outer", indicator=True)

# selects data not represented in both datasets
no_match_data = all_data[all_data["_merge"] != "both"]
all_data = all_data[all_data["_merge"] == "both"]

# preview data
all_data.head()

Unnamed: 0,revision_id,article_quality,article_name,country,population,regions,_merge
0,355319463.0,Stub,Bir I of Kanem,Chad,15.4,AFRICA,both
1,498683267.0,Stub,Abdullah II of Kanem,Chad,15.4,AFRICA,both
2,565745353.0,Stub,Salmama II of Kanem,Chad,15.4,AFRICA,both
3,565745365.0,Stub,Kuri I of Kanem,Chad,15.4,AFRICA,both
4,565745375.0,Stub,Mohammed I of Kanem,Chad,15.4,AFRICA,both


I will write the resulting data to .csv using python's `to_csv` function. Both the matches and no matches csv files are saved to a folder named `output_data` inside the current working directory.

Both files contain the following columns (the no_match data will have `None` values for at least one column in every row):
* country
* article_name
* revision_id
* article_quality
* population

In [16]:
all_data.to_csv("output_data/wp_wpds_politicians_by_country.csv", sep=",", columns=["country", "article_name", "revision_id", "article_quality", "population"])
no_match_data.to_csv("output_data/wp_wpds_countries-no_match.csv", sep=",", columns=["country", "article_name", "revision_id", "article_quality", "population"])

-----------------------

## Data analysis

The next step is to complete some analysis on the resulting dataset. This will allow me to derive some metrics to generate some summary statistics for different countries.  

The first step is to create a DataFrame to store the computed columns. I then loop through all of the unique countries in the data and make calculations. First, I gather all of the data in the `all_data` DataFrame, the result of the above merge, and filter by only rows with the country name for that loop iteration. To get the counts of articles from that country, I can simply get the `len()` of the rows meeting this condition, as each row in the dataset represents a unique article. High quality articles are defined as those with a prediction of `GA` or `FA`. I then loop through all of the filtered rows and increment a counter variable for each article that is either `GA` or `FA` quality. Once all the rows in the filtered dataset have been processed, I save this count of high quality articles. I then append the population of the country to the dataset as well to use in later calculations. Finally, I merge all of these calculations into the DataFrame.

In [17]:
article_stats = pd.DataFrame()
country_list = []
counts = []
populations= []
high_quality_counts = []

# gets list of all countries included at least once in all_data
countries = all_data["country"].unique()

for country in countries:
    country_list.append(country)
    # filter by current selected country
    articles_from_country = all_data[all_data["country"] == country]
    counts.append(len(articles_from_country))
    count = 0
    for index, row in articles_from_country.iterrows():
        # "FA" and "GA" are considered 'high quality' predictions
        if row["article_quality"] == "FA" or row["article_quality"] == "GA":
            count += 1
    high_quality_counts.append(count)
    populations.append(row["population"])
    
    
article_stats["country"] = country_list
article_stats["num_articles"] = counts
article_stats["population"] = populations
article_stats["num_high_quality_articles"] = high_quality_counts

I can preview the resulting statistical dataset:

In [18]:
article_stats.head()

Unnamed: 0,country,num_articles,population,num_high_quality_articles
0,Chad,97,15.4,2
1,Cambodia,213,16.0,4
2,Canada,843,37.2,22
3,Egypt,235,97.0,9
4,Pakistan,1023,200.6,19


Next, I will repeat the process above but at a region level this time. All of the criteria and calculations at the region level are the same, with the exception of tracking the region population.

In [19]:
region_stats = pd.DataFrame()
region_list = []
num_articles = []
high_quality_counts = []

# gets list of all regions included at least once in all_data
regions = all_data["regions"].unique()    

for region in regions:
    region_list.append(region)
    # filters data to rows with current selected region
    articles_from_region = all_data[all_data["regions"] == region]
    num_articles.append(len(articles_from_region))
    count = 0
    for index, row in articles_from_region.iterrows():
        # "FA" and "GA" are considered 'high quality' predictions
        if row["article_quality"] == "FA" or row["article_quality"] == "GA":
            count += 1
    high_quality_counts.append(count)
    
region_stats["region"] = region_list
region_stats["num_articles"] = num_articles
region_stats["num_high_quality_articles"] = high_quality_counts   

To gather population data at a region level, I will merge the DataFrame just created with the `population_for_regions` DataFrame created earlier in this notebook. I will join these two sources on the `region` column. Again, I can preview the region statistics DataFrame to ensure everything looks correct:

In [20]:
region_stats = pd.merge(region_stats, population_for_regions, on="region")
region_stats.head() 

Unnamed: 0,region,num_articles,num_high_quality_articles,population
0,AFRICA,6851,125,1284
1,ASIA,11531,310,4536
2,NORTHERN AMERICA,1921,99,365
3,EUROPE,15864,322,746
4,LATIN AMERICA AND THE CARIBBEAN,5169,69,649


I will do some data cleaning to allow for calculations below. The `population` data is reported in millions, so I will conert the column to string data type and replace the `,`'s found in the data, for example: population 1,174 M. I then convert these to floats and multiply them by 1,000,000 to accurately represent the population in millions for calculations.

In [21]:
# populations reported in millions
article_stats["population"] = article_stats["population"].str.replace(",","")
article_stats["population"] = article_stats["population"].astype(float) * 1000000

region_stats["population"] = region_stats["population"].str.replace(",","")
region_stats["population"] = region_stats["population"].astype(float) * 1000000

I will do some feature engineering where I calculate the proportion of total articles per population, as well as the number of high quality articles (labled `GA` or `FA` by ORES) per total number of articles. I do this for both the `article_stats` DataFrame at a country level and the `region_stats` DataFrame at a geographic region level.

In [22]:
article_stats["articles_per_population"] = article_stats["num_articles"] / article_stats["population"]
article_stats["quality_articles_per_articles"] = article_stats["num_high_quality_articles"] / article_stats["num_articles"]

region_stats["articles_per_population"] = region_stats["num_articles"] / region_stats["population"]
region_stats["quality_articles_per_articles"] = region_stats["num_high_quality_articles"] / region_stats["num_articles"]

Using the aggregated dataset and calculated columns, I will generate some result tables to explore the findings.

--------------------

## Result tables

### 1. Top 10 countries by coverage

These are the 10 countries with the highest proportion of politician articles on Wikideia per the population.

In [23]:
top_10_by_coverage = article_stats.nlargest(10, "articles_per_population")
top_10_by_coverage[["country", "articles_per_population"]]

Unnamed: 0,country,articles_per_population
98,Tuvalu,0.0054
149,Nauru,0.0052
39,San Marino,0.0027
63,Monaco,0.001
97,Liechtenstein,0.0007
86,Tonga,0.00063
104,Marshall Islands,0.000617
66,Iceland,0.000503
166,Andorra,0.000425
77,Grenada,0.00036


### 2. Bottom 10 countries by coverage

These are the 10 countries with the least amount of politician articles on Wikideia per the population.

In [24]:
bottom_10_by_coverage = article_stats.nsmallest(10, "articles_per_population")
bottom_10_by_coverage[["country", "articles_per_population"]]

Unnamed: 0,country,articles_per_population
6,India,7.146503e-07
58,Indonesia,7.918552e-07
20,China,8.107332e-07
150,Uzbekistan,8.510638e-07
106,Ethiopia,9.395349e-07
163,"Korea, North",1.40625e-06
178,Zambia,1.412429e-06
126,Thailand,1.691843e-06
125,Mozambique,1.901639e-06
115,Bangladesh,1.917067e-06


### 3. Top 10 countries by relative quality

These are the 10 countries with the highest proportion of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [25]:
top_10_by_quality = article_stats.nlargest(10, "quality_articles_per_articles")
top_10_by_quality[["country", "quality_articles_per_articles"]]

Unnamed: 0,country,quality_articles_per_articles
163,"Korea, North",0.194444
169,Saudi Arabia,0.127119
138,Mauritania,0.125
161,Central African Republic,0.121212
45,Romania,0.113703
98,Tuvalu,0.092593
124,Bhutan,0.090909
172,Dominica,0.083333
47,Syria,0.078125
44,Benin,0.076923


### 4. Bottom 10 countries by relative quality

These are the 10 countries with the least number of quality articles (`GA` or `FA` predictions from ORES) per number of politician articles on Wikipedia from that country.

In [26]:
bottom_10_by_quality = article_stats.nsmallest(10, "quality_articles_per_articles")
bottom_10_by_quality[["country", "quality_articles_per_articles"]]

Unnamed: 0,country,quality_articles_per_articles
14,Malta,0.0
22,Angola,0.0
28,Finland,0.0
32,Tunisia,0.0
39,San Marino,0.0
50,Uganda,0.0
52,Moldova,0.0
63,Monaco,0.0
76,Turkmenistan,0.0
80,Slovakia,0.0


### 5. Geographic regions by coverage (by politician articles from countries in each region as a proportion of total regional population)

The geographic regions in order of most politician articles per population to least.

In [27]:
regions_by_coverage = region_stats.sort_values("articles_per_population", ascending=False)
regions_by_coverage[["region", "articles_per_population"]]

Unnamed: 0,region,articles_per_population
5,OCEANIA,7.6e-05
3,EUROPE,2.1e-05
4,LATIN AMERICA AND THE CARIBBEAN,8e-06
0,AFRICA,5e-06
2,NORTHERN AMERICA,5e-06
1,ASIA,3e-06


### 6. Geographic regions by coverage (by relative proportion of politician articles from countries in each region that are of GA and FA-quality)

The geographic regions in order of highest proportion of high quality articles to total articles to least.

In [28]:
regions_by_quality_coverage = region_stats.sort_values("quality_articles_per_articles", ascending=False)
regions_by_quality_coverage[["region", "quality_articles_per_articles"]]

Unnamed: 0,region,quality_articles_per_articles
2,NORTHERN AMERICA,0.051536
1,ASIA,0.026884
5,OCEANIA,0.0211
3,EUROPE,0.020298
0,AFRICA,0.018246
4,LATIN AMERICA AND THE CARIBBEAN,0.013349


----------------

## Reflections and implications

**Write a few paragraphs reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist). You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.**

I learned a lot during the process of this assignment. First, I strengthened my technical skills as I had not done much data source merging before this assignment. Ideas such as logging which rows I could not get data for or saving rows from a merge that don't match were not concepts I was previously familiar with. 

As I expected, population was a huge driver in the relative proportions of articles analysis. Less people likely means less politicians and less articles but this relationship is likely not directly linear. It is difficult to quantify this relationship by looking at the data used for the table calculations as there are likely other factors. I believe this cold be a source of 'bias' of sorts introduced into the analysis.

There are many cases where bias could exist which are discussed more below. Additionally, there are many edge cases where the data represented here may not tell the store. For example, data censorship and government control are likely what lead to North Korea landing low on article count per population but high on the quality list. This is an interesting case where the government influences much more about what is communicated than in other nations. I believe this could cause bias since it sways these numbers in a way not accounted for by a data field.

Following our Wikipedia 'attack' recognition exercise in class on Week 3, it became very clear that there is no clear consensus on content in this papers even in extremes like personal attacks. Depending on how the data used to train the ORES model was labeled, this could be a significant source of bias in this dataset, especially if those in charge of the classification were not a representative sample.

This assignment made me wonder as a whole how much the training data effects the model, e.g. if a model was train on a disproportionate sample of quality articles, would it be more likely to make the prediction equal to whatever was most heavily represented in the training set?

**What biases did you expect to find in the data (before you started working with it), and why?**

I expected the data to be biased by representation. Since this analysis is based on Wikipedia articles, countries with less internet access and less of a norm around digital news will be underrepresented in this dataset. Additionally, depending on how government is structured in a certain country there may be more or less articles regarding politicians. 

When we looked at the Wikipedia contribution maps in class the distributions were not as I expected. Such as I believe the most contributions written in Mandarin were from France since Wikipedia is banned in China. When dealing with political articles this could really influence the quantity, and quality, of articles. 

There are several [countries where Wikipedia is banned](https://en.wikipedia.org/wiki/Censorship_of_Wikipedia).  China, where Wikipedia is banned, and Uzbekistan, where Wikipedia is censored, both show up on the bottom 10 by coverage list as we might expect. For a more detailed analysis, it may be worth considering factors such as this in the conclusions.

**What (potential) sources of bias did you discover in the course of your data processing and analysis?**

The ORES service is designed to address the quality of Wikipedia articles in a manner that is scalable. To do so, it must rely on many assumptions that may unintentionally introduce bias into the analysis. 

The ORES service leverages identifications of articles that might need to be deleted. The [documentation page](mediawiki.org/wiki/ORES) lists the models used for detection of vandalism, attacks or spam of English writings. The page then states that articles that pass this initial test have their quality examined based on the [English Wikipedia 1.0 assessment rating scale](en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team). I could only find mentions of the English models used on the documentation page. It is then unclear if there are comparable models for other languages or how this was addressed as likely the international dataset represented much more than the English language. Even if articles were all translated to English prior to analysis, there is likely to be some issues in translation. Therefore, articles written in a non-English language may be more or less likely to be flagged as a 'delete' article, skewing the representation of the dataset. 

The ORES predictive learning model is [based on human assessment](mediawiki.org/wiki/ORES). As we discussed in class, humans can make errors in labeling so there is a ceiling on the model performance that is likely not 100% correct. This can introduce subtle bias in terms of what the human was most likely to flag an article as.

The article quality model is [based on the structure of the article](mediawiki.org/wiki/ORES). It looks at certain characteristics such as the sections, references, citation format but does not look at "the quality of the writing or whether or not there's a tone problem," instead claiming that many "structural characteristics of articles seem to correlate strongly with good writing and tone."

**Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?**

Since the ORES responses are based on the structure of the article as mentioned above, I can see some serious potential bias issues here. For example, in countries where there is less formal education around writing formatting, articles from that area may consistently score lower on the ORES rankings. While it has been seen that there is a strong correltion between structure and an article's writing and tone, this relationship is not guarenteed. A well-written piece with a non-traditional structure may score poorly when in face the article had high quality content. This could be a case where the results may be unintentionally biased.

Additionally, I am not sure what the relationship is between population and number of politicians. For example, if 2 countries both have 10M population but Country A has a government made of 500 politicians and Country B has a government with just 100 politicians, I would expect there to be variation in terms of thier respective number of articles. Therefore, it may be interesting to include a variable of government size in this analysis and see how well this was tied to population. Depending on the desired conclusions, this could sway the results.

Data will never be perfectly clear and unbiased. This data analysis, like most, is ripe for bias and misinterpretation. Therefore, considering factors mentioned above such as Wikipedia censorship will al