Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data


## Step 1⃣ | Data acquisition

Two data sources are used: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `data_raw` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

In [1]:
import pandas as pd

In [2]:
page = pd.read_csv("../data_raw/page_data.csv")
export = pd.read_csv("../data_raw/export_2019.csv", delimiter = ";")

In [3]:
page.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [4]:
export.head()

Unnamed: 0,country,population,region
0,Algeria,44.357,AFRICA
1,Egypt,100.803,AFRICA
2,Libya,6.891,AFRICA
3,Morocco,35.952,AFRICA
4,Sudan,43.849,AFRICA


## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.


In [5]:
page = page[~page['page'].str.contains("Template")]

In [6]:
page.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |


### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).


It's recommended to batch no more than 50 revisions within a given request. The dataframe is split into equal chunks of size 50.

In [9]:
import requests
import json
import numpy as np
import sys
from math import floor

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/Francosinus',
    'From': 'franziska.rau@fu-berlin.de'
}


def get_ores_data(rev_id,headers):
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data=json.dumps(response)
    
    return data

def createDf(ids,df):
    
    prediction=[]
    err=[]
    for i in range(0, len(ids) // 50+1):
    
        if i %50 ==0:
            print(f"Processing batch: {i} of {len(ids) // 50+1}")

        batch = '|'.join(str(x) for x in ids[(i * 50):((i+1) * 50)])
        data = get_ores_data(batch,headers)
        data = json.loads(data)
    
        for key in data['enwiki']['scores'].keys():
            
            try:
                pred = data['enwiki']['scores'][str(key)]['wp10']['score']['prediction']
                prediction.append([key, pred])
                
            except KeyError:
                err.append(key)

    ores = pd.DataFrame(prediction, columns = ["rev_id","prediction"])
    ores['rev_id']=ores['rev_id'].astype(int)
    full = df.merge(ores, on=['rev_id'], how='left')

    return full

In [10]:
ids = [item for item in page['rev_id']]
 
page_ores = createDf(ids, page)


Processing batch: 0 of 935
Processing batch: 50 of 935
Processing batch: 100 of 935
Processing batch: 150 of 935
Processing batch: 200 of 935
Processing batch: 250 of 935
Processing batch: 300 of 935
Processing batch: 350 of 935
Processing batch: 400 of 935
Processing batch: 450 of 935
Processing batch: 500 of 935
Processing batch: 550 of 935
Processing batch: 600 of 935
Processing batch: 650 of 935
Processing batch: 700 of 935
Processing batch: 750 of 935
Processing batch: 800 of 935
Processing batch: 850 of 935
Processing batch: 900 of 935


**Let's have a look at the data! We can see that the table now contains a column prediction with refers to the quality scores of the articles**

In [11]:
page_ores.head()

Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


### Combining the datasets

Both datasets are combined: (1) the wikipedia articles with ORES scores (2) the population data. Both have columns named `country`. Not all entries can be merged, so the non matching rows are filtered out. This can be easily done by removing rows which contain NaN values. 

The non matching data can be found in `../data_clean/countries-no_match.csv`. The remaining data can be found in  `../data_clean/politicians_by_country.csv`.


In [19]:
df = pd.merge(page_ores, export, on='country', how='outer')

In [22]:
df.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population,region
0,Bir I of Kanem,Chad,355319463.0,Stub,16877000.0,AFRICA
1,Abdullah II of Kanem,Chad,498683267.0,Stub,16877000.0,AFRICA
2,Salmama II of Kanem,Chad,565745353.0,Stub,16877000.0,AFRICA
3,Kuri I of Kanem,Chad,565745365.0,Stub,16877000.0,AFRICA
4,Mohammed I of Kanem,Chad,565745375.0,Stub,16877000.0,AFRICA


**To make further processing easier, the columns are renamed.**

In [21]:
df['population']=df['population']*1e6
df=df.rename(columns={"page": "article_name", "rev_id": "revision_id", "prediction":"article_quality"})
no_match = df[df.isnull().any(axis=1)]
match = df[~df.isnull().any(axis=1)]

In [23]:
no_match.to_csv("../data_clean/countries-no_match.csv")
match.to_csv("../data_clean/politicians_by_country.csv")

## Step 3⃣ | Analysis

The analysis consists of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for **each country** and for **each region**. By `"high quality"` arcticle I mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results

The results from this analysis are six `data tables`.


First we read the cleaned table in again to further process it.

In [34]:
clean = pd.read_csv("../data_clean/politicians_by_country.csv")

In [35]:
clean.head()

Unnamed: 0.1,Unnamed: 0,article_name,country,revision_id,article_quality,population,region
0,0,Bir I of Kanem,Chad,355319463.0,Stub,16877000.0,AFRICA
1,1,Abdullah II of Kanem,Chad,498683267.0,Stub,16877000.0,AFRICA
2,2,Salmama II of Kanem,Chad,565745353.0,Stub,16877000.0,AFRICA
3,3,Kuri I of Kanem,Chad,565745365.0,Stub,16877000.0,AFRICA
4,4,Mohammed I of Kanem,Chad,565745375.0,Stub,16877000.0,AFRICA


**Table 1:**

**Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [36]:
top_coverage = clean.groupby(['country'])
top_coverage = top_coverage.agg({'article_name':'count','population':'last'})
top_coverage = top_coverage.reset_index()
top_coverage['coverage'] = (top_coverage['article_name']*100) / (top_coverage['population'])
top_coverage = top_coverage.sort_values('coverage', ascending=False)
top_coverage = top_coverage.reset_index(drop=True)

In [37]:
top_coverage.head(10)

Unnamed: 0,country,article_name,population,coverage
0,Tuvalu,54,10000.0,0.54
1,Albania,457,2838000.0,0.016103
2,New Zealand,783,4987000.0,0.015701
3,Norway,656,5387000.0,0.012177
4,Moldova,423,3535000.0,0.011966
5,Estonia,148,1331000.0,0.011119
6,Finland,569,5529000.0,0.010291
7,Sao Tome and Principe,21,210000.0,0.01
8,Lithuania,244,2794000.0,0.008733
9,Uruguay,285,3531000.0,0.008071


In [38]:
top_coverage[1:11].to_csv("../results/country_coverage_data_top_10.csv")

**Table 2:**

**Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [39]:
top_coverage = top_coverage.sort_values('coverage', ascending=True)
top_coverage.head(10)

Unnamed: 0,country,article_name,population,coverage
182,Guyana,20,787000000.0,3e-06
181,Djibouti,37,988000000.0,4e-06
180,Belize,16,419000000.0,4e-06
179,Barbados,14,287000000.0,5e-06
178,Bahamas,20,393000000.0,5e-06
177,Cape Verde,36,556000000.0,6e-06
176,Suriname,40,605000000.0,7e-06
175,French Guiana,27,294000000.0,9e-06
174,Martinique,34,356000000.0,1e-05
173,Montenegro,72,622000000.0,1.2e-05


In [40]:
top_coverage[1:11].to_csv("../results/country_coverage_data_bottom_10.csv")

**Table 3:**

**Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [41]:
top_coverage = clean.groupby(['region'])
top_coverage = top_coverage.agg({'article_name':'count','population':'last'})
top_coverage = top_coverage.reset_index()
top_coverage['coverage'] = (top_coverage['article_name']*100) / (top_coverage['population'])
top_coverage = top_coverage.sort_values('coverage', ascending=False)
top_coverage = top_coverage.reset_index(drop=True)

In [42]:
top_coverage.head(10)

Unnamed: 0,region,article_name,population,coverage
0,ASIA,11691,35041000.0,0.033364
1,EUROPE,15776,82000000.0,0.019239
2,AFRICA,6844,98000000.0,0.006984
3,OCEANIA,3127,106000000.0,0.00295
4,LATIN AMERICA AND THE CARIBBEAN,5273,419000000.0,0.001258
5,NORTHERN AMERICA,1910,329878000.0,0.000579


In [43]:
top_coverage[1:11].to_csv("../results/region_coverage_data_top.csv")

**Table 4: **
**Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [44]:
relative = clean.loc[(clean.article_quality =='GA') | (clean.article_quality =='FA')]

In [45]:
relative = clean.groupby(['country'])
relative = relative.agg({'article_name':'count','population':'last'})
relative = relative.reset_index()
relative['rel_proportion'] = (relative['article_name']*100) / (relative['population'])
relative = relative.sort_values('rel_proportion', ascending=False)
relative = relative.reset_index(drop=True)

In [46]:
relative.head(10)

Unnamed: 0,country,article_name,population,rel_proportion
0,Tuvalu,54,10000.0,0.54
1,Albania,457,2838000.0,0.016103
2,New Zealand,783,4987000.0,0.015701
3,Norway,656,5387000.0,0.012177
4,Moldova,423,3535000.0,0.011966
5,Estonia,148,1331000.0,0.011119
6,Finland,569,5529000.0,0.010291
7,Sao Tome and Principe,21,210000.0,0.01
8,Lithuania,244,2794000.0,0.008733
9,Uruguay,285,3531000.0,0.008071


**Table 5:**

**Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [47]:
relative[1:11].to_csv("../results/country_proportion_good_quality_data_top10.csv")

In [48]:
relative = relative.sort_values('rel_proportion', ascending=True)
relative[1:11].to_csv("../results/country_proportion_good_quality_data_bottom10.csv")

In [49]:
relative.head(10)

Unnamed: 0,country,article_name,population,rel_proportion
182,Guyana,20,787000000.0,3e-06
181,Djibouti,37,988000000.0,4e-06
180,Belize,16,419000000.0,4e-06
179,Barbados,14,287000000.0,5e-06
178,Bahamas,20,393000000.0,5e-06
177,Cape Verde,36,556000000.0,6e-06
176,Suriname,40,605000000.0,7e-06
175,French Guiana,27,294000000.0,9e-06
174,Martinique,34,356000000.0,1e-05
173,Montenegro,72,622000000.0,1.2e-05


**Table 6:**

**Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [50]:
relative = clean.groupby(['region'])
relative = relative.agg({'article_name':'count','population':'last'})
relative = relative.reset_index()
relative['rel_proportion'] = (relative['article_name']*100) / (relative['population'])
relative = relative.sort_values('rel_proportion', ascending=False)
relative = relative.reset_index(drop=True)

In [51]:
relative.head(10)

Unnamed: 0,region,article_name,population,rel_proportion
0,ASIA,11691,35041000.0,0.033364
1,EUROPE,15776,82000000.0,0.019239
2,AFRICA,6844,98000000.0,0.006984
3,OCEANIA,3127,106000000.0,0.00295
4,LATIN AMERICA AND THE CARIBBEAN,5273,419000000.0,0.001258
5,NORTHERN AMERICA,1910,329878000.0,0.000579


In [52]:
relative[1:11].to_csv("../results/region_proportion_good_quality_data_top.csv")

***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).