Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data
Please follow the reproducability workflow as practiced during the last exercise.

## Step 1⃣ | Data acquisition

You will use two data sources: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `_data` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

As the first step we need to import the neccessary libaries for this project.

In [1]:
import pandas as pd
import os

We now load the csv files as pandas dataframes for further processing.

In [74]:
population_df = pd.read_csv('../data_raw/export_2019.csv', delimiter=';')
articles_df = pd.read_csv('../data_raw/page_data.csv')

In [3]:
# load all files from a folder
# files = [file for file in os.listdir('../data_raw')]
# combined = pd.DataFrame()

# for file in files:
#     delimiter = ',' if file != 'export_2019.csv' else ';'
#     current_df = pd.read_csv('../data_raw/' + file, delimiter=delimiter)
#     combined = pd.concat([combined, current_df])
# combined

## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.

***

| | `page_data.csv` | | |
|-|------|---------|--------|
| | **page** | **country** | **rev_id** |
|0|	Template:ZambiaProvincialMinisters | Zambia | 235107991 |
|1|	Bir I of Kanem | Chad | 355319463 |

***

| | `export_2019.csv` | | |
|-|------|---------|--------|
| | **country** | **population** | **region** |
|0|	Algeria | 44.357 | AFRICA |
|1|	Egypt | 100.803 | 355319463 |

***

We clean the Wikipedia articles of every row containing the `Template:` prefix.

In [75]:
articles_df = articles_df[~articles_df['page'].str.contains('Template:')]
articles_df

# sort out nan's
# new_df = data_df[data_df.isna().any(axis=1)]
# df.dropna(how='all')

# convert to int
# df['col'] = pd.to_numeric(df['col'])

# run function on df
# def get_city(address):
#   return address.split(',')[1]
# df['col'].apply(lambda addr: get_city(addr))

# python f strings
# .apply(lambda addr: f"{get_city(addr)} xy")

# plot labeling
# plt.xticks(df['x'].unique(), rotation='vertical', size=8)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can [read more](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) about what these assessment classes mean on English Wikipedia. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these six categories to any `rev_id`. You need to extract all `rev_id`s in the `page_data.csv` file and use the ORES API to get the predicted quality score for that specific article revision.

### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).

You can use the following **samle code for API calls**:

In [58]:
import requests
import json
import numpy as np
from ratelimit import limits

final_ores_df = pd.DataFrame()

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/Arne117',
    'From': 'arner92@zedat.fu-berlin.de'
}

# 50 revisions within a given request, up to 4 parallel requests.
@limits(calls=4, period=0.1)
def get_ores_data(rev_ids, headers):
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {
        'project' : 'enwiki',
        'model'   : 'wp10',
        'revids'  : rev_ids
    }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.loads(json.dumps(response))

    return data

def clean_ores_data(data):
    del data['enwiki']['models']
    chunk_df = pd.DataFrame(data['enwiki'])
    chunk_df.columns = ['score']
    chunk_df.index.name = 'rev_id'
    chunk_df['score'] = chunk_df['score'].apply(lambda score: score['wp10'])
    return chunk_df

#chunk1 = '355319463|393276188|393822005|395521877|395526568|401577829|442937236|448555418|470173494|477962574|492060822|492964343|498683267|502721672|516633096|521986779|532253442|543225630|545936100|546364151|549300521|550682925|550953646|559553872|559788982|560758943|561744402|564873005|565745353|565745365|565745375|566504165|573710096|574571582|576988466|585894477|592289232|595693452|596181202|598819900|601122766|601127343|614786300|623004627|623334577|624468970|625509885|626606789|627001041|627051151'
#chunk2 = '627432937|627547024|628261896|628268705|628270736|628312759|628379479|628563978|628619000|628766656|628988952|629562076|629818376|630396351|630396786|630704768|631437331|631581752|632008524|632261377|632447328|633612729|634032715|635240253|635814126|636911471|637801253|638214719|638362866|638377138|638566016|638571205|638599355|639021339|639061161|639471171|640014648|640214913|640826254|641422326|643410335|643746000|643932216|643932220|643932225|643932226|643932239|643932242|644024203|644041882'
#chunk3 = '644399375|645225697|645408814|647367482|647450883|647483021|647832959|647893858|648048611|648271473|650458255|650462344|650494773|651250302|651785828|651856758|653467222|653527941|653895210|654012510|655031906|655032578|655390291|655980284|656352211|656386131|656737492|657176628|658939122|659315526|660214915|660579437|660884852|662681733|662927043|663088604|663166199|663348266|663572469|663617582|663783398|663880497|664532059|664669156|664787336|664793790|664814622|664822251|666029819|666596918'

#chunks = [chunk1, chunk2, chunk1]
#for chunk in chunks:
#    chunk_ores_result = get_ores_data(chunk, headers)
#    chunk_ores_df = clean_ores_data(chunk_ores_result)
#    final_ores_df = pd.concat([final_ores_df, chunk_ores_df])

# The last rows index gives the length of the df and is dived by 50 to get the number of chunks needed to fit 50 rev_ids into one chunk for the api call
for i, chunk in enumerate(np.array_split(articles_df['rev_id'], int(articles_df.iloc[-1].name / 50)), start=1):
    print(f"Requesting chunk {i} of {int(articles_df.iloc[-1].name / 50)}")
    chunk_rev_ids = '|'.join(map(str, chunk))
    #chunk_ores_result = get_ores_data(chunk_rev_ids, headers)
    chunk_ores_df = clean_ores_data(chunk_ores_result)
    final_ores_df = pd.concat([final_ores_df, chunk_ores_df]) # ignore_index=True



Requesting chunk 1 of 943
Requesting chunk 2 of 943
Requesting chunk 3 of 943
Requesting chunk 4 of 943
Requesting chunk 5 of 943
Requesting chunk 6 of 943
Requesting chunk 7 of 943
Requesting chunk 8 of 943
Requesting chunk 9 of 943
Requesting chunk 10 of 943
Requesting chunk 11 of 943
Requesting chunk 12 of 943
Requesting chunk 13 of 943
Requesting chunk 14 of 943
Requesting chunk 15 of 943
Requesting chunk 16 of 943
Requesting chunk 17 of 943
Requesting chunk 18 of 943
Requesting chunk 19 of 943
Requesting chunk 20 of 943
Requesting chunk 21 of 943
Requesting chunk 22 of 943
Requesting chunk 23 of 943
Requesting chunk 24 of 943
Requesting chunk 25 of 943
Requesting chunk 26 of 943
Requesting chunk 27 of 943
Requesting chunk 28 of 943
Requesting chunk 29 of 943
Requesting chunk 30 of 943
Requesting chunk 31 of 943
Requesting chunk 32 of 943
Requesting chunk 33 of 943
Requesting chunk 34 of 943
Requesting chunk 35 of 943
Requesting chunk 36 of 943
Requesting chunk 37 of 943
Requesting

Requesting chunk 805 of 943
Requesting chunk 806 of 943
Requesting chunk 807 of 943
Requesting chunk 808 of 943
Requesting chunk 809 of 943
Requesting chunk 810 of 943
Requesting chunk 811 of 943
Requesting chunk 812 of 943
Requesting chunk 813 of 943
Requesting chunk 814 of 943
Requesting chunk 815 of 943
Requesting chunk 816 of 943
Requesting chunk 817 of 943
Requesting chunk 818 of 943
Requesting chunk 819 of 943
Requesting chunk 820 of 943
Requesting chunk 821 of 943
Requesting chunk 822 of 943
Requesting chunk 823 of 943
Requesting chunk 824 of 943
Requesting chunk 825 of 943
Requesting chunk 826 of 943
Requesting chunk 827 of 943
Requesting chunk 828 of 943
Requesting chunk 829 of 943
Requesting chunk 830 of 943
Requesting chunk 831 of 943
Requesting chunk 832 of 943
Requesting chunk 833 of 943
Requesting chunk 834 of 943
Requesting chunk 835 of 943
Requesting chunk 836 of 943
Requesting chunk 837 of 943
Requesting chunk 838 of 943
Requesting chunk 839 of 943
Requesting chunk 840

In [157]:
# final_ores_df.to_csv('en-wikipedia_tmp.csv')

In [232]:
ores_df = pd.read_csv('en-wikipedia_tmp.csv', index_col='rev_id')
ores_df['score'] = ores_df['score'].str.replace("\'", "\"")
ores_df['score'] = ores_df['score'].apply(lambda score: json.loads(score))
# ores_df

Sending one request for each `rev_id` might take some time. If you want to send batches you can use `'|'.join(str(x) for x in revision_ids` to put your ids together. Please make sure to deal with [exception handling](https://www.w3schools.com/python/python_try_except.asp) of the `KeyError` exception, when extracting the `prediction` from the `JSON` response.

### Combining the datasets

Now you need to combine both dataset: (1) the wikipedia articles and its ORES quality scores and (2) the population data. Both have columns named `country`. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.

Please remove any rows that do not have matching data, and output them to a `CSV` file called `countries-no_match.csv`. Consolidate the remaining data into a single `CSV` file called `politicians_by_country.csv`.

The schema for that file should look like the following table:


| article_name | country | region | revision_id | article_quality | population |
|--------------|---------|--------|-------------|-----------------|------------|
| Bir I of Kanem | Chad  | AFRICA | 807422778 | Stub | 16877000 |

Here we combine the articles data frame with the ORES result data frame for further processing. 

In [233]:
merged_df = articles_df.join(ores_df, on='rev_id')
merged_df

Unnamed: 0,page,country,rev_id,score
1,Bir I of Kanem,Chad,355319463,"{'score': {'prediction': 'Stub', 'probability'..."
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,"{'score': {'prediction': 'Stub', 'probability'..."
12,Yos Por,Cambodia,393822005,"{'score': {'prediction': 'Stub', 'probability'..."
23,Julius Gregr,Czech Republic,395521877,"{'score': {'prediction': 'Stub', 'probability'..."
24,Edvard Gregr,Czech Republic,395526568,"{'score': {'prediction': 'Stub', 'probability'..."
...,...,...,...,...
47192,Yahya Jammeh,Gambia,807482007,"{'score': {'prediction': 'GA', 'probability': ..."
47193,Lucius Fairchild,United States,807483006,"{'score': {'prediction': 'C', 'probability': {..."
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153,"{'score': {'prediction': 'GA', 'probability': ..."
47195,Francis Fessenden,United States,807483270,"{'score': {'prediction': 'C', 'probability': {..."


In [234]:
merged_df = merged_df.merge(population_df, on='country', how='left')
merged_df

Unnamed: 0,page,country,rev_id,score,population,region
0,Bir I of Kanem,Chad,355319463,"{'score': {'prediction': 'Stub', 'probability'...",16.877,AFRICA
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,"{'score': {'prediction': 'Stub', 'probability'...",5.008,ASIA
2,Yos Por,Cambodia,393822005,"{'score': {'prediction': 'Stub', 'probability'...",15.497,ASIA
3,Julius Gregr,Czech Republic,395521877,"{'score': {'prediction': 'Stub', 'probability'...",,
4,Edvard Gregr,Czech Republic,395526568,"{'score': {'prediction': 'Stub', 'probability'...",,
...,...,...,...,...,...,...
46696,Yahya Jammeh,Gambia,807482007,"{'score': {'prediction': 'GA', 'probability': ...",2.417,AFRICA
46697,Lucius Fairchild,United States,807483006,"{'score': {'prediction': 'C', 'probability': {...",329.878,NORTHERN AMERICA
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,"{'score': {'prediction': 'GA', 'probability': ...",35.041,ASIA
46699,Francis Fessenden,United States,807483270,"{'score': {'prediction': 'C', 'probability': {...",329.878,NORTHERN AMERICA


In [235]:
# It's possible that you will be unable to get a score for a particular article.
# If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score.
# This log should be saved as a separate file named ORES_no_scores.csv and should include the page, country, and
# rev_id (just as in page_data.csv).
ores_no_score_mask = merged_df['score'].apply(lambda res: 'error' in res)
ores_score_mask = merged_df['score'].apply(lambda res: 'score' in res)

ores_error_df = merged_df[ores_no_score_mask]
ores_score_df = merged_df[ores_score_mask]
#ores_error_df
#ores_score_df

ores_error_df[['page', 'country', 'rev_id', 'score']].to_csv('../data_clean/ORES_no_scores.csv')

In [236]:
# Remove unused dictonary structure and format the dataframe to the required structure
final_df = ores_score_df
final_df['article_quality'] = final_df['score'].apply(lambda entry: entry['score']['prediction'])
final_df = final_df.rename(columns={'page': 'article_name'})
final_df = final_df.drop('score', 1)
final_df = final_df[['article_name', 'country', 'region', 'rev_id', 'article_quality', 'population']]

final_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['article_quality'] = final_df['score'].apply(lambda entry: entry['score']['prediction'])


Unnamed: 0,article_name,country,region,rev_id,article_quality,population
0,Bir I of Kanem,Chad,AFRICA,355319463,Stub,16.877
1,Information Minister of the Palestinian Nation...,Palestinian Territory,ASIA,393276188,Stub,5.008
2,Yos Por,Cambodia,ASIA,393822005,Stub,15.497
3,Julius Gregr,Czech Republic,,395521877,Stub,
4,Edvard Gregr,Czech Republic,,395526568,Stub,
...,...,...,...,...,...,...
46695,Hal Bidlack,United States,NORTHERN AMERICA,807481636,C,329.878
46696,Yahya Jammeh,Gambia,AFRICA,807482007,GA,2.417
46697,Lucius Fairchild,United States,NORTHERN AMERICA,807483006,C,329.878
46698,Fahd of Saudi Arabia,Saudi Arabia,ASIA,807483153,GA,35.041


## Step 3⃣ | Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population (we can also call it `coverage`) and high-quality articles (we can also call it `relative-quality`)for **each country** and for **each region**. By `"high quality"` arcticle we mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results format

The results from this analysis are six `data tables`. Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment. The tables will show:

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

**❗Hint:** You will find what country belongs to which region (e.g. `ASIA`) also in `export_2019.csv`. You need to calculate the total poulation per region. For that you could use `groupby` and also check out `apply`.

#### 1. Top 10 countries by coverage

In [198]:
countries_coverage = final_df.groupby('country')

# sorting dataframe 
data.sort_values('team', inplace = True) 
  
# making boolean series for a team name 
filter1 = data["Team"]=="Atlanta Hawks"
  
# making boolean series for age 
filter2 = final_df['ga'] +  final_df['fa'] >= 0.5
  
# filtering data on basis of both filters 
data.where(filter1 & filter2, inplace = True) 
countries_coverage

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11da49f10>

#### 2.

#### 3.

In [None]:
# ga + fa >= 0.5
filter2 = final_df['ga'] +  final_df['fa'] >= 0.5


***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).