Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)

***

# A2 - Wikipedia, ORES, and Bias in Data
Please follow the reproducability workflow as practiced during the last exercise.

## Step 1⃣ | Data acquisition

You will use two data sources: (1) Wikipedia articles of politicians and (2) world population data.

**Wikipedia articles -**
The Wikipedia articles can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). It contains politiciaans by country from the English-language wikipedia. Please read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.

**Population data -**
The population data is available in `CSV` format in the `_data` folder. The file is named `export_2019.csv`. This dataset is drawn from the [world population datasheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau (downloaded 2020-11-13 10:14 AM). I have edited the dataset to make it easier to use in this assignment. The population per country is given in millions!

In [304]:
import pandas as pd
import numpy as np
import os
import os.path, time
from os import makedirs, path
import altair as alt

## Step 2⃣ | Data processing and cleaning
The data in `page_data.csv` contain some rows that you will need to filter out. It contains some page names that start with the string `"Template:"`. These pages are not Wikipedia articles, and should not be included in your analysis. The data in `export_2019.csv` does not need any cleaning.

***

| | `page_data.csv` | | |
|-|------|---------|--------|
| | **page** | **country** | **rev_id** |
|0|	Template:ZambiaProvincialMinisters | Zambia | 235107991 |
|1|	Bir I of Kanem | Chad | 355319463 |

***

| | `export_2019.csv` | | |
|-|------|---------|--------|
| | **country** | **population** | **region** |
|0|	Algeria | 44.357 | AFRICA |
|1|	Egypt | 100.803 | 355319463 |

***

#### Loading the data

In [259]:
pages_df = df = pd.read_csv(
  'raw_data\pages\csv\page_data.csv',
  sep = ','
)

In [394]:
population_df = df = pd.read_csv(
  'raw_data\population\csv\export_2019.csv',
  sep = ';'
)

#### Exploring the data

In [393]:
pages_df.head() 

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [262]:
# Checking for Nan or null values
pages_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     47197 non-null  object
 1   country  47197 non-null  object
 2   rev_id   47197 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [263]:
print(len(pd.unique(pages_df['country'])),' countries')

219  countries


In [398]:
population_df.iloc[0]['population']

44.357

In [265]:
# Checking for Nan or null values
population_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     210 non-null    object 
 1   population  210 non-null    float64
 2   region      210 non-null    object 
dtypes: float64(1), object(2)
memory usage: 5.0+ KB


In [413]:
population_df['population'].dtype

dtype('float64')

In [266]:
pd.unique(population_df['region'])

array(['AFRICA', 'NORTHERN AMERICA', 'LATIN AMERICA AND THE CARIBBEAN',
       'ASIA', 'EUROPE', 'OCEANIA'], dtype=object)

### Getting article quality predictions with ORES

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [**ORES**](https://www.mediawiki.org/wiki/ORES) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of the six quality categories. The options are, from best to worst:

| ID | Quality Category |  Explanation |
|----|------------------|----------|
| 1 | FA    | Featured article |
| 2 | GA    | Good article |
| 3 | B     | B-class article |
| 4 | C     | C-class article |
| 5 | Start | Start-class article |
| 6 | Stub  | Stub-class article |

For context, these quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. If you're curious, you can [read more](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) about what these assessment classes mean on English Wikipedia. For this assignment, you only need to know that these categories exist, and that ORES will assign one of these six categories to any `rev_id`. You need to extract all `rev_id`s in the `page_data.csv` file and use the ORES API to get the predicted quality score for that specific article revision.

### ORES REST API endpoint

The [ORES REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) is configured fairly similarly to the pageviews API we used for the last assignment. It expects the following parameters:

* **project** --> `enwiki`
* **revid** --> e.g. `235107991` or multiple ids e.g.: `235107991|355319463` (batch)
* **model** --> `wp10` - The name of a model to use when scoring.

**❗Note on batch processing:** Please read the documentation about [API usage](https://www.mediawiki.org/wiki/ORES#API_usage) if you want to query a large number of revisions (batches). 

You will notice that ORES returns a prediction value that contains the name of one category (e.g. `Start`), as well as probability values for each of the six quality categories. For this assignment, you only need to capture and use the value for prediction.

**❗Note:** It's possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log should be saved as a separate file named `ORES_no_scores.csv` and should include the `page`, `country`, and `rev_id` (just as in `page_data.csv`).

You can use the following **samle code for API calls**:

Sending one request for each `rev_id` might take some time. If you want to send batches you can use `'|'.join(str(x) for x in revision_ids` to put your ids together. Please make sure to deal with [exception handling](https://www.w3schools.com/python/python_try_except.asp) of the `KeyError` exception, when extracting the `prediction` from the `JSON` response.

In [267]:
import requests
import json

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/Alioio',
    'From': 'alib10@fu-berlin.de'
}

def get_ores_data(rev_id, headers):
    
    # Define the endpoint
    # https://ores.wikimedia.org/scores/enwiki/?models=wp10&revids=807420979|807422778
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : rev_id
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    data = json.dumps(response)

    return data

In [268]:
#pages_df = pages_df[:210]

In [269]:
#rev_ids = pages_df['rev_id'][:200]
rev_ids = pages_df['rev_id']

In [270]:
def make_df(keys, subset):
    quality_df = {'rev_id': [],
        'quality': []
        }

    quality_df = pd.DataFrame(quality_df, columns = ['rev_id', 'quality'])

    for key in keys:
        try:
            new_row = pd.Series(data={'rev_id':key, 'quality':subset[key]['wp10']['score']['prediction']})
            quality_df = quality_df.append(new_row, ignore_index=True)
            #print(subset[key]['wp10']['score']['prediction'])
        except KeyError:
            try:
                new_row = pd.Series(data={'rev_id':key, 'quality':subset[key]['wp10']['error']['message']})
                quality_df = quality_df.append(new_row, ignore_index=True)
            except KeyError:
                new_row = pd.Series(data={'rev_id':key, 'quality':'unknown error'})
                quality_df = quality_df.append(new_row, ignore_index=True)
    return quality_df

In [271]:
def run_batch(id1, id2, batchids):
    bucket = batchids[id1: id2]
    bucket = ('|'.join(str(x) for x in bucket))
    output = get_ores_data(bucket, headers)
    output = json.loads(output)
    subset = output['enwiki']['scores']
    keys   = subset.keys()
    df = make_df(keys, subset)  
    return df

In [272]:
if not os.path.isfile('pre_processed_data/csv/ORES_scores.csv'):

    final_df = pd.DataFrame({'rev_id': [], 'quality': [] }, columns = ['rev_id', 'quality'])

    last = 0
    for bucket in range(0, len(rev_ids)-50, 50):
        print('new batch: ',bucket,': ',bucket+50)
        df = run_batch(bucket, bucket+50, rev_ids)
        final_df = pd.concat([final_df, df], axis= 0)
        last = bucket

    if( (len(rev_ids) - len(final_df)) > 0):
        print('new batch: ',len(final_df),': ',len(rev_ids))
        df = run_batch(len(final_df), len(rev_ids), rev_ids)
        final_df = pd.concat([final_df, df], axis= 0)
        
    pages_df['rev_id'] = pages_df['rev_id'].astype(str)
    final_df['rev_id'] = final_df['rev_id'].astype(str)
    merged_df = pd.merge(pages_df, final_df, on='rev_id', how='outer')
else:
    final_df = pd.read_csv('pre_processed_data/csv/ORES_scores.csv') 


In [276]:
print('Number of revisions without score: ',len(final_df[final_df['quality'].str.contains('RevisionNotFound') | final_df['quality'].str.contains('unknown error')]))

Number of revisions without score:  213


In [277]:
def store_preprocessed(df, name):
    
    if not os.path.isdir('pre_processed_data/csv'):
        makedirs('pre_processed_data/csv')
        
    df.to_csv('pre_processed_data/csv/'+name+'.csv') 

In [278]:
store_preprocessed(merged_df, 'ORES_scores')

In [279]:
no_scores_df = merged_df[merged_df['quality'].str.contains('RevisionNotFound:')]
no_scores_df

Unnamed: 0,page,country,rev_id,quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,RevisionNotFound: Could not find revision ({re...
126,List of politicians in Poland,Poland,516633096,RevisionNotFound: Could not find revision ({re...
222,Tingtingru,Vanuatu,550682925,RevisionNotFound: Could not find revision ({re...
330,Daud Arsala,Afghanistan,627547024,RevisionNotFound: Could not find revision ({re...
539,Bharat Saud,Nepal,671484594,RevisionNotFound: Could not find revision ({re...
...,...,...,...,...
46782,John Rose (Trotskyist),United Kingdom,807336308,RevisionNotFound: Could not find revision ({re...
46862,Jalal Movaghar,Iran,807367030,RevisionNotFound: Could not find revision ({re...
46863,Mohsen Movaghar,Iran,807367166,RevisionNotFound: Could not find revision ({re...
47182,King Gutierrez,Philippines,807479587,RevisionNotFound: Could not find revision ({re...


In [280]:
store_preprocessed(no_scores_df, 'ORES_no_scores')

### Combining the datasets

Now you need to combine both dataset: (1) the wikipedia articles and its ORES quality scores and (2) the population data. Both have columns named `country`. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vis versa.

Please remove any rows that do not have matching data, and output them to a `CSV` file called `countries-no_match.csv`. Consolidate the remaining data into a single `CSV` file called `politicians_by_country.csv`.

The schema for that file should look like the following table:


| article_name | country | region | revision_id | article_quality | population |
|--------------|---------|--------|-------------|-----------------|------------|
| Bir I of Kanem | Chad  | AFRICA | 807422778 | Stub | 16877000 |

#### Some exploration of matching and not matching countries 

In [281]:
contries_pop = pd.unique(population_df['country'])

In [460]:
print('# of countries in the population dataset: ',len(contries_pop))

# of countries in the population dataset:  210


In [284]:
countries_wiki = pd.unique(merged_df['country'])

In [461]:
print('# of countries in the article quality dataset: ',len(countries_wiki))

# of countries in the article quality dataset:  219


In [462]:
print('Countries which are present in the population dataset but has no matching entry in the article quality df:')
contries_pop[np.invert(np.in1d(contries_pop, countries_wiki))]

Countries which are present in the population dataset but has no matching entry in the article quality df:


array(['Western Sahara', "Cote d'Ivoire", 'Mayotte', 'Reunion',
       'Congo, Dem. Rep.', 'eSwatini', 'El Salvador', 'Honduras',
       'Curacao', 'Puerto Rico', 'St. Kitts-Nevis', 'Saint Lucia',
       'St. Vincent and the Grenadines', 'Georgia', 'Oman', 'Brunei',
       'Timor-Leste', 'China, Hong Kong SAR', 'China, Macao SAR',
       'Channel Islands', 'Czechia', 'North Macedonia',
       'French Polynesia', 'Guam', 'New Caledonia', 'Palau', 'Samoa'],
      dtype=object)

In [463]:
print('Countries which are present in the article quality df but has no matching entry in the population df:')
countries_wiki[np.invert(np.in1d( countries_wiki, contries_pop))]

Countries which are present in the article quality df but has no matching entry in the population df:


array(['Hondura', 'Czech Republic', 'Salvadoran', 'Saint Kitts and Nevis',
       'Palauan', 'Ivorian', 'Saint Vincent and the Grenadines',
       'Rhodesian', 'Omani', 'Congo, Dem. Rep. of', 'Niuean',
       'East Timorese', 'Faroese', 'Cape Colony', 'South Korean',
       'Samoan', 'Montserratian', 'Pitcairn Islands', 'Macedonia',
       'Abkhazia', 'Carniolan', 'Saint Lucian', 'South African Republic',
       'Incan', 'Chechen', 'Jersey', 'Guernsey', 'South Ossetian',
       'Cook Island', 'Tokelauan', 'Swaziland', 'Dagestani',
       'Greenlandic', 'Ossetian', 'Somaliland', 'Rojava'], dtype=object)

#### Checked for some random examples if countries are not mathich because of different writing. Found two examples where the matching counterpart exists with diffent writing.

In [465]:
merged_df[merged_df['country'].str.contains('Hondu')]

Unnamed: 0,page,country,rev_id,quality
22,Template:Honduras-politician-stub,Hondura,394587547,Stub
46,Template:Honduras-mayor-stub,Hondura,443469862,Stub
1155,Céleo Arias,Hondura,704789339,Stub
1239,Juan Francisco de Molina,Hondura,705346284,Stub
1240,Felipe Neri Medina,Hondura,705346304,Stub
...,...,...,...,...
45822,Selvin Laínez,Hondura,806796619,Stub
45988,Ana Julia García,Hondura,806875293,Stub
46089,Francisco Ferrera,Hondura,806948506,Stub
47076,Juan Ángel Arias Boquín,Hondura,807445333,Start


In [289]:
merged_df[merged_df['country'].str.contains('Congo, Dem. Rep.')]

Unnamed: 0,page,country,rev_id,quality
290,Mavua Mudima,"Congo, Dem. Rep. of",592289232,Stub
441,List of provincial governors of the Democratic...,"Congo, Dem. Rep. of",663088604,Stub
1296,Eugène Diomi Ndongala Nzomambu,"Congo, Dem. Rep. of",705390654,Stub
1344,Théophile Mbemba Fundu,"Congo, Dem. Rep. of",705707214,Stub
1346,Ntumba Luaba,"Congo, Dem. Rep. of",705723187,Stub
...,...,...,...,...
46088,Jean-Chrysostome Werengemere,"Congo, Dem. Rep. of",806947596,C
46501,Jean-Pierre Bemba,"Congo, Dem. Rep. of",807193274,GA
46603,Jason Sendwe,"Congo, Dem. Rep. of",807243537,C
46639,Gérard Kamanda wa Kamanda,"Congo, Dem. Rep. of",807258156,Stub


#### Merging the dataframes

In [290]:
merged_df_pop = pd.merge(merged_df, population_df, on='country', how='inner')

In [291]:
merged_df_outer = pd.merge(merged_df, population_df, on='country', how='left')

In [292]:
no_match_df = merged_df_outer[merged_df_outer['population'].isna()]

In [293]:
#small plausibility check. 
(len(merged_df) - len(merged_df_pop)) - len(no_match_df)

0

In [294]:
store_preprocessed(merged_df_pop, 'politicians_by_country')

In [295]:
store_preprocessed(no_match_df, 'countries-no_match')

In [296]:
merged_df_pop[(merged_df_pop['region'] == 'EUROPE') & ((merged_df_pop['quality'] == 'FA') |  (merged_df_pop['quality'] == 'GA'))].sort_values(by=['country', 'quality'])

Unnamed: 0,page,country,rev_id,quality,population,region
5167,Fan S. Noli,Albania,802740752,GA,2.838,EUROPE
5168,Fatos Nano,Albania,802841606,GA,2.838,EUROPE
5207,Edi Rama,Albania,807436239,GA,2.838,EUROPE
15196,Werner Faymann,Austria,805569890,GA,8.914,EUROPE
15209,Christian Kern,Austria,806875042,GA,8.914,EUROPE
...,...,...,...,...,...,...
41016,Gustav Wilhelm Wolff,United Kingdom,807173782,GA,67.160,EUROPE
41019,Sally Bercow,United Kingdom,807189268,GA,67.160,EUROPE
41030,Chris Huhne,United Kingdom,807292420,GA,67.160,EUROPE
41039,John Bercow,United Kingdom,807337238,GA,67.160,EUROPE


## Step 3⃣ | Analysis

Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population (we can also call it `coverage`) and high-quality articles (we can also call it `relative-quality`)for **each country** and for **each region**. By `"high quality"` arcticle we mean an article that ORES predicted as `FA` (featured article) or `GA` (good article).

**Examples:**

* if a country has a population of `10,000` people, and you found `10` articles about politicians from that country, then the percentage of `articles-per-population` would be `0.1%`.
* if a country has `10` articles about politicians, and `2` of them are `FA` or `GA` class articles, then the percentage of `high-quality-articles` would be `20%`.

### Results format

The results from this analysis are six `data tables`. Embed these tables in the Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment. The tables will show:

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
1. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
1. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

**❗Hint:** You will find what country belongs to which region (e.g. `ASIA`) also in `export_2019.csv`. You need to calculate the total poulation per region. For that you could use `groupby` and also check out `apply`.

1. **Top 10 countries by coverage**<br>10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [436]:
pop_artc_propotion_df = pd.DataFrame({'country': [], 'art_prop': [] }, columns = ['country', 'art_prop'])
countries = pd.unique(merged_df_pop['country'])
#ratio: (sum(articles)/population) * 100

for country in countries:
    new_row = pd.Series(data={'country':country, 
                              'art_prop': (len(merged_df_pop[merged_df_pop['country'] == country]) / merged_df_pop[merged_df_pop['country'] == country]['population'].iloc[0]) / 100 })

    pop_artc_propotion_df = pop_artc_propotion_df.append(new_row, ignore_index=True)


In [400]:
pop_artc_propotion_df.iloc[pop_artc_propotion_df['art_prop'].argmin()]

country         Guyana
art_prop    2.5413e-06
Name: 139, dtype: object

In [414]:
pop_artc_propotion_df.iloc[pop_artc_propotion_df['art_prop'].argmax()]

country     Tuvalu
art_prop      0.55
Name: 112, dtype: object

In [492]:
alt.Chart(pop_artc_propotion_df.sort_values(by=['art_prop'], ascending=False)[1:11] ).mark_bar().encode(
    y = alt.Y('country:N', sort='-x'),
    x = alt.X('art_prop:Q')
)

2. **Bottom 10 countries by coverage**<br>10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [490]:
alt.Chart(pop_artc_propotion_df.sort_values(by=['art_prop'], ascending=True)[:10] ).mark_bar().encode(
    y = alt.Y('country:N', sort='x'),
    x = alt.X('art_prop:Q')
)

3. **Top 10 countries by relative quality**<br>10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [582]:
artc_quality_rank_df = pd.DataFrame({'country': [], 'high_ratio': [] , 'all articles': [], 'high_quality_art': [] }, columns = ['country', 'high_ratio', 'all articles','high_quality_art'])
#countries = pd.unique(merged_df_pop['country'])
#ratio: (sum(articles)/population) * 100

for country in countries:
    
    all_articles  = len(merged_df_pop[(merged_df_pop['country'] == country)])
    good_articles = len(merged_df_pop[(merged_df_pop['country'] == country) & ((merged_df_pop['quality'] == 'FA') | (merged_df_pop['quality'] == 'GA'))])
                                     
    if(good_articles > 0):
        good_art_ratio = good_articles / all_articles 
    else:
        good_art_ratio = 0
 
    new_row = pd.Series(data={'country':country, 
                              'high_ratio': good_art_ratio,
                               'all articles': all_articles,
                               'high_quality_art': good_articles})

    artc_quality_rank_df = artc_quality_rank_df.append(new_row, ignore_index=True)

In [583]:
artc_quality_rank_df 

Unnamed: 0,country,high_ratio,all articles,high_quality_art
0,Zambia,0.000000,26.0,0.0
1,Chad,0.010000,100.0,1.0
2,Zimbabwe,0.011976,167.0,2.0
3,Uganda,0.005319,188.0,1.0
4,Namibia,0.000000,165.0,0.0
...,...,...,...,...
178,Dominica,0.083333,12.0,1.0
179,Bahamas,0.000000,20.0,0.0
180,Barbados,0.000000,14.0,0.0
181,Belize,0.000000,16.0,0.0


In [584]:
artc_quality_rank_df = artc_quality_rank_df[artc_quality_rank_df['high_ratio'] > 0]

In [585]:
country = 'Switzerland'
len(merged_df_pop[(merged_df_pop['country'] == country) & ((merged_df_pop['quality'] == 'FA') | (merged_df_pop['quality'] == 'GA'))])

1

In [586]:
len(merged_df_pop[(merged_df_pop['country'] == country )])

407

In [587]:
artc_quality_rank_df.iloc[artc_quality_rank_df['high_ratio'].argmax()]

country             Korea, North
high_ratio              0.205128
all articles                  39
high_quality_art               8
Name: 117, dtype: object

In [588]:
artc_quality_rank_df.iloc[artc_quality_rank_df['high_ratio'].argmin()]

country                Belgium
high_ratio          0.00191205
all articles               523
high_quality_art             1
Name: 64, dtype: object

In [589]:
top_10 = alt.Chart(artc_quality_rank_df.sort_values(by=['high_ratio'], ascending=False)[:10] ).mark_bar().encode(
    y = alt.Y('country:N', sort='-x'),
    x = alt.X('high_ratio:Q'),
    tooltip = ['all articles:Q', 'high_quality_art:Q']
)

4. **Bottom 10 countries by relative quality**<br>10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [590]:
bottom_10 = alt.Chart(artc_quality_rank_df.sort_values(by=['high_ratio'], ascending=True)[:10] ).mark_bar().encode(
    y = alt.Y('country:N', sort='x'),
    x = alt.X('high_ratio:Q'),
    tooltip = ['all articles:Q', 'high_quality_art:Q']
)

In [591]:
top_10 & bottom_10

5. **Regions by coverage**<br>Ranking of regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [419]:
pd.unique(merged_df_pop[merged_df_pop['population'] > 300]['country'])

array(['Fiji', 'Solomon Islands', 'India', 'Iceland', 'Luxembourg',
       'United States', 'China', 'Djibouti', 'Vanuatu', 'Malta', 'Guyana',
       'Maldives', 'Montenegro', 'Suriname', 'Martinique', 'Guadeloupe',
       'Cape Verde', 'Bahamas', 'Belize'], dtype=object)

***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).