# Considering Bias in Data
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022

James Yang, 10/12/22, MSDS



In [350]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

import warnings
warnings.filterwarnings('ignore')

import numpy as np

The example relies on some constants that help make the code a bit more readable.

In [236]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<yangj98@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [237]:
#--------Get Article Titles to run --------#

import pandas as pd
ARTICLE_TITLES = pd.read_csv("politicians_by_country_SEPT.2022.csv - politicians_international_SEPT.2022.csv.csv")
ARTICLE_TITLES


Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
...,...,...,...
7579,Rekayi Tangwena,https://en.wikipedia.org/wiki/Rekayi_Tangwena,Zimbabwe
7580,Josiah Tongogara,https://en.wikipedia.org/wiki/Josiah_Tongogara,Zimbabwe
7581,Langton Towungana,https://en.wikipedia.org/wiki/Langton_Towungana,Zimbabwe
7582,Herbert Ushewokunze,https://en.wikipedia.org/wiki/Herbert_Ushewokunze,Zimbabwe


In [238]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [251]:
#Create a dataframe and json off the pull

def create_dataframe(filename, ARTICLE_TITLES):
    output_df = pd.DataFrame()
    for i in range(1, len(ARTICLE_TITLES) - 1):
        print("Getting pageview data for: ",ARTICLE_TITLES[i])
        print("Percentage done: ", i/len(ARTICLE_TITLES))
        info = request_pageinfo_per_article(ARTICLE_TITLES[i])
        views_df = pd.DataFrame(info)
        output_df = pd.concat([output_df, views_df])

    result = output_df.to_json(orient="records")
    parsed = json.loads(result)
    print(output_df)

    f = open(filename, 'w')
    f.write(json.dumps(parsed, indent=4))
    f.close()
    
    print("Wrote to file!")
    return output_df

In [None]:
#Get the dataframe with the politicians and country and page id

df = create_dataframe("politicians_by_country.json", ARTICLE_TITLES['name'])

In [253]:
#Change dataframe to have the names
ARTICLE_TITLES = ARTICLE_TITLES['name']

In [254]:
#Create a dataframe to output the revisioins from the articles.

import json
import pandas as pd

def create_article_revisions(file, column):
    output_df = pd.DataFrame()
    f = open(file)
    df = json.load(f)
#     print(df[0]['query'])
    for i in range(0, len(df)-1):
        temp_dict = df[i][column]
        pull_df = pd.DataFrame.from_dict(temp_dict, orient = 'index')
#         pull_df = pull_df.T
        output_df = pd.concat([output_df, pull_df])
    return output_df


In [255]:
df_with_articles = create_article_revisions("politicians_by_country.json", 'query')
df_with_articles

Unnamed: 0,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,redirect,new,missing
42972519,42972519.0,0,Abdul Ghafar Lakanwal,wikitext,en,en,ltr,2022-09-26T05:36:04Z,9.435623e+08,4165.0,42972696.0,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,,,,
10483286,10483286.0,0,Majah Ha Adrif,wikitext,en,en,ltr,2022-10-10T00:16:18Z,8.524041e+08,3162.0,13330265.0,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,https://en.wikipedia.org/w/index.php?title=Maj...,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,,,,
11966231,11966231.0,0,Haroon al-Afghani,wikitext,en,en,ltr,2022-10-10T02:53:39Z,1.095102e+09,16718.0,15250816.0,https://en.wikipedia.org/wiki/Haroon_al-Afghani,https://en.wikipedia.org/w/index.php?title=Har...,https://en.wikipedia.org/wiki/Haroon_al-Afghani,,,,
46841383,46841383.0,0,Tayyab Agha,wikitext,en,en,ltr,2022-10-10T23:30:56Z,1.104998e+09,6313.0,46843786.0,https://en.wikipedia.org/wiki/Tayyab_Agha,https://en.wikipedia.org/w/index.php?title=Tay...,https://en.wikipedia.org/wiki/Tayyab_Agha,,,,
68624823,68624823.0,0,Ahmadullah Wasiq,wikitext,en,en,ltr,2022-10-11T01:23:12Z,1.109362e+09,5267.0,68639469.0,https://en.wikipedia.org/wiki/Ahmadullah_Wasiq,https://en.wikipedia.org/w/index.php?title=Ahm...,https://en.wikipedia.org/wiki/Ahmadullah_Wasiq,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8659677,8659677.0,0,Leopold Takawira,wikitext,en,en,ltr,2022-10-10T05:59:15Z,1.112149e+09,3505.0,12596029.0,https://en.wikipedia.org/wiki/Leopold_Takawira,https://en.wikipedia.org/w/index.php?title=Leo...,https://en.wikipedia.org/wiki/Leopold_Takawira,,,,
10031005,10031005.0,0,Gift Tandare,wikitext,en,en,ltr,2022-10-10T06:29:20Z,1.114642e+09,767.0,10495324.0,https://en.wikipedia.org/wiki/Gift_Tandare,https://en.wikipedia.org/w/index.php?title=Gif...,https://en.wikipedia.org/wiki/Gift_Tandare,,,,
23231242,23231242.0,0,Rekayi Tangwena,wikitext,en,en,ltr,2022-09-26T04:15:54Z,1.073819e+09,1861.0,23734031.0,https://en.wikipedia.org/wiki/Rekayi_Tangwena,https://en.wikipedia.org/w/index.php?title=Rek...,https://en.wikipedia.org/wiki/Rekayi_Tangwena,,,,
633594,633594.0,0,Josiah Tongogara,wikitext,en,en,ltr,2022-09-25T23:01:07Z,1.106932e+09,12797.0,3349627.0,https://en.wikipedia.org/wiki/Josiah_Tongogara,https://en.wikipedia.org/w/index.php?title=Jos...,https://en.wikipedia.org/wiki/Josiah_Tongogara,,,,


In [256]:
df_info = df_with_articles[['title', 'lastrevid']]
df_info

Unnamed: 0,title,lastrevid
42972519,Abdul Ghafar Lakanwal,9.435623e+08
10483286,Majah Ha Adrif,8.524041e+08
11966231,Haroon al-Afghani,1.095102e+09
46841383,Tayyab Agha,1.104998e+09
68624823,Ahmadullah Wasiq,1.109362e+09
...,...,...
8659677,Leopold Takawira,1.112149e+09
10031005,Gift Tandare,1.114642e+09
23231242,Rekayi Tangwena,1.073819e+09
633594,Josiah Tongogara,1.106932e+09


In [257]:
#See the Nulls and drop them
df_info.to_csv('df_with_articles.csv')
df_info[df_info.isnull().any(axis=1)].to_csv('wp_countries-no_match.txt')
df_info

Unnamed: 0,title,lastrevid
42972519,Abdul Ghafar Lakanwal,9.435623e+08
10483286,Majah Ha Adrif,8.524041e+08
11966231,Haroon al-Afghani,1.095102e+09
46841383,Tayyab Agha,1.104998e+09
68624823,Ahmadullah Wasiq,1.109362e+09
...,...,...
8659677,Leopold Takawira,1.112149e+09
10031005,Gift Tandare,1.114642e+09
23231242,Rekayi Tangwena,1.073819e+09
633594,Josiah Tongogara,1.106932e+09


In [258]:
df_info.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_info.dropna(inplace=True)


In [259]:
df_info

Unnamed: 0,title,lastrevid
42972519,Abdul Ghafar Lakanwal,9.435623e+08
10483286,Majah Ha Adrif,8.524041e+08
11966231,Haroon al-Afghani,1.095102e+09
46841383,Tayyab Agha,1.104998e+09
68624823,Ahmadullah Wasiq,1.109362e+09
...,...,...
8659677,Leopold Takawira,1.112149e+09
10031005,Gift Tandare,1.114642e+09
23231242,Rekayi Tangwena,1.073819e+09
633594,Josiah Tongogara,1.106932e+09


In [260]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<yangj98@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
# ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

In [261]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [263]:
# ARTICLE_REVISIONS = 

'852404094'

In [522]:
# map the revisions
#WARNING THIS TAKES A BIT#
all_scores = []
idx = []
for i in range(0, len(df_info['lastrevid'].map(int).map(str))-1):
    print("Percentage Done: ", i/len(df_info['lastrevid'].map(int).map(str)))
    score = request_ores_score_per_article(df_info['lastrevid'].map(int).map(str)[i])
    keys = list((score['enwiki']['scores'].keys()))
    idx.extend(keys)
    vals = list(map(lambda key: score['enwiki']['scores'][key]['articlequality']['score']['prediction'], keys))
    all_scores.extend(vals)

In [287]:
#Get the lastrevid and score

df_scores = pd.DataFrame(list(zip(idx, all_scores)), columns=['lastrevid', 'score'])
df_scores['lastrevid'] = df_scores['lastrevid'].astype(int)
df_scores

Unnamed: 0,lastrevid,score
0,943562276,Start
1,852404094,Start
2,1095102390,B
3,1104998382,Start
4,1109361754,Start


In [288]:
ARTICLE_TITLES = pd.read_csv("politicians_by_country_SEPT.2022.csv - politicians_international_SEPT.2022.csv.csv")
df = pd.merge(ARTICLE_TITLES, df_info, left_on='name', right_on='title', how='right').merge(df_scores, on='lastrevid')
df = df[['name', 'country', 'title', 'lastrevid', 'score']]
df

Unnamed: 0,name,country,title,lastrevid,score
0,Abdul Ghafar Lakanwal,Afghanistan,Abdul Ghafar Lakanwal,9.435623e+08,Start
1,Abdul Ghafar Lakanwal,Afghanistan,Abdul Ghafar Lakanwal,9.435623e+08,C
2,Majah Ha Adrif,Afghanistan,Majah Ha Adrif,8.524041e+08,Start
3,Majah Ha Adrif,Afghanistan,Majah Ha Adrif,8.524041e+08,Stub
4,Haroon al-Afghani,Afghanistan,Haroon al-Afghani,1.095102e+09,B
...,...,...,...,...,...
7918,Mário Viegas Carrascalão,Timor-Leste,Mário Viegas Carrascalão,1.072431e+09,Start
7919,Maria Terezinha Viegas,Timor-Leste,Maria Terezinha Viegas,1.112160e+09,Start
7920,Kafui Adjamagbo-Johnson,Togo,Kafui Adjamagbo-Johnson,1.090162e+09,Stub
7921,Angèle Dola Akofa Aguigah,Togo,Angèle Dola Akofa Aguigah,1.099133e+09,Stub


In [289]:
def population_manipulation(file):
    population = pd.read_csv(file)
    population['region'] = population.apply(lambda x: x['Geography'] if x['Geography'].isupper() else None, axis = 1)
    population['region'] = population['region'].fillna(method="ffill")
    population = population.drop(population[(population['Geography'].str.isupper())].index)
    population
    return population

In [290]:
population = population_manipulation('population_by_country_2022.csv - population_by_country_2022.csv.csv')

df = pd.merge(df, population, left_on='country', right_on='Geography', how='outer')
df.dropna()

Unnamed: 0,name,country,title,lastrevid,score,Geography,Population (millions),region
0,Abdul Ghafar Lakanwal,Afghanistan,Abdul Ghafar Lakanwal,9.435623e+08,Start,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,Abdul Ghafar Lakanwal,9.435623e+08,C,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,Majah Ha Adrif,8.524041e+08,Start,Afghanistan,41.1,SOUTH ASIA
3,Majah Ha Adrif,Afghanistan,Majah Ha Adrif,8.524041e+08,Stub,Afghanistan,41.1,SOUTH ASIA
4,Haroon al-Afghani,Afghanistan,Haroon al-Afghani,1.095102e+09,B,Afghanistan,41.1,SOUTH ASIA
...,...,...,...,...,...,...,...,...
7918,Festo Sanga,Tanzania,Festo Sanga,1.087521e+09,Start,Tanzania,65.5,EASTERN AFRICA
7919,Mwita Waitara,Tanzania,Mwita Waitara,1.006422e+09,C,Tanzania,65.5,EASTERN AFRICA
7920,Kafui Adjamagbo-Johnson,Togo,Kafui Adjamagbo-Johnson,1.090162e+09,Stub,Togo,8.8,WESTERN AFRICA
7921,Angèle Dola Akofa Aguigah,Togo,Angèle Dola Akofa Aguigah,1.099133e+09,Stub,Togo,8.8,WESTERN AFRICA


In [291]:
#output the csv
df.to_csv('wp_politicians_by_country.csv')

In [280]:
#Create the output json file for ease of loading
def create_output(file):
    output_df = pd.DataFrame()
    f = open(file)
    df = json.load(f)
    for i in range(0, len(df)-1):
        dic = df[i]['enwiki']
        temp = pd.DataFrame.from_dict([dic]).T
        output_df = pd.concat([output_df, temp])
    return output_df


In [281]:
df_with_prob = create_output('scores.json')

In [418]:
population = pd.read_csv('population_by_country_2022.csv - population_by_country_2022.csv.csv')

df = pd.read_csv('wp_politicians_by_country.csv')

last_seen_region = ""
for geo in population['Geography']:
    if geo.isupper():
        last_seen_region = geo
    else:
        df.loc[df["country"] == geo, "Geography"] = last_seen_region
 

In [419]:
df = df.rename(columns={'title':'title_name', 'lastrevid':'revision_id', 'score':'article_quality', 'Population (millions)':'population', 'Geography' : 'region'})
df.to_csv('wp_politicians_by_country.csv', index=False)

## Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order)

In [441]:
df = pd.read_csv('wp_politicians_by_country.csv')
#Step 5 Analysis
df.agg({"population": ["count", "sum"]})
out = df.groupby(["country"]).agg({"population": ["count", "mean"]})
out["per_million_capita"] = out[('population','count')] / out[('population','mean')]
out = out.sort_values(by='per_million_capita', ascending=False)
#Get rid of the infinite values
out = out[out["per_million_capita"] != np.inf][:10]
#1. Top 10 countries by coverage
out

Unnamed: 0_level_0,population,population,per_million_capita
Unnamed: 0_level_1,count,mean,Unnamed: 3_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Antigua and Barbuda,34,0.1,340.0
Andorra,20,0.1,200.0
Federated States of Micronesia,13,0.1,130.0
Barbados,28,0.3,93.333333
Marshall Islands,9,0.1,90.0
Montenegro,45,0.6,75.0
Albania,170,2.8,60.714286
Seychelles,6,0.1,60.0
Luxembourg,37,0.7,52.857143
Bhutan,41,0.8,51.25


## Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order)

In [442]:
#2. Bottom 10 countries by coverage
out = df.groupby(["country"]).agg({"population": ["count", "mean"]})
out["per_million_capita"] = out[('population','count')] / out[('population','mean')]
out = out.sort_values(by='per_million_capita', ascending=True)
out = out[out["per_million_capita"] != np.inf][:10]
#1. Top 10 countries by coverage
out

Unnamed: 0_level_0,population,population,per_million_capita
Unnamed: 0_level_1,count,mean,Unnamed: 3_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
China,2,1436.6,0.001392
Mexico,1,127.5,0.007843
Ukraine,2,41.0,0.04878
Saudi Arabia,3,36.7,0.081744
Romania,2,19.0,0.105263
India,181,1417.2,0.127717
Sri Lanka,3,22.4,0.133929
Egypt,14,103.5,0.135266
Taiwan,5,23.2,0.215517
Ethiopia,28,123.4,0.226904


## Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

In [443]:
#3. Top 10 countries by high quality
df = pd.read_csv('wp_politicians_by_country.csv')
df = df[(df['article_quality'] == 'FA') | (df['article_quality'] == 'GA')]
out = df.groupby(["country"]).agg({"population": ["count", "mean"]})
out["per_million_capita"] = out[('population','count')] / out[('population','mean')]
out = out.sort_values(by='per_million_capita', ascending=False)
#Get rid of the infinite values
out = out[out["per_million_capita"] != np.inf][:10]
#1. Top 10 countries by coverage
out

Unnamed: 0_level_0,population,population,per_million_capita
Unnamed: 0_level_1,count,mean,Unnamed: 3_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Antigua and Barbuda,3,0.1,30.0
Andorra,2,0.1,20.0
Marshall Islands,1,0.1,10.0
Federated States of Micronesia,1,0.1,10.0
Montenegro,4,0.6,6.666667
Samoa,1,0.2,5.0
Barbados,1,0.3,3.333333
Cape Verde,2,0.6,3.333333
Albania,9,2.8,3.214286
Fiji,2,0.9,2.222222


## Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [469]:
#4. Bottom 10 Countries by High Quality
df = pd.read_csv('wp_politicians_by_country.csv')
df = df[(df['article_quality'] == 'FA') | (df['article_quality'] == 'GA')]
out = df.groupby(["country"]).agg({"population": ["count", "mean"]})
out["per_million_capita"] = out[('population','count')] / out[('population','mean')]
out = out.sort_values(by='per_million_capita', ascending=True)
#Get rid of the infinite values
out = out[out["per_million_capita"] != np.inf][:10]
#1. Top 10 countries by coverage
out

Unnamed: 0_level_0,population,population,per_million_capita
Unnamed: 0_level_1,count,mean,Unnamed: 3_level_1
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
India,2,1417.2,0.001411
Ethiopia,1,123.4,0.008104
South Africa,1,60.6,0.016502
Indonesia,5,275.5,0.018149
Kenya,1,54.0,0.018519
Iran,2,88.6,0.022573
Brazil,5,214.8,0.023277
Japan,3,124.9,0.024019
Saudi Arabia,1,36.7,0.027248
Morocco,1,36.7,0.027248


## Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [520]:
#5. Geographic regions by total coverage: In descending order High quality
df = pd.read_csv('wp_politicians_by_country.csv')
df = df[(df['article_quality'] == 'FA') | (df['article_quality'] == 'GA')]
df.groupby('region').size().sort_values(0, ascending=False).head(10)


region
SOUTHERN EUROPE    56
WESTERN ASIA       37
WESTERN EUROPE     26
WESTERN AFRICA     26
SOUTH ASIA         26
EASTERN EUROPE     18
EASTERN AFRICA     14
SOUTH AMERICA      14
SOUTHEAST ASIA     12
CENTRAL AMERICA    12
dtype: int64

## Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [521]:
#6. Geographic regions by total coverage: In descending order High Quality
df = pd.read_csv('wp_politicians_by_country.csv')
df = df[(df['article_quality'] == 'FA') | (df['article_quality'] == 'GA')]
df.groupby('region').size().sort_values(0, ascending=False).tail(10)

region
SOUTHEAST ASIA     12
CENTRAL AMERICA    12
MIDDLE AFRICA      11
NORTHERN EUROPE    11
CARIBBEAN           8
NORTHERN AFRICA     6
EAST ASIA           6
OCEANIA             6
SOUTHERN AFRICA     4
CENTRAL ASIA        2
dtype: int64