# Step 1: Getting the Article and Population Data
The first step is getting the data, which lives in several different places. You will need data that lists Wikipedia articles of politicians and data for country populations.
The Wikipedia Category:Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as politicians_by_country.SEPT.2022.csv.
The population data is available in CSV format as population_by_country_2022.csv from the homework folder. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.


In [194]:
# These are standard python modules
import json, time, urllib.parse
import pandas as pd
import numpy as np
import re
import requests
import matplotlib.pyplot as plt
from datetime import datetime

In [195]:
df_politicians = pd.read_csv('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/politicians_by_country_SEPT.2022.csv')
df_population = pd.read_csv('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/population_by_country_2022.csv')
df_politicians.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


## Some Considerations
You should be a little careful with the data. Crawling Wikipedia categories to identify relevant page subsets can result in misleading and/or duplicate category labels. Naturally, the data crawl attempted to resolve these, but not all may have been caught. You should document how you handle any data inconsistencies.
The population_by_country_2022.csv contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in politicians_by_country.SEPT.2022.csv, but you will want to retain some of them so that you can report coverage and quality by region as specified in the analysis section below.


### Data Inconsistencies

There are few duplicates in the politicians data

In [196]:
duplicate_articles = df_politicians[df_politicians.duplicated(subset=['name', 'url'], keep = False)]
duplicate_records = df_politicians[df_politicians.duplicated(subset=['name', 'url', 'country'], keep = False)]

print(f'''There are a total of {duplicate_articles.shape[0]} duplicate articles with different country names''')
print(f'''There are a total of {duplicate_records.shape[0]} duplicate records with same name, url and country names''')

There are a total of 98 duplicate articles with different country names
There are a total of 4 duplicate records with same name, url and country names


Obtaining the recent/last updated record from the provided workbook and removing the duplicates 

In [197]:
df_politicians = df_politicians[~df_politicians.duplicated(subset=['name', 'url'], keep = 'last')]
df_politicians.shape

(7534, 3)

Population value being 0 for few countries

In [198]:
zero_population = df_population[df_population['Population (millions)'] == 0]
print(f'''There are a total of {zero_population.shape[0]} countries with population as zero''')
zero_population.head()

There are a total of 6 countries with population as zero


Unnamed: 0,Geography,Population (millions)
183,Liechtenstein,0.0
185,Monaco,0.0
211,San Marino,0.0
223,Nauru,0.0
226,Palau,0.0


Eliminating the cumulative regional population counts and obtaining the closest/lowest in the hierarchy regions for the countries

In [199]:
df_population['shifted'] = df_population['Geography'].shift(-1)
df_population = df_population[~((df_population['Geography'].str.isupper() == True) & (df_population['shifted'].str.isupper() == True))].iloc[:,0:2].reset_index().drop('index', axis = 1)
regions = pd.DataFrame()
regions['region'] = df_population[df_population['Geography'].str.isupper()]['Geography']
regions['flag'] = np.arange(1, len(regions['region']) + 1)
df = df_population.merge(regions, left_on = "Geography", right_on = "region", how = 'left').iloc[:,[0,1,3]]
df['flag'] = df['flag'].expanding().max()
df_population = df.merge(regions, on = "flag", how = 'inner')
df_population = df_population.iloc[:,[0, 1, 3]]
df_population = df_population[df_population['Geography'] != df_population['region']]
df_population.head()

Unnamed: 0,Geography,Population (millions),region
1,Algeria,44.9,NORTHERN AFRICA
2,Egypt,103.5,NORTHERN AFRICA
3,Libya,6.8,NORTHERN AFRICA
4,Morocco,36.7,NORTHERN AFRICA
5,Sudan,46.9,NORTHERN AFRICA


# Step 2: Getting Article Quality Predictions
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

FA - Featured article

GA - Good article

B - B-class article

C - C-class article

Start - Start-class article

Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors.

ORES requires a specific revision ID of a specific article to be able to make a label prediction. You can use the API:Info request to get a range of metadata on an article, including the most current revision ID of the article page.
Putting this together, to get a Wikipedia page quality prediction from ORES for each politician’s article page you will need to: a) read each line of politicians_by_country.SEPT.2022.csv, b) make a page info request to get the current page revision, and c) make an ORES request using the page title and current revision id.

The homework folder contains example code in notebooks to illustrate making a page info request and making an ORES request. This sample code is licensed CC0 so feel free to reuse any of the code in either notebook without attribution.
Note: It is possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. The choice is up to you.


### Making a page info request

In [19]:
output = []

ARTICLE_TITLES = df_politicians['name'].to_list()
# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<karasth@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [6]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    #global output
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title   
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    #output.append(json_response)
    #print(output)
    return json_response


In [275]:
'''
from multiprocessing import Process, Pool, Queue

ARTICLE_TITLES = df_politicians['name'].to_list()
ARTICLE_TITLES = ARTICLE_TITLES[0:4]

for name in ARTICLE_TITLES:
    procs = []
    proc = Process(target = request_pageinfo_per_article, args=(name,))
    procs.append(proc)
    proc.start()

# complete the processes
for proc in procs:
    proc.join()

print(output)
'''

"\nfrom multiprocessing import Process, Pool, Queue\n\nARTICLE_TITLES = df_politicians['name'].to_list()\nARTICLE_TITLES = ARTICLE_TITLES[0:4]\n\nfor name in ARTICLE_TITLES:\n    procs = []\n    proc = Process(target = request_pageinfo_per_article, args=(name,))\n    procs.append(proc)\n    proc.start()\n\n# complete the processes\nfor proc in procs:\n    proc.join()\n\nprint(output)\n"

In [8]:
df_info = list()
for i in ARTICLE_TITLES:
  # Exception handling to make sure code doesn't break
  try:
    info = request_pageinfo_per_article(article_title = i, request_template = PAGEINFO_PARAMS_TEMPLATE)
    key = list(info['query']['pages'].keys())[0]
    df_info.append(pd.json_normalize(info['query']['pages'][key]))
    #print("Obtained the page info for: ",i)
  except:
    print("Couldn't get the page info for: ", i)

# Obtain the overall dinosaurs data
df_articles = pd.concat(df_info)
df_articles = df_articles[["title", "lastrevid"]]
df_articles.shape

(7584, 2)

In [14]:
# Write the raw data to excel
#df_articles.to_excel('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/pageinfo_all.xlsx', index = False)
# Read the raw data from excel
df_articles = pd.read_excel('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/pageinfo_all.xlsx')
df_articles.info()
df_articles = df_articles[~df_articles['lastrevid'].isnull()]
df_articles['lastrevid'] = df_articles['lastrevid'].astype(int)
df_articles.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7584 entries, 0 to 7583
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      7584 non-null   object 
 1   lastrevid  7578 non-null   float64
dtypes: float64(1), object(1)
memory usage: 118.6+ KB


Unnamed: 0,title,lastrevid
0,Shahjahan Noori,1099689043
1,Abdul Ghafar Lakanwal,943562276
2,Majah Ha Adrif,852404094
3,Haroon al-Afghani,1095102390
4,Tayyab Agha,1104998382


### ORES API request

In [200]:
# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<karasth@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

In [201]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [202]:
ARTICLE_REVISIONS = dict(zip(df_articles.title, df_articles.lastrevid))
df_ores = list()
with open('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/scores_error_log.txt', 'w') as f:
  for i in ARTICLE_TITLES:
    # Exception handling to make sure code doesn't break
      try: 
        score = request_ores_score_per_article(ARTICLE_REVISIONS[i])
        revid = ARTICLE_REVISIONS.get(i)
        data = score['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction']
        df_ores = df_ores + [[i, revid, data]]
        #print("Obtained the ores score for: ", i)
      except:
        line = "Couldn't get the ORES score for: " + i
        f.write(line)
        f.write('\n')

df_scores = pd.DataFrame(df_ores, columns = ['name', 'lastrevid', 'prediction'])

Unnamed: 0,name,lastrevid,prediction
0,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,1104998382,Start


In [203]:
# Write the raw data to excel
df_scores.to_excel('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/scores_all.xlsx', index = False)
# Read the raw data from excel
df_scores = pd.read_excel('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/scores_all.xlsx')
df_scores.info()
df_scores.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        300 non-null    object
 1   lastrevid   300 non-null    int64 
 2   prediction  300 non-null    object
dtypes: int64(1), object(2)
memory usage: 7.2+ KB


Unnamed: 0,name,lastrevid,prediction
0,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,1104998382,Start


# Step 3: Combining the Datasets
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.
Identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called:
wp_countries-no_match.txt
Consolidate the remaining data into a single CSV file called:
wp_politicians_by_country.csv
The schema for that file should look something like this:

Column

country

region

population

article_title

revision_id

article_quality



In [204]:
duplicate_articles = df_articles[df_articles.duplicated(subset=['title', 'lastrevid'], keep = False)]
#duplicate_records = df_articles[df_articles.duplicated(subset=['name', 'url', 'country'], keep = False)]

print(f'''There are a total of {duplicate_articles.shape[0]} duplicate articles with different country names''')
#print(f'''There are a total of {duplicate_records.shape[0]} duplicate records with same name, url and country names''')

df_articles = df_articles[~df_articles.duplicated(subset=['title', 'lastrevid'], keep = 'last')]
df_articles.shape

There are a total of 0 duplicate articles with different country names


(7528, 2)

In [205]:
duplicate_articles = df_scores[df_scores.duplicated(subset=['lastrevid'], keep = False)]
print(f'''There are a total of {duplicate_articles.shape[0]} duplicate articles with different country names''')

There are a total of 0 duplicate articles with different country names


In [206]:
df_articles = df_articles[['title', 'lastrevid']]
print(df_articles.shape)
df_politicians = df_politicians.merge(df_articles, left_on = "name", right_on = "title", how = 'inner')
print(df_politicians.shape)
df_scores = df_politicians.merge(df_scores, on = ['name', 'lastrevid'], how = 'left')
df_scores = df_scores.drop(['title', 'url'], axis = 1)
print(df_scores.shape)
df_scores.head()

(7528, 2)
(7528, 5)
(7528, 4)


Unnamed: 0,name,country,lastrevid,prediction
0,Shahjahan Noori,Afghanistan,1099689043,GA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start
2,Majah Ha Adrif,Afghanistan,852404094,Start
3,Haroon al-Afghani,Afghanistan,1095102390,B
4,Tayyab Agha,Afghanistan,1104998382,Start


Enriching the ORES scores data with population data

In [208]:
df_main = df_scores.merge(df_population, left_on = 'country', right_on = 'Geography', how = 'outer')
df_main.head()

Unnamed: 0,name,country,lastrevid,prediction,Geography,Population (millions),region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA


Obtaining all countries for which there are no matches i.e., either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

In [209]:
l1 = df_main[df_main['country'].isnull()]['Geography'].unique()
l2 = df_main[df_main['Geography'].isnull()]['country'].unique()
no_match = list(set(np.append(l1, l2)))
no_match.sort()
no_match

['Australia',
 'Brunei',
 'Canada',
 'China,  Hong Kong SAR',
 'China,  Macao SAR',
 'Curacao',
 'French Guiana',
 'French Polynesia',
 'Guadeloupe',
 'Guam',
 'Ireland',
 'Kiribati',
 'Korean',
 'Martinique',
 'Mauritius',
 'Mayotte',
 'New Caledonia',
 'New Zealand',
 'Philippines',
 'Puerto Rico',
 'Reunion',
 'Sao Tome and Principe',
 'United Kingdom',
 'United States',
 'Western Sahara',
 'eSwatini']

Writing the no match countries list to a output text file

In [210]:
with open('/content/drive/MyDrive/MS Admission/Washington/GitHub/DataScienceProjects/data-512-homework_2/wp_countries-no_match.txt', 'w') as f:
  for i in no_match:
    f.write(line)
    f.write('\n')

Consolidate the remaining data into a single CSV file

In [211]:
df_main = df_main[(~df_main['country'].isnull()) & (~df_main['Geography'].isnull())]
df_main = df_main.drop('Geography', axis = 1)
df_main = df_main.rename(columns={'Geography': 'country', 'Population (millions)': 'population', 'name': 'article_title', 'latestrevid': 'revision_id', 'prediction': 'article_quality'})
df_main['lastrevid'] = df_main['lastrevid'].astype('int')
df_main.to_csv('wp_politicians_by_country.csv', index=False)

In [212]:
df_main.head()

Unnamed: 0,article_title,country,lastrevid,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


# Step 4: Analysis
Your analysis will consist of calculating total-articles-per-population (a ratio representing the number of articles per person) and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. All of these values are to be “per capita”.

In this analysis a country can only exist in one region. The population_by_country_2022.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) region.

For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Also, keep in mind that the population_by_country_2022.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbers.


In [262]:
def enrich_region_population(data = df_population):
  region_population = data[['region', 'Population (millions)']].groupby('region').sum().reset_index()
  region_population = region_population.rename(columns = {'Population (millions)': 'population'})
  region_population['population'] = region_population['population'].round().astype('int')
  return region_population

In [264]:
def all_quality(data, grain = 'country'):
  if grain == 'region':
    total_articles = data[['region', 'article_title']]
    reg_pop = enrich_region_population()
    total_articles = total_articles.merge(reg_pop, on = "region", how = "inner")
    total_articles = total_articles.groupby(['region', 'population']).nunique().reset_index()
  elif grain == 'country':
    total_articles = df_main[['country', 'population', 'article_title']].groupby(['country', 'population']).nunique().reset_index()
  total_articles = total_articles.rename(columns = {'article_title': 'total_articles'})
  total_articles['articles_per_capita'] = total_articles['total_articles'] / (total_articles['population'] * 1000000)
  total_articles = total_articles.sort_values(by = ['articles_per_capita'], ascending = False)
  # Checking if there is division by zero because of population being 0
  total_articles = total_articles[total_articles['articles_per_capita'] != np.inf]
  total_articles.reset_index(inplace = True)
  total_articles = total_articles.drop('index', axis = 1)
  return total_articles

In [267]:
country_total_articles = all_quality(df_main, grain = 'country')
region_total_articles = all_quality(df_main, grain = 'region')

In [265]:
def high_quality(data, grain = 'country'):
  high_quality = data[data['article_quality'] == 'FA']
  if grain == 'region':
    high_quality = high_quality[['region', 'article_title']]
    reg_pop = enrich_region_population()
    high_quality = high_quality.merge(reg_pop, on = "region", how = "inner")
    high_quality = high_quality.groupby(['region', 'population']).nunique().reset_index()
  elif grain == 'country':
    high_quality = high_quality[['country', 'population', 'article_title']].groupby(['country', 'population']).nunique().reset_index()
  high_quality = high_quality.rename(columns = {'article_title': 'high_quality_articles'})
  high_quality['quality_articles_per_capita'] = high_quality['high_quality_articles'] / (high_quality['population'] * 1000000)
  high_quality = high_quality.sort_values(by = ['quality_articles_per_capita'], ascending = False)
  # Checking if there is division by zero because of population being 0
  high_quality = high_quality[high_quality['quality_articles_per_capita'] != np.inf]
  high_quality.reset_index(inplace = True)
  high_quality = high_quality.drop('index', axis = 1)
  return high_quality

In [268]:
country_high_quality = high_quality(df_main, grain = 'country')
region_high_quality = high_quality(df_main, grain = 'region')

# Step 5: Results
Your results from this analysis will be produced in the form of data tables. You are being asked to produce six total tables, that show:


Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).

In [269]:
countries_coverage_top10 = country_total_articles.head(10)['country']
countries_coverage_top10

0               Antigua and Barbuda
1    Federated States of Micronesia
2                           Andorra
3                          Barbados
4                  Marshall Islands
5                        Seychelles
6                        Montenegro
7                        Luxembourg
8                            Bhutan
9                           Grenada
Name: country, dtype: object

Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [270]:
countries_coverage_bottom10 = country_total_articles.reindex(index = country_total_articles.index[::-1]).head(10)['country']
countries_coverage_bottom10

177           China
176          Mexico
175    Saudi Arabia
174         Romania
173           India
172       Sri Lanka
171           Egypt
170        Ethiopia
169          Taiwan
168         Vietnam
Name: country, dtype: object


Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [271]:
countries_quality_top10 = country_high_quality.head(10)['country']
countries_quality_top10

0        Andorra
1    Afghanistan
Name: country, dtype: object

Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

In [272]:
countries_quality_bottom10 = country_high_quality.reindex(index = country_high_quality.index[::-1]).head(10)['country']
countries_quality_bottom10

1    Afghanistan
0        Andorra
Name: country, dtype: object


Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [273]:
region_coverage = region_total_articles['region']
region_coverage

0     SOUTHERN EUROPE
1           CARIBBEAN
2      WESTERN EUROPE
3      EASTERN EUROPE
4     NORTHERN EUROPE
5        WESTERN ASIA
6             OCEANIA
7     SOUTHERN AFRICA
8      EASTERN AFRICA
9       SOUTH AMERICA
10     WESTERN AFRICA
11       CENTRAL ASIA
12    CENTRAL AMERICA
13      MIDDLE AFRICA
14    NORTHERN AFRICA
15     SOUTHEAST ASIA
16         SOUTH ASIA
17          EAST ASIA
Name: region, dtype: object

Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [274]:
region_high_quality = region_high_quality['region']
region_high_quality

0    SOUTHERN EUROPE
1         SOUTH ASIA
Name: region, dtype: object