# Homework 2


## License
This code has been developed by leveraging an example code that was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022
Some code snippets have also been derived from the internet and tweaked as per usage.

In [342]:

# These are standard python modules
import json, time, urllib.parse
import requests

The example relies on some constants that help make the code a bit more readable.

In [132]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ananya03@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#

#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ananya03@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
#ARTICLE_TITLES = df_politicians['name'].to_list()

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [240]:
# These are standard python modules
import json, time, urllib.parse
import pandas as pd
import numpy as np
import re
import requests
import matplotlib.pyplot as plt
from datetime import datetime

In [241]:
population=pd.read_excel('population_2022.xlsx')
politicians=pd.read_excel('politicians_2022.xlsx')

politicians.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [242]:
duplicate_articles = politicians[politicians.duplicated(subset=['name', 'url'], keep = False)]
duplicate_records = politicians[politicians.duplicated(subset=['name', 'url', 'country'], keep = False)]

In [243]:
print(f'''There are {duplicate_articles.shape[0]} duplicate articles ''')
print(f'''There are {duplicate_records.shape[0]} duplicate records ''')

There are 98 duplicate articles 
There are 4 duplicate records 


In [244]:
duplicate_records

Unnamed: 0,name,url,country
6198,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6231,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia
6295,Abdirahman Aw Ali Farrah,https://en.wikipedia.org/wiki/Abdirahman_Aw_Al...,Somalia
6309,Ibrahim Megag Samatar,https://en.wikipedia.org/wiki/Ibrahim_Megag_Sa...,Somalia


In [245]:
politicians = politicians.drop_duplicates()
politicians.shape

(7582, 3)

## Countries with 0 population
There might be countries with 0 popualtion here because the number os rounded up to millions and smaller countries might have a population of less than half a million, which would get rounded off to zero.

In [246]:
pop_0 = population[population['Population (millions)'] == 0]
pop_0.shape[0]

6

In [247]:
pop_0.head()

Unnamed: 0,Geography,Population (millions)
183,Liechtenstein,0.0
185,Monaco,0.0
211,San Marino,0.0
223,Nauru,0.0
226,Palau,0.0


#### We will keep them in the data for completeness. However, while rank ordering we will keep in mind how to deal with these values

In [248]:
population.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


In [249]:
population['Region'] = population['Geography'].map(lambda name: name if name.isupper() else None)

In [250]:
population.head()

Unnamed: 0,Geography,Population (millions),Region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,
4,Egypt,103.5,


In [251]:
population['Region'] = population['Region'].ffill()

In [252]:
population.head()

Unnamed: 0,Geography,Population (millions),Region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA


In [253]:
population=population[population['Geography']!=population['Region']]
population.head()

Unnamed: 0,Geography,Population (millions),Region
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA
5,Libya,6.8,NORTHERN AFRICA
6,Morocco,36.7,NORTHERN AFRICA
7,Sudan,46.9,NORTHERN AFRICA


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article revisions. Therefore, the main parameter is article_revid.

## Get quality of article

In [254]:
#########
#
#    PROCEDURES/FUNCTIONS
#
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [255]:
ARTICLE_TITLES = politicians['name'].unique()
#Keep only unique values

In [256]:
ARTICLE_TITLES.dtype

dtype('O')

In [152]:
ARTICLE_TITLES

array(['Shahjahan Noori', 'Abdul Ghafar Lakanwal', 'Majah Ha Adrif', ...,
       'Langton Towungana', 'Herbert Ushewokunze', 'Denis Walker'],
      dtype=object)

In [257]:
data={'pages':{}} #creating a blank dictionary to save metadata of a articles
no_response=[] #list to collect articles with no json response
n=len(ARTICLE_TITLES)

for i in range(0,n,50):
    article_arg='|'.join(ARTICLE_TITLES[i:min((i+50),n)]) #make api request for 50 articles at once
    info = request_pageinfo_per_article(article_arg)
    if '-1' in info['query']['pages'].keys(): #checking if there is no response to the api request
        
        no_response.append(info['query']['pages'].pop('-1'))
    data['pages'].update(info['query']['pages']) #recursively updating the empty dictionary with new values

In [258]:
no_response

[{'ns': 0,
  'title': 'Prince Ofosu Sefah',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl': 'https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Prince_Ofosu_Sefah&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah'},
 {'ns': 0,
  'title': 'Harjit Kaur Talwandi',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl': 'https://en.wikipedia.org/wiki/Harjit_Kaur_Talwandi',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Harjit_Kaur_Talwandi&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Harjit_Kaur_Talwandi'},
 {'ns': 0,
  'title': 'Abd al-Razzaq al-Hasani',
  'missing': '',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'fullurl

In [259]:
f = open('error_log.txt', 'w')
for item in no_response:
    text = "Unable to get the information for: {}".format(item['title'])
    print(text)
    f.write(text)


Unable to get the information for: Prince Ofosu Sefah
Unable to get the information for: Harjit Kaur Talwandi
Unable to get the information for: Abd al-Razzaq al-Hasani
Unable to get the information for: Abiodun Abimbola Orekoya
Unable to get the information for: Roman Konoplev


In [260]:
data_list=[] 
for k,v in data['pages'].items():
    data_list.append(v)
data_articles= pd.DataFrame(data_list)
data_articles.head()

Unnamed: 0,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,redirect,new
0,65412901,0,Abas Basir,wikitext,en,en,ltr,2022-10-11T01:20:40Z,1098419766,19306,65415333.0,https://en.wikipedia.org/wiki/Abas_Basir,https://en.wikipedia.org/w/index.php?title=Aba...,https://en.wikipedia.org/wiki/Abas_Basir,,,
1,27428272,0,Abdul Baqi Turkistani,wikitext,en,en,ltr,2022-10-11T03:06:55Z,889226470,1297,27595416.0,https://en.wikipedia.org/wiki/Abdul_Baqi_Turki...,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Baqi_Turki...,,,
2,42972519,0,Abdul Ghafar Lakanwal,wikitext,en,en,ltr,2022-09-26T05:36:04Z,943562276,4165,42972696.0,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,,,
3,29443640,0,Abdul Ghani Ghani,wikitext,en,en,ltr,2022-10-10T23:29:32Z,1072441893,1352,29453228.0,https://en.wikipedia.org/wiki/Abdul_Ghani_Ghani,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Ghani_Ghani,,,
4,44098744,0,Abdul Malik Hamwar,wikitext,en,en,ltr,2022-10-10T23:30:44Z,1100874645,3512,44237349.0,https://en.wikipedia.org/wiki/Abdul_Malik_Hamwar,https://en.wikipedia.org/w/index.php?title=Abd...,https://en.wikipedia.org/wiki/Abdul_Malik_Hamwar,,,


In [261]:
data_articles.shape

(7529, 17)

In [262]:
data_articles = data_articles.drop_duplicates()
data_articles.shape

(7529, 17)

In [263]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/?models={models}&revids={revids}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ananya03@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revids" : "",               # the revision to be scored - this will probably change each call
    "models": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

In [264]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revids'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [265]:
data_articles['lastrevid'].dtype

dtype('int64')

In [266]:
data_articles['lastrevid']=data_articles['lastrevid'].astype('str')

In [267]:
#Find the list of rev ids to iterate over
rev_ids = list(data_articles.lastrevid.unique())

In [268]:
rev_ids=list(rev_ids)

In [269]:
len(data_articles)

7529

In [278]:
pred={'scores':{}}
no_response2={'scores':{}} 
n=len(data_articles)
for i in range(0,n,50):
    revidss='|'.join(rev_ids[i:min((i+50),n)])
    info = request_ores_score_per_article(revidss)
    if info==None:#no api response
        pass 
    else:
     #   print(info)
        pred['scores'].update(info['enwiki']['scores']) 

In [279]:
no_response2

{'scores': {}}

In [280]:
len(pred['scores'])

7529

### Check if scores for all articles are available

In [281]:
revids = []
pred1 = []
for revid, v in pred['scores'].items():
    revids.append(revid)
    pred1.append(v['articlequality']['score']['prediction'])

In [282]:
len(revids)

7529

In [285]:
final_df = pd.DataFrame({'revid':revids,'pred':pred1})

In [286]:
final_df.shape

(7529, 2)

# Combining the Data

In [287]:
temp = data_articles[['title', 'lastrevid']]
print(temp.shape)

(7529, 2)


In [288]:
temp = temp[~temp['lastrevid'].isnull()]
print(temp.shape)

(7529, 2)


In [289]:
df_politicians = politicians.merge(temp, left_on = "name", right_on = "title", how = 'inner')
print(df_politicians.shape)
df_politicians.head()

(7577, 5)


Unnamed: 0,name,url,country,title,lastrevid
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382


In [290]:
temp2 = df_politicians.merge(final_df,left_on = "lastrevid", right_on = "revid" , how = 'left')
temp2 = temp2.drop(['title', 'url', 'revid'], axis = 1)
print(temp2.shape)
temp2.head()

(7577, 4)


Unnamed: 0,name,country,lastrevid,pred
0,Shahjahan Noori,Afghanistan,1099689043,GA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start
2,Majah Ha Adrif,Afghanistan,852404094,Start
3,Haroon al-Afghani,Afghanistan,1095102390,B
4,Tayyab Agha,Afghanistan,1104998382,Start


In [291]:
temp3 = temp2.merge(population, left_on = 'country', right_on = 'Geography', how = 'outer')
temp3.head()

Unnamed: 0,name,country,lastrevid,pred,Geography,Population (millions),Region
0,Shahjahan Noori,Afghanistan,1099689043,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,Afghanistan,41.1,SOUTH ASIA


In [293]:
no_match_found1 = temp3[temp3['country'].isnull()]['Geography'].unique()
len(no_match_found1)

25

In [294]:
no_match_found2 = temp3[temp3['Geography'].isnull()]['country'].unique()
no_match_countries = np.append(no_match_found1, no_match_found2)
no_match_countries.sort()
len(no_match_countries)

26

In [329]:
no_match_countries

array(['Australia', 'Brunei', 'Canada', 'China,  Hong Kong SAR',
       'China,  Macao SAR', 'Curacao', 'French Guiana',
       'French Polynesia', 'Guadeloupe', 'Guam', 'Ireland', 'Kiribati',
       'Korean', 'Martinique', 'Mauritius', 'Mayotte', 'New Caledonia',
       'New Zealand', 'Philippines', 'Puerto Rico', 'Reunion',
       'Sao Tome and Principe', 'United Kingdom', 'United States',
       'Western Sahara', 'eSwatini'], dtype=object)

In [330]:
with open('wp_countries-no_match.txt', 'w') as f:
  for i in no_match_countries:
    f.write(i)
    f.write('\n')

### Consolidate the remaining data into a single CSV file called:
wp_politicians_by_country.csv


In [302]:
final_df = temp3[(~temp3['country'].isnull()) & (~temp3['Geography'].isnull())]

In [303]:
final_df = final_df.drop('Geography', axis = 1)
final_df = final_df.rename(columns={'Geography': 'country', 'Population (millions)': 'population', 'Region': 'region' ,
                                    'name': 'article_title', 'lastrevid': 'revision_id', 'pred': 'article_quality'})
final_df.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [304]:
final_df.to_csv('wp_politicians_by_country.csv', index=False)

In [305]:
final_df.shape

(7507, 6)

# Step 4: Analysis
Your analysis will consist of calculating total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. All of these values are to be “per capita”.
In this analysis a country can only exist in one region. The population_by_country_2022.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) region.
For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Also, keep in mind that the population_by_country_2022.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbers.

In [307]:
temp4 = final_df[['country','region', 'population']]
temp4 = temp4.drop_duplicates()
regional_population = temp4[['region', 'population']].groupby('region').sum().reset_index()
regional_population.head()

Unnamed: 0,region,population
0,CARIBBEAN,39.5
1,CENTRAL AMERICA,177.9
2,CENTRAL ASIA,78.0
3,EAST ASIA,1665.8
4,EASTERN AFRICA,470.3


In [308]:
regional_population['population'] = regional_population['population'].round().astype('int')

In [309]:
# Merge region wise population back to the dataset

In [310]:
total_articles = final_df[['region', 'article_title']]
total_articles = total_articles.groupby(['region']).nunique().reset_index()
total_articles.head()

Unnamed: 0,region,article_title
0,CARIBBEAN,201
1,CENTRAL AMERICA,193
2,CENTRAL ASIA,103
3,EAST ASIA,246
4,EASTERN AFRICA,646


In [311]:
total_articles = total_articles.merge(regional_population, on = "region", how = "inner")
total_articles.head()

Unnamed: 0,region,article_title,population
0,CARIBBEAN,201,40
1,CENTRAL AMERICA,193,178
2,CENTRAL ASIA,103,78
3,EAST ASIA,246,1666
4,EASTERN AFRICA,646,470


In [312]:
total_articles['articles_per_capita'] = total_articles['article_title'] / (total_articles['population'] )
total_articles = total_articles.sort_values(by = ['articles_per_capita'])
total_articles.head()

Unnamed: 0,region,article_title,population,articles_per_capita
3,EAST ASIA,246,1666,0.147659
11,SOUTH ASIA,644,2009,0.320557
12,SOUTHEAST ASIA,410,560,0.732143
7,NORTHERN AFRICA,227,251,0.904382
6,MIDDLE AFRICA,203,196,1.035714


In [313]:
#Country wise articles per capita

In [314]:
counrty_per_capita = final_df[['country', 'population', 'article_title']].groupby(['country', 'population']).nunique().reset_index()
counrty_per_capita['articles_per_capita'] = counrty_per_capita['article_title'] / (counrty_per_capita['population'])
counrty_per_capita = counrty_per_capita.sort_values(by = ['articles_per_capita'])
counrty_per_capita.head()

Unnamed: 0,country,population,article_title,articles_per_capita
32,China,1436.6,2,0.001392
106,Mexico,127.5,1,0.007843
140,Saudi Arabia,36.7,3,0.081744
134,Romania,19.0,2,0.105263
73,India,1417.2,178,0.1256


In [265]:
# Repeating for high quality articles

In [316]:
high_quality = final_df[(final_df['article_quality'] == 'FA') | (final_df['article_quality'] == 'GA')]

In [317]:
total_articles_h = high_quality[['region', 'article_title']]
total_articles_h = total_articles_h.groupby(['region']).nunique().reset_index()
total_articles_h = total_articles_h.merge(regional_population, on = "region", how = "inner")
total_articles_h['articles_per_capita'] = total_articles_h['article_title'] / (total_articles_h['population'] )
total_articles_h = total_articles_h.sort_values(by = ['articles_per_capita'])
total_articles_h.head()

Unnamed: 0,region,article_title,population,articles_per_capita
3,EAST ASIA,16,1666,0.009604
11,SOUTH ASIA,23,2009,0.011448
6,MIDDLE AFRICA,5,196,0.02551
7,NORTHERN AFRICA,7,251,0.027888
10,SOUTH AMERICA,13,434,0.029954


In [318]:
#Country wise articles per capita
counrty_per_capita_h = high_quality[['country', 'population', 'article_title']].groupby(['country', 'population']).nunique().reset_index()
counrty_per_capita_h['articles_per_capita'] = counrty_per_capita_h['article_title'] / (counrty_per_capita_h['population'])
counrty_per_capita_h = counrty_per_capita_h.sort_values(by = ['articles_per_capita'])
counrty_per_capita_h.shape

(93, 4)

# Step 5

Your results from this analysis will be produced in the form of data tables. You are being asked to produce six total tables, that show:

### 1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) 

In [343]:
counrty_per_capita1 = counrty_per_capita.sort_values(by = ['articles_per_capita'], ascending = False)

counrty_top10 = counrty_per_capita1.head(10)['country']
counrty_top10

108                            Monaco
139                        San Marino
125                             Palau
172                            Tuvalu
95                      Liechtenstein
115                             Nauru
5                 Antigua and Barbuda
54     Federated States of Micronesia
3                             Andorra
13                           Barbados
Name: country, dtype: object

### If we remove infinity values

In [344]:
counrty_per_capita1[counrty_per_capita1['articles_per_capita']!=np.Inf].head(10)[['country']]

Unnamed: 0,country
5,Antigua and Barbuda
54,Federated States of Micronesia
3,Andorra
13,Barbados
104,Marshall Islands
143,Seychelles
110,Montenegro
97,Luxembourg
18,Bhutan
64,Grenada


### 2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) 

In [345]:
counrty_per_capita1 = counrty_per_capita.sort_values(by = ['articles_per_capita'], ascending = True)

counrty_bottom10 = counrty_per_capita1.head(10)['country']
counrty_bottom10

32            China
106          Mexico
140    Saudi Arabia
134         Romania
73            India
153       Sri Lanka
48            Egypt
53         Ethiopia
161          Taiwan
180         Vietnam
Name: country, dtype: object

# High quality

### 3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

In [346]:
counrty_per_capita1 = counrty_per_capita_h.sort_values(by = ['articles_per_capita'], ascending = False)
hq_country_top10  = counrty_per_capita1.head(10)['country']
hq_country_top10

86                   Tuvalu
2                   Andorra
53               Montenegro
1                   Albania
80                 Suriname
9        Bosnia-Herzegovina
49                Lithuania
19                  Croatia
74                 Slovenia
61    Palestinian Territory
Name: country, dtype: object

### If we remove infinity values

In [347]:
counrty_per_capita1[counrty_per_capita1['articles_per_capita']!=np.Inf].head(10)[['country']]

Unnamed: 0,country
2,Andorra
53,Montenegro
1,Albania
80,Suriname
9,Bosnia-Herzegovina
49,Lithuania
19,Croatia
74,Slovenia
61,Palestinian Territory
28,Gabon


### 4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [322]:
counrty_per_capita1 = counrty_per_capita_h.sort_values(by = ['articles_per_capita'], ascending = True)
hq_country_bottom10  = counrty_per_capita1.head(10)['country']
hq_country_bottom10

35       India
84    Thailand
39       Japan
58     Nigeria
91     Vietnam
17    Colombia
87      Uganda
60    Pakistan
79       Sudan
37        Iran
Name: country, dtype: object

### 5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

Here, the population of the region is calculated based on the countries that we got a match for. The idea is not include the population of countries for which there is match because for these countries, articles are not being accounted for either.

In [331]:
total_articles = total_articles.sort_values(by = ['articles_per_capita'], ascending = False)
total_articles

Unnamed: 0,region,article_title,population,articles_per_capita
8,NORTHERN EUROPE,261,34,7.676471
9,OCEANIA,86,12,7.166667
14,SOUTHERN EUROPE,879,151,5.821192
0,CARIBBEAN,201,40,5.025
17,WESTERN EUROPE,699,197,3.548223
5,EASTERN EUROPE,732,287,2.550523
16,WESTERN ASIA,684,294,2.326531
13,SOUTHERN AFRICA,118,68,1.735294
4,EASTERN AFRICA,646,470,1.374468
10,SOUTH AMERICA,577,434,1.329493


### 6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [332]:
total_articles_h = total_articles_h.sort_values(by = ['articles_per_capita'], ascending = False)
total_articles_h

Unnamed: 0,region,article_title,population,articles_per_capita
14,SOUTHERN EUROPE,46,151,0.304636
8,NORTHERN EUROPE,8,34,0.235294
0,CARIBBEAN,8,40,0.2
9,OCEANIA,2,12,0.166667
5,EASTERN EUROPE,38,287,0.132404
17,WESTERN EUROPE,22,197,0.111675
16,WESTERN ASIA,28,294,0.095238
13,SOUTHERN AFRICA,4,68,0.058824
1,CENTRAL AMERICA,10,178,0.05618
12,SOUTHEAST ASIA,24,560,0.042857
