# Step 1: Getting the Article and Population Data

The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. The data file is extracted the data file, which is called page_data.csv.
The population data is available in CSV format as WPDS_2020_data.csv. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.

In [49]:
import pandas as pd

In [50]:
# import data files
page_data = pd.read_csv('page_data.csv')
WPDS_data = pd.read_csv('WPDS_2020_data.csv')

In [51]:
page_data.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [52]:
page_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     47197 non-null  object
 1   country  47197 non-null  object
 2   rev_id   47197 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


In [53]:
WPDS_data.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


# Step 2: Cleaning the Data

This step contains the process of data cleaning. In the case of page_data.csv, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles will be excluded in the following analysis.
Similarly, WPDS_2020_data.csv contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). All the sub-regions will be mapped to each country so that we can report coverage and quality by region in the analysis section.

In [54]:
# Rename the 'FIPS' column to region
WPDS_data = WPDS_data.rename({'FIPS':'region'},axis=1)
# Change the type of "Channel Islands" to "Country" instead of "Sub-region"
WPDS_data.at[168,'Type'] = 'Country'

In [55]:
# Extract all the index of Type: Sub-Region
index = WPDS_data.index
condition = WPDS_data["Type"] == "Sub-Region"
region_indices = index[condition].to_list()
# Map the countries with sub-region they belong 
for i in range(len(region_indices)):
  indice = region_indices[i]
  WPDS_data.at[indice:,'region'] = WPDS_data['Name'][indice]

In [56]:
WPDS_data.head(5)

Unnamed: 0,region,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,NORTHERN AFRICA,Algeria,Country,2019,44.357,44357000
4,NORTHERN AFRICA,Egypt,Country,2019,100.803,100803000


In [57]:
# Drop all the rows which Type = 'World' or 'Sub-Region'
region_data = pd.DataFrame(WPDS_data[WPDS_data['Type'].str.contains('World')])
region_data = region_data.append(WPDS_data[WPDS_data['Type'].str.contains('Sub')])
WPDS_data.drop(region_data.index,inplace = True)
WPDS_data = WPDS_data.reset_index(drop = True)

In [58]:
WPDS_data.head(5)

Unnamed: 0,region,Name,Type,TimeFrame,Data (M),Population
0,NORTHERN AFRICA,Algeria,Country,2019,44.357,44357000
1,NORTHERN AFRICA,Egypt,Country,2019,100.803,100803000
2,NORTHERN AFRICA,Libya,Country,2019,6.891,6891000
3,NORTHERN AFRICA,Morocco,Country,2019,35.952,35952000
4,NORTHERN AFRICA,Sudan,Country,2019,43.849,43849000


In [59]:
# Drop all the rows with 'page' containing "Template"
template = page_data[page_data['page'].str.contains('Template:')]
page_data.drop(template.index,inplace = True)
page_data = page_data.reset_index(drop = True)
page_data.head(5)

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


In [60]:
page_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46701 entries, 0 to 46700
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   page     46701 non-null  object
 1   country  46701 non-null  object
 2   rev_id   46701 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


# Step 3: Getting Article Quality Predictions

In this step, I an using a machine learning tool,"Objective Revision Evaluation Service", that can provide estimates of Wikipedia article quality. All the articles recorded in the page_data.csv will be predicted with a quality estimates, which ,from best to worst, can be:
FA - Featured article, GA - Good article, B - B-class article, C - C-class article, Start - Start-class article, Stub - Stub-class article

In [61]:
# install ores
!pip3 install ores



In [62]:
revid = page_data['rev_id'].tolist()
revids = [int(id) for id in revid]

In [63]:
# Generate article quality prediction
from ores import api
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 Class project <sophiart@uw.edu>")
response = ores_session.score("enwiki", ["articlequality"], revids)

In [64]:
# Extract error message and make a log
result = []
i = 0
for re in response:
  i = i + 1
  parent = re['articlequality']
  if 'error' in parent:
    temp = parent['error']['type']
    
  else:
    temp = parent['score']['prediction']
  result.append(temp)

In [65]:
page_data['prediction'] = result
page_data.head(5)

Unnamed: 0,page,country,rev_id,prediction
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [66]:
pred_data = pd.DataFrame(page_data[page_data['prediction'] == 'FA'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'FL'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'A'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'GA'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'B'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'C'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'Start'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'Stub'])
pred_data = pred_data.append(page_data[page_data['prediction'] == 'List'])

In [67]:
page_data.drop(pred_data.index,inplace = True)
error_log = page_data.reset_index(drop = True)
error_log.head(5)

Unnamed: 0,page,country,rev_id,prediction
0,List of politicians in Poland,Poland,516633096,RevisionNotFound
1,Tingtingru,Vanuatu,550682925,RevisionNotFound
2,Daud Arsala,Afghanistan,627547024,RevisionNotFound
3,Book:Two Political Biographies,India,636911471,RevisionNotFound
4,Dilaver Bey,Turkey,669987106,RevisionNotFound


In [68]:
error_log.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   page        276 non-null    object
 1   country     276 non-null    object
 2   rev_id      276 non-null    int64 
 3   prediction  276 non-null    object
dtypes: int64(1), object(3)
memory usage: 8.8+ KB


In [69]:
pred_data = pred_data.reset_index(drop = True)
pred_data.head(5)

Unnamed: 0,page,country,rev_id,prediction
0,Mariano Rivera Paz,Guatemala,753360584,FA
1,Lê Văn Duyệt,Vietnam,763222145,FA
2,Haim Arlosoroff,Israel,775055838,FA
3,Sassoon Eskell,Iraq,776240985,FA
4,Malouma,Mauritania,780340163,FA


In [70]:
pred_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46425 entries, 0 to 46424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   page        46425 non-null  object
 1   country     46425 non-null  object
 2   rev_id      46425 non-null  int64 
 3   prediction  46425 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.4+ MB




# Step 4: Combining the Datasets




In this step, I merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, there are some entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa. These entries that can not be matched will be output in a csv file called "wp_wpds_countries-no_match.csv" and the rest will be stored in "wp_wpds_politicians_by_country.csv" for future use. 

In [71]:
WPDS_data = WPDS_data.drop(['Type','TimeFrame','Data (M)'], axis=1)
WPDS_data = WPDS_data.rename({'Name':'country'},axis=1)

In [72]:
# Merge page data with world population data
pg_wpds_match = WPDS_data.merge(pred_data, on='country',how='inner')
pg_wpds_match.head(5)

Unnamed: 0,region,country,Population,page,rev_id,prediction
0,NORTHERN AFRICA,Algeria,44357000,Ahmed Ouyahia,799959233,GA
1,NORTHERN AFRICA,Algeria,44357000,Mohamed Seghir Boushaki,805107344,GA
2,NORTHERN AFRICA,Algeria,44357000,Benyoucef Benkhedda,794381535,B
3,NORTHERN AFRICA,Algeria,44357000,Hayreddin Barbarossa,802441492,B
4,NORTHERN AFRICA,Algeria,44357000,Emir Abdelkader,803016375,B


In [73]:
pg_wpds_match.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44568 entries, 0 to 44567
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   region      44568 non-null  object
 1   country     44568 non-null  object
 2   Population  44568 non-null  int64 
 3   page        44568 non-null  object
 4   rev_id      44568 non-null  int64 
 5   prediction  44568 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.4+ MB


In [74]:
pg_wpds_nomatch = WPDS_data.merge(pred_data, on='country',how='outer')
pg_wpds_nomatch = pg_wpds_nomatch[pg_wpds_nomatch.isnull().any(axis=1)]
pg_wpds_nomatch.head(5)

Unnamed: 0,region,country,Population,page,rev_id,prediction
899,NORTHERN AFRICA,Western Sahara,597000.0,,,
1122,WESTERN AFRICA,Cote d'Ivoire,26175000.0,,,
4118,EASTERN AFRICA,Mayotte,284000.0,,,
4177,EASTERN AFRICA,Reunion,861000.0,,,
6061,MIDDLE AFRICA,"Congo, Dem. Rep.",89568000.0,,,


In [75]:
# Rearrange data columns
final_match_data = pd.DataFrame(columns=['country', 'article_name', 'revision_id','article_quality_est.','population'])
final_no_match_data = pd.DataFrame(columns=['country', 'article_name', 'revision_id','article_quality_est.','population'])

final_match_data['country'] = pg_wpds_match['country']
final_match_data['article_name'] = pg_wpds_match['page']
final_match_data['revision_id'] = pg_wpds_match['rev_id']
final_match_data['article_quality_est.'] = pg_wpds_match['prediction']
final_match_data['population'] = pg_wpds_match['Population']

final_no_match_data['country'] = pg_wpds_nomatch['country']
final_no_match_data['article_name'] = pg_wpds_nomatch['page']
final_no_match_data['revision_id'] = pg_wpds_nomatch['rev_id']
final_no_match_data['article_quality_est.'] = pg_wpds_nomatch['prediction']
final_no_match_data['population'] = pg_wpds_nomatch['Population']

final_match_data.to_csv('wp_wpds_politicians_by_country.csv', index = False)
final_no_match_data.to_csv('wp_wpds_countries-no_match.csv',index = False)

In [76]:
error_log.to_csv('error_log.csv',index = False)

# Step 5: Analysis

The analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

In [77]:
# Calculte total articles with countries
country_total_article = pg_wpds_match.groupby(['country']).size().reset_index(name = 'total_article')

In [78]:
# Calculate high quality article with countries
high_quality_data = pd.DataFrame(pg_wpds_match[pg_wpds_match['prediction'] == 'FA'])
high_quality_data = high_quality_data.append(pg_wpds_match[pg_wpds_match['prediction'] == 'GA'])
country_hq_article = high_quality_data.groupby(['country']).size().reset_index(name = 'high_quality_article')

In [79]:
country_region_pop = pg_wpds_match.drop(['page','rev_id','prediction'], axis=1)

In [80]:
country_region_pop = country_region_pop.drop_duplicates()

In [81]:
country_analysis = country_region_pop.merge(country_total_article, on='country',how='outer')
country_analysis = country_analysis.merge(country_hq_article, on='country',how='outer')

In [82]:
import numpy as np
country_analysis = country_analysis.replace(np.nan,0)

In [83]:
# Calculte total articles percentage and high quality article percentage with countries
country_analysis['total_article_percentage (%)'] = country_analysis['total_article'] / country_analysis['Population'] * 100
country_analysis['high_quality_article_percentage (%)'] = country_analysis['high_quality_article'] / country_analysis['total_article'] * 100

In [84]:
# Calculate region features
region_total_article = pg_wpds_match.groupby(['region']).size().reset_index(name = 'total_article')
region_hq_article = high_quality_data.groupby(['region']).size().reset_index(name = 'high_quality_article')
region_population = region_data.drop(['Name','Type','TimeFrame','Data (M)'], axis = 1)
region_analysis = region_population.merge(region_total_article, on ='region', how = 'outer')
region_analysis = region_analysis.merge(region_hq_article, on ='region', how = 'outer')
region_analysis = region_analysis.drop(0)
region_analysis = region_analysis.replace(np.nan,0)

In [85]:
africa_data = pd.DataFrame(region_analysis[region_analysis['region'].str.contains('AFRICA')])
africa_total_article = africa_data['total_article'].sum()
africa_hq_article = africa_data['high_quality_article'].sum()
region_analysis.loc[region_analysis['region'] == 'AFRICA', 'total_article'] = africa_total_article
region_analysis.loc[region_analysis['region'] == 'AFRICA', 'high_quality_article'] = africa_hq_article

asia_data = pd.DataFrame(region_analysis[region_analysis['region'].str.contains('ASIA')])
asia_total_article = asia_data['total_article'].sum()
asia_hq_article = asia_data['high_quality_article'].sum()
region_analysis.loc[region_analysis['region'] == 'ASIA', 'total_article'] = asia_total_article
region_analysis.loc[region_analysis['region'] == 'ASIA', 'high_quality_article'] = asia_hq_article

europe_data = pd.DataFrame(region_analysis[region_analysis['region'].str.contains('EUROPE')])
europe_total_article = europe_data['total_article'].sum()
europe_hq_article = europe_data['high_quality_article'].sum()
region_analysis.loc[region_analysis['region'] == 'EUROPE', 'total_article'] = europe_total_article
region_analysis.loc[region_analysis['region'] == 'EUROPE', 'high_quality_article'] = europe_hq_article

america_data = pd.DataFrame(region_analysis[region_analysis['region'].str.contains('AMERICA')])
america_data = america_data.append(region_analysis[region_analysis['region'] == 'CARIBBEAN'])
america_total_article = america_data['total_article'].sum()
america_hq_article = america_data['high_quality_article'].sum()
region_analysis.loc[region_analysis['region'] == 'LATIN AMERICA AND THE CARIBBEAN', 'total_article'] = america_total_article
region_analysis.loc[region_analysis['region'] == 'LATIN AMERICA AND THE CARIBBEAN', 'high_quality_article'] = america_hq_article

In [86]:
region_analysis['total_article_percentage (%)'] = region_analysis['total_article'] / region_analysis['Population'] * 100
region_analysis['high_quality_article_percentage (%)'] = region_analysis['high_quality_article'] / region_analysis['total_article'] * 100

In [87]:
country_analysis.head(5)

Unnamed: 0,region,country,Population,total_article,high_quality_article,total_article_percentage (%),high_quality_article_percentage (%)
0,NORTHERN AFRICA,Algeria,44357000,116,2.0,0.000262,1.724138
1,NORTHERN AFRICA,Egypt,100803000,234,10.0,0.000232,4.273504
2,NORTHERN AFRICA,Libya,6891000,110,4.0,0.001596,3.636364
3,NORTHERN AFRICA,Morocco,35952000,206,1.0,0.000573,0.485437
4,NORTHERN AFRICA,Sudan,43849000,95,2.0,0.000217,2.105263


In [88]:
region_analysis

Unnamed: 0,region,Population,total_article,high_quality_article,total_article_percentage (%),high_quality_article_percentage (%)
1,AFRICA,1337918000,6839.0,119.0,0.000511,1.74002
2,NORTHERN AFRICA,244344000,899.0,19.0,0.000368,2.113459
3,WESTERN AFRICA,401115000,2139.0,40.0,0.000533,1.870033
4,EASTERN AFRICA,444970000,2502.0,35.0,0.000562,1.398881
5,MIDDLE AFRICA,179757000,665.0,16.0,0.00037,2.406015
6,SOUTHERN AFRICA,67732000,634.0,9.0,0.000936,1.419558
7,NORTHERN AMERICA,368193000,1901.0,104.0,0.000516,5.470805
8,LATIN AMERICA AND THE CARIBBEAN,651036000,7171.0,180.0,0.001101,2.51011
9,CENTRAL AMERICA,178611000,1543.0,23.0,0.000864,1.490603
10,CARIBBEAN,43233000,695.0,13.0,0.001608,1.870504


# Step 6: Results

Results from this analysis will be published in the form of six data tables.

Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [89]:
result_1 = country_analysis.sort_values('total_article_percentage (%)', ascending = False).reset_index(drop = True)
result_1[['country','total_article_percentage (%)']].head(10)

Unnamed: 0,country,total_article_percentage (%)
0,Tuvalu,0.54
1,Nauru,0.472727
2,San Marino,0.238235
3,Monaco,0.105263
4,Liechtenstein,0.071795
5,Marshall Islands,0.064912
6,Tonga,0.063636
7,Iceland,0.05462
8,Andorra,0.041463
9,Federated States of Micronesia,0.033962


Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [90]:
result_2 = country_analysis.sort_values('total_article_percentage (%)', ascending = True).reset_index(drop = True)
result_2[['country','total_article_percentage (%)']].head(10)

Unnamed: 0,country,total_article_percentage (%)
0,India,6.9e-05
1,Indonesia,7.7e-05
2,China,8.1e-05
3,Uzbekistan,8.2e-05
4,Ethiopia,8.8e-05
5,Zambia,0.000136
6,"Korea, North",0.00014
7,Thailand,0.000168
8,Mozambique,0.000186
9,Bangladesh,0.000187


Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [91]:
result_3 = country_analysis.sort_values('high_quality_article_percentage (%)', ascending = False).reset_index(drop = True)
result_3[['country','high_quality_article_percentage (%)']].head(10)

Unnamed: 0,country,high_quality_article_percentage (%)
0,"Korea, North",22.222222
1,Saudi Arabia,12.820513
2,Romania,12.244898
3,Central African Republic,12.121212
4,Uzbekistan,10.714286
5,Mauritania,10.416667
6,Guatemala,8.433735
7,Dominica,8.333333
8,Syria,7.8125
9,Benin,7.692308


Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [92]:
result_4 = country_analysis.sort_values('high_quality_article_percentage (%)', ascending = True).reset_index(drop = True)
result_4[['country','high_quality_article_percentage (%)']].head(10)

Unnamed: 0,country,high_quality_article_percentage (%)
0,Belize,0.0
1,Guyana,0.0
2,Comoros,0.0
3,Djibouti,0.0
4,Eritrea,0.0
5,French Guiana,0.0
6,Antigua and Barbuda,0.0
7,Bahamas,0.0
8,Barbados,0.0
9,Turkmenistan,0.0


Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population


In [93]:
result_5 = region_analysis.sort_values('total_article_percentage (%)', ascending = False).reset_index(drop = True)
result_5[['region','total_article_percentage (%)']]
result_5

Unnamed: 0,region,Population,total_article,high_quality_article,total_article_percentage (%),high_quality_article_percentage (%)
0,OCEANIA,43155000,3126.0,63.0,0.007244,2.015355
1,NORTHERN EUROPE,105990000,3763.0,102.0,0.00355,2.710603
2,SOUTHERN EUROPE,153251000,3710.0,74.0,0.002421,1.994609
3,WESTERN EUROPE,195479000,4560.0,56.0,0.002333,1.22807
4,EUROPE,746622000,15765.0,350.0,0.002112,2.220108
5,CARIBBEAN,43233000,695.0,13.0,0.001608,1.870504
6,EASTERN EUROPE,291902000,3732.0,118.0,0.001279,3.161844
7,LATIN AMERICA AND THE CARIBBEAN,651036000,7171.0,180.0,0.001101,2.51011
8,SOUTHERN AFRICA,67732000,634.0,9.0,0.000936,1.419558
9,WESTERN ASIA,280927000,2563.0,89.0,0.000912,3.472493


Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [94]:
result_6 = region_analysis.sort_values('high_quality_article_percentage (%)', ascending = False).reset_index(drop = True)
result_6[['region','high_quality_article_percentage (%)']]
result_6

Unnamed: 0,region,Population,total_article,high_quality_article,total_article_percentage (%),high_quality_article_percentage (%)
0,NORTHERN AMERICA,368193000,1901.0,104.0,0.000516,5.470805
1,SOUTHEAST ASIA,661845000,2020.0,73.0,0.000305,3.613861
2,WESTERN ASIA,280927000,2563.0,89.0,0.000912,3.472493
3,EASTERN EUROPE,291902000,3732.0,118.0,0.001279,3.161844
4,EAST ASIA,1641063000,2473.0,76.0,0.000151,3.07319
5,CENTRAL ASIA,74961000,245.0,7.0,0.000327,2.857143
6,NORTHERN EUROPE,105990000,3763.0,102.0,0.00355,2.710603
7,ASIA,4625927000,11667.0,316.0,0.000252,2.708494
8,LATIN AMERICA AND THE CARIBBEAN,651036000,7171.0,180.0,0.001101,2.51011
9,MIDDLE AFRICA,179757000,665.0,16.0,0.00037,2.406015
