## Imports

In [67]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import json

from src.data_query import request_pageinfo_per_article, request_ores_score_per_article
from src.data_process import merge_wiki_page_and_population
from src.global_constants import GlobalConstants
import os


## Reading data

We would take in two datasets: <br>
1. **politicians by country** contains the politicians, their wiki page links, as well as their origin country.
2. **population by country** contains each country's population as well as their region in a hierarchical manner.

In [54]:
wiki_pages_df = pd.read_csv('politicians_by_country_SEPT.2022.csv')

populations_df = pd.read_csv('population_by_country_2022.csv')

articles = wiki_pages_df['name']

## Making requests

For each of the article in **politicians by country** dataset, we request the page info as well as the ORES predictions. We store the page info in `info_json` and ORES predictions in `ores_json`. The `global_constants` variable is created for storing all the constants that could be referred to in `src/global_constants.py`.

In [4]:
global_constants = GlobalConstants(wiki_pages_df)
info_json = {}
ores_json = {}

for article in tqdm(articles):
    info = request_pageinfo_per_article(article_title=article, gc=global_constants)
    article_revid = global_constants.ores_constants.ARTICLE_REVISIONS[article]
    ores = request_ores_score_per_article(article_revid=article_revid, gc=global_constants)

    if info is not None:
        info_json.update(info)
    
    if ores is not None:
        ores_json.update(ores)

100%|██████████| 200/200 [00:59<00:00,  3.37it/s]


Saving the JSON files in `/data/`.

In [None]:
# if not os.path.exists(os.path.join(global_constants.project_root, 'data')):
#     os.mkdir('data')

# with open(os.path.join(global_constants.project_root, "data", "info.json"), "w") as f:
#     json.dump(info_json, f, indent=4)

# with open(os.path.join(global_constants.project_root, "data", "ores.json"), "w") as f:
#     json.dump(ores_json, f, indent=4)

# with open(os.path.join(global_constants.project_root, 'data', 'error_files.txt'), "w") as f:
#     for error_file in global_constants.error_files:
#         f.write(error_file + '\n')

### Reading from JSON files

We now read the JSON files containing the page infos and ORES predictions. 

In [55]:
with open(os.path.join(global_constants.project_root, "data", "info.json"), "r") as f:
    info_json = json.load(f)

with open(os.path.join(global_constants.project_root, "data", "ores.json"), "r") as f:
    ores_json = json.load(f)

## Merging into a dataframe
`merge_wiki_page_and_population` combines the politicians, their countries, and article quality, and so on into a structured table from their original JSON files. It also outputs countries with no matching articles in `data/wp_countries-no_match.txt`. It can be found in `src/data_process.py` for more information. <br>
We can output the merged df to `/data/wp_politicians_by_country.csv`.

In [98]:
merged_df = merge_wiki_page_and_population(wiki_pages_df=wiki_pages_df, pop_df=populations_df, info_dict=info_json, ores_dict=ores_json, gc=global_constants)
merged_df.to_csv(os.path.join(global_constants.project_root, 'data', 'wp_politicians_by_country.csv'))

 54%|█████▍    | 4062/7526 [00:02<00:01, 2081.61it/s]

Error occurred during processing Bak Jungyang. 'Korean'
Error occurred during processing Kim Gap-sun. 'Korean'
Error occurred during processing Bak Gyusu. 'Korean'
Error occurred during processing Bak Jeongyang. 'Korean'
Error occurred during processing Chang Deok-soo. 'Korean'
Error occurred during processing Cho Bong-am. 'Korean'
Error occurred during processing Cho Man-sik. 'Korean'
Error occurred during processing Choe Bu. 'Korean'
Error occurred during processing Choe Yun-ui. 'Korean'
Error occurred during processing Chough Pyung-ok. 'Korean'
Error occurred during processing Gwon Sang-ha. 'Korean'
Error occurred during processing Han Eum. 'Korean'
Error occurred during processing Heo Jeok. 'Korean'
Error occurred during processing Hong Jung-wook. 'Korean'
Error occurred during processing Hong U-won. 'Korean'
Error occurred during processing Hwang Gi-cheon. 'Korean'
Error occurred during processing Isabu. 'Korean'
Error occurred during processing Jang Hyeongwang. 'Korean'
Error occ

100%|██████████| 7526/7526 [00:04<00:00, 1779.12it/s]


In [99]:
merged_df.head()

Unnamed: 0,country,region,continent,population,article_title,revision_id,article_quality,region_population
0,Afghanistan,SOUTH ASIA,ASIA,41.1,Shahjahan Noori,1099689043,GA,2008.0
0,Afghanistan,SOUTH ASIA,ASIA,41.1,Abdul Ghafar Lakanwal,943562276,Start,2008.0
0,Afghanistan,SOUTH ASIA,ASIA,41.1,Majah Ha Adrif,852404094,Start,2008.0
0,Afghanistan,SOUTH ASIA,ASIA,41.1,Haroon al-Afghani,1095102390,B,2008.0
0,Afghanistan,SOUTH ASIA,ASIA,41.1,Tayyab Agha,1104998382,Start,2008.0


## Analysis

For this analysis, we calculate **total-articles-per-population** and **high-quality-articles-per-population** for both country by country and regional basis.

First we filter all the high quality articles into `good_articles_df`.

In [100]:
good_articles_df = merged_df.loc[(merged_df['article_quality'] == "GA") | (merged_df['article_quality'] == 'SA')].copy()

### Total articles per population

For regular articles, we count the number of articles on country and regional basis, and divide their corresponding populations. If population is $0$, they would be filled with `nan`.

In [102]:
merged_df['country_article_count'] = merged_df.groupby('country')['article_title'].transform('count')
merged_df['region_article_count'] = merged_df.groupby('region')['article_title'].transform('count')
merged_df['articles_per_pop_country'] = merged_df.apply(lambda x: (x['country_article_count'] / x['population']) if x['population'] else np.nan, axis=1)
merged_df['articles_per_pop_region'] = merged_df.apply(lambda x: x['region_article_count'] / x['region_population'], axis=1)

The grouped dataframe is saved in `data/clean_data/tapp.csv`.

In [103]:
tapp_df = merged_df.groupby(['region', 'country'])['country_article_count', 'region_article_count', 'articles_per_pop_country', 'articles_per_pop_region'].agg('first')
tapp_df.to_csv(os.path.join(global_constants.project_root, 'data', 'clean_data', 'tapp.csv'))

  tapp_df = merged_df.groupby(['region', 'country'])['country_article_count', 'region_article_count', 'articles_per_pop_country', 'articles_per_pop_region'].agg('first')


### High quality articles per population

Next, we count the proportion of the high quality articles. The steps are the same with above except only high quality articles are taken into account.

In [104]:
good_articles_df['qual_country_article_count'] = good_articles_df.groupby('country')['article_title'].transform('count')
good_articles_df['qual_region_article_count'] = good_articles_df.groupby('region')['article_title'].transform('count')
good_articles_df['qual_articles_per_pop_country'] = good_articles_df.apply(lambda x: (x['qual_country_article_count'] / x['population']) if x['population'] else np.nan, axis=1)
good_articles_df['qual_articles_per_pop_region'] = good_articles_df.apply(lambda x: x['qual_region_article_count'] / x['region_population'], axis=1)

In [105]:
qapp_df = good_articles_df.groupby(['region', 'country'])['qual_country_article_count', 'qual_region_article_count', 'qual_articles_per_pop_country', 'qual_articles_per_pop_region'].agg('first')
qapp_df.to_csv(os.path.join(global_constants.project_root, 'data', 'clean_data', 'qapp.csv'))

  qapp_df = good_articles_df.groupby(['region', 'country'])['qual_country_article_count', 'qual_region_article_count', 'qual_articles_per_pop_country', 'qual_articles_per_pop_region'].agg('first')


## Generating results tables
We then generate six tables for analysis.

### Top 10 countries by coverage
The 10 countries with the highest total articles per capita (in descending order) .

In [107]:
result_df1 = tapp_df.reset_index().sort_values('articles_per_pop_country', ascending=False).reset_index().loc[:10, ['country', 'articles_per_pop_country']]
# result_df1

### Bottom 10 countries by coverage
The 10 countries with the lowest total articles per capita (in ascending order).

In [108]:
result_df2 = tapp_df.reset_index().sort_values('articles_per_pop_country').reset_index().loc[:10, ['country', 'articles_per_pop_country']]
# result_df2

### Top 10 countries by high quality
The 10 countries with the highest high quality articles per capita (in descending order).

In [None]:
result_df3 = qapp_df.reset_index().sort_values('qual_articles_per_pop_country', ascending=False).reset_index().loc[:10, ['country', 'qual_articles_per_pop_country']]
# result_df3

### Bottom 10 countries by high quality
The 10 countries with the lowest high quality articles per capita (in ascending order).

In [None]:
result_df4 = qapp_df.reset_index().sort_values('qual_articles_per_pop_country').reset_index().loc[:10, ['country', 'qual_articles_per_pop_country']]
# result_df4

### Geographic regions by total coverage
A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [117]:
result_df5 = tapp_df.reset_index().sort_values('articles_per_pop_region', ascending=False).reset_index()[['country', 'articles_per_pop_region']]
result_df5

Unnamed: 0,country,articles_per_pop_region
0,Albania,5.821192
1,Andorra,5.821192
2,Bosnia-Herzegovina,5.821192
3,Croatia,5.821192
4,Greece,5.821192
...,...,...
179,Japan,0.145161
180,"Korea, North",0.145161
181,"Korea, South",0.145161
182,Mongolia,0.145161


### Geographic regions by high quality coverage
Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [118]:
result_df6 = qapp_df.reset_index().sort_values('qual_articles_per_pop_region', ascending=False).reset_index()[['country', 'qual_articles_per_pop_region']]
result_df6

Unnamed: 0,country,qual_articles_per_pop_region
0,Cuba,0.181818
1,Haiti,0.181818
2,Dominican Republic,0.181818
3,Portugal,0.165563
4,Albania,0.165563
...,...,...
84,Iran,0.009960
85,Nepal,0.009960
86,Japan,0.009558
87,"Korea, North",0.009558
