# Data Analysis: Wikimedia Article Quality Data

This notebook takes the final output from `data_wrangling_merge_csvs.ipynb` and performs required analyses on this file, outputting results in tabular format.

# First Step: Read in Files

The following packages are required for this notebook:

In [137]:
import pandas as pd
from IPython.display import display

Now we read in the file. We also read in the base population by country file for accurate counts of regional populations in analyses five and six.

In [138]:
#read in files
wp_politicians_by_country = pd.read_csv("../../data/final/wp_politicians_by_country.csv")
population_by_country = pd.read_csv("../../data/intermediate/population_by_country_AUG.2024.csv")

# ANALYSIS 1: Top 10 Countries By Coverage

First we group by country and find the total article count to identify coverage. Note that the population is 0 for some countries, which will cause faulty per capita calculation.

In [121]:
#group by country and find number of articles
article_coverage_df = wp_politicians_by_country.groupby('country').agg(article_count = ('country', 'size'),
                                                                       population = ('population', 'mean'),
                                                                       region = ('region', 'first')).reset_index()
#Per Capital Metrics
article_coverage_df["Article Count Per Million People"] = article_coverage_df['article_count'] / article_coverage_df['population']

Next, we filter the coverage dataframe to find the top 10 article counts, and display the dataframe in the jupyter notebook

In [122]:
#Sort input dataframe, pick top 10, and remove index
top10_article_coverage = article_coverage_df.sort_values("Article Count Per Million People", ascending = False)[0:10].reset_index(drop = True)
top10_article_coverage = top10_article_coverage.drop(['article_count'], axis = 1)
#display the dataframe
display(top10_article_coverage.style)

Unnamed: 0,country,population,region,Article Count Per Million People
0,Tuvalu,0.0,OCEANIA,inf
1,Monaco,0.0,WESTERN EUROPE,inf
2,Antigua and Barbuda,0.1,CARIBBEAN,330.0
3,Federated States of Micronesia,0.1,OCEANIA,140.0
4,Marshall Islands,0.1,OCEANIA,130.0
5,Tonga,0.1,OCEANIA,100.0
6,Barbados,0.3,CARIBBEAN,83.333333
7,Montenegro,0.6,SOUTHERN EUROPE,60.0
8,Seychelles,0.1,EASTERN AFRICA,60.0
9,Maldives,0.6,SOUTH ASIA,55.0


# ANALYSIS 2: Bottom 10 Countries by Coverage

We have already generated article count by country. Therefore, we can use the previous dataframe, change the sort order, and display the dataframe.

In [123]:
#Sort input dataframe, pick bottom 10, and remove index
bottom10_article_coverage = article_coverage_df.sort_values("Article Count Per Million People", ascending = True)[0:10].reset_index(drop = True)
#drop unnecessary columns
bottom10_article_coverage = bottom10_article_coverage.drop(['article_count'], axis = 1)
#display the dataframe
display(bottom10_article_coverage.style)

Unnamed: 0,country,population,region,Article Count Per Million People
0,China,1411.3,EAST ASIA,0.011337
1,India,1428.6,SOUTH ASIA,0.105698
2,Ghana,34.1,WESTERN AFRICA,0.117302
3,Saudi Arabia,36.9,WESTERN ASIA,0.135501
4,Zambia,20.2,EASTERN AFRICA,0.148515
5,Norway,5.5,NORTHERN EUROPE,0.181818
6,Israel,9.8,WESTERN ASIA,0.204082
7,Egypt,105.2,NORTHERN AFRICA,0.304183
8,Cote d'Ivoire,30.9,WESTERN AFRICA,0.323625
9,Ethiopia,126.5,EASTERN AFRICA,0.347826


# Analysis 3: Top 10 Countries by High Quality Article Count

First, we have to move from article scores to whether or not the article is high quality. We can do this by assigning a boolean indicating whether or not the quality score maps to a high quality score ("FA", or "GA"). Then we can sum the number of high quality articles per country.

In [139]:
# find all article quality score possibility 
set(wp_politicians_by_country.article_quality.tolist())

{'B', 'C', 'FA', 'GA', 'Start', 'Stub', nan}

In [140]:
#init dict of high quality or low quality
quality_dict = {'B' : False,
                'C' : False,
                'nan' : False,
                'FA' : True,
                'GA' : True,
                'Start' : False,
                'Stub' : False}

wp_politicians_by_country['High_Quality'] = wp_politicians_by_country['article_quality'].map(quality_dict)

Here we aggregate by country and sum high quality article count.

In [141]:
#group by country, aggregate to find required fields
highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),
                                                                                population = ('population', 'mean'),
                                                                                region = ('region', 'first')).reset_index()

  highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),


Next, we sort by high quality article count, generate the per capita metric, select the countries with the top 10 high quality article count, and display the dataframe.

In [142]:
#convert high quality count to a numeric
highquality_count_by_country['high_quality_count'] = pd.to_numeric(highquality_count_by_country['high_quality_count'], errors='coerce')
#create per capita metric
highquality_count_by_country['High Quality Count Per Million People'] = highquality_count_by_country['high_quality_count'] / highquality_count_by_country['population']
#Sort input dataframe, pick top 10, and remove index
top10_highquality = highquality_count_by_country.sort_values('High Quality Count Per Million People', ascending = False)[0:10].reset_index(drop = True)
top10_highquality = top10_highquality.drop(['high_quality_count'], axis = 1)
#display the dataframe
display(top10_highquality.style)

Unnamed: 0,country,population,region,High Quality Count Per Million People
0,Montenegro,0.6,SOUTHERN EUROPE,5.0
1,Luxembourg,0.7,WESTERN EUROPE,2.857143
2,Albania,2.7,SOUTHERN EUROPE,2.592593
3,Kosovo,1.7,SOUTHERN EUROPE,2.352941
4,Maldives,0.6,SOUTH ASIA,1.666667
5,Lithuania,2.9,NORTHERN EUROPE,1.37931
6,Croatia,3.8,SOUTHERN EUROPE,1.315789
7,Guyana,0.8,SOUTH AMERICA,1.25
8,Palestinian Territory,5.5,WESTERN ASIA,1.090909
9,Slovenia,2.1,SOUTHERN EUROPE,0.952381


# ANALYSIS 4: Bottom 10 Countries by High Quality Article Count

We have already generated high quality article count by country. Therefore, we can use the previous dataframe, change the sort order, and display the dataframe.

In [146]:
#Sort input dataframe, pick top 10, and remove index
bottom10_highquality = highquality_count_by_country.sort_values('High Quality Count Per Million People', ascending = True)[0:10].reset_index(drop = True)
bottom10_highquality = bottom10_highquality.drop(['high_quality_count'], axis = 1)
#display the dataframe
display(bottom10_highquality.style)

Unnamed: 0,country,population,region,High Quality Count Per Million People
0,Antigua and Barbuda,0.1,CARIBBEAN,0.0
1,Belize,0.5,CENTRAL AMERICA,0.0
2,Barbados,0.3,CARIBBEAN,0.0
3,Bahamas,0.4,CARIBBEAN,0.0
4,Cape Verde,0.6,WESTERN AFRICA,0.0
5,Botswana,2.7,SOUTHERN AFRICA,0.0
6,Bhutan,0.8,SOUTH ASIA,0.0
7,Benin,13.7,WESTERN AFRICA,0.0
8,China,1411.3,EAST ASIA,0.0
9,Chad,18.3,MIDDLE AFRICA,0.0


However, we note that all of these have 0 high quality articles per capita. Additional exploration reveals that many countries actually have 0 counts.

In [149]:
display(highquality_count_by_country[highquality_count_by_country['High Quality Count Per Million People'] == 0].reset_index(drop = True).style)

Unnamed: 0,country,high_quality_count,population,region,High Quality Count Per Million People
0,Antigua and Barbuda,0,0.1,CARIBBEAN,0.0
1,Bahamas,0,0.4,CARIBBEAN,0.0
2,Barbados,0,0.3,CARIBBEAN,0.0
3,Belize,0,0.5,CENTRAL AMERICA,0.0
4,Benin,0,13.7,WESTERN AFRICA,0.0
5,Bhutan,0,0.8,SOUTH ASIA,0.0
6,Botswana,0,2.7,SOUTHERN AFRICA,0.0
7,Cape Verde,0,0.6,WESTERN AFRICA,0.0
8,Chad,0,18.3,MIDDLE AFRICA,0.0
9,China,0,1411.3,EAST ASIA,0.0


# ANALYSIS 5: List of Geographic Regions with Total Articles Per Capita

First, we group by region and aggregate article count and population size. We also merge with the population by country table to find accurate population estimates per region.

In [133]:
#group by country and find number of articles
article_coverage_df_region = wp_politicians_by_country.groupby('region').agg(article_count = ('region', 'size')).reset_index()
#merge to find accurate population counts
population_by_country.columns = ["region", "Population"]
article_coverage_df_region = pd.merge(article_coverage_df_region,
                                      population_by_country,
                                      how = "left",
                                      on = "region")

Next, we construct the per capita metric, order by it, and display the table

In [134]:
#per capita metric
article_coverage_df_region['Article Count Per Million People'] = article_coverage_df_region['article_count'] / article_coverage_df_region['Population']
article_coverage_df_region = article_coverage_df_region.drop(["article_count"], axis = 1)
article_coverage_df['rank'] = range(0, 166)
#order table, drop index, and display
display(article_coverage_df_region.sort_values('Article Count Per Million People', ascending = False).reset_index(drop=True).style)

Unnamed: 0,region,Population,Article Count Per Million People
0,SOUTHERN EUROPE,152.0,5.243421
1,CARIBBEAN,44.0,4.977273
2,WESTERN EUROPE,199.0,2.502513
3,EASTERN EUROPE,285.0,2.487719
4,WESTERN ASIA,299.0,2.040134
5,NORTHERN EUROPE,108.0,1.768519
6,SOUTHERN AFRICA,70.0,1.757143
7,OCEANIA,45.0,1.6
8,EASTERN AFRICA,483.0,1.376812
9,SOUTH AMERICA,426.0,1.335681


# ANALYSIS 6: List of Geographic Regions with Total High Quality Articles Per Capita

First, we aggregate high quality article count by region.

In [135]:
#group by country and find number of articles
highquality_coverage_df_region = wp_politicians_by_country.groupby('region').agg(high_quality_count = ('High_Quality', 'sum')).reset_index()
#merge to find accurate population counts
highquality_coverage_df_region = pd.merge(highquality_coverage_df_region,
                                            population_by_country,
                                            how = "left",
                                            on = "region")

Next, we construct the per capita metric, order by it, and display the table

In [136]:
#per capita metric
highquality_coverage_df_region['High Quality Count Per Million People'] = highquality_coverage_df_region['high_quality_count'] / highquality_coverage_df_region['Population']
highquality_coverage_df_region = highquality_coverage_df_region.drop(["high_quality_count"], axis = 1)
highquality_coverage_df_region = highquality_coverage_df_region.sort_values('High Quality Count Per Million People', ascending = False)
highquality_coverage_df_region['rank'] = range(0, 18)
#show table
display(highquality_coverage_df_region.reset_index(drop=True).style)

Unnamed: 0,region,Population,High Quality Count Per Million People,rank
0,SOUTHERN EUROPE,152.0,0.348684,0
1,CARIBBEAN,44.0,0.204545,1
2,EASTERN EUROPE,285.0,0.133333,2
3,SOUTHERN AFRICA,70.0,0.114286,3
4,WESTERN EUROPE,199.0,0.105528,4
5,WESTERN ASIA,299.0,0.090301,5
6,NORTHERN EUROPE,108.0,0.083333,6
7,NORTHERN AFRICA,256.0,0.066406,7
8,CENTRAL ASIA,80.0,0.0625,8
9,CENTRAL AMERICA,182.0,0.054945,9
