# Data Analysis: Wikimedia Article Quality Data

This notebook takes the final output from `data_wrangling_merge_csvs.ipynb` and performs required analyses on this file, outputting results in tabular format.

# First Step: Read in Files

The following packages are required for this notebook:

In [24]:
import pandas as pd
from IPython.display import display

Now we read in the file. We also read in the base population by country file for accurate counts of regional populations in analyses five and six.

In [68]:
wp_politicians_by_country = pd.read_csv("../../data/final/wp_politicians_by_country.csv")
population_by_country = pd.read_csv("../../data/intermediate/population_by_country_AUG.2024.csv")

# ANALYSIS 1: Top 10 Countries By Coverage

First we group by country and find the total article count to identify coverage. Note that the population is 0 for some countries, which will cause faulty per capita calculation.

In [40]:
#group by country and find number of articles
article_coverage_df = wp_politicians_by_country.groupby('country').agg(article_count = ('country', 'size'),
                                                                       population = ('population', 'mean'),
                                                                       region = ('region', 'first')).reset_index()
#Per Capital Metrics
article_coverage_df["Article Count Per Million People"] = article_coverage_df['article_count'] / article_coverage_df['population']

Next, we filter the coverage dataframe to find the top 10 article counts, and display the dataframe in the jupyter notebook

In [55]:
#Sort input dataframe, pick top 10, and remove index
top10_article_coverage = article_coverage_df.sort_values("Article Count Per Million People", ascending = False)[0:10].reset_index(drop = True)
top10_article_coverage = top10_article_coverage.drop(['population', 'region', 'article_count'], axis = 1)
#display the dataframe
display(top10_article_coverage.style)

Unnamed: 0,country,Article Count Per Million People
0,Tuvalu,inf
1,Monaco,inf
2,Antigua and Barbuda,330.0
3,Federated States of Micronesia,140.0
4,Marshall Islands,130.0
5,Tonga,100.0
6,Barbados,83.333333
7,Montenegro,60.0
8,Seychelles,60.0
9,Maldives,55.0


# ANALYSIS 2: Bottom 10 Countries by Coverage

We have already generated high quality article count by country. Therefore, we can use the previous dataframe, change the sort order, and display the dataframe.

In [54]:
#Sort input dataframe, pick bottom 10, and remove index
bottom10_article_coverage = article_coverage_df.sort_values("Article Count Per Million People", ascending = True)[0:10].reset_index(drop = True)
#drop unnecessary columns
bottom10_article_coverage = bottom10_article_coverage.drop(['population', 'region', 'article_count'], axis = 1)
#display the dataframe
display(bottom10_article_coverage.style)

Unnamed: 0,country,Article Count Per Million People
0,China,0.011337
1,India,0.105698
2,Ghana,0.117302
3,Saudi Arabia,0.135501
4,Zambia,0.148515
5,Norway,0.181818
6,Israel,0.204082
7,Egypt,0.304183
8,Cote d'Ivoire,0.323625
9,Ethiopia,0.347826


# Analysis 3: Top 10 Countries by High Quality Article Count

First, we have to move from article scores to whether or not the article is high quality. We can do this by assigning a boolean indicating whether or not the quality score maps to a high quality score ("FA", or "GA"). Then we can sum the number of high quality articles per country.

In [43]:
# find all article quality score possibility 
set(wp_politicians_by_country.article_quality.tolist())

{'B', 'C', 'FA', 'GA', 'Start', 'Stub', nan}

In [90]:
#init dict of high quality or low quality
quality_dict = {'B' : False,
                'C' : False,
                'nan' : False,
                'FA' : True,
                'GA' : True,
                'Start' : False,
                'Stub' : False}

wp_politicians_by_country['High_Quality'] = wp_politicians_by_country['article_quality'].map(quality_dict)

Here we aggregate by country and sum high quality article count.

In [91]:
highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),
                                                                                population = ('population', 'mean'),
                                                                                region = ('region', 'first')).reset_index()

  highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),


Next, we sort by high quality article count, generate the per capita metric, select the countries with the top 10 high quality article count, and display the dataframe.

In [53]:
#convert high quality count to a numeric
highquality_count_by_country['high_quality_count'] = pd.to_numeric(highquality_count_by_country['high_quality_count'], errors='coerce')
#create per capita metric
highquality_count_by_country['High Quality Count Per Million People'] = highquality_count_by_country['high_quality_count'] / highquality_count_by_country['population']
#Sort input dataframe, pick top 10, and remove index
top10_highquality = highquality_count_by_country.sort_values('High Quality Count Per Million People', ascending = False)[0:10].reset_index(drop = True)
top10_highquality = top10_highquality.drop(['population', 'region', 'high_quality_count'], axis = 1)
#display the dataframe
display(top10_highquality.style)

Unnamed: 0,country,High Quality Count Per Million People
0,Montenegro,5.0
1,Luxembourg,2.857143
2,Albania,2.592593
3,Kosovo,2.352941
4,Maldives,1.666667
5,Lithuania,1.37931
6,Croatia,1.315789
7,Guyana,1.25
8,Palestinian Territory,1.090909
9,Slovenia,0.952381


# ANALYSIS 4: Bottom 10 Countries by High Quality Article Count

In [52]:
#Sort input dataframe, pick top 10, and remove index
bottom10_highquality = highquality_count_by_country.sort_values('High Quality Count Per Million People', ascending = False)[0:10].reset_index(drop = True)
bottom10_highquality = bottom10_highquality.drop(['population', 'region', 'high_quality_count'], axis = 1)
#display the dataframe
display(bottom10_highquality.style)

Unnamed: 0,country,High Quality Count Per Million People
0,Montenegro,5.0
1,Luxembourg,2.857143
2,Albania,2.592593
3,Kosovo,2.352941
4,Maldives,1.666667
5,Lithuania,1.37931
6,Croatia,1.315789
7,Guyana,1.25
8,Palestinian Territory,1.090909
9,Slovenia,0.952381


# ANALYSIS 5: List of Geographic Regions with Total Articles Per Capita

First, we group by region and aggregate article count and population size. We also merge with the population by country table to find accurate population estimates per region.

In [93]:
#group by country and find number of articles
article_coverage_df_region = wp_politicians_by_country.groupby('region').agg(article_count = ('region', 'size')).reset_index()
#merge to find accurate population counts
population_by_country.columns = ["region", "Population"]
article_coverage_df_region = pd.merge(article_coverage_df_region,
                                      population_by_country,
                                      how = "left",
                                      on = "region")

Next, we construct the per capita metric, order by it, and display the table

In [94]:
#per capita metric
article_coverage_df_region['Article Count Per Million People'] = article_coverage_df_region['article_count'] / article_coverage_df_region['Population']
article_coverage_df_region = article_coverage_df_region.drop(["Population", "article_count"], axis = 1)
article_coverage_df['rank'] = range(0, 166)
#show table
display(article_coverage_df_region.sort_values('Article Count Per Million People', ascending = False).style)

Unnamed: 0,region,Article Count Per Million People
14,SOUTHERN EUROPE,5.243421
0,CARIBBEAN,4.977273
17,WESTERN EUROPE,2.502513
5,EASTERN EUROPE,2.487719
16,WESTERN ASIA,2.040134
8,NORTHERN EUROPE,1.768519
13,SOUTHERN AFRICA,1.757143
9,OCEANIA,1.6
4,EASTERN AFRICA,1.376812
10,SOUTH AMERICA,1.335681


# ANALYSIS 6: List of Geographic Regions with Total High Quality Articles Per Capita

First, we aggregate high quality article count by region.

In [96]:
#group by country and find number of articles
highquality_coverage_df_region = wp_politicians_by_country.groupby('region').agg(high_quality_count = ('High_Quality', 'sum')).reset_index()
#merge to find accurate population counts
highquality_coverage_df_region = pd.merge(highquality_coverage_df_region,
                                            population_by_country,
                                            how = "left",
                                            on = "region")

Next, we construct the per capita metric, order by it, and display the table

In [None]:
#per capita metric
highquality_coverage_df_region['High Quality Count Per Million People'] = article_coverage_df_region['high_quality_count'] / article_coverage_df_region['Population']
highquality_coverage_df_region = highquality_coverage_df_region.drop(["Population", "high_quality_count"], axis = 1)
article_coverage_df['rank'] = range(0, 17)
#show table
display(article_coverage_df_region.sort_values('Article Count Per Million People', ascending = False).style)

In [66]:
display(highquality_coverage_df_region.sort_values('high_quality_count', ascending = False).style)

Unnamed: 0,region,high_quality_count,population
14,SOUTHERN EUROPE,53,17956.6
5,EASTERN EUROPE,38,29044.9
16,WESTERN ASIA,27,13369.5
12,SOUTHEAST ASIA,25,45028.4
17,WESTERN EUROPE,21,18969.9
11,SOUTH ASIA,21,264161.6
10,SOUTH AMERICA,19,34514.1
4,EASTERN AFRICA,17,23941.2
7,NORTHERN AFRICA,17,12173.9
15,WESTERN AFRICA,13,58982.1
