# Data Analysis: Wikimedia Article Quality Data

This notebook takes the final output from `data_wrangling_merge_csvs.ipynb` and performs required analyses on this file, outputting results in tabular format.

# First Step: Read in Files

The following packages are required for this notebook:

In [8]:
import pandas as pd
from IPython.display import display

Now we read in the file.

In [2]:
wp_politicians_by_country = pd.read_csv("../../data/final/wp_politicians_by_country.csv")

# ANALYSIS 1: Top 10 Countries By Coverage

First we group by country and find the total article count to identify coverage

In [13]:
#group by country and find number of articles
article_coverage_df = wp_politicians_by_country.groupby('country').agg(article_count = ('country', 'size'),
                                                                       population = ('population', 'mean'),
                                                                       region = ('region', 'first')).reset_index()

Next, we filter the coverage dataframe to find the top 10 article counts, and display the dataframe in the jupyter notebook

In [14]:
#Sort input dataframe, pick top 10, and remove index
top10_article_coverage = article_coverage_df.sort_values("article_count", ascending = False)[0:10]
#display the dataframe
display(top10_article_coverage.style)

Unnamed: 0,country,article_count,population,region
106,Nigeria,246,223.8,WESTERN AFRICA
116,Poland,159,37.7,EASTERN EUROPE
66,India,151,1428.6,SOUTH ASIA
71,Italy,148,58.8,SOUTHERN EUROPE
134,Spain,137,48.3,SOUTHERN EUROPE
75,Kenya,125,55.1,EASTERN AFRICA
53,France,119,65.9,WESTERN EUROPE
119,Russia,119,146.9,EASTERN EUROPE
72,Japan,117,124.5,EAST ASIA
67,Indonesia,113,278.7,SOUTHEAST ASIA


# ANALYSIS 2: Bottom 10 Countries by Coverage

We have already generated high quality article count by country. Therefore, we can use the previous dataframe, change the sort order, and display the dataframe.

In [34]:
#Sort input dataframe, pick bottom 10, and remove index
bottom10_article_coverage = article_coverage_df.sort_values("article_count", ascending = True)[0:10]
#display the dataframe
display(bottom10_article_coverage.style)

Unnamed: 0,country,article_count,population,region
108,Norway,1,5.5,NORTHERN EUROPE
92,Malta,1,0.6,SOUTHERN EUROPE
154,Tuvalu,1,0.0,OCEANIA
59,Grenada,2,0.1,CARIBBEAN
70,Israel,2,9.8,WESTERN ASIA
47,Equatorial Guinea,2,1.7,MIDDLE AFRICA
137,St. Lucia,3,0.2,CARIBBEAN
20,Botswana,3,2.7,SOUTHERN AFRICA
164,Zambia,3,20.2,EASTERN AFRICA
136,St. Kitts and Nevis,3,0.1,CARIBBEAN


# Analysis 3: Top 10 Countries by High Quality Article Count

First, we have to move from article scores to whether or not the article is high quality. We can do this by assigning a boolean indicating whether or not the quality score maps to a high quality score ("FA", or "GA"). Then we can sum the number of high quality articles per country.

In [20]:
# find all article quality score possibility 
set(wp_politicians_by_country.article_quality.tolist())

{'B', 'C', 'FA', 'GA', 'Start', 'Stub', nan}

In [23]:
#init dict of high quality or low quality
quality_dict = {'B' : False,
                'C' : False,
                'nan' : False,
                'FA' : True,
                'GA' : True,
                'Start' : False,
                'Stub' : False}

wp_politicians_by_country['High_Quality'] = wp_politicians_by_country['article_quality'].map(quality_dict)

Here we aggregate by country and sum high quality article count.

In [31]:
highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),
                                                                                population = ('population', 'mean'),
                                                                                region = ('region', 'first')).reset_index()

  highquality_count_by_country = wp_politicians_by_country.groupby('country').agg(high_quality_count = ("High_Quality", sum),


Next, we sort by high quality article count, select the countries with the top 10 high quality article count, and display the dataframe.

In [38]:
#convert high quality count to a numeric
highquality_count_by_country['high_quality_count'] = pd.to_numeric(highquality_count_by_country['high_quality_count'], errors='coerce')
#Sort input dataframe, pick top 10, and remove index
top10_highquality = highquality_count_by_country.sort_values("high_quality_count", ascending = False)[0:10]
#display the dataframe
display(top10_highquality.style)

Unnamed: 0,country,high_quality_count,population,region
134,Spain,18,48.3,SOUTHERN EUROPE
67,Indonesia,15,278.7,SOUTHEAST ASIA
119,Russia,9,146.9,EASTERN EUROPE
156,Ukraine,8,36.7,EASTERN EUROPE
132,South Africa,8,60.7,SOUTHERN AFRICA
1,Albania,7,2.7,SOUTHERN EUROPE
116,Poland,7,37.7,EASTERN EUROPE
141,Switzerland,7,8.8,WESTERN EUROPE
69,Iraq,7,45.5,WESTERN ASIA
21,Brazil,6,204.0,SOUTH AMERICA


# ANALYSIS 4: Bottom 10 Countries by High Quality Article Count

In [39]:
#Sort input dataframe, pick top 10, and remove index
bottom10_highquality = highquality_count_by_country.sort_values("high_quality_count", ascending = True)[0:10]
#display the dataframe
display(bottom10_highquality.style)

Unnamed: 0,country,high_quality_count,population,region
4,Antigua and Barbuda,0,0.1,CARIBBEAN
15,Belize,0,0.5,CENTRAL AMERICA
12,Barbados,0,0.3,CARIBBEAN
9,Bahamas,0,0.4,CARIBBEAN
27,Cape Verde,0,0.6,WESTERN AFRICA
20,Botswana,0,2.7,SOUTHERN AFRICA
17,Bhutan,0,0.8,SOUTH ASIA
16,Benin,0,13.7,WESTERN AFRICA
31,China,0,1411.3,EAST ASIA
29,Chad,0,18.3,MIDDLE AFRICA


# ANALYSIS 5: List of Geographic Regions with Total Articles Per Capita

First, we group by region and aggregate article count and population size.

In [62]:
#group by country and find number of articles
article_coverage_df_region = wp_politicians_by_country.groupby('region').agg(article_count = ('region', 'size'),
                                                                       population = ('population', 'sum')).reset_index()

Next, we order by article_count, and display the table

In [63]:
display(article_coverage_df_region.sort_values('article_count', ascending = False).style)

Unnamed: 0,region,article_count,population
14,SOUTHERN EUROPE,797,17956.6
5,EASTERN EUROPE,709,29044.9
11,SOUTH ASIA,670,264161.6
4,EASTERN AFRICA,665,23941.2
16,WESTERN ASIA,610,13369.5
10,SOUTH AMERICA,569,34514.1
15,WESTERN AFRICA,515,58982.1
17,WESTERN EUROPE,498,18969.9
12,SOUTHEAST ASIA,396,45028.4
7,NORTHERN AFRICA,302,12173.9


# ANALYSIS 6: List of Geographic Regions with Total High Quality Articles Per Capita

First, we aggregate high quality article count by region.

In [65]:
#group by country and find number of articles
highquality_coverage_df_region = wp_politicians_by_country.groupby('region').agg(high_quality_count = ('High_Quality', 'sum'),
                                                                                population = ('population', 'sum')).reset_index()

Next, we order by high_quality_count, and display the table

In [66]:
display(highquality_coverage_df_region.sort_values('high_quality_count', ascending = False).style)

Unnamed: 0,region,high_quality_count,population
14,SOUTHERN EUROPE,53,17956.6
5,EASTERN EUROPE,38,29044.9
16,WESTERN ASIA,27,13369.5
12,SOUTHEAST ASIA,25,45028.4
17,WESTERN EUROPE,21,18969.9
11,SOUTH ASIA,21,264161.6
10,SOUTH AMERICA,19,34514.1
4,EASTERN AFRICA,17,23941.2
7,NORTHERN AFRICA,17,12173.9
15,WESTERN AFRICA,13,58982.1
