# Goals
Create a series of tables to show...

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.   
2. The countries with the highest and lowest proportion of high quality articles about politicians (according to ORES).
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.



Step 1: Gathering the data

Wikipedia politicians by country dataset: [https://figshare.com/articles/Untitled_Item/5513449](https://figshare.com/articles/Untitled_Item/5513449)

Population data by country/region: [https://www.prb.org/international/indicator/population/table/](https://www.prb.org/international/indicator/population/table/)

# Step 2: Cleaning the data
The file page_data.csv contains some page names that start with the string "Template:". These pages are not articles and need to be removed.

Similarly, WPDF_2020_data.cs contains some rows that provide cumulative regional population counts rather than country-level counts. These rows have all caps in the values in the 'geography' field. They need to be removed and stored somewhere else for later analysis. 

First, we'll need to import the data into Pandas DataFrames. Then, we'll execute the data cleaning steps.

In [None]:
import pandas as pd

In [None]:
page_data = pd.read_csv('page_data.csv')
population_data = pd.read_csv('/content/WPDS_2020_data.csv')

In [None]:
# Removes all articles whose title contains "Template:"
page_data2 = page_data[~page_data.page.str.contains("Template:")]

# Removes all regional population count rows
population_data2 = population_data[~population_data.Name.str.isupper()]

# Stores all regional population count rows in a new df
regional_pop_data = population_data[population_data.Name.str.isupper()]

# Step 3: Estimating article quality
Using the Objective Revision Evaluation Service (ORES), a machine learning tool created to estimate wikipedia article quality, we will obtain the predicted article quality for our list of articles in the page_data2 DataFrame.

In [None]:
!pip install ores
from ores import api



In [None]:
# Provide useragent string to help ORES team track requests
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 class project jjfields@uw.edu")

In [None]:
# Process all ~50k articles in one call
results = ores_session.score("enwiki", ["articlequality"], page_data2['rev_id'])

In [None]:
# Create a new column we can add to our page_data2 df which includes predicted article quality
scores = []

for score in results:
  try:
    scores.append(score['articlequality']['score']['prediction'])
  except:
    scores.append(-1) # -1 will be the code for the case where ORES was unable to provide a prediction

In [None]:
# Create a DataFrame with predicted_score added as column
page_data3 = page_data2
page_data3['predicted_quality'] = scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
page_data3['predicted_quality'].value_counts()

Stub     24210
Start    14499
C         5929
GA         771
B          726
FA         290
-1         276
Name: predicted_quality, dtype: int64

# Step 4: Merging the page_data3 DF and population_data DF
We will merge these datasets together in order to complete our analysis.

In [None]:
page_data3.head(1)

Unnamed: 0,page,country,rev_id,predicted_quality
1,Bir I of Kanem,Chad,355319463,Stub


In [None]:
population_data3 = population_data2
population_data3.rename(columns={'Name': 'country'}, inplace=True)
population_data3.head(1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,FIPS,country,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000


In [None]:
df = pd.merge(page_data3, population_data2, on=['country'], how='outer', indicator=True)
df.head(3)

Unnamed: 0,page,country,rev_id,predicted_quality,FIPS,Type,TimeFrame,Data (M),Population,_merge
0,Bir I of Kanem,Chad,355319463.0,Stub,TD,Country,2019.0,16.877,16877000.0,both
1,Abdullah II of Kanem,Chad,498683267.0,Stub,TD,Country,2019.0,16.877,16877000.0,both
2,Salmama II of Kanem,Chad,565745353.0,Stub,TD,Country,2019.0,16.877,16877000.0,both


In [None]:
df['_merge'].value_counts()

both          44842
left_only      1859
right_only       27
Name: _merge, dtype: int64

The new "_merge" column indicates whether the merge key exists in only the left (left_only) frame, right (right_only) frame, or both. We will use this to preserve the data which did not merge succesfully but also remove it from the df.

In [None]:
df_both = df[df['_merge'] == 'both']
df_not_both = df[df['_merge'] != 'both']

# Save dfs to CSV files
df_both.to_csv("wp_wpds_countries-no_match.csv")
df_not_both.to_csv("wp_wpds_politicians_by_country.csv")

# Step 5: Analysis
We will calculate the proportion (as a percentage) of articles-per-population and high-quality articles for each country and for each geographic region. We define "high quality" to be articles which recieved a predicted quality score from ORES of "FA" or "GA" (featured article or good article).

In [None]:
# Make a DF which lists each country and the count of number of articles they have

article_counts = pd.DataFrame(df_both['country'].value_counts()) # Sum up number of rows that exist for each country (each represents article)

# Organize data into nice DF for joining later
article_counts.rename(columns={'country': 'article_count'}, inplace=True)
article_counts.reset_index(inplace=True)
article_counts.rename(columns={'index': 'country'}, inplace=True)

article_counts.head(3)

Unnamed: 0,country,article_count
0,France,1681
1,Australia,1561
2,China,1133


In [None]:
# Make a DF which lists the number of "FA" or "GA" articles for each country
gafa_count = df_both[(df_both.predicted_quality == 'FA') | (df_both.predicted_quality == 'GA')] # Eliminate rows which do not represent GA/FA article
gafa_count = pd.DataFrame(gafa_count['country'].value_counts()) # Sum up the number of rows that exist for each country

# Organize data into nice DF for joining later
gafa_count.rename(columns={'country': 'gafa_count'}, inplace=True)
gafa_count.reset_index(inplace=True)
gafa_count.rename(columns={'index': 'country'}, inplace=True)

gafa_count.head(3)

Unnamed: 0,country,gafa_count
0,United States,80
1,United Kingdom,56
2,Romania,42


In [None]:
# Create final df by joining cleaned population data and the two dfs we just created above

counts_df = pd.merge(population_data3, article_counts, on=['country'], how='left')
counts_df = pd.merge(counts_df, gafa_count, on=['country'], how='left')
counts_df.fillna(0, inplace=True)

counts_df.head(3)

Unnamed: 0,FIPS,country,Type,TimeFrame,Data (M),Population,article_count,gafa_count
0,DZ,Algeria,Country,2019,44.357,44357000,116.0,2.0
1,EG,Egypt,Country,2019,100.803,100803000,237.0,10.0
2,LY,Libya,Country,2019,6.891,6891000,110.0,4.0


# A note about regions_df
The population data (provided in the WPDS_2020_data.csv file) implicitly defines the region and subregion each country belongs to in a heirarchical manner. For example, If the first two rows included the data REGION = AFRICA and SUBREGION = NORTH AFRICA, then the following rows containing country information would be implicitly defined to belong to the region Africa and the subregion North Africa. However, due to some missing data this would lead to the USA and Canada also belonging to Africa because they are provided with a subregion but not a region. Due to this, I manually downloaded DataFrame population_data and added subregions and regions to each country row as I saw appropriate while staying as close to the provided data as possible. The result of my manual modification can be seen in population_data_mod.csv

In [None]:
regions_df = pd.read_csv("population_data_mod.csv", encoding='utf-8')

In [None]:
regions_df.head()

Unnamed: 0.1,Unnamed: 0,FIPS,country,Type,TimeFrame,Data (M),Population,sub-region,region
0,0,WORLD,WORLD,World,2019,7772.85,7772850000,,
1,1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,,
2,2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,,
3,3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA,AFRICA
4,4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA,AFRICA


In [None]:
# Finally we can join and make our final df for the analysis

analysis_df = counts_df.merge(regions_df, on='country', how='inner')[['country', 'Population_x', 'article_count', 'gafa_count', 'sub-region', 'region']]
analysis_df.rename(columns={'Population_x': 'population'}, inplace=True)

analysis_df.head()

Unnamed: 0,country,population,article_count,gafa_count,sub-region,region
0,Algeria,44357000,116.0,2.0,NORTHERN AFRICA,AFRICA
1,Egypt,100803000,237.0,10.0,NORTHERN AFRICA,AFRICA
2,Libya,6891000,110.0,4.0,NORTHERN AFRICA,AFRICA
3,Morocco,35952000,206.0,1.0,NORTHERN AFRICA,AFRICA
4,Sudan,43849000,95.0,2.0,NORTHERN AFRICA,AFRICA


### Articles/Population percentage by country

In [None]:
articles_over_population_county_df = analysis_df[['country', 'population', 'article_count']]
articles_over_population_county_df['article_pop_percentage'] = (100 * articles_over_population_county_df['article_count'])/articles_over_population_county_df['population']

articles_over_population_county_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,country,population,article_count,article_pop_percentage
0,Algeria,44357000,116.0,0.000262
1,Egypt,100803000,237.0,0.000235
2,Libya,6891000,110.0,0.001596
3,Morocco,35952000,206.0,0.000573
4,Sudan,43849000,95.0,0.000217


### High quality articles (gafa_count)/Population percentage by country

In [None]:
gafa_over_population_country_df = analysis_df[['country', 'population', 'gafa_count']]
gafa_over_population_country_df['gafa_pop_percentage'] = (100 * gafa_over_population_country_df['gafa_count'])/gafa_over_population_country_df['population']

gafa_over_population_country_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,country,population,gafa_count,gafa_pop_percentage
0,Algeria,44357000,2.0,5e-06
1,Egypt,100803000,10.0,1e-05
2,Libya,6891000,4.0,5.8e-05
3,Morocco,35952000,1.0,3e-06
4,Sudan,43849000,2.0,5e-06


### Articles/Population percentage by subregion

In [None]:
articles_over_population_sub_region_df = analysis_df[['sub-region', 'population', 'article_count']]
articles_over_population_sub_region_df = articles_over_population_sub_region_df.groupby(by=['sub-region'], dropna=False).sum()
articles_over_population_sub_region_df.reset_index(inplace=True)
articles_over_population_sub_region_df['article_pop_percentage'] = (100 * articles_over_population_sub_region_df['article_count'])/articles_over_population_sub_region_df['population']

articles_over_population_sub_region_df.head()

Unnamed: 0,sub-region,population,article_count,article_pop_percentage
0,CARIBBEAN,42747000,697.0,0.001631
1,CENTRAL AMERICA,178612000,1545.0,0.000865
2,CENTRAL ASIA,74960000,247.0,0.00033
3,EAST ASIA,1641063000,2477.0,0.000151
4,EASTERN AFRICA,444970000,2509.0,0.000564


### High quality articles (gafa_count)/Populatio percentage by subregion

In [None]:
gafa_over_population_sub_region_df = analysis_df[['sub-region', 'population', 'gafa_count']]
gafa_over_population_sub_region_df = gafa_over_population_sub_region_df.groupby(by=['sub-region'], dropna=False).sum()
gafa_over_population_sub_region_df.reset_index(inplace=True)
gafa_over_population_sub_region_df['gafa_pop_percentage'] = (100 * gafa_over_population_sub_region_df['gafa_count'])/gafa_over_population_sub_region_df['population']

gafa_over_population_sub_region_df.head()

Unnamed: 0,sub-region,population,gafa_count,gafa_pop_percentage
0,CARIBBEAN,42747000,13.0,3e-05
1,CENTRAL AMERICA,178612000,23.0,1.3e-05
2,CENTRAL ASIA,74960000,7.0,9e-06
3,EAST ASIA,1641063000,76.0,5e-06
4,EASTERN AFRICA,444970000,35.0,8e-06


# Step 6:
We will now imbed several tables in the notebook which show the results of our analysis.

In [None]:
# Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

articles_over_population_county_df.sort_values(by=['article_pop_percentage'], ascending=False).head(10)

Unnamed: 0,country,population,article_count,article_pop_percentage
208,Tuvalu,10000,54.0,0.54
200,Nauru,11000,52.0,0.472727
189,San Marino,34000,81.0,0.238235
165,Monaco,38000,40.0,0.105263
163,Liechtenstein,39000,28.0,0.071795
199,Marshall Islands,57000,37.0,0.064912
207,Tonga,99000,63.0,0.063636
152,Iceland,368000,202.0,0.054891
179,Andorra,82000,34.0,0.041463
194,Federated States of Micronesia,106000,36.0,0.033962


In [None]:
# Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politican articles as a proportion of country population

articles_over_population_county_df.sort_values(by=['article_pop_percentage'], ascending=True).head(10)

Unnamed: 0,country,population,article_count,article_pop_percentage
142,"China, Macao SAR",686000,0.0,0.0
32,Mayotte,284000,0.0,0.0
61,El Salvador,6481000,0.0,0.0
129,Brunei,469000,0.0,0.0
79,Puerto Rico,3189000,0.0,0.0
34,Reunion,861000,0.0,0.0
138,Timor-Leste,1318000,0.0,0.0
48,"Congo, Dem. Rep.",89568000,0.0,0.0
101,Georgia,3715000,0.0,0.0
71,Curacao,155000,0.0,0.0


In [None]:
# Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politican articles that are of GA and FA-quality (refered to as gafa)

gafa_over_population_country_df.sort_values(by=['gafa_pop_percentage'], ascending=False).head(10)

Unnamed: 0,country,population,gafa_count,gafa_pop_percentage
208,Tuvalu,10000,4.0,0.04
72,Dominica,72000,1.0,0.001389
209,Vanuatu,321000,3.0,0.000935
152,Iceland,368000,2.0,0.000543
153,Ireland,5003000,25.0,0.0005
186,Montenegro,622000,2.0,0.000322
78,Martinique,356000,1.0,0.000281
122,Bhutan,730000,2.0,0.000274
202,New Zealand,4987000,13.0,0.000261
174,Romania,19241000,42.0,0.000218


In [None]:
# Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

gafa_over_population_country_df.sort_values(by=['gafa_pop_percentage'], ascending=True).head(10)

Unnamed: 0,country,population,gafa_count,gafa_pop_percentage
115,Kazakhstan,18732000,0.0,0.0
34,Reunion,861000,0.0,0.0
101,Georgia,3715000,0.0,0.0
36,Seychelles,98000,0.0,0.0
107,Oman,4713000,0.0,0.0
118,Turkmenistan,6031000,0.0,0.0
71,Curacao,155000,0.0,0.0
41,Zambia,18384000,0.0,0.0
151,Finland,5529000,0.0,0.0
43,Angola,32522000,0.0,0.0


In [None]:
# Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

articles_over_population_sub_region_df.sort_values(by=['article_pop_percentage'], ascending=False).head(10)

Unnamed: 0,sub-region,population,article_count,article_pop_percentage
10,OCEANIA,42999000,3132.0,0.007284
9,NORTHERN EUROPE,105852000,3781.0,0.003572
15,SOUTHERN EUROPE,153216000,3729.0,0.002434
18,WESTERN EUROPE,195479000,4577.0,0.002341
0,CARIBBEAN,42747000,697.0,0.001631
5,EASTERN EUROPE,291902000,3771.0,0.001292
14,SOUTHERN AFRICA,67732000,635.0,0.000938
17,WESTERN ASIA,280927000,2580.0,0.000918
1,CENTRAL AMERICA,178612000,1545.0,0.000865
11,SOUTH AMERICA,429188000,3042.0,0.000709


In [None]:
# Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politican articles from countries in each region that are of GA and FA-quality

gafa_over_population_sub_region_df.sort_values(by=['gafa_pop_percentage'], ascending=False).head(10)

Unnamed: 0,sub-region,population,gafa_count,gafa_pop_percentage
10,OCEANIA,42999000,63.0,0.000147
9,NORTHERN EUROPE,105852000,102.0,9.6e-05
15,SOUTHERN EUROPE,153216000,74.0,4.8e-05
5,EASTERN EUROPE,291902000,118.0,4e-05
17,WESTERN ASIA,280927000,89.0,3.2e-05
0,CARIBBEAN,42747000,13.0,3e-05
18,WESTERN EUROPE,195479000,56.0,2.9e-05
8,NORTHERN AMERICA,368068000,104.0,2.8e-05
14,SOUTHERN AFRICA,67732000,9.0,1.3e-05
1,CENTRAL AMERICA,178612000,23.0,1.3e-05


# Writeup: Reflections and Implications



*   What biases did you expect to find in the data (before you started working with it), and why?
  * I expected to see that wealthier countries would have a better representation of their politican figures on wikipedia due to improved access to the technology required to create, edit, and contribute to such pages.
*   What (potential) sources of bias did you discover in the course of your data processing and analysis?
  * Overwhelmingly, western countries are better represented than asian or african countries.
* What might your results suggest about (English) Wikipedia as a data source?
  * This suggests (albiet weakly) that information on the English Wikipedia is biased towards a western perspective of things. This could especially be the case when less-than-objective information is discussed on wikipedia pages. For example, events about a politicians term might be viewed and understood through a primarily western lense. 
* What might your results suggest about the internet and global society in general?
  * If this trend can properly be extrapolated to the whole of internet society (a lot of work would be needed to show this), then we could conclude that in general there is a western bias on the internet. 
* Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?
  * A sentiment-analysis tool used to determine politican approval based on wikipedia sections about events which occured during the politician's term would have a western lean and so the training of the sentiment-analysis tool would not properly generalize to determine politician approval in the politican's home country if that place was not a western country.
* Can you think of a realistic data science research situation where using these data (to train a model, perform a hypothesis-driven research, or make business decisions) might still be appropriate and useful, despite its inherent limitations and biases?
  * A hypothesis-driven research scenario in which the researcher was attempting to determine the number of representatives belonging to each country might still be able to use this data to estimate the number. 
* How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?
  * The researcher could use article ID's from the language wikipedia of the home country of the politician the page is describing.


