## Data 512 - Assignment A2

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:

the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
the countries with the highest and lowest proportion of high quality articles about politicians.
You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.

### Getting the article and population data

The first step is to download the file that contains the wikipedia country information from figshare and store this as a pandas dataframe.

In [129]:
import pandas as pd
import numpy as np
import math
import requests
import json
country = pd.read_csv('page_data.csv')

We then download the population data from dropbox(link described in readme file), and repeat the same process as above

In [18]:
population = pd.read_csv('WPDS_2018_data.csv')

### Getting article quality predictions

Using a Wikimedia API endpoint that connects to a machine learning algorithm called ORES, we get quality predictions for each of the articles listed in the country data above.

In [36]:
headers = {'User-Agent' : 'https://github.com/Gmoog', 'From' : 'mgautam@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

#### Make a list of all the rev_ids from the country table and send this information over to the ORES function, to retrieve the article quality rating

In [137]:
# store the revision ids in a list
rev_ids = country['rev_id'].tolist()

47197

Divide the revisions ids into lists of size 100 and query the ORES function. This helps avoid hitting the limits.
Also some of the revision ids, return an error from ORES, these revision_ids have been excluded from the results. The code in the cell below takes about 3 min to execute.

In [166]:
#variables to iterate through the revision ids in sizes of 100
i=0
j=100
#lists to store the revision ids and quality predictions
re_id = []
prediction = []
# dictionary to store the results from ORES
res={}
#divide the ids into lists of size 100 
for t in range(math.ceil(len(rev_ids)/100)):
    ids = rev_ids[i:j]
    res = get_ores_data(ids,headers)
    # check for no error messages in the output, and only then append the data
    for ids in res['enwiki']['scores']:
        if not res['enwiki']['scores'][ids]['wp10'].get('score') is None:
            re_id.append(ids)
            prediction.append(res['enwiki']['scores'][ids]['wp10']['score']['prediction'])
    i+=100
    j+=100 
#create a dataframe to hold the revision ids and quality data
art_quality = pd.DataFrame(np.column_stack([re_id,prediction]), columns=['revision_id','article_quality'])
art_quality.revision_id = art_quality.revision_id.astype(int)


#### Merging data from the 3 datasets created so far : country, population and article quality


In [170]:
#lowering the case for both country lists, so that the join is not impacted by the country case
population['Geography'] = population['Geography'].str.lower()
country['country'] = country['country'].str.lower()
# create a new dataframe by joining on the country column, using the inner join, so that unmatched rows are not included
country_population = country.merge(population, how='inner', left_on='country',right_on='Geography')
del country_population['Geography']
# rename the columns as specified in the instructions
country_population = country_population.rename(index=str, columns={"page": "article_name", "rev_id": "revision_id", "Population mid-2018 (millions)":"population"})
# finally, add the data from the article quality dataframe, by joining on the revision_id column
final_df = country_population.merge(art_quality, on='revision_id')


#### The last step in the data wrangling step involves saving the final dataframe created above as a csv file, using the appropriate naming conventions

In [172]:
# saving this dataframe to the final csv data file
final_df.to_csv('en-wikipedia_article_quality_bycountry.csv',index=False)

### Data Analysis

Using pandas and its aggregation methods, create a dataframe that lists unique countries and the percentage of articles produced as a function of its population.

In [191]:
# convert the population column to a float datatype, after replacing the commas by blanks
final_df['population'] = final_df['population'].str.replace(',', '').astype(float)
# dataframe to hold information about countries and proportion of the number of articles with respect to its population, expressed as a percentage
art_prop = pd.DataFrame(np.column_stack([np.sort(final_df['country'].unique()),final_df['country'].value_counts()/(final_df.groupby('country')['population'].mean()*10000.00)]),columns=['country','article_proportion (as % of population)'])

Create another dataframe, that aggregates countries and the ratio of good quality articles produced from them as a function of the overall article count.

In [279]:
# dataframe to store values of each country and the total number of articles
total_article = final_df.groupby('country').size().reset_index(name='total_article_count')
# list to identify the good articles
good = ['GA','FA']
# dataframe to store number of good articles per country
good_article = final_df[final_df['article_quality'].isin(good)].groupby('country').size().reset_index(name='good_article_count')
# merge the two dataframes using the left outer join, since there can be countries with zero good articles
good_v_total = total_article.merge(good_article, on='country',how='left')
good_v_total.fillna(0, inplace=True)
# calculate a new field to store the ratio of good articles to total articles per country
good_v_total['proportion (as a percentage)'] = (good_v_total['good_article_count'] * 100)/good_v_total['total_article_count']

### Embed four tables as described

Table-1 Shows the 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [275]:
art_prop.sort_values('article_proportion (as % of population)',ascending=False)[:10]

Unnamed: 0,country,article_proportion (as % of population)
166,tuvalu,0.55
115,nauru,0.53
135,san marino,0.273333
108,monaco,0.1
93,liechtenstein,0.0725
161,tonga,0.063
103,marshall islands,0.0616667
68,iceland,0.0515
3,andorra,0.0425
52,federated states of micronesia,0.038


Table-2 Shows the 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [276]:
art_prop.sort_values('article_proportion (as % of population)')[:10]

Unnamed: 0,country,article_proportion (as % of population)
69,india,7.19026e-05
70,indonesia,8.06938e-05
34,china,8.14321e-05
173,uzbekistan,8.81459e-05
51,ethiopia,9.76744e-05
178,zambia,0.000141243
82,"korea, north",0.000152344
159,thailand,0.000169184
13,bangladesh,0.000194111
112,mozambique,0.000196721


Table-3 Shows the 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [280]:
good_v_total.sort_values('proportion (as a percentage)',ascending=False)[:10]

Unnamed: 0,country,total_article_count,good_article_count,proportion (as a percentage)
82,"korea, north",39,7.0,17.948718
137,saudi arabia,119,16.0,13.445378
31,central african republic,68,8.0,11.764706
132,romania,348,40.0,11.494253
104,mauritania,52,5.0,9.615385
19,bhutan,33,3.0,9.090909
166,tuvalu,55,5.0,9.090909
44,dominica,12,1.0,8.333333
171,united states,1092,82.0,7.509158
18,benin,94,7.0,7.446809


Table-4 Shows 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [281]:
good_v_total.sort_values('proportion (as a percentage)')[:10]

Unnamed: 0,country,total_article_count,good_article_count,proportion (as a percentage)
136,sao tome and principe,22,0.0,0.0
112,mozambique,60,0.0,0.0
28,cameroon,105,0.0,0.0
65,guyana,20,0.0,0.0
165,turkmenistan,33,0.0,0.0
108,monaco,40,0.0,0.0
107,moldova,426,0.0,0.0
36,comoros,51,0.0,0.0
103,marshall islands,37,0.0,0.0
38,costa rica,150,0.0,0.0


### Reflection

Please refer to the readme document.