# A2: Bias in data

For this assignment, I'll compare [Wikipedia](https://www.wikipedia.org) articles about political figures, authored from various countries. By comparing the count of articles from a country to its population, and also the "quality" of articles about politicians, I hope to see and show that the level of coverage varies significantly and could produce bias within the content of the articles. The quality of a given article will be evaluated using the [ORES](https://www.mediawiki.org/wiki/ORES) service.

First, I import the libraries that my Python 3 code will be using:

In [114]:
import requests   # For the ORES API call 
import numpy      # For replacement of unread values with NaNs
import pandas     # For dataframing, merging, numeric conversions, and reading CSVs

Next, load the population data from the CSV file (WPDS_2018_data.csv) obtained from [here:](https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0)

In [115]:
# Load population data
data_population = pandas.read_csv('WPDS_2018_data.csv')

# Rename the columns, because we'll want 'country' to merge, later, and because I like short'population' better
data_population = pandas.DataFrame({'country':data_population['Geography'],
                          'population':data_population['Population mid-2018 (millions)']})
# Take a look at its data
data_population.head()
#data_population

Unnamed: 0,country,population
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


To acquire the data regarding Wikipedia pages, I download the archive from [here](https://figshare.com/articles/Untitled_Item/5513449) and expand to be able to open \country\data\page_data.csv. For this assignment, this file is stored in the same working directory as this Python notebook. 

In [116]:
data_page = pandas.read_csv('page_data.csv')
# Take a look at its data
data_page.head(4)
#data_page

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070


To obtain a quality score from ORES, I adapted code from [here,](https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb) (credit to GitHub users [jtmorgan](https://github.com/jtmorgan) and [ironholds](https://github.com/Ironholds)). The following function makes an ORES API call, given a list of revision ides to search for and a call header identifying me as caller.

In [117]:
headers = {'User-Agent' : 'https://github.com/pking70', 'From' : 'pking70@uw.edu'}

def get_ores_data(revision_ids, headers):
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - concatenating all the revision IDs together separated by | marks. 
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    
    # Make the call for a response in JSON format
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    #print(json.dumps(response, indent=4, sort_keys=True))
    return response

To loop through all the page data, I segment the ids into groups of 100. Then I call ORES and append the response to a new dataframe named predictions. For rev_ids for which there is not a valid response, I replace with a numpy NaN.

This code can take a while to execute, depending on the count of pages to query (for this run, I have over 47,000). I uncomment the 'print(i)' statement when I want to see progress and not wonder if it is indefinitely looping, but for now it is commented out.

In [118]:
# So if we grab some example revision IDs and turn them into a list and then call get_ores_data...
rev_ids = list(data_page['rev_id'])   # Extract the rev_ids from the page data
start = 0                             # Start at item 0
step = 100                            # How many ids to query ORES for at once. ORES does not work with large counts, but 100 works
predictions = pandas.DataFrame()      # A dataframe for the prediction results

for i in range(start, len(rev_ids), step): # Loop through all the rev_ids
    
    # print(i)                             # Uncomment this if you want to watch progress 
    rev_ids_set = rev_ids[i:i+step]        # Use this number of ids for the ORES call 
    response = get_ores_data(rev_ids_set, headers)   # Call ORES
    
    for revision in response['enwiki']['scores']:   # Loop through the JSON ORES call response
        try:
            prediction = response['enwiki']['scores'][revision]['wp10']['score']['prediction']   # Store predictions
        except:
            prediction = numpy.nan                  # When there is not a valid response, store a NaN
        
        # In a new dataframe, store revisions and predictions
        predictions = predictions.append({'revision':revision, 'prediction':prediction}, ignore_index=True)

To review the structure of the predictions dataframe:

In [119]:
predictions.head()

Unnamed: 0,prediction,revision
0,,235107991
1,Stub,355319463
2,Stub,391862046
3,Stub,391862070
4,Stub,391862409


The predictions data is merged with the page data, joined on their respective revision id fields. The revision field must be converted from string to numeric for this to work.

In [120]:
predictions['revision'] = pandas.to_numeric(predictions['revision'], errors='coerce')
data_page_prediction = data_page.merge(predictions, left_on='rev_id', right_on='revision') 
# Take a look at its data
data_page_prediction.head()

Unnamed: 0,page,country,rev_id,prediction,revision
0,Template:ZambiaProvincialMinisters,Zambia,235107991,,235107991
1,Bir I of Kanem,Chad,355319463,Stub,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub,391862046
3,Template:Uganda-politician-stub,Uganda,391862070,Stub,391862070
4,Template:Namibia-politician-stub,Namibia,391862409,Stub,391862409


The prediction+page data is merged with the population data that I loaded into data_population earlier, joined on their respective country fields. I was having trouble with this merge, so I trim any extra spaces from both country fields to possibly improve matching.

In [121]:
data_pagepredpop = data_page_prediction.merge(data_population, left_on='country'.strip(), right_on='country'.strip())
# Take a look at its data
data_pagepredpop.head()

Unnamed: 0,page,country,rev_id,prediction,revision,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,,235107991,17.7
1,Gladys Lundwe,Zambia,757566606,Stub,757566606,17.7
2,Mwamba Luchembe,Zambia,764848643,Stub,764848643,17.7
3,Thandiwe Banda,Zambia,768166426,Start,768166426,17.7
4,Sylvester Chisembele,Zambia,776082926,C,776082926,17.7


For the final dataframe, extract the fields we want with the titles requested by the assignment, <a href="https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018)/Assignments#Combining_the_datasets">here.</a>

In [122]:
data_final = pandas.DataFrame({'country':data_pagepredpop['country'],
                               'population':data_pagepredpop['population'],
                               'article_name':data_pagepredpop['page'],
                               'revision_id':data_pagepredpop['rev_id'],
                               'article_quality':data_pagepredpop['prediction']})
# Take a look at its data
data_final

Unnamed: 0,article_name,article_quality,country,population,revision_id
0,Template:ZambiaProvincialMinisters,,Zambia,17.7,235107991
1,Gladys Lundwe,Stub,Zambia,17.7,757566606
2,Mwamba Luchembe,Stub,Zambia,17.7,764848643
3,Thandiwe Banda,Start,Zambia,17.7,768166426
4,Sylvester Chisembele,C,Zambia,17.7,776082926
5,Victoria Kalima,Start,Zambia,17.7,776530837
6,Margaret Mwanakatwe,Start,Zambia,17.7,779747587
7,Nkandu Luo,Start,Zambia,17.7,779747961
8,Susan Nakazwe,Start,Zambia,17.7,779748181
9,Catherine Namugala,Start,Zambia,17.7,779748285


To analyze this data, I want to examine which countries have the most (and least) amount of articles on Wikipedia regarding their political figures. I also want to examine the proportion of highly and lowly rated articles for each country.

Note that the quality of an article has been returned by ORES, according to the scale defined <a href="https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018)/Assignments#Getting_article_quality_predictions">here.</a> 

In short, the prediction column of final_data now contains a value on this spectrum:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

To prepare for analysis, I must calculate the count of articles per country, and the per capita ratio of articles per country:

In [123]:
data_article_count = data_final.groupby(['country']).size().reset_index(name='count')
# Take a look at its data
data_article_count.head()

Unnamed: 0,country,count
0,Afghanistan,327
1,Albania,460
2,Algeria,119
3,Andorra,34
4,Angola,110


The article count data is merged with the population data, joined on their respective country fields.

In [124]:
data_popcount = data_population.merge(data_article_count, left_on='country'.strip(), right_on='country'.strip())
# Take a look at its data
data_popcount.head()

Unnamed: 0,country,population,count
0,Algeria,42.7,119
1,Egypt,97.0,239
2,Libya,6.5,111
3,Morocco,35.2,208
4,Sudan,41.7,98


I want the per capita proportion of articles to population. I have to convert population and count to numeric, and also multiply population by one million to scale it according to its defined format (remember, it was 'Population mid-2018 (millions)').

In [125]:
data_popcount['count'] = pandas.to_numeric(data_popcount['count'], errors='coerce')     # Numeric conversion
data_popcount['population'] = pandas.to_numeric(data_popcount['population'], errors='coerce') # Numeric conversion
data_popcount['per capita'] = 100*(data_popcount['count']/(data_popcount['population']*1000000)) # Ratio calculation
# Take a look at its data
data_popcount.head()

Unnamed: 0,country,population,count,per capita
0,Algeria,42.7,119,0.000279
1,Egypt,97.0,239,0.000246
2,Libya,6.5,111,0.001708
3,Morocco,35.2,208,0.000591
4,Sudan,41.7,98,0.000235


To see the ten highest-ranked countries in terms of number of politician articles as a proportion of country population:


In [126]:
data_popcount.sort_values(by='per capita', ascending=False).head(10)

Unnamed: 0,country,population,count,per capita
178,Tuvalu,0.01,55,0.55
173,Nauru,0.01,53,0.53
164,San Marino,0.03,82,0.273333
141,Monaco,0.04,40,0.1
139,Liechtenstein,0.04,29,0.0725
177,Tonga,0.1,63,0.063
172,Marshall Islands,0.06,37,0.061667
128,Iceland,0.4,206,0.0515
154,Andorra,0.08,34,0.0425
169,Federated States of Micronesia,0.1,38,0.038


To see the ten lowest-ranked countries in terms of number of politician articles as a proportion of country population:

In [127]:
data_popcount.sort_values(by='per capita', ascending=True).head(10)

Unnamed: 0,country,population,count,per capita
111,Indonesia,265.2,215,8.1e-05
100,Uzbekistan,32.9,29,8.8e-05
25,Ethiopia,107.5,105,9.8e-05
37,Zambia,17.7,26,0.000147
121,"Korea, North",25.6,39,0.000152
117,Thailand,66.2,112,0.000169
102,Bangladesh,166.4,324,0.000195
30,Mozambique,30.5,60,0.000197
118,Vietnam,94.7,191,0.000202
4,Sudan,41.7,98,0.000235


To see the ten highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:


In [128]:
data_quality = data_final[(data_final['article_quality']=='GA')|(data_final['article_quality']=='FA' )]
data_quality = data_quality.groupby(['country']).size().reset_index(name='count_quality')
# Take a look at its data
data_quality.head()

Unnamed: 0,country,count_quality
0,Afghanistan,10
1,Albania,4
2,Algeria,2
3,Argentina,15
4,Armenia,5


Merge:

In [129]:
data_countqual = data_popcount.merge(data_quality, left_on='country'.strip(), right_on='country'.strip())
# Take a look at its data
data_countqual.head()

Unnamed: 0,country,population,count,per capita,count_quality
0,Algeria,42.7,119,0.000279,2
1,Egypt,97.0,239,0.000246,8
2,Libya,6.5,111,0.001708,3
3,Morocco,35.2,208,0.000591,1
4,Sudan,41.7,98,0.000235,1


I want the proportion of highly rated articles to total articles. 

In [130]:
data_countqual['proportion'] = data_countqual['count_quality']/data_countqual['count']
# Take a look at its data
data_countqual.head()

Unnamed: 0,country,population,count,per capita,count_quality,proportion
0,Algeria,42.7,119,0.000279,2,0.016807
1,Egypt,97.0,239,0.000246,8,0.033473
2,Libya,6.5,111,0.001708,3,0.027027
3,Morocco,35.2,208,0.000591,1,0.004808
4,Sudan,41.7,98,0.000235,1,0.010204


To see the ten highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:

In [131]:
data_countqual.sort_values(by='proportion', ascending=False).head(10)

Unnamed: 0,country,population,count,per capita,count_quality,proportion
100,"Korea, North",25.6,39,0.000152,7,0.179487
73,Saudi Arabia,33.4,119,0.000356,16,0.134454
31,Central African Republic,4.7,68,0.001447,8,0.117647
122,Romania,19.5,348,0.001785,40,0.114943
13,Mauritania,4.5,52,0.001156,5,0.096154
141,Tuvalu,0.01,55,0.55,5,0.090909
83,Bhutan,0.8,33,0.004125,3,0.090909
46,Dominica,0.07,12,0.017143,1,0.083333
40,United States,328.0,1098,0.000335,82,0.074681
5,Benin,11.5,94,0.000817,7,0.074468


To see the ten lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country:

In [132]:
data_countqual.sort_values(by='proportion', ascending=True).head(10)

Unnamed: 0,country,population,count,per capita,count_quality,proportion
29,Tanzania,59.1,408,0.00069,1,0.002451
59,Peru,32.2,354,0.001099,1,0.002825
109,Lithuania,2.8,248,0.008857,1,0.004032
15,Nigeria,195.9,684,0.000349,3,0.004386
3,Morocco,35.2,208,0.000591,1,0.004808
137,Fiji,0.9,199,0.022111,1,0.005025
53,Bolivia,11.3,187,0.001655,1,0.005348
54,Brazil,209.4,556,0.000266,3,0.005396
116,Luxembourg,0.6,180,0.03,1,0.005556
17,Sierra Leone,7.7,166,0.002156,1,0.006024


However, the above table ranks only countries for which there are high quality articles to count. All the countries that had zero (0) GA or FA articles have been omitted, as their proportion is 0. To rank countries that have no high quality articles, would be impossible. They are all tied at 0, which would make them all equally the "lowest." To see which countries completely lack high quality articles for comparison, first I create a dataframe that contains the counts of all level of quality articles by country:

In [133]:
data_allquality = data_final.groupby(['country']).size().reset_index(name='count_allquality')
# Take a look at its data
data_allquality

Unnamed: 0,country,count_allquality
0,Afghanistan,327
1,Albania,460
2,Algeria,119
3,Andorra,34
4,Angola,110
5,Antigua and Barbuda,25
6,Argentina,496
7,Armenia,199
8,Australia,1566
9,Austria,340


There are 180 such countries.

I can find the indexes within all countries that have quality articles with this merge:

In [134]:
data_indexes = pandas.merge(data_allquality.reset_index(), data_quality)
data_indexes

Unnamed: 0,index,country,count_allquality,count_quality
0,0,Afghanistan,327,10
1,1,Albania,460,4
2,2,Algeria,119,2
3,6,Argentina,496,15
4,7,Armenia,199,5
5,8,Australia,1566,42
6,9,Austria,340,3
7,10,Azerbaijan,182,2
8,12,Bahrain,42,1
9,13,Bangladesh,324,3


Then, by dropping these indexes (the indexes of countries that have high quality articles) I produce a new dataset of countries that lack any high quality articles, which I call data_lowquality:

In [135]:
data_lowquality = data_allquality.drop(data_indexes['index'])
data_lowquality

Unnamed: 0,country,count_allquality
3,Andorra,34
4,Angola,110
5,Antigua and Barbuda,25
11,Bahamas,20
14,Barbados,14
16,Belgium,523
17,Belize,16
28,Cameroon,106
30,Cape Verde,37
36,Comoros,51


In essence, this is the list of countries that have zero (0) highly rated articles. It is not exactly meaningful to rank them; They are above, in alphabetic order: the 37 countries with the lowest possible proportion of highly rated articles.

Finally, I save my data to a CSV for sharing and reproducability:

In [136]:
data_final.to_csv('data_final.csv')

For further reflection upon the meaning of the data processing and analysis within this notebook, please see the Readme file.