# A2 - Bias in Data

Preston Stringham

The purpose of this project is to think about bias with respect to human-centered data science. This is demonstrated by finding the distribution of the quality of articles regarding politicians from many countries using the English Wikipedia. I expect there to be many sources of bias, especially heavy bias towards developed countries that also speak English. Let's see how my intuition compares to what is found in the data.

# Step 1 - Data Acquisition

Let's load in our data and libraries.

In [2]:
import pandas as pd
import matplotlib as plt
import requests

In [5]:
politician_df = pd.read_csv('../data/data-raw/page_data.csv')
population_df = pd.read_csv('../data/data-raw/WPDS_2020_data - WPDS_2020_data.csv')

Let's now observe our data.

In [6]:
politician_df

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


In [7]:
population_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.850,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000
230,SB,Solomon Islands,Country,2019,0.715,715000
231,TO,Tonga,Country,2019,0.099,99000
232,TV,Tuvalu,Country,2019,0.010,10000


# Step 2 - Data Preprocessing

Some basic preprocessing is needed before getting our scores. We need to remove rows that have "Template:" in our politician dataframe as these are not necessary. In addition, the population dataframe have data labels in capital letters that we need to remove for now.

In [8]:
politician_df = politician_df[~politician_df.page.str.contains("Template:")]

In [9]:
population_df = population_df[~population_df['Name'].str.isupper()]

## Getting Scores

We are now ready to obtain our ORES scores using the corresponding ORES REST API. The endpoint for this API is below.

In [13]:
endpoint = 'https://ores.wikimedia.org/v3/scores/enwiki/{revid}/articlequality'

We create this small function to call that will allow us to add additional parameters before getting the result from the REST API in JSON format.

In [20]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters))
    response = call.json()
    
    return response

We need to get results based on its revision ID. I instantiate a dictionary below that we will use to structure our parameters used to run the api_call method.

In [37]:
parameters = {'revid': '1234'}

In [34]:
politician_df = politician_df.reset_index()

This is when we actually get the scores. The potentially naive idea I have is to iterate through every single revision ID in our politician dataframe and get the results. This makes the proessing of the REST API data much more direct as we are only dealing with one response at a time. Additional logic is added since we want to take not of which articles were unable to produce a score from ORES.

In [60]:
for i in range(len(politician_df.index)):
    rev_id = str(politician_df.at[i, 'rev_id'])
    parameters['revid'] = rev_id
    result = api_call(endpoint, parameters)
    if 'error' in list(result['enwiki']['scores'][rev_id]['articlequality'].keys()):
        politician_df.at[i, 'prediction'] = None
    else:
        politician_df.at[i, 'prediction'] = result['enwiki']['scores'][rev_id]['articlequality']['score']['prediction']

Let's check our data now that we have predictions.

In [61]:
politician_df

Unnamed: 0,index,page,country,rev_id,prediction
0,1,Bir I of Kanem,Chad,355319463,Stub
1,10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,12,Yos Por,Cambodia,393822005,Stub
3,23,Julius Gregr,Czech Republic,395521877,Stub
4,24,Edvard Gregr,Czech Republic,395526568,Stub
...,...,...,...,...,...
46696,47192,Yahya Jammeh,Gambia,807482007,GA
46697,47193,Lucius Fairchild,United States,807483006,C
46698,47194,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA
46699,47195,Francis Fessenden,United States,807483270,C


For any prediction of 'None' let's export this data so we are aware of which articles were not able to produce a result.

276 articles were not able to be predicted. I output a log of them below.

In [74]:
politician_df[politician_df['prediction'].isna()].to_csv('wp_wpds_countries-no_prediction.csv')

Unnamed: 0,index,page,country,rev_id,prediction
14,126,List of politicians in Poland,Poland,516633096,
21,222,Tingtingru,Vanuatu,550682925,
51,330,Daud Arsala,Afghanistan,627547024,
75,359,Book:Two Political Biographies,India,636911471,
180,514,Dilaver Bey,Turkey,669987106,
...,...,...,...,...,...
46287,46782,John Rose (Trotskyist),United Kingdom,807336308,
46367,46862,Jalal Movaghar,Iran,807367030,
46368,46863,Mohsen Movaghar,Iran,807367166,
46686,47182,King Gutierrez,Philippines,807479587,


We can drop those rows now.

In [184]:
politician_df = politician_df.dropna()

In [82]:
population_df = population_df.rename(columns={'Name': 'country'})

We now can join our population together based on the country name.

In [185]:
join_df = pd.merge(politician_df, population_df, on=['country'], how='outer')

In [179]:
join_df[join_df['country'] == 'United States']

Unnamed: 0,page,country,rev_id,prediction,Population
2865,Butler-Belmont family,United States,470173494.0,Start,329878000.0
2866,Heard-Hawes family,United States,502721672.0,C,329878000.0
2867,Russell family (American political family),United States,550953646.0,Stub,329878000.0
2868,Read family of Delaware,United States,651856758.0,Stub,329878000.0
2869,National Foundation for Women Legislators,United States,667815878.0,Start,329878000.0
...,...,...,...,...,...
3952,Huey Long,United States,807473939.0,B,329878000.0
3953,Hamilton family,United States,807473969.0,C,329878000.0
3954,Hal Bidlack,United States,807481636.0,C,329878000.0
3955,Lucius Fairchild,United States,807483006.0,C,329878000.0


In [186]:
join_df = join_df.drop(['index', 'FIPS', 'TimeFrame', 'Data (M)', 'Type'], axis=1)

I now want to find any rows that could not be joined together.

In [187]:
no_matches_df = join_df[join_df.isna().any(axis=1) == True]

In [191]:
no_matches_df.to_csv('wp_wpds_countries-no_match.csv')

We can drop those rows too.

In [192]:
join_df = join_df.dropna()

I now simply rearrange the columns and give them better names.

In [193]:
join_df = join_df[['country', 'page', 'rev_id', 'prediction', 'Population']]

In [194]:
join_df.columns = ['country', 'article_name', 'revision_id', 'article_quality_est', 'population']

In [153]:
join_df.to_csv('wp_wpds_politicians_by_country.csv')

# Step 3 - Analysis

In [11]:
join_df = pd.read_csv('../data/data-clean/wp_wpds_politicians_by_country.csv')

We can use pivot tables here to get a count of the articles. 

In [12]:
pivot_df = pd.pivot_table(join_df,
                          fill_value=0, 
                          columns=['article_quality_est'],
                          aggfunc={'article_quality_est': len}, index=['country']
                         )
pivot_df.columns = pivot_df.columns.droplevel()

In [13]:
pivot_df = pivot_df.reset_index()

In [14]:
population_df = population_df.rename(columns={'Name' : 'country'})

In [15]:
pivot_analysis = pd.merge(pivot_df, population_df, on='country', how='inner')

In [16]:
pivot_analysis.drop(['Type', 'FIPS', 'Data (M)', 'TimeFrame'], axis=1, inplace=True)

Let's take a look at our new pivoted dataframe.

In [17]:
pivot_analysis

Unnamed: 0,country,B,C,FA,GA,Start,Stub,Population
0,Afghanistan,8,46,1,12,99,153,38928000
1,Albania,3,59,0,3,147,244,2838000
2,Algeria,3,10,0,2,44,57,44357000
3,Andorra,0,2,0,0,8,24,82000
4,Angola,2,6,0,0,23,75,32522000
...,...,...,...,...,...,...,...,...
177,Venezuela,2,16,0,3,47,62,28645000
178,Vietnam,6,50,7,6,55,63,96209000
179,Yemen,1,12,1,2,32,68,29826000
180,Zambia,0,5,0,0,15,5,18384000


We now find the sum of all articles as well as the sum of the high quality articles. We then find the number of articles per person and the proportion of high quality articles among all articles for that country.

In [34]:
pivot_analysis['total_article_count'] = pivot_analysis[['B', 'C', 'GA', 'Start', 'Stub', 'FA']].sum(axis=1)
pivot_analysis['quality_article_count'] = pivot_analysis[['GA', 'FA']].sum(axis=1)
pivot_analysis['articles_per_person'] = (pivot_analysis['total_article_count']/pivot_analysis['Population']) * 100
pivot_analysis['quality_per_article'] = (pivot_analysis['quality_article_count']/pivot_analysis['total_article_count']) * 100

We need to find the same information as the pivot_analysis dataframe but on a regional level. Additional processing is necessary to find this.

In [19]:
full_population_df = pd.read_csv('../data/data-raw/WPDS_2020_data - WPDS_2020_data.csv')

In [20]:
full_population_df = full_population_df.rename(columns={'Name' : 'country'})

One easy way we can use our data to determine whenther a row is a label for a sub-region is use the 'FIPS' column. Every country is labeled using a two letter label. Every sub-region has a full name i.e. 'North American' so we just need to fill in all the rows below a sub-region label with its corresponding sub-region label.

One of the FIPS rows was NaN, so I simply fill in a value so the logic works correctly.

In [21]:
full_population_df['FIPS'].iloc[62] = 'FF' # One country had NaN FIPS. Had to fix.
current_region = 'NORTHERN AFRICA'
for i in range(3, len(full_population_df.index)):
    if len(full_population_df.at[i, 'FIPS']) == 2 or full_population_df.at[i, 'FIPS'] == None:
        full_population_df.at[i, 'Region'] = current_region
    else:
        current_region = full_population_df.at[i, 'FIPS']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [23]:
region_df = pd.merge(pivot_analysis, full_population_df, on='country', how='outer')

In [24]:
region_df = region_df.drop(['FIPS', 'country', 'Type', 'TimeFrame', 'Data (M)', 'Population_x'], axis=1)

I now group by the region with the sum aggregate function. This should give me the sum of articles for every country in that region.

In [25]:
region_df = region_df.groupby(['Region']).sum()

In [27]:
region_df = region_df.rename(columns={'Population_y' : 'Population'})

We now find the sum of all articles as well as the sum of the high quality articles. We then find the number of articles per person and the proportion of high quality articles among all articles for that country. This time, at the regional level.

In [31]:
region_df['total_article_count'] = region_df[['B', 'C', 'GA', 'Start', 'Stub', 'FA']].sum(axis=1)
region_df['quality_article_count'] = region_df[['GA', 'FA']].sum(axis=1)
region_df['articles_per_person'] = (region_df['total_article_count']/region_df['Population'])  * 100
region_df['quality_per_article'] = (region_df['quality_article_count']/region_df['total_article_count']) * 100

In [32]:
region_df = region_df.reset_index()

In [33]:
region_df = region_df.dropna()

# Step 4 - Results

Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [37]:
pivot_analysis[['country', 'articles_per_person', 'Population']].sort_values('articles_per_person', ascending=False).head(10)

Unnamed: 0,country,articles_per_person,Population
168,Tuvalu,0.54,10000
116,Nauru,0.472727,11000
137,San Marino,0.238235,34000
110,Monaco,0.105263,38000
95,Liechtenstein,0.071795,39000
104,Marshall Islands,0.064912,57000
163,Tonga,0.063636,99000
70,Iceland,0.05462,368000
3,Andorra,0.041463,82000
52,Federated States of Micronesia,0.033962,106000


Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [38]:
pivot_analysis[['country', 'articles_per_person', 'Population']].sort_values('articles_per_person', ascending=True).head(10)

Unnamed: 0,country,articles_per_person,Population
71,India,6.9e-05,1400100000
72,Indonesia,7.7e-05,271739000
34,China,8.1e-05,1402385000
175,Uzbekistan,8.2e-05,34174000
51,Ethiopia,8.8e-05,114916000
180,Zambia,0.000136,18384000
84,"Korea, North",0.00014,25779000
161,Thailand,0.000168,66534000
114,Mozambique,0.000186,31166000
13,Bangladesh,0.000187,169809000


Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [270]:
pivot_analysis[['country', 'quality_per_article', 'total_article_count']].sort_values('quality_per_article', ascending=False).head(10)

Unnamed: 0,country,quality_per_article,total_article_count
84,"Korea, North",22.222222,36
140,Saudi Arabia,12.820513,117
135,Romania,12.244898,343
31,Central African Republic,12.121212,66
176,Uzbekistan,10.714286,28
106,Mauritania,10.416667,48
64,Guatemala,8.433735,83
44,Dominica,8.333333,12
158,Syria,7.8125,128
18,Benin,7.692308,91


Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [227]:
pivot_analysis[['country', 'quality_per_article']].sort_values('quality_per_article', ascending=True).head(10)

Unnamed: 0,country,quality_per_article
148,Solomon Islands,0.0
164,Tonga,0.0
117,Nauru,0.0
116,Namibia,0.0
43,Djibouti,0.0
114,Mozambique,0.0
110,Monaco,0.0
49,Eritrea,0.0
50,Estonia,0.0
109,Moldova,0.0


Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [366]:
region_df[['Region', 'articles_per_person']].sort_values('articles_per_person', ascending=False)

Unnamed: 0,Region,articles_per_person
15,OCEANIA,0.003628
5,Channel Islands,0.003555
20,SOUTHERN EUROPE,0.001211
23,WESTERN EUROPE,0.001166
2,CARIBBEAN,0.000808
8,EASTERN EUROPE,0.000639
19,SOUTHERN AFRICA,0.000468
22,WESTERN ASIA,0.000456
3,CENTRAL AMERICA,0.000432
16,SOUTH AMERICA,0.000353


Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [40]:
region_df[['Region', 'quality_per_article']].sort_values('quality_per_article', ascending=False)

Unnamed: 0,Region,quality_per_article
9,NORTHERN AMERICA,5.470805
13,SOUTHEAST ASIA,3.613861
17,WESTERN ASIA,3.472493
6,EASTERN EUROPE,3.161844
4,EAST ASIA,3.07319
2,CENTRAL ASIA,2.857143
3,Channel Islands,2.710603
7,MIDDLE AFRICA,2.406015
8,NORTHERN AFRICA,2.113459
10,OCEANIA,2.015355


# Reflection

When I first began this assignment, I was certain that developed English speaking countries would have the highest number of high quality articles. However, when I compared the number of articles for a particular country to its population, I found that the number of articles per person was much higher in countries with smaller populations, which makes sense. Especially considering a country like North Korea. They do not speak english there, but there were much higher quality articles out of all articles written about that country, very likely these articles were not being written by North Koreans. However, it is important to note that there were not a lot of articles written about North Korea, but there must be many of high quality. 

Another reflection I made was about the framing of the problem. It is important to remember that these articles are about politicians, so seeing countries like North Korea and Saudi Arabia is no surprise. Similarly, smaller countries not as involved in world affairs like Tonga are at the bottom of the quality spectrum. This is another interesting introduction of bias into the data.

There were several sources of bias in this data. One of which could have been the ORES model itself, as I am uncertain of exactly how the scores for each article are assigned. In addition, the analysis we performed was biased towards countries with small or large populations. That is, countries with large populations are going to have significantly less articles per person and more articles per person for countries with smaller populations. Furthermore, this analysis is biased towards countries with greater or fewer articles written about that country. As mentioned above, North Korea had few total articles written about it, but many were of high quality. 

I think what this suggests about the internet and society in general is that we are all biased in our own way. I would be very interested to look at articles from a different language on Wikipedia to see how it compares to English. I imagine that there would be major differences in some areas, but, perhaps, similarity in others. The language we speak, where we live, how we live, and numerous other variables contribute to bias in certain ideas and this bias translates into the data we analyze. 

In general, I think that this assignment was a great way to get us thinking about bias in data. While we have had discussions about this topic before, I felt that this example was a great way to showcase the many ways that bias can be introduced into the data that we analyze. 

