## A2-Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. We combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:

the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
the countries with the highest and lowest proportion of high quality articles about politicians.
a ranking of geographic regions by articles-per-person and proportion of high quality articles.

### Getting the Article and Population Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df_pagedata=pd.read_csv("page_data.csv")
df_pagedata.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [3]:
df_populationdata=pd.read_csv("WPDS_2020_data - WPDS_2020_data.csv")
df_populationdata.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [4]:
df_pagedata=df_pagedata[~df_pagedata['page'].str.contains("Template:")]
df_pagedata.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [5]:
df_countrydata = df_populationdata[df_populationdata['Type']=='Country']
df_countrydata.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


In [6]:
df_regiondata = df_populationdata[df_populationdata['Type']=='Sub-Region']
df_regiondata.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000


### Getting article quality predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”.

ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article
These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors.

We use a REST API to obtain the information for each article.

In [7]:
import requests
import json

In [8]:
headers = {'User-Agent': 'https://github.com/Minerva-Lan', 'From' :'lluan@uw.edu'}

In [9]:
def api_call(rev_ids, headers):
    endpoint="https://ores.wikimedia.org/v3/scores/{context}/?revids={revids}&models={model}"
    params={"context": "enwiki",
            "revids"  : '|'.join(str(x) for x in rev_ids),
            "model"  : "articlequality"
            }
    call = requests.get(endpoint.format(**params), headers=headers)
    response = call.json()
    return response

In [10]:
def extract_class_labels(response): 
    for rev_id, val in response.items():
        try:
            rev_scores[rev_id] = val["articlequality"]["score"]["prediction"]
        except:
            pass

In [11]:
page_data_rev_ids=df_pagedata['rev_id'].tolist()
step = 50
rev_scores={}
for i in range(0, len(page_data_rev_ids), step):
    response = api_call(page_data_rev_ids[i: i+step], headers)
    extract_class_labels(response['enwiki']['scores'])

In [12]:
df_article_quality = pd.DataFrame(list(rev_scores.items()), columns=['rev_id', 'quality'])
df_article_quality.head()

Unnamed: 0,rev_id,quality
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


In [13]:
df_article_quality['rev_id'] = df_article_quality['rev_id'].astype('int')
page_article_quality = pd.merge(df_pagedata, df_article_quality, on='rev_id', how='left')
page_article_quality.head()

Unnamed: 0,page,country,rev_id,quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [14]:
page_no_match = page_article_quality[page_article_quality['quality'].isnull()]
page_no_match.head()

Unnamed: 0,page,country,rev_id,quality
14,List of politicians in Poland,Poland,516633096,
21,Tingtingru,Vanuatu,550682925,
51,Daud Arsala,Afghanistan,627547024,
75,Book:Two Political Biographies,India,636911471,
180,Dilaver Bey,Turkey,669987106,


In [15]:
page_no_match.to_csv('article_without_quality.csv', index=False)

In [16]:
page_with_article_quality=page_article_quality[~page_article_quality['quality'].isnull()]
page_with_article_quality.head()

Unnamed: 0,page,country,rev_id,quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [17]:
page_with_article_quality.to_csv('page_with_article_quality.csv', index=False)

### Combining the Datasets

In [18]:
wp_wpds_politicians_by_country=pd.merge(left=page_with_article_quality, right =df_countrydata, 
                                  left_on='country', right_on='Name')
wp_wpds_politicians_by_country.head()

Unnamed: 0,page,country,rev_id,quality,FIPS,Name,Type,TimeFrame,Data (M),Population
0,Bir I of Kanem,Chad,355319463,Stub,TD,Chad,Country,2019,16.877,16877000
1,Abdullah II of Kanem,Chad,498683267,Stub,TD,Chad,Country,2019,16.877,16877000
2,Salmama II of Kanem,Chad,565745353,Stub,TD,Chad,Country,2019,16.877,16877000
3,Kuri I of Kanem,Chad,565745365,Stub,TD,Chad,Country,2019,16.877,16877000
4,Mohammed I of Kanem,Chad,565745375,Stub,TD,Chad,Country,2019,16.877,16877000


In [19]:
wp_wpds_politicians_by_country=wp_wpds_politicians_by_country[['page', 'country', 'rev_id', 'quality', 'Population']]
wp_wpds_politicians_by_country.rename(columns={'page' : 'article_name',
                                          'rev_id' : 'revision_id',
                                          'quality' : 'article_quality_est',
                                          'Population' : 'population'}, inplace = True)
wp_wpds_politicians_by_country.head()

Unnamed: 0,article_name,country,revision_id,article_quality_est,population
0,Bir I of Kanem,Chad,355319463,Stub,16877000
1,Abdullah II of Kanem,Chad,498683267,Stub,16877000
2,Salmama II of Kanem,Chad,565745353,Stub,16877000
3,Kuri I of Kanem,Chad,565745365,Stub,16877000
4,Mohammed I of Kanem,Chad,565745375,Stub,16877000


In [20]:
wp_wpds_politicians_by_country.to_csv('wp_wpds_politicians_by_country.csv', index=False)

In [21]:
wp_wpds_politicians_by_country_all=pd.merge(left=page_with_article_quality, right =df_countrydata, 
                                  left_on='country', right_on='Name', how='outer')

In [22]:
no_country_match = wp_wpds_politicians_by_country_all[wp_wpds_politicians_by_country_all['Population'].isnull()]
no_country_match.to_csv('wp_wpds_countries-no_match.csv', index = False)

### Analysis

In [23]:
pivot_quality = pd.pivot_table(wp_wpds_politicians_by_country, index=['country'], columns=['article_quality_est'],
              aggfunc={'article_quality_est' : len}, fill_value =0)
pivot_quality.columns=pivot_quality.columns.droplevel()
pivot_quality.reset_index(inplace=True)
pivot_quality.columns.name=None
pivot_quality.head()

Unnamed: 0,country,B,C,FA,GA,Start,Stub
0,Afghanistan,8,46,1,12,99,153
1,Albania,3,59,0,3,147,244
2,Algeria,3,10,0,2,44,57
3,Andorra,0,2,0,0,8,24
4,Angola,2,6,0,0,23,75


In [24]:
df_country_analysis=pivot_quality.merge(wp_wpds_politicians_by_country.groupby(['country'])['population'].mean(),
                    left_on='country', right_index=True)

In [25]:
df_country_analysis['total_articles']=df_country_analysis['B']+df_country_analysis['C']+df_country_analysis['FA']+df_country_analysis['GA']+df_country_analysis['Start']+df_country_analysis['Stub']
df_country_analysis['high_quality_article'] = df_country_analysis['FA'] + df_country_analysis['GA']
df_country_analysis['relative_quality']=(df_country_analysis['high_quality_article']/df_country_analysis['total_articles'])*100
df_country_analysis['coverage']=(df_country_analysis['total_articles']/df_country_analysis['population'])*100

In [26]:
df_country_analysis.drop(columns=['B', 'C', 'FA', 'GA', 'Start', 'Stub'], inplace=True)
df_country_analysis.head()

Unnamed: 0,country,population,total_articles,high_quality_article,relative_quality,coverage
0,Afghanistan,38928000,319,13,4.075235,0.000819
1,Albania,2838000,456,3,0.657895,0.016068
2,Algeria,44357000,116,2,1.724138,0.000262
3,Andorra,82000,34,0,0.0,0.041463
4,Angola,32522000,106,0,0.0,0.000326


In [27]:
region = "NORTHERN AFRICA"

regions = ['WORLD', 'AFRICA', 'NORTHERN AFRICA']
for i in range(3, len(df_populationdata)):
    if df_populationdata.iloc[i]['Type']=='Sub-Region':
        region = df_populationdata.iloc[i]['Name']
    regions.append(region)

In [28]:
df_populationdata['Region'] = regions

In [29]:
df_country_region_analysis = pd.merge(left=df_country_analysis, right =df_populationdata[['Name', 'Region']], 
                                  left_on='country', right_on='Name', how='left')
df_country_region_analysis.drop(columns='Name', inplace = True)
df_country_region_analysis.head()

Unnamed: 0,country,population,total_articles,high_quality_article,relative_quality,coverage,Region
0,Afghanistan,38928000,319,13,4.075235,0.000819,SOUTH ASIA
1,Albania,2838000,456,3,0.657895,0.016068,SOUTHERN EUROPE
2,Algeria,44357000,116,2,1.724138,0.000262,NORTHERN AFRICA
3,Andorra,82000,34,0,0.0,0.041463,SOUTHERN EUROPE
4,Angola,32522000,106,0,0.0,0.000326,MIDDLE AFRICA


In [30]:
df_region_analysis=df_country_region_analysis.groupby('Region').agg({'total_articles':'sum', 'high_quality_article': 'sum'})
df_region_analysis= df_region_analysis.merge(df_regiondata[['Name', 'Population']],left_index=True, right_on='Name')
df_region_analysis.head()

Unnamed: 0,total_articles,high_quality_article,Name,Population
77,695,13,CARIBBEAN,43233000
68,1543,23,CENTRAL AMERICA,178611000
129,245,7,CENTRAL ASIA,74961000
168,3763,102,Channel Islands,172000
157,2473,76,EAST ASIA,1641063000


In [31]:
df_region_analysis['coverage'] = df_region_analysis['total_articles']/df_region_analysis['Population']*100
df_region_analysis['relative_quality'] = df_region_analysis['high_quality_article']/df_region_analysis['total_articles']*100
df_region_analysis=df_region_analysis[['Name', 'coverage', 'relative_quality']]
df_region_analysis.rename(columns={'Name':'Region'}, inplace=True)
df_region_analysis.head()

Unnamed: 0,Region,coverage,relative_quality
77,CARIBBEAN,0.001608,1.870504
68,CENTRAL AMERICA,0.000864,1.490603
129,CENTRAL ASIA,0.000327,2.857143
168,Channel Islands,2.187791,2.710603
157,EAST ASIA,0.000151,3.07319


### Results

#### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [32]:
df_country_analysis[['country', 'coverage']].sort_values(by=['coverage'], ascending=False)[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,coverage
0,Tuvalu,0.54
1,Nauru,0.472727
2,San Marino,0.238235
3,Monaco,0.105263
4,Liechtenstein,0.071795
5,Marshall Islands,0.064912
6,Tonga,0.063636
7,Iceland,0.05462
8,Andorra,0.041463
9,Federated States of Micronesia,0.033962


#### Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [33]:
df_country_analysis[['country','coverage']].sort_values(by=['coverage'])[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,coverage
0,India,6.9e-05
1,Indonesia,7.7e-05
2,China,8.1e-05
3,Uzbekistan,8.2e-05
4,Ethiopia,8.8e-05
5,Zambia,0.000136
6,"Korea, North",0.00014
7,Thailand,0.000168
8,Mozambique,0.000186
9,Bangladesh,0.000187


#### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [34]:
df_country_analysis[['country','relative_quality']].sort_values(by=['relative_quality'], ascending=False)[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,relative_quality
0,"Korea, North",22.222222
1,Saudi Arabia,12.820513
2,Romania,12.244898
3,Central African Republic,12.121212
4,Uzbekistan,10.714286
5,Mauritania,10.416667
6,Guatemala,8.433735
7,Dominica,8.333333
8,Syria,7.8125
9,Benin,7.692308


#### Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [35]:
df_country_analysis[['country','relative_quality']].sort_values(by=['relative_quality'])[0:10].reset_index().drop('index', axis=1)

Unnamed: 0,country,relative_quality
0,Solomon Islands,0.0
1,Tonga,0.0
2,Nauru,0.0
3,Namibia,0.0
4,Djibouti,0.0
5,Mozambique,0.0
6,Monaco,0.0
7,Eritrea,0.0
8,Estonia,0.0
9,Moldova,0.0


#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [36]:
df_region_analysis[['Region','coverage']].sort_values(by=['coverage'], ascending =False)

Unnamed: 0,Region,coverage
168,Channel Islands,2.187791
216,OCEANIA,0.007244
200,SOUTHERN EUROPE,0.002421
179,WESTERN EUROPE,0.002333
77,CARIBBEAN,0.001608
189,EASTERN EUROPE,0.001279
58,SOUTHERN AFRICA,0.000936
110,WESTERN ASIA,0.000912
68,CENTRAL AMERICA,0.000864
95,SOUTH AMERICA,0.000706


#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [37]:
df_region_analysis[['Region','relative_quality']].sort_values(by=['relative_quality'], ascending =False)

Unnamed: 0,Region,relative_quality
64,NORTHERN AMERICA,5.470805
145,SOUTHEAST ASIA,3.613861
110,WESTERN ASIA,3.472493
189,EASTERN EUROPE,3.161844
157,EAST ASIA,3.07319
129,CENTRAL ASIA,2.857143
168,Channel Islands,2.710603
48,MIDDLE AFRICA,2.406015
2,NORTHERN AFRICA,2.113459
216,OCEANIA,2.015355


### Reflection¶

#### What biases did you expect to find in the data (before you started working with it), and why?

One thing to notice is that the source of this data is from the English Wikipedia pages. English is not a native language spoken by the worldwide population. It is possible that there are a lot of articles that are not written in English. So, I was not surprised to see China and India do not have the highest no. of articles, though they have the highest population. It is more likely that they have far more articles published about politicians in their official languages in other mediums.


#### What (potential) sources of bias did you discover in the course of your data processing and analysis?

Over the course of the analysis, I discovered that the proportion of high-quality articles by country could be impacted by the ORES model which was used to classify the articles’ quality. Since the model was trained to measure the quality of articles based on the structure of the article and not on its content, it could potentially introduce bias. Many aspects of the machine learning algorithms are not revealed in the ORES document. It possibly hides unintended bias.

 

#### Can you think of a realistic data science research situation where using this data (to train a model, perform a hypothesis-driven research, or make business decisions) might create biased or misleading results, due to the inherent gaps and limitations of the data?

It is obvious that the number of politician articles per country's population are heavily influenced by the number of articles. The population data is more accurate than the number of article data. Because we only fetch the number of politician articles from English Wikipedia pages, the article dataset does not contain all the politician articles from all the countries in the world. If someone was trying to analyze the politician articles coverage per population with only the source of English Wikipedia pages, their conclusions would be extremely biased and misleading.