The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.


### Step 1: Getting the Article and Population Data

In [1]:
import pandas as pd
import numpy as np
from ores import api
from tqdm import tqdm

In [2]:
page_data_path = '../data/page_data.csv'
WPDS_path = '../data/WPDS_2020_data.csv'

page_df = pd.read_csv(page_data_path)
WPDS_df = pd.read_csv(WPDS_path)

### Step 2: Cleaning the Data

In [3]:
page_df.rev_id.isna().sum()

0

In [4]:
page_df = page_df.loc[~page_df['page'].str.contains('Template:')]

In [5]:
page_df.rev_id.isna().sum()

0

In [6]:
region = 'NORTHERN AFRICA'

regions = ['WORLD', 'AFRICA', 'NORTHERN AFRICA']
for i in range(3, len(WPDS_df)):
    if WPDS_df.iloc[i]['Type'] == 'Sub-Region':
        region = WPDS_df.iloc[i]['Name']
    regions.append(region)
    
WPDS_df['Region'] = regions

regions_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == True]
countries_df = WPDS_df.loc[WPDS_df.Name.str.isupper() == False]

In [7]:
regions_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
0,WORLD,WORLD,World,2019,7772.85,7772850000,WORLD
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000,AFRICA
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000,NORTHERN AFRICA
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000,WESTERN AFRICA
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000,EASTERN AFRICA
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000,MIDDLE AFRICA
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000,SOUTHERN AFRICA
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000,NORTHERN AMERICA
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000,LATIN AMERICA AND THE CARIBBEAN
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000,CENTRAL AMERICA


In [8]:
countries_df

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population,Region
3,DZ,Algeria,Country,2019,44.357,44357000,NORTHERN AFRICA
4,EG,Egypt,Country,2019,100.803,100803000,NORTHERN AFRICA
5,LY,Libya,Country,2019,6.891,6891000,NORTHERN AFRICA
6,MA,Morocco,Country,2019,35.952,35952000,NORTHERN AFRICA
7,SD,Sudan,Country,2019,43.849,43849000,NORTHERN AFRICA
...,...,...,...,...,...,...,...
229,WS,Samoa,Country,2019,0.200,200000,OCEANIA
230,SB,Solomon Islands,Country,2019,0.715,715000,OCEANIA
231,TO,Tonga,Country,2019,0.099,99000,OCEANIA
232,TV,Tuvalu,Country,2019,0.010,10000,OCEANIA


In [9]:
len(countries_df.Name.unique())

210

### Step 3: Getting Article Quality Predictions

ORES returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories. For this assignment, **you only need to capture and use the value for prediction.**

In [10]:
page_df

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


In [11]:
ores_session = api.Session("https://ores.wikimedia.org", "DATA 512 Class project <swheele@uw.edu>")

revids = list(page_df.rev_id)
results = []

for revid in revids:
    results.append(ores_session.score("enwiki", ["articlequality"], [revid]))


In [12]:
scores = []

In [13]:
print(len(revids))
print(len(results))
print(len(scores))

46701
46701
0


In [14]:
count=0
for score in results[0]:
    print(score)

{'articlequality': {'score': {'prediction': 'Stub', 'probability': {'B': 0.005643168767502225, 'C': 0.005641424870624224, 'FA': 0.0010757577110297029, 'GA': 0.001543343686495854, 'Start': 0.010537503531047517, 'Stub': 0.9755588014333005}}}}


In [15]:
for i in tqdm(range(len(results))):
    for score in results[i]:
        if 'error' in list(score['articlequality'].keys()):
            scores.append((revids[i], np.nan))
        else:
            scores.append((revids[i], score['articlequality']['score']['prediction']))

100%|███████████████████████████████████| 46701/46701 [1:13:29<00:00, 10.59it/s]


In [16]:
len(scores)

46700

In [29]:
print(revids[1000])
print(scores[999])

705737107
(705737107, 'Start')


In [None]:
page_df = pd.read_csv(page_data_path)

page_df.shape

In [None]:
len(scores)

In [None]:
page_df = pd.concat([page_df, pd.Series(scores)], axis=1)

page_df.rev_id.isna().sum()

In [None]:
page_df = page_df.rename(columns={0: "score"})

page_df.score.isna().sum()

In [None]:
page_df.rev_id.isna().sum()

In [None]:
page_df

In [None]:
nan_scores_df = page_df[page_df['score'].isna()]
articles_df = page_df[~page_df['score'].isna()]

In [None]:
nan_scores_df

In [None]:
articles_df[(articles_df.rev_id == np.nan) & (articles_df.score == 'Stub')]

In [None]:
page_df.loc[(page_df.rev_id.isna())]

In [None]:
page_df2 = pd.read_csv(page_data_path)
page_df2

In [None]:
page_df2.rev_id.isna().sum()

#### Step 4: Combining the Datasets

In [None]:
merged_df = articles_df.merge(countries_df, how='outer', left_on='country', right_on='Name')


In [None]:
merged_df

In [None]:
no_matches_df = merged_df.loc[(merged_df.score.isna()) | (merged_df.Name.isna())]
merged_df = merged_df.drop(index=no_matches_df.index)
merged_df = merged_df.drop(columns=['Name', 'Type', 'FIPS', 'TimeFrame', 'Data (M)']).rename(columns={'page': 'article_name', 'rev_id': 'revision_id', 'score': 'article_quality_est', 'Population': 'population'})

merged_df

In [None]:
'Guam' in merged_df.country

#### Step 5: Analysis

In [None]:
groupby_df = merged_df.groupby(['country', 'article_quality_est']).agg({'revision_id': 'count', 'population': 'first'})

groupby_df

In [None]:
groupby_df.loc[('United Kingdom', slice(None)), :].revision_id.sum()

In [None]:
print('FA' not in groupby_df.loc[('Cambodia', slice(None)), :].index)

In [None]:
merged_df.country.unique()

In [None]:
data = []
for country in merged_df.country.unique():

    if (groupby_df.index.isin([(country, 'FA')]).any()) | (groupby_df.index.isin([(country, 'GA')]).any()):
        articles_sum = groupby_df.loc[(country, ['FA', 'GA']), :].revision_id.sum()
    else:
        articles_sum = 0

    articles_per_pop = ( articles_sum/groupby_df.loc[(country, slice(None)), :].population[0] ) * 100
    high_quality_percent = ( articles_sum/groupby_df.loc[(country, slice(None)), :].revision_id.sum() ) * 100
    
    data.append([country, articles_per_pop, high_quality_percent])

results_df = pd.DataFrame(data, columns=['country', 'articles_per_pop', 'high_quality_percent'])
results_df

In [None]:
results_df.nlargest(10, 'articles_per_pop', keep='all')

In [None]:
results_df.loc[results_df.high_quality_percent == 9.523809523809524]

In [None]:
groupby_df

In [None]:
WPDS_full_df = pd.read_csv(WPDS_path)

In [None]:
WPDS_regions

In [None]:
WPDS_regions_df.shape