# DATA512 A2: Bias in data

Wikipedia is a free and openly editable online encyclopedia. Despite its nature of open collaboration, [critics note](https://en.wikipedia.org/wiki/Criticism_of_Wikipedia) that there is bias in its coverage. In this notebook, we examine bias in English Wikipedia's coverage of politicians and the quality of these articles.

## Data sources
For Wikipedia data about politicians, we use data processed by Os Keyes from [Politicians by Country from the English-language Wikipedia](https://figshare.com/articles/Untitled_Item/5513449), which is available as a Fileset on figshare under the CC-BY-SA 4.0 license. The file named `page_data.csv` was extracted from the Fileset and saved to the `data_raw` folder.

Format: {page, country, rev_id}

Population data is from the 2018 World Population Data Sheet Population Reference Bureau's [World Population Data Sheet](http://www.worldpopdata.org/table), using the "Population mid-2018 (millions)" indicator with geography filter set to regions Africa, Asia, Europe, Latin America And The Carribean, Northern America and Oceania, plus all countries selected. Instead of using the PRB website directly, we used cached copy of WPDS 2018 CSV file hosted at [this Dropbox location](https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0), which we saved to the `data_raw` folder. The license for this dataset is unknown.

Format: {geography, population in millions}

To measure quality, we will use Wikimedia's web service called [Objective Revision Evaluation Service (ORES)](https://ores.wikimedia.org/) to make predictions about articles' quality rating according to the English Wikipedia 1.0 (wp10) assessment scale.

## Data acquisition

In [1]:
import os, errno
import json
import requests

def create_folder(path):
    """Creates a folder if it doesn't already exist."""
    created = False
    
    try:
        os.makedirs(path)
        created = True
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise
            
    return created

HEADERS = {
    'Api-User-Agent': 'https://github.com/EdmundTse/data-512-a2'
}

def get_ores_data(revision_ids):
    """Uses ORES API to get wp10 quality scores for the given revision IDs."""
    
    ores_endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    revids = '|'.join(str(x) for x in revision_ids)
    params = {
        'project': 'enwiki',
        'model': 'wp10',
        'revids': revids
    }
    
    api_call = requests.get(ores_endpoint.format(**params), headers=HEADERS)
    
    response = api_call.json()
    return response

First, load the Wikipedia data to find the revision IDs we need to request quality scores for. The CSV file has a header row and has the following columns:

| Column Name | Description                 | Format  |
|-------------|-----------------------------|---------|
| page        | Wikipedia page name         | text    |
| country     | Country name                | text    |
| rev_id      | Wikipedia page revision ID  | integer |

In [2]:
import csv

RAW_DATA_DIR = 'data_raw'
page_data_path = os.path.join(RAW_DATA_DIR, 'page_data.csv')

with open(page_data_path, encoding='utf-8') as page_data_file:
    reader = csv.reader(page_data_file)
    
    # Skip the header row
    next(reader)
    
    page_data = [row for row in reader]

Next, batch the revision IDs, as recommended by the API usage guidelines, to retrieve quality scores using ORES. If the saved responses already exist, use that instead.

In [3]:
# If present, load the saved API responses from file.
ORES_FILENAME = 'ores_responses.json'
path = os.path.join(RAW_DATA_DIR, ORES_FILENAME)

try:
    with open(path) as f:
        results = json.load(f)
except FileNotFoundError:
    results = []

In [4]:
%%time

# This variant uses 4 workers to make parallel requests to ORES for faster completion

import threading, queue

if not results:
    NUM_THREADS = 4
    MAX_BATCH_SIZE = 50

    def ores_worker():
        """Worker thread takes a batchs of rev_ids and makes requests to ORES."""
        while True:
            item = q.get()
            if item is None:
                break

            serial_num, rev_ids = item
            response = get_ores_data(rev_ids)
            results.append((serial_num, response))
            q.task_done()

    # Start the ORES worker threads
    q = queue.Queue()
    threads = []
    for i in range(NUM_THREADS):
        t = threading.Thread(target=ores_worker)
        t.start()
        threads.append(t)

    # Batch and queue revision IDs for ORES requests
    serial_num = 0
    rev_ids_batch = []
    for page in page_data:
        # Add this page revision ID to batch
        rev_id = int(page[2])
        rev_ids_batch.append(rev_id)

        # When the batch is filled, enqueue a new request
        if len(rev_ids_batch) == MAX_BATCH_SIZE:
            serial_num += 1
            q.put((serial_num, rev_ids_batch))
            rev_ids_batch = []

    # Flush any remaining revision IDs that didnt't fill a batch
    if rev_ids_batch:
        serial_num += 1
        q.put((serial_num, rev_ids_batch))

    # Wait for queue, workers and their threads to complete
    q.join()
    for i in range(NUM_THREADS):
        q.put(None)
    for t in threads:
        t.join()

    # Put the responses back into the original order, then discard the order index
    results.sort()
    results = [x[1] for x in results]

Wall time: 1min 29s


Save the raw ORES API responses to files.

In [5]:
# Output all of the API responses into one file
path = os.path.join(RAW_DATA_DIR, ORES_FILENAME)

with open(path, 'w') as f:
    json.dump(results, f)

## Data processing

First, process the ORES API responses to extract the results. Revision IDs that did not produce a quality score, possibly due to the revision being deleted, will be recorded with a null value.

In [6]:
ores = []

for response in results:
    scores = response['enwiki']['scores']
    for rev_id, result in scores.items():
        prediction = None
        try:
            prediction = result['wp10']['score']['prediction']
        except KeyError:
            pass
        ores.append((rev_id, prediction))

Load the population data from the CSV file. This is a quoted CSV with a header row with these columns:

| Column Name | Description                         | Format  |
|-------------|-------------------------------------|---------|
| Geography   | Continent or country name           | text    |
| Population  | Population in mid-2018, in millions | decimal |

In [7]:
pop_data_path = os.path.join(RAW_DATA_DIR, 'WPDS_2018_data.csv')
pop_data = []

with open(pop_data_path, encoding='utf-8') as pop_data_file:
    reader = csv.reader(pop_data_file)
    
    # Skip the header row
    next(reader)
    
    for row in reader:
        geo = row[0]
        # Parse decimal value formatted with comma separator
        population = float(row[1].replace(',', ''))
        pop_data.append((geo, population))

Combine the the data sources into one table.

In [8]:
import pandas as pd

df_page = pd.DataFrame(page_data, columns=['article_name', 'country', 'revision_id'])
df_page.describe()

Unnamed: 0,article_name,country,revision_id
count,47197,47197,47197
unique,47197,219,47197
top,Matilda Amissah-Arthur,France,801221770
freq,1,1689,1


In [9]:
df_ores = pd.DataFrame(ores, columns=['revision_id', 'article_quality'])
df_ores.describe()

Unnamed: 0,revision_id,article_quality
count,47197,47092
unique,47197,6
top,801221770,Stub
freq,1,24633


In [10]:
df_pop = pd.DataFrame(pop_data, columns=['country', 'population'])
df_pop.shape

(207, 2)

Finally, combine the Wikipedia page data, world population data and article quality data from ORES into one table. When merging page data with article quality data, we introduced several None values for those articles that we were unable to get a score, which we then proceed to exclude from the dataset. On the second merge with population data, we used the inner join operation to drop rows where there is not a match for the 'country' values.

In [11]:
df = df_page.merge(df_ores).merge(df_pop).dropna(subset=['article_quality'])
df.describe(include='all').head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
count,44973,44973,44973.0,44973,44973.0
unique,44973,180,44973.0,6,
top,Hou Kok Chung,France,801221770.0,Stub,
freq,1,1689,1.0,23597,
mean,,,,,116.660071


Output the cleaned data to a CSV in the required format.

In [12]:
CLEAN_DATA_DIR = 'data_clean'
CLEANED_FILENAME = 'combined.csv'

create_folder(CLEAN_DATA_DIR)

# Output the cleaned data in the required format
cleaned_filepath = os.path.join(CLEAN_DATA_DIR, CLEANED_FILENAME)

OUTPUT_COLUMNS = [
    'country',
    'article_name',
    'revision_id',
    'article_quality',
    'population'
]

df.to_csv(cleaned_filepath, columns=OUTPUT_COLUMNS, index=False)

In [13]:
df.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
1,Gladys Lundwe,Zambia,757566606,Stub,17.7
2,Mwamba Luchembe,Zambia,764848643,Stub,17.7
3,Thandiwe Banda,Zambia,768166426,Start,17.7
4,Sylvester Chisembele,Zambia,776082926,C,17.7
5,Victoria Kalima,Zambia,776530837,Start,17.7


There are articles included in the dataset that are purely templates without content. These articles have names starting with "Template:". Should we filter these out, since they are not intended to be articles that contain content? We will leave these in place for now as it is an article for that country albeit not about any one politician, though it might be worthwhile revisiting this decision in the future.

## Analysis
We examine these Wikipedia articles for bias, by looking at:
* The proportion of articles-per-population and high-quality articles for each country. We define high-quality as having received an ORES prediction of either "feature article" (FA) or "good article" (GA).
* The number of politician articles per capita for each country.

In [14]:
articles_by_country = df_page.groupby('country').size().rename('articles').reset_index()
articles_by_country.shape

(219, 2)

In [15]:
df_country = df_pop.merge(articles_by_country).set_index('country')
df_country.shape

(180, 2)

Let's examine the result of this join operation. First, there were 219 countries represented from the Wikipedia page data. From the WPDS data, there were 207 geographies represented, which had 6 regions and 201 countries. After the inner join operation, only 180 countries names common to both data sets remain.

In this notebook, we will not be examining which countries were unable to be joined; although that could be an interesting exercise for future work.

Now, we tabulate the number of articles per capita and show the countries with the highest and lowest rates:

In [16]:
RESULTS_DIR = 'results'
_ = create_folder(RESULTS_DIR)

### The 10 highest-ranked countries by number of politician articles relative to population

In [17]:
df_country.loc[:, 'articles_per_million'] = df_country.articles / df_country.population
articles_per_million_highest = df_country.sort_values(by='articles_per_million', ascending=False)[:10]

# Output table of results
path = os.path.join(RESULTS_DIR, 'articles_per_million_highest.csv')
articles_per_million_highest.to_csv(path)

articles_per_million_highest

Unnamed: 0_level_0,population,articles,articles_per_million
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tuvalu,0.01,55,5500.0
Nauru,0.01,53,5300.0
San Marino,0.03,82,2733.333333
Monaco,0.04,40,1000.0
Liechtenstein,0.04,29,725.0
Tonga,0.1,63,630.0
Marshall Islands,0.06,37,616.666667
Iceland,0.4,206,515.0
Andorra,0.08,34,425.0
Federated States of Micronesia,0.1,38,380.0


### The 10 lowest-ranked countries by number of politician articles relative to population

In [18]:
articles_per_million_lowest = df_country.sort_values(by='articles_per_million', ascending=True)[:10]

# Output table of results
path = os.path.join(RESULTS_DIR, 'articles_per_million_lowest.csv')
articles_per_million_lowest.to_csv(path)

articles_per_million_lowest

Unnamed: 0_level_0,population,articles,articles_per_million
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,1371.3,990,0.721943
Indonesia,265.2,215,0.810709
China,1393.8,1138,0.816473
Uzbekistan,32.9,29,0.881459
Ethiopia,107.5,105,0.976744
Zambia,17.7,26,1.468927
"Korea, North",25.6,39,1.523438
Thailand,66.2,112,1.691843
Bangladesh,166.4,324,1.947115
Mozambique,30.5,60,1.967213


### The 10 highest-ranked countries by proportion of high-quality articles about its politicians

In [19]:
# Tabulate by country the proportion of articles that are high quality
df.loc[:, 'high_quality'] = df.dropna(subset=['article_quality']).article_quality.isin(['FA', 'GA'])
high_quality_prop = df.groupby('country').high_quality.mean()
high_quality_prop_highest = pd.DataFrame(high_quality_prop.sort_values(ascending=False)[:10])

# Output table of results
path = os.path.join(RESULTS_DIR, 'high_quality_prop_highest.csv')
high_quality_prop_highest.to_csv(path)

high_quality_prop_highest

Unnamed: 0_level_0,high_quality
country,Unnamed: 1_level_1
"Korea, North",0.179487
Saudi Arabia,0.134454
Central African Republic,0.117647
Romania,0.114943
Mauritania,0.096154
Tuvalu,0.090909
Bhutan,0.090909
Dominica,0.083333
United States,0.075092
Benin,0.074468


### The 10 lowest-ranked countries by proportion of high-quality articles about its politicians

In [20]:
high_quality_prop_lowest = pd.DataFrame(high_quality_prop.sort_values(ascending=True)[:10])

# Output table of results
path = os.path.join(RESULTS_DIR, 'high_quality_prop_lowest.csv')
high_quality_prop_lowest.to_csv(path)

high_quality_prop_lowest

Unnamed: 0_level_0,high_quality
country,Unnamed: 1_level_1
Sao Tome and Principe,0.0
Mozambique,0.0
Cameroon,0.0
Guyana,0.0
Turkmenistan,0.0
Monaco,0.0
Moldova,0.0
Comoros,0.0
Marshall Islands,0.0
Costa Rica,0.0


Actually, there are many more countries with zero high quality articles about politicians. Since the order was arbitrary within the same value, it would be better to list all such countries:

In [21]:
high_quality_prop_zero = high_quality_prop[high_quality_prop == 0]

# Output table of results
path = os.path.join(RESULTS_DIR, 'high_quality_prop_zero.csv')
high_quality_prop_zero.to_csv(path)

high_quality_prop_zero.index.values

array(['Andorra', 'Angola', 'Antigua and Barbuda', 'Bahamas', 'Barbados',
       'Belgium', 'Belize', 'Cameroon', 'Cape Verde', 'Comoros',
       'Costa Rica', 'Djibouti', 'Federated States of Micronesia',
       'Finland', 'Guyana', 'Kazakhstan', 'Kiribati', 'Lesotho',
       'Liechtenstein', 'Macedonia', 'Malta', 'Marshall Islands',
       'Moldova', 'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Slovakia',
       'Solomon Islands', 'Switzerland', 'Tunisia', 'Turkmenistan',
       'Uganda', 'Zambia'], dtype=object)

## Findings

Given this is an exercise looking at English Wikipedia, I expected to find more articles for politicians in English-speaking countries. If the editors tend to be from the same country as the politician, then it would make sense for those articles to be of higher quality.

However what was apparent from the analysis is that the highest ranked countries by politician articles per capita tended to be those with the smallest population. At the top are Tuvalu and Nauru, both are countries with populations of about 10,000. One notable exception was Iceland, which has four times the population of the next most populous country in that list yet the number of articles for its politicians were many enough for it to be amonst the highest per capita.

At the inverse of the scale, we find that China and India have amongst the fewest politician articles per capita, which perhaps could be due to their huge population.

It was surprising to see North Korea being the highest ranked by proportion of high quality politician articles while also having amongst the lowest number of politician articles per capita. Furthermore, the people of North Korea generally does not have access to the internet and Wikipedia so we are left to wonder who are the people who edits North Korean politician articles, and why they would be more motivated than other countries to polish them to a high quality?

As expected being an English speaking country, the US is amongst the top 10 countries by proportion of high-quality politican articles. While not in the top 10, the UK was not far behind at rank 12. Nobably absent from the top were other English speaking countries such as Canada and Australia. The only other predominantly English speaking country in the top 10 by proportion of high quality politician articles is Dominica.

Finally, we notice that there are many countries over a wide range of population sizes that do not have any politician articles predicted as high quality. From this and other findings above, we can see that English Wikipedia articles do not cover every country's politicians equally well.

Other questions of interest could be to see how different the number of editors per politician article is by country, and what proportion of the politician article editors are from the same country as the politican.