In [1]:
import json
import time
import urllib.parse

import requests

import pandas as pd

# HW2 Bias in Wikipedia Article Data

Among the numerous articles hosted on Wikipedia and curated by its users are various articles covering the lives and careers of notable politicians from around the world. For many data-focused political research projects, Wikipedia's information serves as an easily accessible and widely repected data source. However, not all Wikipedia articles are created equal.

As part of its community curation systems, Wikipedia allows for every article to be assigned a quality rating indicated how thorough, well-reseached, and transparent it content and sources are. Due to the fact that this process requires significant human effort, not all articles have such a grade assigned to them. However, Wikipedia's ORES API allows for quality rating to be assigned through a machine learning model trained on the prior quality ratings given by Wikipedia's community.

Through the use of this system, it becomes reasonable to ask the questions that this analysis will seek to answer: Is there a bias in quality of Wikipedia politician articles across different countries and regions? And if so, what patterns describe the nature of of this bias?

### Part 1 - Obtaining Article and Population Data

#### Article Data

Before any article quality data for each politician can be obtained, the politicians must first be matched to their respective articles. The page info for each article contains the last revision ID for that article. This ID can be used by the ORES API in order to assign a quality score to a specific instance of a given article, as they are frequently revised.

In order to obtain this, the Wikipedia category of politicians by nationality was crawled to obtain a list of articles on political figures across the world. This list will be imported below and cleaned of any inappropriate or unusable values so that the list can be used to query the Page Info API.

In [None]:
# Reading in list of wikipedia politician articles by country
politicians_df = pd.read_csv("politicians_by_country_AUG.2024.csv")

When inspecting the article titles in the crawled dataset, a small problem can be seen. Many of the articles in the list are not articles about politicians but instead articles about political offices. In order to address this, articles referring to specific political positions will be removed from the article list.

In [None]:
# Removing non-politician articles from dataset
invalid_article_string_list = ["Presid", "Ministry", "Ministers"]
politicians_df = politicians_df[~politicians_df.name.str.contains('|'.join(invalid_article_string_list))]

A secondary glance at the list will reveal a much more significant issue to our analysis. Several political figures have separate articles for their policial histories and their overall lives. In particular, politicians from impoverished, obscure, or otherwise poorly-documented countries seem to be much more likely to have their articles split in this way. Less than 20 such articles appear to exist in the dataset, but this trend is worth noting for future analysis.

With the list of articles cleaned, the page info can now be obtained.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<brun0b42@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [89]:
# Obtaining page info for each politician article
page_info_list = []
for article_num, article_title in enumerate(politicians_df.name):
    print("Getting page info data for: ", article_title, str(article_num) + " / " + str(len(politicians_df.name)), "                ", end="\r")
    info = request_pageinfo_per_article(article_title)
    info_dict = next(iter(info['query']['pages'].values()))
    page_info_list = page_info_list + [info_dict]

# Compiling page info responses into a DataFrame
politicians_page_info_df = pd.DataFrame(page_info_list)
# Saving the page info responses into a json file
#politicians_page_info_df.to_json("politicians_page_info.json")

Getting page info data for:  Denis Walker 7141 / 7142                                                         71 / 7142                 

Looking at the data obtained from the Page Info API, one small problem in need of cleaning stands out. Eight of the articles sent to the Page Info for an info request led to an API response with missing values for both ID and revision ID. This can likely be attributed to the article lacking english language page info. Regardless of the cause, the revision ID is necessary to obtain the ORES quality estimates and thus these data points must be dropped.

In [110]:
# Removing data for articles will NaN id and lastrevid entries
politicians_page_info_df[politicians_page_info_df.lastrevid.isna()]
politicians_page_info_df = politicians_page_info_df[~politicians_page_info_df.lastrevid.isna()]

With the Page Info API data obtained, there is one more dataset to be brought in before moving to the ORES API. In order to effectively compare the quality statistics for articles across countries and regions, it is important to have metrics for the populations in each country. Aside from countries that have been home to exceptionally important recent events, it can generally be assumed that the number of politicians in and the amount of attention paid to any given country will be proportional to its size. As such, population and region data will be brought in from the Population Reference Bureau's world population datasheet.

In [155]:
# Reading in population dataset
populations_df = pd.read_csv("population_by_country_AUG.2024.csv")

### Part 2 - Generating Article Quality Data

With the revision IDs and article titles obtained, the ORES WingLift API can now be queried. By providing an API access token and a revision ID, the ORES API will take the specific revision of the given article and calculate its estimated quality score.

Since this process is limited to a maximum of 5000 requests per hour and some articles lack the page info necessary to generate a predicted score, it is important to catch exceptions and http error responses during this process.

In [97]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - Include your own username and API token to perform the necessary function calls
#
USERNAME = "YourUserName"
ACCESS_TOKEN = "YourAccessToken"

In [98]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [144]:
ORES_quality_list = []
ORES_request_failed_list = []
ORES_request_http_error_list = []
for article_num, article_info in enumerate(zip(politicians_page_info_df.title, politicians_page_info_df.lastrevid.astype(int))):
    print("Getting page info data for: ", article_info[0], str(article_num) + " / " + str(len(politicians_page_info_df.title)), "                ", end="\r")
    # Obtaining raw ORES response data for each article
    try:
        score = request_ores_score_per_article(article_revid=article_info[1],
                                               email_address="brunob42@uw.edu",
                                               access_token=ACCESS_TOKEN)
    except Exception as excep:
        # Catching errors where the ORES request function throws an unexpected in-script error
        ORES_request_failed_list = ORES_request_failed_list + [article_info[0]]
    # Unpacking densely nested article quality information
    try:
        score_values = next(iter(next(iter(next(iter(next(iter(score.values()))['scores'].values())).values())).values()))['probability']
    except :
        # Catching cases where the ORES request encounters an HTTP request timeout or missing data
        ORES_request_http_error_list = ORES_request_http_error_list + [article_info[0]]
    # Adding in article and revision data to identify scores
    score_values['title'] = article_info[0]
    score_values['lastrevid'] = article_info[1]
    # Adding entry to list of ORES predictions
    ORES_quality_list = ORES_quality_list + [score_values]

ORES_score_df = pd.DataFrame(ORES_quality_list)
#ORES_score_df.to_json("ORES_quality_scores.json")

Getting page info data for:  Denis Walker 7133 / 7134                                                         64 / 7134                 

### Part 3: Data Combination

With all three datasets now obtained, it is necessary to combine them together so that the page info, population/region, and ORES score data can all be associated with their respective politician articles.

To begin with, any articles that lack a quality estimate will be removed as they cannot be used for this analysis. Approximately ten articles are dropped in this way.

In [200]:
# Removing entries that could not be assigned an ORES quality estimate
politicians_df = politicians_df[politicians_df.name.isin(ORES_score_df.title)]
ORES_score_df = ORES_score_df[ORES_score_df.title.isin(politicians_df.name)]

Following this, the quality and page info dataset can be neatly joined on their shared article name columns.

To simplify later analysis of the assigned quality scores, a column will be added to the merged dataset containing the quality category that has the highest probability assigned to it by the ORES ML model.

In [378]:
# Adding ORES quality data to politician dataset
ORES_score_df = ORES_score_df.rename({"title":"name"}, axis=1)
politicians_score_df = pd.merge(politicians_df, ORES_score_df, on='name')
# Adding column denoting most likely quality score category
politicians_score_df['score'] = politicians_score_df[politicians_score_df.columns[3:-4]].idxmax(axis=1)

Now, the population and region data must be added. To accomplish this, each article's country will be matched to the region associated with it in the population file's hierarchy. This can be done due to the simple fact that each region in the population dataset is written in all caps and comes immediately before the names of the countries associated with it. Thus, the region can be obtained by finding the country name in the dataset and working back up the list until a region name is found

Then, the population for that country will be obtained by finding the matching row in the dataset.

In [379]:
def get_region_from_country(country):
    region = None
    # Looping through geographic regions to find country location
    for geo_index, geography in enumerate(populations_df.Geography):
        if geography == country:
            current_region_index = geo_index
            # Working back up the index to find region from country name
            while populations_df.Geography[current_region_index] != populations_df.Geography[current_region_index].upper():
                current_region_index = current_region_index - 1
            region = populations_df.Geography[current_region_index]
            break
    return region

In [380]:
def get_population_from_region(region):
    # Obtaining population value associated with given region
    return float(populations_df[populations_df.Geography == region].Population)

In [381]:
# Assigning regions to every politician based on country
politicians_score_df['region'] = politicians_score_df.country.apply(get_region_from_country)

While attempting to map the datasets together, it can be seen that 143 of the articles sadly lack a corresponding region in the population dataset. These articles are all assigned to one of three countries and should be saved to record the reason for their absence in the analysis before continuing with the merge of the three datasets

In [382]:
# Obtaining and saving list of countries that did not have a matching region in the population dataset
regionless_politicians_df = politicians_score_df[politicians_score_df['region'].isna()]
pd.Series(regionless_politicians_df.country.unique()).to_csv("wp_countries-no_match.txt")

In [383]:
# Removing regionless politicians from final dataset
politicians_score_df = politicians_score_df[~politicians_score_df['region'].isna()]

In [384]:
# Assigning population by region
politicians_score_df['population'] = politicians_score_df.country.apply(get_population_from_region)

  return float(populations_df[populations_df.Geography == region].Population)


With the three datasets combined, only a small number of final steps remain. In order to simplify future analysis, a boolean column containing the value of whether or not a given article was classified as 'high quality' (`FA` or `GA` in the ORES API) should be added to the dataset, the columns should be renamed for clarity, and the dataset should be saved as a csv for ease of use in the future.

In [385]:
def is_high_quality(quality):
    if quality == 'GA' or quality == 'FA':
        return True
    else:
        return False

In [386]:
# Adding high quality column denoting a predicted quality of GA or FA
politicians_score_df['high_quality'] = politicians_score_df.score.apply(is_high_quality)

In [387]:
# Renaing columns for clarity
politicians_score_df = politicians_score_df.rename({'name':'article_title',
                                                    'lastrevid':'revision_id',
                                                    'score':'article_quality'},
                                                  axis = 1)
# Saving final dataset
politicians_score_df.to_csv('wp_politicians_by_country.csv')

### Part 4: Analysis and Results

With the final dataset now obtained, the six main questions of this Wikipedia politician bias analysis can be answered.

The first two of these questions are:
* What 10 countries have the most politician articles per capita?

and

* What 10 countries have the _fewest_ politician articles per capita?

To answer these first two questions, the datset can be grouped by country and population so that the count of articles within each country can be calculated and divided by each country's population to obtain the number of articles per million people.

In [399]:
# Getting number of high quality politician articles (GA or FA) for each country
articles_by_country_df = politicians_score_df.groupby(['country','population']).count().loc[:,'article_title'].reset_index()
# Calculating number of high quality articles per million people in each country
articles_by_country_df['articles_per_capita'] = articles_by_country_df.article_title / articles_by_country_df.population
articles_by_country_df = articles_by_country_df.rename({'article_title':'num_articles'}, axis=1)
# Generating table of 10 highest lowest per capita countries (excluding countries with 0 population)
articles_by_country_df.sort_values('articles_per_capita', ascending=False).head(12).iloc[2:]

Unnamed: 0,country,population,num_articles,articles_per_capita
4,Antigua and Barbuda,0.1,33,330.0
51,Federated States of Micronesia,0.1,14,140.0
93,Marshall Islands,0.1,13,130.0
149,Tonga,0.1,10,100.0
12,Barbados,0.3,25,83.333333
98,Montenegro,0.6,38,63.333333
125,Seychelles,0.1,6,60.0
90,Maldives,0.6,33,55.0
17,Bhutan,0.8,44,55.0
121,Samoa,0.2,8,40.0


Looking at the above table, it can be seen that nearly all countries with a high per-capita article count achieve this due to their low populaations. Regardless of how small these countries are, their politicians possess a baseline level of importance that ensures that they will have articles written about them on Wikipedia.

Sorting this dataset in ascending order instead, we obtain:

In [400]:
# Generating table of 10 lowest articles per capita countries
articles_by_country_df.sort_values('articles_per_capita', ascending=True).head(10)

Unnamed: 0,country,population,num_articles,articles_per_capita
31,China,1411.3,16,0.011337
57,Ghana,34.1,3,0.087977
66,India,1428.6,151,0.105698
122,Saudi Arabia,36.9,5,0.135501
164,Zambia,20.2,3,0.148515
108,Norway,5.5,1,0.181818
70,Israel,9.8,2,0.204082
45,Egypt,105.2,32,0.304183
37,Cote d'Ivoire,30.9,10,0.323625
100,Mozambique,33.9,12,0.353982


Since the countries with a _high_ per-capita article count achieve their status through a low population count, it should come as no surprise that the countries with a _low_ per-capita article count achieve _their_ status with a high population count. Being comprised primarily of countries with some combination of sizable populations or exceptionally few articles.

The second pair of questions to be answered by this analysis are:
* What 10 countries have the most _high-quality_ politician articles per capita?

and

* What 10 countries have the fewest _high-quality_ politician articles per capita?

To answer these questions, the dataset can once again be grouped by country and population so that the total number of high quality articles for that country can be calculated and divided by the total population.

In [401]:
# Getting number of high quality politician articles (GA or FA) for each country
high_quality_by_country_df = politicians_score_df.groupby(['country','population']).sum('high_quality').loc[:,'high_quality'].reset_index()
# Calculating number of high quality articles per million people in each country
high_quality_by_country_df['quality_per_capita'] = high_quality_by_country_df.high_quality / high_quality_by_country_df.population
# Generating table of 10 highest quality per capita countries
high_quality_by_country_df.sort_values('quality_per_capita', ascending=False).head(10)

Unnamed: 0,country,population,high_quality,quality_per_capita
98,Montenegro,0.6,2,3.333333
86,Luxembourg,0.7,1,1.428571
85,Lithuania,2.9,4,1.37931
76,Kosovo,1.7,2,1.176471
1,Albania,2.7,3,1.111111
107,North Macedonia,1.8,1,0.555556
80,Latvia,1.9,1,0.526316
124,Serbia,6.6,3,0.454545
13,Belarus,9.2,3,0.326087
95,Moldova,3.4,1,0.294118


Much like with the results for the raw number of articles-per-capita, the countries with a high number of high-quality articles per capita uniformly have moderate or small populations. However, these countries _also_ have a relatively large number of high_quality articles in each. It may not initially appear to be the case, as each country only has a 1-4 articles each, but it turns out that exceptionally few articles in the dataset (approximately 90 in total) were assigned a high quality score.

Sorting this result dataset in ascending order instead, we can obtain the obvious answer to the inverse question:

In [402]:
# Generating table of 10 lowest quality per capita countries
high_quality_by_country_df.sort_values('quality_per_capita', ascending=True).head(10)

Unnamed: 0,country,population,high_quality,quality_per_capita
0,Afghanistan,42.4,0,0.0
117,Portugal,10.5,0,0.0
114,Paraguay,6.2,0,0.0
113,Papua New Guinea,9.5,0,0.0
111,Palestinian Territory,5.5,0,0.0
110,Pakistan,240.5,0,0.0
109,Oman,5.0,0,0.0
108,Norway,5.5,0,0.0
105,Niger,27.2,0,0.0
104,Nicaragua,6.8,0,0.0


Since very few high_quality articles exist in the dataset, countless countries have 0 high quality articles whatsoever, and thus they are all tied for the lowest per-capita count of high-quality articles about politicians.

Thus far, these analyses have been focused on comparisons across countries. However the final two questions to be answered are:

* What regions have the most politician articles per capita?

and

* What regions have the most _high-quality_ politician articles per capita?

To answer these questions, the dataset can be grouped by region and the sum of both populations and high-quality article counts can be obtained across each region while the counts of politician articles overall can be calculated separately. The article and high-quality article totals can then be divided by the population totals to obtain the per-capita values for both articles overall and high-quality articles specifically.

In [405]:
# Calculating the population in each region and the number of politician articles in each region
region_populations = politicians_score_df.groupby('region').sum().loc[:,'population']
articles_by_region_df = politicians_score_df.groupby('region').count().loc[:,['B','article_title']]
# Grouping both calculations together into one dataset
articles_by_region_df['population'] = region_populations
# Calculating the per-capita article count for each region
articles_by_region_df = articles_by_region_df.loc[:,['article_title','population']].rename({'article_title':'article_count'},axis=1)
articles_by_region_df['articles_per_capita'] = articles_by_region_df.article_count / articles_by_region_df.population
# Displaying per-capita article counts in descending order
articles_by_region_df.sort_values('articles_per_capita', ascending=False)

Unnamed: 0_level_0,article_count,population,articles_per_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OCEANIA,71,110.8,0.640794
NORTHERN EUROPE,196,1194.5,0.164085
CARIBBEAN,215,1369.7,0.156969
CENTRAL AMERICA,193,1477.5,0.130626
CENTRAL ASIA,118,2203.5,0.053551
WESTERN ASIA,613,13469.7,0.04551
SOUTHERN EUROPE,819,18285.8,0.044789
EASTERN AFRICA,672,24116.9,0.027864
WESTERN EUROPE,500,19064.0,0.026227
NORTHERN AFRICA,306,12366.3,0.024745


Once again, we can see that raw article values are highest for the major european and asian regions, but the per-capita article counts are highest in regions with small populations.

In [395]:
# Getting number of high quality politician articles (GA or FA) for each region
high_quality_by_region_df = politicians_score_df.groupby('region').sum().loc[:,['population', 'high_quality']]
# Calculating number of high quality articles per million people in each region
high_quality_by_region_df['quality_per_capita'] = high_quality_by_region_df.high_quality / high_quality_by_region_df.population
# Generating table of highest quality per capita regions
high_quality_by_region_df.sort_values('quality_per_capita', ascending=False)

Unnamed: 0_level_0,population,high_quality,quality_per_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NORTHERN EUROPE,1194.5,7,0.00586
CARIBBEAN,1369.7,5,0.00365
SOUTHERN EUROPE,18285.8,27,0.001477
CENTRAL AMERICA,1477.5,2,0.001354
SOUTHERN AFRICA,5954.3,4,0.000672
WESTERN ASIA,13469.7,7,0.00052
CENTRAL ASIA,2203.5,1,0.000454
EASTERN EUROPE,29582.5,10,0.000338
MIDDLE AFRICA,9496.8,3,0.000316
WESTERN EUROPE,19064.0,4,0.00021


From this final table, however, we can at last see a clear bias in the quality of the articles on politicians put together on Wikipedia. Compared to the other regions, the european regions have substantially more high-quality articles and a relatively small or moderate population count that makes their high-quality article count per-capita absurdly larger than the other regions.

### Part 5: Reflections

Looking at the results of this analysis, several key insights stand out as being particularly worthy of note. Throughout this analysis, it could be seen that the articles from less well-known or financially influential countries were far less likely to be contructed with Wikipedia's tenets of quality. From the analyses of high-quality articles it can be seen that the bulk of high quality articles were for politicians in european regions, to a degree that cannot be explained by population alone. However, even the process of cleaning the datasets for use in assembling each table revealed the fact that politicians from less influential countries were far more likely to have their political accomplishments separated off into a small stub article separate from the article on their personal life. Perhaps unsurprisingly, Wikipedia has a noticable bias towards mainstream European politicians when it comes to the effort put into refining each article. Given that Wikipedia is curated by individuals driven by their own passion for the site, it is ultimately quite understandable that the politicians that are more influential, and thus more _visible_ in modern culture, would recieve a disproportionate amount of attention and effort on their articles.

However, this presents an obvious issue for researchers seeking to use Wikipedia as a source of unbiased political history. Of course, any attempt to use Wikipedia's information without an acknowledgement of the inherent biases of crowdsourcing information is misguided to begin with. But the thoroughness with which Wikipedia catalogues information from politicians across the work nevertheless makes it a compelling source of data for a researcher seeking to compare political careers from different regions across the world. A researcher aiming to study the differences between a typical political career path in Britain vs in the United States vs in Pakistan may make the decision to use Wikipedia as a source of comprehensive data simply due to the number of politicians it has catalogued. However, the conclusions such a researcher may draw would likely end up skewed by the fact that the information, citations, and article structure for politicians outside of europe are simply worse than they are for those specific countries.

Ultimately, any researcher attempting to use Wikipedia as a source of political career information will have to enrich the dataset to support their needs. While Wikipedia has a significant disparity in article quality, the details within ech article often come with specific details and citations for all articles greater than a stub. As such, a researcher could use Wikipedia as a baseline source to learn the basic details of a politician's life and actions, while supplementing the articles with data obtained from political science research in the country associated with each figure. As this very analysis demonstrates, Wikipedia articles can be effective sources for analysis when combined with outside datasets such as population data and when embelished by additional data from sources such as the ORES API. Despite its inconsistency across regions, it serves perfectly well as a dataset to merge other more meticulous (but less expansive) datasets onto.rved?


In [414]:
print(high_quality_by_region_df.sort_values('quality_per_capita', ascending=False).to_markdown())

| region          |   population |   high_quality |   quality_per_capita |
|:----------------|-------------:|---------------:|---------------------:|
| NORTHERN EUROPE |       1194.5 |              7 |          0.00586019  |
| CARIBBEAN       |       1369.7 |              5 |          0.00365043  |
| SOUTHERN EUROPE |      18285.8 |             27 |          0.00147656  |
| CENTRAL AMERICA |       1477.5 |              2 |          0.00135364  |
| SOUTHERN AFRICA |       5954.3 |              4 |          0.000671783 |
| WESTERN ASIA    |      13469.7 |              7 |          0.000519685 |
| CENTRAL ASIA    |       2203.5 |              1 |          0.000453823 |
| EASTERN EUROPE  |      29582.5 |             10 |          0.000338038 |
| MIDDLE AFRICA   |       9496.8 |              3 |          0.000315896 |
| WESTERN EUROPE  |      19064   |              4 |          0.00020982  |
| NORTHERN AFRICA |      12366.3 |              2 |          0.00016173  |
| SOUTH AMERICA   |      