# HW 2 - Considering Bias in Data

## Goal
This assignment aims to explore bias in data through the analysis of Wikipedia articles on political figures across different countries.

## Data Sources
1. **Wikipedia Articles**: A dataset of articles on politicians was generated by crawling the [Wikipedia Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality). [`politicians_by_country.AUG.2024.csv`](https://drive.google.com/drive/folders/1qINqVxEf072AKY1HpaQYlFTEllLWjl4L).

2. **Population Data**: Population data is sourced from the World Population Data Sheet published by the Population Reference [`population_by_country_AUG.2024.csv`](https://drive.google.com/drive/folders/1qINqVxEf072AKY1HpaQYlFTEllLWjl4L).

## API
**ORES**: The assignment utilizes the ORES machine learning service, now being integrated into Wikimedia's new infrastructure called LiftWing. This transition moves well-established ORES ML models to LiftWing for generating article quality estimates.
The ORES API documentation can be accessed on the [main ORES page](https://www.mediawiki.org/wiki/ORES).

## License
This project is based on a code example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023



In [120]:
import pandas as pd

politicians_by_country = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/politicians_by_country_AUG.2024.csv')
population_by_country = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/population_by_country_AUG.2024.csv')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

In [4]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'jbjunguw@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [6]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [11]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [17]:
!pip install tqdm



In [21]:
import math

def chunk_list(data, n):
    for i in range(0, len(data), n):
        yield data[i:i + n]

politicians_dict = politicians_by_country.groupby('country').apply(lambda df: df.to_dict('records')).to_dict()

no_matching = []
matching = []
missing = []


  politicians_dict = politicians_by_country.groupby('country').apply(lambda df: df.to_dict('records')).to_dict()


In [19]:
from tqdm import tqdm

politicians_dict = politicians_by_country.groupby('country').apply(lambda df: df.to_dict('records')).to_dict()

no_matching = []
matching = []
missing = []

  politicians_dict = politicians_by_country.groupby('country').apply(lambda df: df.to_dict('records')).to_dict()


In [33]:
for index, row in tqdm(population_by_country.iterrows(), total=population_by_country.shape[0]):
    geography = row["Geography"]
    print(f"Processing country/region: {geography}")

    if geography.upper() == geography:
        cur_region = geography
    else:
        if geography not in politicians_dict:
            no_matching.append(geography)
        else:
            politicians = politicians_dict[geography]
            politician_names = [politician['name'] for politician in politicians]

            # Split the names into chunks of 50 (API limit)
            for chunk in chunk_list(politician_names, 50):
                politician_titles = "|".join(chunk)

                request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
                request_info['titles'] = politician_titles

                pageinfo_response = request_pageinfo_per_article(request_template=request_info)

                if 'query' in pageinfo_response and 'pages' in pageinfo_response['query']:
                    pageinfo_responses = pageinfo_response['query']['pages']
                else:
                    print(f"Unexpected response format or error: {json.dumps(pageinfo_response, indent=4)}")
                    missing.append(f"Error retrieving pageinfo for {politician_titles}")
                    continue

                # Process each politician in the chunk
                for politician in chunk:
                    pageinfo = next((p for p in pageinfo_responses.values() if p.get('title') == politician), None)

                    if pageinfo:
                        if 'lastrevid' in pageinfo:
                            lastrevid = pageinfo['lastrevid']

                            score = request_ores_score_per_article(
                                article_revid=lastrevid,
                                email_address="jbjunguw@uw.edu",
                                access_token=ACCESS_TOKEN
                            )

                            if score and 'enwiki' in score and 'scores' in score['enwiki'] and str(lastrevid) in score['enwiki']['scores'] \
                                and 'articlequality' in score['enwiki']['scores'][str(lastrevid)] \
                                and 'prediction' in score['enwiki']['scores'][str(lastrevid)]['articlequality']['score']:

                                article_quality = score['enwiki']['scores'][str(lastrevid)]['articlequality']['score']['prediction']
                                result = [
                                    ('country', geography),
                                    ('region', cur_region),
                                    ('population', row['Population'] * 10000),
                                    ('article_title', politician),
                                    ('revision_id', lastrevid),
                                    ('article_quality', article_quality)
                                ]
                                matching.append(result)
                            else:
                                missing.append(politician)
                        else:
                            print(f"Missing 'lastrevid' for {politician}")
                            missing.append(politician)
                    else:
                        print(f"Missing page info for {politician}")
                        missing.append(politician)


  0%|          | 0/233 [00:00<?, ?it/s]

Processing country/region: WORLD
Processing country/region: AFRICA
Processing country/region: NORTHERN AFRICA
Processing country/region: Algeria


  2%|▏         | 4/233 [01:16<1:13:19, 19.21s/it]

Processing country/region: Egypt


  2%|▏         | 5/233 [01:53<1:30:29, 23.81s/it]

Processing country/region: Libya


  3%|▎         | 6/233 [02:31<1:44:35, 27.65s/it]

Processing country/region: Morocco


  3%|▎         | 7/233 [03:46<2:34:45, 41.08s/it]

Processing country/region: Sudan


  3%|▎         | 8/233 [04:37<2:44:48, 43.95s/it]

Processing country/region: Tunisia


  4%|▍         | 9/233 [05:40<3:04:28, 49.41s/it]

Processing country/region: Western Sahara
Processing country/region: WESTERN AFRICA
Processing country/region: Benin


  5%|▌         | 12/233 [05:48<1:26:21, 23.45s/it]

Processing country/region: Burkina Faso


  6%|▌         | 13/233 [06:16<1:29:35, 24.43s/it]

Processing country/region: Cape Verde


  6%|▌         | 14/233 [06:27<1:17:59, 21.37s/it]

Processing country/region: Cote d'Ivoire


  6%|▋         | 15/233 [06:38<1:08:07, 18.75s/it]

Processing country/region: Gambia


  7%|▋         | 16/233 [06:57<1:08:26, 18.92s/it]

Processing country/region: Ghana


  7%|▋         | 17/233 [07:01<53:25, 14.84s/it]  

Missing 'lastrevid' for Richard Sumah
Processing country/region: Guinea


  8%|▊         | 18/233 [07:37<1:14:07, 20.69s/it]

Processing country/region: GuineaBissau
Processing country/region: Liberia


  9%|▊         | 20/233 [08:05<1:03:08, 17.79s/it]

Processing country/region: Mali


  9%|▉         | 21/233 [08:37<1:14:53, 21.19s/it]

Processing country/region: Mauritania


  9%|▉         | 22/233 [09:08<1:22:49, 23.55s/it]

Processing country/region: Niger


 10%|▉         | 23/233 [09:26<1:17:13, 22.06s/it]

Processing country/region: Nigeria
Missing 'lastrevid' for Segun ''Aeroland'' Adewale


 10%|█         | 24/233 [13:45<5:05:03, 87.58s/it]

Processing country/region: Senegal


 11%|█         | 25/233 [14:18<4:10:43, 72.32s/it]

Processing country/region: Sierra Leone


 11%|█         | 26/233 [14:41<3:19:59, 57.97s/it]

Processing country/region: Togo


 12%|█▏        | 27/233 [14:48<2:27:43, 43.02s/it]

Processing country/region: EASTERN AFRICA
Processing country/region: Burundi


 12%|█▏        | 29/233 [15:05<1:33:29, 27.50s/it]

Processing country/region: Comoros


 13%|█▎        | 30/233 [15:30<1:30:52, 26.86s/it]

Processing country/region: Djibouti


 13%|█▎        | 31/233 [15:46<1:20:46, 23.99s/it]

Processing country/region: Eritrea


 14%|█▎        | 32/233 [16:02<1:13:12, 21.85s/it]

Processing country/region: Ethiopia


 14%|█▍        | 33/233 [16:51<1:37:49, 29.35s/it]

Processing country/region: Kenya


 15%|█▍        | 34/233 [20:02<4:09:52, 75.34s/it]

Processing country/region: Madagascar


 15%|█▌        | 35/233 [20:38<3:31:22, 64.05s/it]

Processing country/region: Malawi


 15%|█▌        | 36/233 [20:56<2:46:00, 50.56s/it]

Processing country/region: Mauritius
Processing country/region: Mayotte
Processing country/region: Mozambique


 17%|█▋        | 39/233 [21:08<1:20:03, 24.76s/it]

Processing country/region: Reunion
Processing country/region: Rwanda


 18%|█▊        | 41/233 [21:27<1:02:00, 19.38s/it]

Processing country/region: Seychelles


 18%|█▊        | 42/233 [21:34<53:49, 16.91s/it]  

Processing country/region: Somalia
Missing 'lastrevid' for Bashir Bililiqo


 18%|█▊        | 43/233 [23:21<1:56:26, 36.77s/it]

Processing country/region: South Sudan


 19%|█▉        | 44/233 [24:35<2:24:02, 45.73s/it]

Processing country/region: Tanzania


 19%|█▉        | 45/233 [25:07<2:12:41, 42.35s/it]

Processing country/region: Uganda


 20%|█▉        | 46/233 [26:14<2:32:38, 48.98s/it]

Processing country/region: Zambia


 20%|██        | 47/233 [26:18<1:53:06, 36.49s/it]

Processing country/region: Zimbabwe


 21%|██        | 48/233 [27:31<2:24:19, 46.81s/it]

Processing country/region: MIDDLE AFRICA
Processing country/region: Angola


 21%|██▏       | 50/233 [28:31<1:59:45, 39.26s/it]

Processing country/region: Cameroon
Missing 'lastrevid' for André Ngongang Ouandji


 22%|██▏       | 51/233 [29:18<2:04:37, 41.08s/it]

Processing country/region: Central African Republic


 22%|██▏       | 52/233 [29:35<1:45:12, 34.88s/it]

Processing country/region: Chad


 23%|██▎       | 53/233 [29:58<1:35:31, 31.84s/it]

Processing country/region: Congo


 23%|██▎       | 54/233 [30:33<1:37:06, 32.55s/it]

Processing country/region: Congo DR


 24%|██▎       | 55/233 [31:29<1:56:39, 39.32s/it]

Processing country/region: Equatorial Guinea


 24%|██▍       | 56/233 [31:32<1:25:15, 28.90s/it]

Processing country/region: Gabon


 24%|██▍       | 57/233 [31:38<1:05:14, 22.24s/it]

Processing country/region: Sao Tome and Principe
Processing country/region: SOUTHERN AFRICA
Processing country/region: Botswana


 26%|██▌       | 60/233 [31:42<30:34, 10.61s/it]  

Processing country/region: eSwatini
Processing country/region: Lesotho


 27%|██▋       | 62/233 [31:49<23:04,  8.10s/it]

Processing country/region: Namibia


 27%|██▋       | 63/233 [32:07<28:34, 10.08s/it]

Processing country/region: South Africa


 27%|██▋       | 64/233 [33:50<1:26:22, 30.66s/it]

Processing country/region: NORTHERN AMERICA
Processing country/region: Canada
Processing country/region: United States
Processing country/region: LATIN AMERICA AND THE CARIBBEAN
Processing country/region: CENTRAL AMERICA
Processing country/region: Belize


 30%|███       | 70/233 [34:01<31:51, 11.73s/it]  

Processing country/region: Costa Rica


 30%|███       | 71/233 [35:06<51:02, 18.91s/it]

Processing country/region: El Salvador


 31%|███       | 72/233 [35:35<55:23, 20.64s/it]

Processing country/region: Guatemala


 31%|███▏      | 73/233 [36:04<59:05, 22.16s/it]

Processing country/region: Honduras


 32%|███▏      | 74/233 [36:23<56:44, 21.41s/it]

Processing country/region: Mexico
Processing country/region: Nicaragua


 33%|███▎      | 76/233 [36:39<43:02, 16.45s/it]

Processing country/region: Panama


 33%|███▎      | 77/233 [37:10<50:43, 19.51s/it]

Processing country/region: CARIBBEAN
Processing country/region: Antigua and Barbuda


 34%|███▍      | 79/233 [37:48<49:44, 19.38s/it]

Processing country/region: Bahamas


 34%|███▍      | 80/233 [37:59<44:53, 17.60s/it]

Processing country/region: Barbados


 35%|███▍      | 81/233 [38:27<50:31, 19.94s/it]

Processing country/region: Cuba


 35%|███▌      | 82/233 [39:20<1:11:08, 28.27s/it]

Processing country/region: Curacao
Processing country/region: Dominica
Processing country/region: Dominican Republic
Missing 'lastrevid' for Tomás Pimentel


 36%|███▋      | 85/233 [40:02<51:17, 20.79s/it]  

Processing country/region: Grenada


 37%|███▋      | 86/233 [40:05<42:30, 17.35s/it]

Processing country/region: Guadeloupe
Processing country/region: Haiti


 38%|███▊      | 88/233 [40:41<42:17, 17.50s/it]

Processing country/region: Jamaica
Processing country/region: Martinique
Processing country/region: Puerto Rico
Processing country/region: St. Kitts and Nevis


 39%|███▉      | 92/233 [40:45<21:30,  9.15s/it]

Processing country/region: St. Lucia


 40%|███▉      | 93/233 [40:49<19:29,  8.35s/it]

Processing country/region: St. Vincent and the Grenadines


 40%|████      | 94/233 [40:54<17:52,  7.72s/it]

Processing country/region: Trinidad and Tobago


 41%|████      | 95/233 [41:13<23:22, 10.17s/it]

Processing country/region: SOUTH AMERICA
Processing country/region: Argentina


 42%|████▏     | 97/233 [42:23<43:38, 19.25s/it]

Processing country/region: Bolivia


 42%|████▏     | 98/233 [42:51<47:27, 21.09s/it]

Processing country/region: Brazil


 42%|████▏     | 99/233 [44:44<1:33:59, 42.09s/it]

Processing country/region: Chile


 43%|████▎     | 100/233 [45:48<1:45:14, 47.47s/it]

Processing country/region: Colombia


 43%|████▎     | 101/233 [47:06<2:02:27, 55.66s/it]

Processing country/region: Ecuador


 44%|████▍     | 102/233 [47:41<1:49:16, 50.05s/it]

Processing country/region: French Guiana
Processing country/region: Guyana


 45%|████▍     | 104/233 [47:59<1:08:57, 32.08s/it]

Processing country/region: Paraguay


 45%|████▌     | 105/233 [48:22<1:03:50, 29.92s/it]

Processing country/region: Peru


 45%|████▌     | 106/233 [49:40<1:29:01, 42.06s/it]

Processing country/region: Suriname
Processing country/region: Uruguay


 46%|████▋     | 108/233 [50:18<1:07:15, 32.28s/it]

Processing country/region: Venezuela


 47%|████▋     | 109/233 [51:18<1:20:10, 38.79s/it]

Processing country/region: ASIA
Processing country/region: WESTERN ASIA
Processing country/region: Armenia


 48%|████▊     | 112/233 [51:56<52:05, 25.83s/it]  

Processing country/region: Azerbaijan
Missing 'lastrevid' for Mehrali Gasimov


 48%|████▊     | 113/233 [53:20<1:13:54, 36.95s/it]

Processing country/region: Bahrain


 49%|████▉     | 114/233 [54:05<1:16:32, 38.59s/it]

Processing country/region: Cyprus


 49%|████▉     | 115/233 [54:22<1:05:58, 33.55s/it]

Processing country/region: Georgia
Processing country/region: Iraq


 50%|█████     | 117/233 [56:03<1:18:08, 40.42s/it]

Processing country/region: Israel


 51%|█████     | 118/233 [56:06<1:01:26, 32.05s/it]

Processing country/region: Jordan


 51%|█████     | 119/233 [56:48<1:05:17, 34.36s/it]

Processing country/region: Kuwait


 52%|█████▏    | 120/233 [57:06<56:55, 30.23s/it]  

Processing country/region: Lebanon


 52%|█████▏    | 121/233 [58:00<1:08:09, 36.52s/it]

Processing country/region: Oman


 52%|█████▏    | 122/233 [58:21<1:00:00, 32.44s/it]

Processing country/region: Palestinian Territory


 53%|█████▎    | 123/233 [59:32<1:19:10, 43.19s/it]

Processing country/region: Qatar


 53%|█████▎    | 124/233 [59:38<59:08, 32.55s/it]  

Processing country/region: Saudi Arabia


 54%|█████▎    | 125/233 [59:44<44:46, 24.88s/it]

Processing country/region: Syria


 54%|█████▍    | 126/233 [1:00:27<53:52, 30.21s/it]

Processing country/region: Turkey


 55%|█████▍    | 127/233 [1:01:25<1:07:58, 38.47s/it]

Processing country/region: United Arab Emirates


 55%|█████▍    | 128/233 [1:01:52<1:01:11, 34.97s/it]

Processing country/region: Yemen


 55%|█████▌    | 129/233 [1:02:24<59:11, 34.14s/it]  

Processing country/region: CENTRAL ASIA
Processing country/region: Kazakhstan


 56%|█████▌    | 131/233 [1:03:07<48:02, 28.26s/it]

Processing country/region: Kyrgyzstan


 57%|█████▋    | 132/233 [1:03:29<44:52, 26.66s/it]

Processing country/region: Tajikistan


 57%|█████▋    | 133/233 [1:03:44<39:38, 23.78s/it]

Processing country/region: Turkmenistan


 58%|█████▊    | 134/233 [1:03:58<34:39, 21.01s/it]

Processing country/region: Uzbekistan


 58%|█████▊    | 135/233 [1:04:25<37:06, 22.72s/it]

Processing country/region: SOUTH ASIA
Processing country/region: Afghanistan


 59%|█████▉    | 137/233 [1:05:54<51:45, 32.35s/it]

Processing country/region: Bangladesh


 59%|█████▉    | 138/233 [1:07:13<1:09:01, 43.60s/it]

Processing country/region: Bhutan


 60%|█████▉    | 139/233 [1:07:57<1:08:36, 43.79s/it]

Processing country/region: India


 60%|██████    | 140/233 [1:10:35<1:55:02, 74.22s/it]

Processing country/region: Iran


 61%|██████    | 141/233 [1:11:52<1:55:13, 75.15s/it]

Processing country/region: Maldives


 61%|██████    | 142/233 [1:12:30<1:37:56, 64.58s/it]

Processing country/region: Nepal


 61%|██████▏   | 143/233 [1:13:29<1:34:23, 62.92s/it]

Processing country/region: Pakistan


 62%|██████▏   | 144/233 [1:15:22<1:54:54, 77.47s/it]

Processing country/region: Sri Lanka


 62%|██████▏   | 145/233 [1:16:32<1:50:31, 75.36s/it]

Processing country/region: SOUTHEAST ASIA
Processing country/region: Brunei
Processing country/region: Cambodia


 64%|██████▎   | 148/233 [1:17:10<57:07, 40.32s/it]  

Processing country/region: Indonesia


 64%|██████▍   | 149/233 [1:19:11<1:20:22, 57.41s/it]

Processing country/region: Laos


 64%|██████▍   | 150/233 [1:19:17<1:02:59, 45.53s/it]

Processing country/region: Malaysia


 65%|██████▍   | 151/233 [1:20:28<1:10:51, 51.84s/it]

Processing country/region: Myanmar
Missing 'lastrevid' for Kyaw Myint


 65%|██████▌   | 152/233 [1:21:48<1:19:44, 59.07s/it]

Processing country/region: Philippines
Processing country/region: Singapore


 66%|██████▌   | 154/233 [1:21:53<45:55, 34.88s/it]  

Processing country/region: Thailand


 67%|██████▋   | 155/233 [1:22:39<48:53, 37.60s/it]

Processing country/region: Timor Leste


 67%|██████▋   | 156/233 [1:22:56<41:27, 32.31s/it]

Processing country/region: Vietnam


 67%|██████▋   | 157/233 [1:23:34<42:48, 33.79s/it]

Processing country/region: EAST ASIA
Processing country/region: China


 68%|██████▊   | 159/233 [1:23:50<28:08, 22.82s/it]

Processing country/region: China (Hong Kong SAR)
Processing country/region: China (Macao SAR)
Processing country/region: Japan


 70%|██████▉   | 162/233 [1:25:59<38:28, 32.52s/it]

Processing country/region: Korea (North)
Processing country/region: Korea (South)
Processing country/region: Mongolia


 71%|███████   | 165/233 [1:26:11<23:35, 20.82s/it]

Processing country/region: Taiwan


 71%|███████   | 166/233 [1:26:22<21:31, 19.27s/it]

Processing country/region: EUROPE
Processing country/region: NORTHERN EUROPE
Processing country/region: Denmark
Processing country/region: Estonia


 73%|███████▎  | 170/233 [1:26:38<12:28, 11.88s/it]

Processing country/region: Finland


 73%|███████▎  | 171/233 [1:27:19<16:49, 16.28s/it]

Processing country/region: Iceland
Processing country/region: Ireland
Processing country/region: Latvia


 75%|███████▍  | 174/233 [1:27:28<10:51, 11.04s/it]

Processing country/region: Lithuania


 75%|███████▌  | 175/233 [1:28:31<18:32, 19.18s/it]

Processing country/region: Norway


 76%|███████▌  | 176/233 [1:28:33<15:17, 16.09s/it]

Processing country/region: Sweden


 76%|███████▌  | 177/233 [1:29:46<25:57, 27.82s/it]

Processing country/region: United Kingdom
Processing country/region: WESTERN EUROPE
Processing country/region: Austria
Missing 'lastrevid' for Barbara Eibinger-Miedl


 77%|███████▋  | 180/233 [1:31:07<24:10, 27.37s/it]

Processing country/region: Belgium


 78%|███████▊  | 181/233 [1:32:39<34:01, 39.27s/it]

Processing country/region: France


 78%|███████▊  | 182/233 [1:34:44<48:27, 57.01s/it]

Processing country/region: Germany


 79%|███████▊  | 183/233 [1:36:32<57:08, 68.57s/it]

Processing country/region: Liechtenstein
Processing country/region: Luxembourg


 79%|███████▉  | 185/233 [1:37:00<37:36, 47.00s/it]

Processing country/region: Monaco


 80%|███████▉  | 186/233 [1:37:11<30:35, 39.05s/it]

Processing country/region: Netherlands
Processing country/region: Switzerland


 81%|████████  | 188/233 [1:38:43<31:17, 41.73s/it]

Processing country/region: EASTERN EUROPE
Processing country/region: Belarus


 82%|████████▏ | 190/233 [1:39:26<24:46, 34.56s/it]

Processing country/region: Bulgaria


 82%|████████▏ | 191/233 [1:40:35<29:01, 41.46s/it]

Processing country/region: Czechia


 82%|████████▏ | 192/233 [1:41:37<31:31, 46.13s/it]

Processing country/region: Hungary


 83%|████████▎ | 193/233 [1:43:05<37:37, 56.44s/it]

Processing country/region: Moldova


 83%|████████▎ | 194/233 [1:44:16<39:02, 60.06s/it]

Processing country/region: Poland


 84%|████████▎ | 195/233 [1:47:02<56:06, 88.59s/it]

Processing country/region: Romania
Processing country/region: Russia


 85%|████████▍ | 197/233 [1:49:09<46:36, 77.68s/it]

Processing country/region: Slovakia


 85%|████████▍ | 198/233 [1:49:54<40:49, 69.98s/it]

Processing country/region: Ukraine


 85%|████████▌ | 199/233 [1:51:18<41:37, 73.45s/it]

Processing country/region: SOUTHERN EUROPE
Processing country/region: Albania


 86%|████████▋ | 201/233 [1:52:35<31:24, 58.88s/it]

Processing country/region: Andorra
Processing country/region: Bosnia Herzegovina


 87%|████████▋ | 203/233 [1:53:21<22:42, 45.43s/it]

Processing country/region: Croatia


 88%|████████▊ | 204/233 [1:54:31<24:24, 50.51s/it]

Processing country/region: Greece


 88%|████████▊ | 205/233 [1:55:24<23:51, 51.14s/it]

Processing country/region: Italy


 88%|████████▊ | 206/233 [1:57:56<34:15, 76.13s/it]

Processing country/region: Kosovo


 89%|████████▉ | 207/233 [1:58:24<27:30, 63.50s/it]

Processing country/region: Malta


 89%|████████▉ | 208/233 [1:58:25<19:27, 46.70s/it]

Processing country/region: Montenegro


 90%|████████▉ | 209/233 [1:59:04<17:45, 44.40s/it]

Processing country/region: North Macedonia


 90%|█████████ | 210/233 [1:59:33<15:17, 39.89s/it]

Processing country/region: Portugal


 91%|█████████ | 211/233 [2:00:50<18:39, 50.89s/it]

Processing country/region: San Marino
Processing country/region: Serbia


 91%|█████████▏| 213/233 [2:02:15<15:39, 46.99s/it]

Processing country/region: Slovenia


 92%|█████████▏| 214/233 [2:02:55<14:21, 45.36s/it]

Processing country/region: Spain


 92%|█████████▏| 215/233 [2:05:24<21:40, 72.23s/it]

Processing country/region: OCEANIA
Processing country/region: Australia
Processing country/region: Federated States of Micronesia


 94%|█████████▎| 218/233 [2:05:39<09:11, 36.77s/it]

Processing country/region: Fiji
Processing country/region: French Polynesia
Processing country/region: Guam
Processing country/region: Kiribati
Processing country/region: Marshall Islands


 96%|█████████▌| 223/233 [2:05:52<02:58, 17.83s/it]

Processing country/region: Nauru
Processing country/region: New Caledonia
Processing country/region: New Zealand
Processing country/region: Palau
Processing country/region: Papua New Guinea


 98%|█████████▊| 228/233 [2:06:03<00:54, 10.90s/it]

Processing country/region: Samoa


 98%|█████████▊| 229/233 [2:06:12<00:42, 10.66s/it]

Processing country/region: Solomon Islands


 99%|█████████▊| 230/233 [2:06:25<00:33, 11.02s/it]

Processing country/region: Tonga


 99%|█████████▉| 231/233 [2:06:36<00:21, 10.95s/it]

Processing country/region: Tuvalu


100%|█████████▉| 232/233 [2:06:38<00:09,  9.21s/it]

Processing country/region: Vanuatu


100%|██████████| 233/233 [2:06:43<00:00, 32.63s/it]


The error rate is calculated as the ratio of articles without scores (8) to the total number of articles, indicating the proportion of missing scores in the dataset.

In [92]:
8/len(politicians_by_country) * 100

0.11180992313067784

In [88]:
no_matching

['Reunion',
 'French Guiana',
 'Guadeloupe',
 'Martinique',
 'Brunei',
 'Mauritius',
 'French Polynesia',
 'Mayotte',
 'Australia',
 'Canada',
 'Dominica',
 'eSwatini',
 'Netherlands',
 'Denmark',
 'GuineaBissau',
 'Curacao',
 'Andorra',
 'Suriname',
 'Mexico',
 'Nauru',
 'China (Macao SAR)',
 'Jamaica',
 'Sao Tome and Principe',
 'Fiji',
 'United States',
 'Georgia',
 'Palau',
 'Romania',
 'Iceland',
 'San Marino',
 'Liechtenstein',
 'China (Hong Kong SAR)',
 'United Kingdom',
 'Guam',
 'Kiribati',
 'Korea (South)',
 'New Zealand',
 'Puerto Rico',
 'Ireland',
 'New Caledonia',
 'Philippines',
 'Korea (North)',
 'Western Sahara']

In [59]:
with open('wp_countries-no_match.txt', 'w') as f:
    for item in no_matching:
        f.write("%s\n" % item)

In [61]:
import pandas as pd

matching_df = pd.DataFrame(
    [{k: v for k, v in row} for row in matching]
)

matching_df = matching_df[
    ["country", "region", "population", "article_title", "revision_id", "article_quality"]
]

unique_matching_df = matching_df.drop_duplicates(subset=["article_title", "revision_id"])

unique_matching_df.to_csv("wp_politicians_by_country.csv", index=False)

In [111]:
population_by_country = population_by_country.set_index("Geography")

article_counts = matching_df.groupby("country")["article_title"].count()

article_coverage = article_counts / population_by_country.loc[article_counts.index, "Population"]

analysis_df = pd.DataFrame(article_coverage, columns=["article_coverage"])

quality_counts = (
    matching_df[
        (matching_df["article_quality"] == "FA") | (matching_df["article_quality"] == "GA")
    ]
    .groupby("country")["article_title"]
    .count()
)

quality_coverage = quality_counts / population_by_country.loc[quality_counts.index, "Population"]

analysis_df["quality_coverage"] = quality_coverage

analysis_df.fillna(0, inplace=True)

top_10_article_coverage = analysis_df.sort_values("article_coverage", ascending=False).head(10)
bottom_10_article_coverage = analysis_df.sort_values("article_coverage").head(10)

top_10_quality_coverage = analysis_df.sort_values("quality_coverage", ascending=False).head(10)
bottom_10_quality_coverage = analysis_df.sort_values("quality_coverage").head(10)

In [112]:
print("Top 10 countries by article coverage:")
top_10_article_coverage[:]["article_coverage"]

Top 10 countries by article coverage:


Unnamed: 0_level_0,article_coverage
country,Unnamed: 1_level_1
Monaco,inf
Tuvalu,inf
Antigua and Barbuda,330.0
Federated States of Micronesia,140.0
Marshall Islands,130.0
Tonga,100.0
Barbados,83.333333
Seychelles,60.0
Montenegro,60.0
Bhutan,55.0


In [113]:
print("Bottom 10 countries by article coverage:")
bottom_10_article_coverage[:]["article_coverage"]

Bottom 10 countries by article coverage:


Unnamed: 0_level_0,article_coverage
country,Unnamed: 1_level_1
China,0.011337
Ghana,0.087977
India,0.105698
Saudi Arabia,0.135501
Zambia,0.148515
Norway,0.181818
Israel,0.204082
Egypt,0.304183
Cote d'Ivoire,0.323625
Ethiopia,0.347826


In [114]:
print("Top 10 countries by quality coverage:")
top_10_quality_coverage[:]["quality_coverage"]

Top 10 countries by quality coverage:


Unnamed: 0_level_0,quality_coverage
country,Unnamed: 1_level_1
Montenegro,5.0
Luxembourg,2.857143
Albania,2.592593
Kosovo,2.352941
Maldives,1.666667
Lithuania,1.37931
Croatia,1.315789
Guyana,1.25
Palestinian Territory,1.090909
Slovenia,0.952381


In [115]:
print("Bottom 10 countries by quality coverage:")
bottom_10_quality_coverage[:]["quality_coverage"]

Bottom 10 countries by quality coverage:


Unnamed: 0_level_0,quality_coverage
country,Unnamed: 1_level_1
Lesotho,0.0
Niger,0.0
Nicaragua,0.0
Namibia,0.0
Mozambique,0.0
Monaco,0.0
Marshall Islands,0.0
Malta,0.0
Malaysia,0.0
Malawi,0.0


In [68]:
region_article_coverage = matching_df.groupby("region")["article_title"].count()
region_article_coverage

Unnamed: 0_level_0,article_title
region,Unnamed: 1_level_1
CARIBBEAN,217
CENTRAL AMERICA,188
CENTRAL ASIA,106
EAST ASIA,152
EASTERN AFRICA,662
EASTERN EUROPE,709
MIDDLE AFRICA,230
NORTHERN AFRICA,401
NORTHERN EUROPE,191
OCEANIA,72


In [78]:
region_article_coverage_df = pd.DataFrame(region_article_coverage).reset_index()
region_population_df = population_by_country.rename(columns={"Geography": "region"})
merged_df = pd.merge(region_article_coverage_df, region_population_df, on='region', how='left')

In [79]:
merged_df['article_coverage_per_capita'] = merged_df['article_title'] / merged_df['Population']

ranked_regions_total_coverage = merged_df[['region', 'article_coverage_per_capita']].sort_values(
    by='article_coverage_per_capita', ascending=False
).set_index('region')

ranked_regions_total_coverage

Unnamed: 0_level_0,article_coverage_per_capita
region,Unnamed: 1_level_1
SOUTHERN EUROPE,5.230263
CARIBBEAN,4.931818
WESTERN EUROPE,2.497487
EASTERN EUROPE,2.487719
WESTERN ASIA,2.036789
NORTHERN EUROPE,1.768519
SOUTHERN AFRICA,1.757143
OCEANIA,1.6
NORTHERN AFRICA,1.566406
EASTERN AFRICA,1.3706


In [117]:
high_quality_articles = matching_df[matching_df["article_quality"].isin(["FA", "GA"])]
region_high_quality_coverage = high_quality_articles.groupby("region")["article_title"].count()

region_high_quality_coverage_df = pd.DataFrame(region_high_quality_coverage).reset_index()
quality_merged_df = pd.merge(region_high_quality_coverage_df, region_population_df, on='region', how='left')

quality_merged_df['quality_coverage_per_capita'] = quality_merged_df['article_title'] / quality_merged_df['Population']

ranked_regions_total_quality_coverage = quality_merged_df[['region', 'quality_coverage_per_capita']].sort_values(
    by='quality_coverage_per_capita', ascending=False
).set_index('region')

ranked_regions_total_quality_coverage

Unnamed: 0_level_0,quality_coverage_per_capita
region,Unnamed: 1_level_1
SOUTHERN EUROPE,0.348684
CARIBBEAN,0.204545
EASTERN EUROPE,0.133333
SOUTHERN AFRICA,0.114286
WESTERN EUROPE,0.105528
WESTERN ASIA,0.090301
NORTHERN EUROPE,0.083333
NORTHERN AFRICA,0.070312
CENTRAL ASIA,0.0625
CENTRAL AMERICA,0.054945
