# Analysis of Quality of Wikipedia Articles on Global Politicians



This notebook contains the code for the analysis of quality of Wikipedia articles on global politicians. The goal of the analysis is to find the top 10 countries and regions with the highest coverage (number of articles per capita) and with the most pewr capita high quality articles. Parts of the code were adapted from this [example notebook](https://drive.google.com/file/d/1GN1ULxKombHRzVsNKzj7tBhnBrSWUWXc/view?usp=drive_link) provided by Dr. David McDonald. The Wikipedia Politicians dataset and the Population dataset where also extracted and provided by Dr. David McDonald.


In [1]:
#
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
import country_converter as coco

we need to get region level data from country names for which we will use the country_converter package.

In [2]:
!pip install country_converter

Collecting country_converter
  Downloading country_converter-1.2-py3-none-any.whl.metadata (24 kB)
Downloading country_converter-1.2-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: country_converter
Successfully installed country_converter-1.2


## Constants and API request templates
We define the constants for our API calls in the following cell. We also include templates for the API requests. Basically we don't expect most of these values to change during the furhter execution of our code and we store them here for convinience. The first two cells are for the ORES API while the third cell contains the constants and request templates for the page info API which is used to extract the latest revison ID which is required by the ORES API

In [2]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "chakim28@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [3]:
USERNAME = ""
ACCESS_TOKEN = ""

In [4]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'chakim28@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## Functions to make API calls

This section contains the functions to make API calls to get page info, specifically the latest revision ID for each article and then use this information to get the article quality prediction from the ORES API

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [6]:
def get_latest_revision(json_response):
  for page_id in json_response:
    latest_revision = json_response[page_id]['lastrevid']
  return latest_revision

### Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [7]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


# Data extraction

The following cells contain the code that reads the list of politicians and the corresponding population for each country. It then queries the page info API for the latest revision ID and then queries the ORES API for article quality predicitons. This information is stored in a CSV file named ```wp_politicians_by_country.csv``` which contains the followinf fields:
- country
- region (this is one of the 22 regions defined by the UN geoscheme)
- population
- article_title (this is basically the name of the politician)
- revision_id (this is ID corresponding to the latest revision of the article)
- article_quality (this is the quality predicted by ORES)

For some of these articles, we are unable to fetch the latest revision ID thorugh the API and some of the ORES API requests might fail due to timeouts. We keep a track of these erors in ```articles_without_scores.txt```. For the articles that encounterd a timeout issue, we rerun the script for these examples and fill in the data. We aim for a error score of less than 1%.

There might also be cases during the merging where either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa. We keep track of such countries in ```wp_countries-no_match.txt'`` for future reference.


In [None]:
# Load the CSV files
politicians_df = pd.read_csv("politicians_by_country_AUG.2024.csv")
population_df = pd.read_csv("population_by_country_AUG.2024.csv")

# Rename columns in population_df
population_df = population_df.rename(columns={'Geography': 'country', 'Population': 'population'})

# Separate regional and country data
regional_df = population_df[population_df['country'].str.isupper()]
country_population_df = population_df[~population_df['country'].str.isupper()]

print("Politicians DataFrame columns:")
print(politicians_df.columns)
print("\nPopulation DataFrame columns:")
print(population_df.columns)

# Create a dictionary to store ORES scores
ores_scores = {}

# List to store articles without ORES scores
articles_without_scores = []

# Function to get the latest revision ID
def get_latest_revision(json_response):
    for page_id in json_response['query']['pages']:
        return json_response['query']['pages'][page_id]['lastrevid']

# Function to extract the prediction from the flattened ORES response
def extract_prediction(ores_response):
    # Flatten the dictionary structure to get the prediction
    scores = ores_response.get('enwiki', {}).get('scores', {})
    for page_id in scores:
        prediction = scores[page_id].get('articlequality', {}).get('score', {}).get('prediction')
        return prediction

# Process each politician in the dataset
total_articles = len(politicians_df)
for index, row in politicians_df.iterrows():
    politician = row['name']

    try:
        # Get page info
        page_info_json = request_pageinfo_per_article(politician)

        if 'query' in page_info_json and 'pages' in page_info_json['query']:
            latest_revision = get_latest_revision(page_info_json)

            # Get ORES score
            ores_score = request_ores_score_per_article(latest_revision,
                                                        email_address="chakim28@uw.edu",
                                                        access_token=ACCESS_TOKEN)

            # Extract prediction properly
            quality_score = extract_prediction(ores_score)

            if quality_score:
                ores_scores[politician] = {
                    'revision_id': latest_revision,
                    'article_quality': quality_score
                }
            else:
                articles_without_scores.append(f"{politician} (No score in ORES response)")
        else:
            articles_without_scores.append(f"{politician} (No valid page info)")

    except Exception as e:
        articles_without_scores.append(f"{politician} (Error: {str(e)})")

    # Add a delay to avoid hitting rate limits
    time.sleep(0.1)

# Add ORES scores to politicians_df
politicians_df['revision_id'] = politicians_df['name'].map(lambda x: ores_scores.get(x, {}).get('revision_id'))
politicians_df['article_quality'] = politicians_df['name'].map(lambda x: ores_scores.get(x, {}).get('article_quality'))

# Merge datasets
merged_df = pd.merge(politicians_df, country_population_df, on='country', how='outer')

# Identify countries with no matches
no_match_countries = merged_df[merged_df['name'].isnull() | merged_df['population'].isnull()]['country'].unique()

# Save countries with no matches to a text file
with open('wp_countries-no_match.txt', 'w') as f:
    for country in no_match_countries:
        f.write(f"{country}\n")

# Create the final CSV file
final_df = merged_df.dropna(subset=['name', 'population'])
final_df = final_df.rename(columns={'name': 'article_title'})
final_df = final_df[['country', 'population', 'article_title', 'revision_id', 'article_quality']]

# Add regional data back to the final DataFrame
final_df = pd.concat([final_df, regional_df[['country', 'population']]])

# Save the final CSV file
final_df.to_csv('wp_politicians_by_country.csv', index=False)

# Compute and print the score error rate
error_rate = len(articles_without_scores) / total_articles
print(f"\nScore Error Rate: {error_rate:.2%}")

# Print or save the log of articles without scores
print("\nArticles without ORES scores:")
for article in articles_without_scores:
    print(article)

# Optionally, save the log to a file
with open('articles_without_scores.txt', 'w') as f:
    for article in articles_without_scores:
        f.write(f"{article}\n")

print("\nProcessing complete. Check wp_countries-no_match.txt, wp_politicians_by_country.csv, and articles_without_scores.txt for results.")


We get the list of articles for which the ORES API might have timed out or for which we did not get a valid latest revision ID and we attempt to re-run the API calls on this list. Upon a successful rerun, we update our ```wp_politicians_by_country.csv``` file with the correct information

In [20]:
# List of names that encountered errors previously
errored_names = [
    "Barbara Eibinger-Miedl", "Mehrali Gasimov", "Julien Goekint", "Kyaw Myint",
    "André Ngongang Ouandji", "Tomás Pimentel", "Richard Sumah", "Mohamed El Fassi",
    "Segun ''Aeroland'' Adewale", "Binos Dauda Yaroe", "Issoufou Saidou-Djermakoye",
    "Yacouba Sido", "José Díaz de Bedoya", "João Almeida (politician)",
    "Álvaro Castello-Branco", "Bashir Bililiqo", "Gift Banda"
]

# Function to get the latest revision ID
def get_latest_revision(json_response):
    for page_id in json_response['query']['pages']:
        return json_response['query']['pages'][page_id]['lastrevid']

# Function to extract the prediction from the ORES response
def extract_prediction(ores_response):
    scores = ores_response.get('enwiki', {}).get('scores', {})
    for page_id in scores:
        prediction = scores[page_id].get('articlequality', {}).get('score', {}).get('prediction')
        return prediction

# Dictionary to store results and a list for articles without scores
ores_scores = {}
articles_without_scores = []

# Process each politician in the errored names list
for politician in errored_names:
    try:
        # Get page info to retrieve the latest revision ID
        page_info_json = request_pageinfo_per_article(politician)

        if 'query' in page_info_json and 'pages' in page_info_json['query']:
            latest_revision = get_latest_revision(page_info_json)

            # Get ORES score using the latest revision ID
            ores_score = request_ores_score_per_article(latest_revision,
                                                        email_address="chakim28@uw.edu",
                                                        access_token=ACCESS_TOKEN)

            # Extract the quality score prediction
            quality_score = extract_prediction(ores_score)

            if quality_score:
                ores_scores[politician] = {
                    'revision_id': latest_revision,
                    'article_quality': quality_score
                }
            else:
                articles_without_scores.append(f"{politician} (No score in ORES response)")
        else:
            articles_without_scores.append(f"{politician} (No valid page info)")

    except Exception as e:
        articles_without_scores.append(f"{politician} (Error: {str(e)})")

    # Add a delay to avoid hitting rate limits
    time.sleep(0.1)

# Print the ORES scores collected
print("ORES Scores Collected:")
for name, score_data in ores_scores.items():
    print(f"{name}: {score_data}")

# Print or save the log of articles without scores
print("\nArticles without ORES scores:")
for article in articles_without_scores:
    print(article)

# Optionally, save the log to a file
with open('articles_without_scores_errored_names.txt', 'w') as f:
    for article in articles_without_scores:
        f.write(f"{article}\n")

print("\nProcessing complete. Check articles_without_scores_errored_names.txt for results.")


ORES Scores Collected:
Julien Goekint: {'revision_id': 1149315537, 'article_quality': 'Start'}
Mohamed El Fassi: {'revision_id': 1190167672, 'article_quality': 'Start'}
Binos Dauda Yaroe: {'revision_id': 1226318030, 'article_quality': 'C'}
Issoufou Saidou-Djermakoye: {'revision_id': 1177650722, 'article_quality': 'Stub'}
Yacouba Sido: {'revision_id': 1177650740, 'article_quality': 'Stub'}
José Díaz de Bedoya: {'revision_id': 1216770558, 'article_quality': 'Stub'}
João Almeida (politician): {'revision_id': 1219074425, 'article_quality': 'Stub'}
Álvaro Castello-Branco: {'revision_id': 1218274868, 'article_quality': 'Stub'}
Gift Banda: {'revision_id': 1227050005, 'article_quality': 'Stub'}

Articles without ORES scores:
Barbara Eibinger-Miedl (Error: 'lastrevid')
Mehrali Gasimov (Error: 'lastrevid')
Kyaw Myint (Error: 'lastrevid')
André Ngongang Ouandji (Error: 'lastrevid')
Tomás Pimentel (Error: 'lastrevid')
Richard Sumah (Error: 'lastrevid')
Segun ''Aeroland'' Adewale (Error: 'lastrevid

In [21]:
# Load the existing CSV file
existing_df = pd.read_csv('wp_politicians_by_country.csv')

# New ORES scores that need to be added
new_ores_scores = {
    "Julien Goekint": {'revision_id': 1149315537, 'article_quality': 'Start'},
    "Mohamed El Fassi": {'revision_id': 1190167672, 'article_quality': 'Start'},
    "Binos Dauda Yaroe": {'revision_id': 1226318030, 'article_quality': 'C'},
    "Issoufou Saidou-Djermakoye": {'revision_id': 1177650722, 'article_quality': 'Stub'},
    "Yacouba Sido": {'revision_id': 1177650740, 'article_quality': 'Stub'},
    "José Díaz de Bedoya": {'revision_id': 1216770558, 'article_quality': 'Stub'},
    "João Almeida (politician)": {'revision_id': 1219074425, 'article_quality': 'Stub'},
    "Álvaro Castello-Branco": {'revision_id': 1218274868, 'article_quality': 'Stub'},
    "Gift Banda": {'revision_id': 1227050005, 'article_quality': 'Stub'}
}

# Convert the new ORES scores into a DataFrame
new_scores_list = []
for name, score_data in new_ores_scores.items():
    new_scores_list.append({
        'article_title': name,
        'revision_id': score_data['revision_id'],
        'article_quality': score_data['article_quality']
    })

new_scores_df = pd.DataFrame(new_scores_list)

# Merge the new scores into the existing DataFrame
updated_df = existing_df.merge(new_scores_df, on='article_title', how='outer', suffixes=('', '_new'))

# Update the original columns with new data if present
updated_df['revision_id'] = updated_df['revision_id_new'].combine_first(updated_df['revision_id'])
updated_df['article_quality'] = updated_df['article_quality_new'].combine_first(updated_df['article_quality'])

# Drop the temporary columns
updated_df = updated_df.drop(columns=['revision_id_new', 'article_quality_new'])

# Save the updated DataFrame back to the CSV
updated_df.to_csv('wp_politicians_by_country.csv', index=False)

print("Updated ORES scores have been added to wp_politicians_by_country.csv.")


Updated ORES scores have been added to wp_politicians_by_country.csv.


## Mapping countries to regions

In this step we attempt to map each country to a region such that each country is mapped to closest (lowest in the hierarchy) region.
We use the ```country-converter``` package to get the UN region mapping for each country and map it to the format it is present in our data. Some countries were not mapped succesfully by the package and for these we create a manual country to region mapping.

In [13]:
# Load the CSV file
df = pd.read_csv('wp_politicians_by_country.csv')

# Identify rows that are countries (not in all uppercase)
df['is_country'] = ~df['country'].str.isupper()

# Use country_converter to get regions (UN subregions) only for countries
cc = coco.CountryConverter()
df.loc[df['is_country'], 'region'] = cc.convert(names=df.loc[df['is_country'], 'country'], to='UNregion')

# Manual adjustment to align with your regions
region_mapping = {
    'Northern Africa': 'NORTHERN AFRICA',
    'Western Africa': 'WESTERN AFRICA',
    'Eastern Africa': 'EASTERN AFRICA',
    'Middle Africa': 'MIDDLE AFRICA',
    'Southern Africa': 'SOUTHERN AFRICA',
    'Northern America': 'NORTHERN AMERICA',
    'Caribbean': 'CARIBBEAN',
    'Central America': 'CENTRAL AMERICA',
    'South America': 'SOUTH AMERICA',
    'Western Asia': 'WESTERN ASIA',
    'Central Asia': 'CENTRAL ASIA',
    'South Asia': 'SOUTH ASIA',
    'Southeast Asia': 'SOUTHEAST ASIA',
    'Eastern Asia': 'EAST ASIA',
    'Northern Europe': 'NORTHERN EUROPE',
    'Western Europe': 'WESTERN EUROPE',
    'Eastern Europe': 'EASTERN EUROPE',
    'Southern Europe': 'SOUTHERN EUROPE',
    'Oceania': 'OCEANIA'
}

# Custom mapping for countries with missing region information
custom_country_mapping = {
    'Sri Lanka': 'SOUTH ASIA',
    'Indonesia': 'SOUTHEAST ASIA',
    'Bangladesh': 'SOUTH ASIA',
    'Malaysia': 'SOUTHEAST ASIA',
    'Papua New Guinea': 'OCEANIA',
    'Iran': 'WESTERN ASIA',
    'Pakistan': 'SOUTH ASIA',
    'Afghanistan': 'SOUTH ASIA',
    'Maldives': 'SOUTH ASIA',
    'India': 'SOUTH ASIA',
    'Solomon Islands': 'OCEANIA',
    'Thailand': 'SOUTHEAST ASIA',
    'Timor Leste': 'SOUTHEAST ASIA',
    'Marshall Islands': 'OCEANIA',
    'Singapore': 'SOUTHEAST ASIA',
    'Cambodia': 'SOUTHEAST ASIA',
    'Federated States of Micronesia': 'OCEANIA',
    'Myanmar': 'SOUTHEAST ASIA',
    'Nepal': 'SOUTH ASIA',
    'Vietnam': 'SOUTHEAST ASIA',
    'Bhutan': 'SOUTH ASIA',
    'Vanuatu': 'OCEANIA',
    'Tonga': 'OCEANIA',
    'Laos': 'SOUTHEAST ASIA',
    'Samoa': 'OCEANIA',
    'Tuvalu': 'OCEANIA'
}

# Apply the custom mapping to match your regions (only for country rows)
df.loc[df['is_country'], 'region'] = df.loc[df['is_country'], 'region'].map(region_mapping)

# Apply the custom country mapping
df.loc[df['is_country'], 'region'] = df.loc[df['is_country'], 'region'].fillna(df.loc[df['is_country'], 'country'].map(custom_country_mapping))

# For rows that are regions (in all uppercase), set the region to be the same as the country
df.loc[~df['is_country'], 'region'] = df.loc[~df['is_country'], 'country']

# Remove the temporary 'is_country' column
df = df.drop('is_country', axis=1)

# Save the updated DataFrame back to CSV
df.to_csv('wp_politicians_by_country.csv', index=False)

print("Updated CSV file saved as 'wp_politicians_by_country.csv'")

# Display the first few rows to verify the new 'region' column
print(df[['country', 'region']].head(10))

# Count of countries with missing region information
missing_regions = df[df['country'] != df['region']]['region'].isna().sum()
print(f"\nNumber of countries with missing region information: {missing_regions}")

if missing_regions > 0:
    print("\nCountries with missing region information:")
    print(df[(df['country'] != df['region']) & df['region'].isna()]['country'].unique())
else:
    print("\nAll countries have been assigned a region.")

Updated CSV file saved as 'wp_politicians_by_country.csv'
                    country           region
0                      Iraq     WESTERN ASIA
1                  Slovenia  SOUTHERN EUROPE
2                 Sri Lanka       SOUTH ASIA
3                    Guyana    SOUTH AMERICA
4                 Indonesia   SOUTHEAST ASIA
5                Bangladesh       SOUTH ASIA
6                Bangladesh       SOUTH ASIA
7                Bangladesh       SOUTH ASIA
8                Kazakhstan     CENTRAL ASIA
9  Central African Republic    MIDDLE AFRICA

Number of countries with missing region information: 0

All countries have been assigned a region.


## Analysis and results
In this step we generate tables for the following:
- Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
- Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
- Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
- Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
- Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
- Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


In [16]:
import pandas as pd

# Load the CSV file with the added region column
df = pd.read_csv('wp_politicians_by_country.csv')

# Convert 'population' column to numeric and filter out non-numeric values
df['population'] = pd.to_numeric(df['population'], errors='coerce')
df = df.dropna(subset=['population'])  # Drop rows where population is NaN

# Filter out regions (where 'country' is in all uppercase)
df = df[~df['country'].str.isupper()]

# Avoid division by zero by removing rows where population is zero
df = df[df['population'] > 0]

# Determine if an article is high-quality
df['is_high_quality'] = df['article_quality'].isin(['FA', 'GA'])

# Calculate total and high-quality articles per capita for each country
country_stats = df.groupby('country').agg({
    'article_title': 'count',
    'is_high_quality': 'sum',
    'population': 'first',
    'region': 'first'
}).reset_index()

country_stats.columns = ['country', 'total_articles', 'high_quality_articles', 'population', 'region']

# Calculate per capita values
country_stats['total_articles_per_capita'] = country_stats['total_articles'] / country_stats['population']
country_stats['high_quality_articles_per_capita'] = country_stats['high_quality_articles'] / country_stats['population']

# Aggregate data by region
region_stats = country_stats.groupby('region').agg({
    'total_articles_per_capita': 'mean',
    'high_quality_articles_per_capita': 'mean'
}).reset_index()

# Sort regions by total articles per capita
regions_by_total_coverage = region_stats.sort_values(by='total_articles_per_capita', ascending=False)

# Sort regions by high-quality articles per capita
regions_by_high_quality_coverage = region_stats.sort_values(by='high_quality_articles_per_capita', ascending=False)

# Print regions by total and high-quality coverage
print("\nGeographic regions by total coverage:")
print(regions_by_total_coverage.to_markdown(index=False, floatfmt=".6f"))
print("\nGeographic regions by high quality coverage:")
print(regions_by_high_quality_coverage.to_markdown(index=False, floatfmt=".6f"))

# Top 10 countries by coverage
top_countries_by_coverage = country_stats.nlargest(10, 'total_articles_per_capita')[['country', 'total_articles_per_capita']]

# Bottom 10 countries by coverage
bottom_countries_by_coverage = country_stats.nsmallest(10, 'total_articles_per_capita')[['country', 'total_articles_per_capita']]

# Top 10 countries by high quality
top_countries_by_high_quality = country_stats.nlargest(10, 'high_quality_articles_per_capita')[['country', 'high_quality_articles_per_capita']]

# Bottom 10 countries by high quality
bottom_countries_by_high_quality = country_stats.nsmallest(10, 'high_quality_articles_per_capita')[['country', 'high_quality_articles_per_capita']]

# Print the tables
print("\nTop 10 countries by coverage (articles per capita):")
print(top_countries_by_coverage.to_markdown(index=False, floatfmt=".6f"))

print("\nBottom 10 countries by coverage (articles per capita):")
print(bottom_countries_by_coverage.to_markdown(index=False, floatfmt=".6f"))

print("\nTop 10 countries by high quality (articles per capita):")
print(top_countries_by_high_quality.to_markdown(index=False, floatfmt=".6f"))

print("\nBottom 10 countries by high quality (articles per capita):")
print(bottom_countries_by_high_quality.to_markdown(index=False, floatfmt=".6f"))


Geographic regions by total coverage:
| region          |   total_articles_per_capita |   high_quality_articles_per_capita |
|:----------------|----------------------------:|-----------------------------------:|
| OCEANIA         |                   62.769424 |                           0.015038 |
| CARIBBEAN       |                   51.320891 |                           0.130612 |
| SOUTHERN EUROPE |                   15.008149 |                           1.150622 |
| SOUTH ASIA      |                   14.694933 |                           0.263691 |
| WESTERN EUROPE  |                   11.025725 |                           0.663467 |
| NORTHERN EUROPE |                    8.078862 |                           0.408874 |
| EASTERN AFRICA  |                    7.565414 |                           0.051389 |
| EASTERN EUROPE  |                    6.922870 |                           0.260920 |
| CENTRAL AMERICA |                    6.563974 |                           0.212835 |
| WE