# DATA 512 Homework 2: Considering Bias in Data

### Project Overview
The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. We will consider articles on political figures from different countries. The idea is to combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. And then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. 

### License

#### Code Attribution

Snippets of the code were taken from a code example developed by **Dr. David W. McDonald** for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the **Creative Commons CC-BY license**.

In the below section, importing the libraries that are necessary to work with API calls, parse data and save data into required file.

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

The data acquisition code relies on some constants that help make the code a bit more readable and flexible. These include request headers, API parameters, name and format of the files to be generated later.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<manasars@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

# Input file path containing article titles.
ARTICLE_LIST_FILE = "inputfiles/politicians_by_country_AUG.2024.csv"
POPULATION_FILE = "inputfiles/population_by_country_AUG.2024.csv"

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. 

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


csv library will be used to read article list from csv and store it in a variable for later use.

In [4]:
import csv
import pandas as pd

This custom module will read the csv file from the path passed as argument, and store the article list in the temporary list variable to be used during API calls.The csv reader object treats each row as a dictionary, so we iterate over each row and access the value for key "name" which is the article name and append to article list variable.

In [5]:
# Function to read article titles from the CSV file
def read_article_file(file_path):
    """
    Read the article titles from a CSV file.

    Args:
    - file_path (str): Path to the CSV file containing article titles.

    Returns:
    - list: A list of article titles extracted from the 'disease' column of the CSV file.

    """
    article_titles = []
    try:
        with open(file_path, mode='r', newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                if 'name' in row:
                    if row['name'] not in article_titles:  # Check for duplicates in the list
                        article_titles.append(row['name']) 
    except FileNotFoundError:
        print(f"Error: CSV file '{file_path}' not found. Please check the file path and try again.")
        raise
    except Exception as e:
        print(f"Unexpected error while reading '{file_path}': {e}")
        raise
    return article_titles

In the below section, we will create a module to fetch page info data for articles. For each article in the list generated above, we retrieve page info data from the Wikimedia API and saves it to internediary file for later use. while processing we also make a list of articles failed to fetch the page info data.


In [6]:

def process_and_save_data(articles):
    """
    Process the Wikipedia API data for each article and save it as a CSV.

    Args:
    - article_titles (list): List of article titles to query from the Wikipedia API.
    - output_file (str): Path to save the CSV output file.

    Returns:
    - None
    """
    output_file="intermediary_files/articles_page_info.csv"

    article_data = []
    failed_to_process = []

    for article in articles:
        print(f"Processing article: {article}")

        # Request data for each article
        response = request_pageinfo_per_article(article)

        if response is not None:
        # Append the response to the list as a dictionary
            article_data.append(response)
        else:
            print(f"Failed to process{article}")
            failed_to_process.append(article)
    
    article_info_df = pd.DataFrame(article_data)

    result_df = pd.DataFrame()

    for index, row in article_info_df.iterrows():
        page_info = row['query']
        page_id = list(page_info['pages'].keys())[0]
        page_data = page_info['pages'][page_id]

        # Convert the nested page_data dictionary into a single row
        page_data_df = pd.DataFrame.from_dict(page_data, orient='index').T
        result_df = pd.concat([result_df, page_data_df])
        result_df = result_df.loc[:, ['pageid', 'title', 'lastrevid']]

    # Write the data to a CSV file
    try:
        result_df.to_csv(output_file, index=False, encoding='utf-8')
        print(f"Data successfully saved to {output_file}.")
    except Exception as e:
        print(f"Error saving data to CSV: {e}")
        print(failed_to_process)



We will now use our custom modules to read the csv and store the article list in temporary variable and use it to read and process the data using the above defined custom module.

In [7]:
# Read the article titles from the CSV
article_titles = read_article_file(ARTICLE_LIST_FILE)

process_and_save_data(article_titles)

Processing article: Majah Ha Adrif
Processing article: Haroon al-Afghani
Processing article: Tayyab Agha
Processing article: Khadija Zahra Ahmadi
Processing article: Aziza Ahmadyar
Processing article: Muqadasa Ahmadzai
Processing article: Mohammad Sarwar Ahmedzai
Processing article: Amir Muhammad Akhundzada
Processing article: Nasrullah Baryalai Arsalai
Processing article: Abdul Rahim Ayoubi
Processing article: Ismael Balkhi
Processing article: Abdul Baqi Turkistani
Processing article: Mohammad Ghous Bashiri
Processing article: Jan Baz
Processing article: Bashir Ahmad Bezan
Processing article: Rafiullah Bidar
Processing article: Mohammad Siddiq Chakari
Processing article: Cheragh Ali Cheragh
Processing article: Nasir Ahmad Durrani
Processing article: Muhammad Hashim Esmatullahi
Processing article: Ezatullah (Nangarhar)
Processing article: Aimal Faizi
Processing article: Gajinder Singh Safri
Processing article: Sharif Ghalib
Processing article: Hashmat Ghani Ahmadzai
Processing article:

We will now define the parameters and constants needed to fetch data from ORES APIs. Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call LiftWing. Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

Access to the ORES API will require that you request an API access key.

To Get your access token: You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin&centralAuthAutologinTried=1&centralAuthError=Not+centrally+logged+in). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

Here is the [guide](https://api.wikimedia.org/wiki/Authentication) provides detailed steps on generating access token.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

In [8]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<manasars@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "manasars@uw.edu  ",         # your email address should go here
    'access_token'  :  ""     # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
EMAIL_ADDRESS = "manasars@uw.edu"
USERNAME = ""
ACCESS_TOKEN = ""
#

We will now define function to make the ORES API request.

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [9]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

The below custom module will fetch and process the ORES scores data for each article with valid last revision id using the ORES API function defined above. 

For each article, the script reads the corresponding last revision id and uses it to make a request to the ORES API to obtain the quality score for that article. If the quality score is successfully retrieved, it is saved into the intermediary csv file for later use.
If for any reason, ORES requests fails for an article or if the last revision id is not available for an article, then details of such articles are recorded. This information is used to calculate and display error rate of ORES scores fetch.


In [10]:

def process_and_save_scores_data(file_path, email_address, access_token):
    """
    Get ORES scores for a list of article revision IDs and save them to a CSV file.

    Args:
        file_path (str): Path to the CSV file containing article page information.
        email_address (str): Your email address for the API request.
        access_token (str): Your access token for the API request.
    """
    
    failed_article_titles = []
    article_titles_missing_revid = []
    output_csv = 'intermediary_files/articles_ores_scores.csv'
    all_article_scores = []
    total_articles = 0

    try:
        with open(file_path, mode='r', newline='', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile)
            
            for row in reader:
                total_articles += 1 
                if 'lastrevid' in row and row['lastrevid'].strip():
                    rev_id = int(row['lastrevid'])
                    print(f"Requesting ORES score for revision ID: {rev_id}")
                    response = request_ores_score_per_article(article_revid=rev_id,
                                                    email_address=email_address,
                                                    access_token=access_token)
                    if response is not None:
                        # Initialize score data with revision ID and prediction
                        score_data = {
                            'pageid' : row['pageid'],
                            'title': row['title'],
                            'revision_id': rev_id,
                            'quality_prediction': response.get('enwiki', {}).get('scores', {}).get(str(rev_id), {}).get('articlequality', {}).get('score', {}).get('prediction')
                        }

                        all_article_scores.append(score_data)

                    else:
                        failed_article_titles.append(row['title'])
                
                else:
                    article_titles_missing_revid.append(row['title'])

        # Convert the list of scores to a DataFrame
        all_article_scores_df = pd.DataFrame(all_article_scores)

        # Save to CSV
        all_article_scores_df.to_csv(output_csv, index=False)
        print(f"Scores saved to {output_csv}")

        # Calculate and print error rate
        error_count = len(failed_article_titles)  # Count of failed articles
        error_rate = error_count / total_articles if total_articles > 0 else 0  # Avoid division by zero
        
        print("\nFailed to fetch ORES scores for below articles\n", failed_article_titles)

        print("\nLast revision id value is missing for these articles\n", article_titles_missing_revid)

        print(f"\nError rate: {error_rate:.2%} ({error_count} failed out of {total_articles})")

    except FileNotFoundError:
        print(f"Error: CSV file '{file_path}' not found. Please check the file path and try again.")
        raise


Call the custom module to fetch the scores info and save it to the csv file.

In [11]:

path_to_page_info_file = "intermediary_files/articles_page_info.csv"
process_and_save_scores_data(path_to_page_info_file, EMAIL_ADDRESS, ACCESS_TOKEN)

Requesting ORES score for revision ID: 1233202991
Requesting ORES score for revision ID: 1230459615
Requesting ORES score for revision ID: 1225661708
Requesting ORES score for revision ID: 1234741562
Requesting ORES score for revision ID: 1195651393
Requesting ORES score for revision ID: 1235521766
Requesting ORES score for revision ID: 1176429234
Requesting ORES score for revision ID: 1247931713
Requesting ORES score for revision ID: 1225385278
Requesting ORES score for revision ID: 1226326055
Requesting ORES score for revision ID: 1244521219
Requesting ORES score for revision ID: 1231655023
Requesting ORES score for revision ID: 1237694188
Requesting ORES score for revision ID: 1227635806
Requesting ORES score for revision ID: 1248505877
Requesting ORES score for revision ID: 1197443408
Requesting ORES score for revision ID: 1134129082
Requesting ORES score for revision ID: 1193992206
Requesting ORES score for revision ID: 988838315
Requesting ORES score for revision ID: 949986748
Re

For analysis, the Wikipedia data and population will be merged together. For this, we will need to ensure both the datasets have column named "Country". First step is to fetch "Country"column from "Geography" column. since region is distinguished from country through All caps text, we will use the same logic to separate out region and country into two columns. Extract out all caps values and store them as separate column "Region", this leaves us with "Geography" column having only country values.

In [127]:

population_df = pd.read_csv(POPULATION_FILE)

population_df['Region'] = ''

# Iterate through the rows to assign region values based on the 'Geography' column
current_region = ''
for index, row in population_df.iterrows():
    # Check if the 'Geography' value is in ALL CAPS (indicating a region)
    if row['Geography'].isupper():
        current_region = row['Geography']
    else:
        population_df.at[index, 'Region'] = current_region

population_df = population_df[~population_df['Geography'].str.isupper()]

# Rename the 'Geography' column to 'Country'
population_df.rename(columns={'Geography': 'country'}, inplace=True)

population_df.reset_index(drop=True, inplace=True)



As a next step in the merge, we now have to have "Country" column in the [articles_ores_scores.csv]() file. Since this file is generated based on the article list, we will propogate the "country" column from [politicians_by_country.AUG.2024.csv](), all the way up to this ores scores file. This done by merging the files generated at every step of data processing. First merge will be between input file and page info csv file. Second merge will be between first merge and the ores scores. This will ensure we have required "country" column. The final merge is for our analysis, between wikipedis ORES scores and population data.

As we do final merge, there may be countries with no matching entry in the other dataset, we will record such country names and store it ina  txt file, [wp_countries-no_match.txt]().


In [168]:

politicians_df = pd.read_csv(ARTICLE_LIST_FILE)
page_info_df = pd.read_csv(path_to_page_info_file)
path_to_scores_file = "intermediary_files/articles_ores_scores.csv"
page_score_df = pd.read_csv(path_to_scores_file)

# First merge
merged_df_1 = pd.merge(politicians_df, page_info_df, left_on='name', right_on='title', how='inner', indicator='merge_status_1')


# Second merge with a new indicator column name.
merged_df_2 = pd.merge(merged_df_1, page_score_df, left_on='lastrevid', right_on='revision_id', how='left', indicator='merge_status_2')

# Merge to obtain non matching records.
merged_df = pd.merge(merged_df_2, population_df, on='country', how='outer', indicator='merge_status_3')

no_match_wp = merged_df[merged_df['merge_status_3'] == 'left_only']['country'].unique()
no_match_pop = merged_df[merged_df['merge_status_3'] == 'right_only']['country'].unique()

# Combine no-match countries from both sides.
no_match_countries = set(no_match_wp) | set(no_match_pop)

# Write unmatched countries to a text file
with open("generated_files/wp_countries-no_match.txt", 'w') as f:
    for country in no_match_countries:
        f.write(f"{country}\n")

print(f"Unmatched countries saved to wp_countries-no_match.txt'.")

final_df = pd.merge(merged_df_2, population_df, on='country', how='inner', indicator='merge_status_3')
final_df = final_df[['country', 'Region', 'Population', 'title_x', 'revision_id', 'quality_prediction']]
final_df = final_df.rename(columns={'title_x': 'title'})
final_df.to_csv("generated_files/wp_politicians_by_country.csv", index=False)

print("\nMerged data saved into wp_politicians_by_country.csv.")


3
43
Unmatched countries saved to wp_countries-no_match.txt'.

Merged data saved into wp_politicians_by_country.csv.


For analysis, we are interested in total articles per capita and high quality articles per capita. Articles with score as "FA" or "GA" will be considered as high quality artciles.

For analysis, value of 0.0 under "Population" cannot be ignored as it is the representation in millions, so the population is not 0. The table genererated provides the "Top 10 countries by coverage": The 10 countries with the highest total articles per capita (in descending order) .

In [147]:
# Group by country to get average population and total articles
country_population = final_df.groupby('country')['Population'].mean()
country_total_articles = final_df.groupby('country').size()

# Calculate articles per capita
country_total_articles_per_capita = country_total_articles / (country_population * 1e6)

top_10_countries_df = pd.DataFrame({
    'Population': country_population,
    'Total Articles': country_total_articles,
    'Articles per Capita': country_total_articles_per_capita
}).reset_index().sort_values('Articles per Capita', ascending=False)

# Get top 10 countries by articles per capita
top_10_countries = top_10_countries_df.head(10).reset_index(drop=True)

print("Top 10 Countries by Coverage:\n", top_10_countries)


Top 10 Countries by Coverage:
                           country  Population  Total Articles  Articles per Capita
0                          Monaco         0.0              10                  inf
1                          Tuvalu         0.0               1                  inf
2             Antigua and Barbuda         0.1              33             0.000330
3  Federated States of Micronesia         0.1              14             0.000140
4                Marshall Islands         0.1              13             0.000130
5                           Tonga         0.1              10             0.000100
6                        Barbados         0.3              25             0.000083
7                      Montenegro         0.6              36             0.000060
8                      Seychelles         0.1               6             0.000060
9                          Bhutan         0.8              44             0.000055


The table genererated provides the "Bottom 10 countries by coverage": The 10 countries with the lowest total articles per capita (in ascending order) .

In [150]:
bottom_10_countries = top_10_countries_df.tail(10).sort_values('Articles per Capita').reset_index(drop=True)
print("Bottom 10 Countries by Coverage: \n", bottom_10_countries, "\n")



Bottom 10 Countries by Coverage: 
          country  Population  Total Articles  Articles per Capita
0          China      1411.3              16         1.133707e-08
1          India      1428.6             151         1.056979e-07
2          Ghana        34.1               4         1.173021e-07
3   Saudi Arabia        36.9               5         1.355014e-07
4         Zambia        20.2               3         1.485149e-07
5         Norway         5.5               1         1.818182e-07
6         Israel         9.8               2         2.040816e-07
7          Egypt       105.2              32         3.041825e-07
8  Cote d'Ivoire        30.9              10         3.236246e-07
9       Ethiopia       126.5              44         3.478261e-07 



The table genererated provides the "Top 10 countries by high quality": The 10 countries with the highest high quality articles per capita (in descending order) .

In [151]:
# Consider "FA" and "GA" as high-quality articles
high_quality_df = final_df[final_df['quality_prediction'].isin(["FA", "GA"])].copy()
country_population = high_quality_df.groupby('country')['Population'].mean()

country_high_quality_articles = high_quality_df.groupby('country').size()

# Calculate articles per capita
country_high_quality_articles_per_capita = country_high_quality_articles / (country_population * 1e6)

top_10_countries_quality_df = pd.DataFrame({
    'Population': country_population,
    'Total High Quality Articles': country_high_quality_articles,
    'High Quality Articles per Capita': country_high_quality_articles_per_capita
}).reset_index().sort_values('High Quality Articles per Capita', ascending=False)

# Get top 10 countries by articles per capita
top_10_countries_quality = top_10_countries_quality_df.head(10).reset_index(drop=True)

print("Top 10 Countries by High Quality:\n", top_10_countries_quality, "\n")


Top 10 Countries by High Quality:
                  country  Population  Total High Quality Articles  High Quality Articles per Capita
0             Montenegro         0.6                            3                      5.000000e-06
1             Luxembourg         0.7                            2                      2.857143e-06
2                Albania         2.7                            7                      2.592593e-06
3                 Kosovo         1.7                            4                      2.352941e-06
4               Maldives         0.6                            1                      1.666667e-06
5              Lithuania         2.9                            4                      1.379310e-06
6                Croatia         3.8                            5                      1.315789e-06
7                 Guyana         0.8                            1                      1.250000e-06
8  Palestinian Territory         5.5                            6

The table genererated provides the "Bottom 10 countries by high quality": The 10 countries with the lowest high quality articles per capita (in ascending order).

In [152]:
bottom_10_countries_quality = top_10_countries_quality_df.tail(10).sort_values('High Quality Articles per Capita').reset_index(drop=True)
print("Bottom 10 Countries by High Quality (High Quality Articles Per Capita):\n", bottom_10_countries_quality, "\n")


Bottom 10 Countries by High Quality (High Quality Articles Per Capita):
       country  Population  Total High Quality Articles  High Quality Articles per Capita
0  Bangladesh       173.5                            1                      5.763689e-09
1       Egypt       105.2                            1                      9.505703e-09
2    Ethiopia       126.5                            2                      1.581028e-08
3       Japan       124.5                            2                      1.606426e-08
4    Pakistan       240.5                            4                      1.663202e-08
5    Colombia        52.2                            1                      1.915709e-08
6    Congo DR       102.3                            2                      1.955034e-08
7     Vietnam        98.9                            2                      2.022245e-08
8      Uganda        48.6                            1                      2.057613e-08
9     Algeria        46.8            

We now do similar analysis at Region level. To analyse at Region level, we will obtain region population from the original input file.
The table generated shows "Geographic regions by total coverage": A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [169]:
# Fetch Region population from input csv file.
region_population_df = pd.read_csv(POPULATION_FILE)
region_population = region_population_df[region_population_df['Geography'].str.isupper()].groupby('Geography')['Population'].sum().reset_index()

# Calculate total articles per region
region_total_articles = final_df.groupby('Region').size().reset_index(name='Total Articles')

region_total_articles_df = pd.merge(region_population, region_total_articles, left_on='Geography', right_on='Region', how='right')
region_total_articles_df = region_total_articles_df [[ 'Region', 'Population', 'Total Articles']]
region_total_articles_df['Articles Per Capita'] = region_total_articles_df['Total Articles'] / (region_total_articles_df['Population'] * 1e6)
region_total_articles_df = region_total_articles_df.sort_values(by='Articles Per Capita', ascending=False).reset_index(drop=True)

print("Geographic Regions by Total Coverage (Total Articles Per Capita):\n", region_total_articles_df, "\n")


Geographic Regions by Total Coverage (Total Articles Per Capita):
              Region  Population  Total Articles  Articles Per Capita
0   SOUTHERN EUROPE       152.0             797         5.243421e-06
1         CARIBBEAN        44.0             219         4.977273e-06
2    WESTERN EUROPE       199.0             498         2.502513e-06
3    EASTERN EUROPE       285.0             709         2.487719e-06
4      WESTERN ASIA       299.0             610         2.040134e-06
5   NORTHERN EUROPE       108.0             191         1.768519e-06
6   SOUTHERN AFRICA        70.0             123         1.757143e-06
7           OCEANIA        45.0              72         1.600000e-06
8    EASTERN AFRICA       483.0             665         1.376812e-06
9     SOUTH AMERICA       426.0             569         1.335681e-06
10     CENTRAL ASIA        80.0             106         1.325000e-06
11  NORTHERN AFRICA       256.0             302         1.179688e-06
12   WESTERN AFRICA       442.0     

The table generated shows "Geographic regions by high quality coverage": Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [170]:
# Calculate total high quality articles per region
region_high_quality_articles = final_df[final_df['quality_prediction'].isin(["FA", "GA"])].groupby('Region').size().reset_index(name='High Quality Articles')

region_high_quality_articles_df = pd.merge(region_population, region_high_quality_articles, left_on='Geography', right_on='Region', how='right')
region_high_quality_articles_df = region_high_quality_articles_df [[ 'Region', 'Population', 'High Quality Articles']]
region_high_quality_articles_df['High Quality Articles Per Capita'] = region_high_quality_articles_df['High Quality Articles'] / (region_high_quality_articles_df['Population'] * 1e6)
region_high_quality_articles_df = region_high_quality_articles_df.sort_values(by='High Quality Articles Per Capita', ascending=False).reset_index(drop=True)

print("Geographic Regions by High Quality Coverage (High Quality Articles Per Capita):\n", region_high_quality_articles_df, "\n")

Geographic Regions by High Quality Coverage (High Quality Articles Per Capita):
              Region  Population  High Quality Articles  High Quality Articles Per Capita
0   SOUTHERN EUROPE       152.0                     53                      3.486842e-07
1         CARIBBEAN        44.0                      9                      2.045455e-07
2    EASTERN EUROPE       285.0                     38                      1.333333e-07
3   SOUTHERN AFRICA        70.0                      8                      1.142857e-07
4    WESTERN EUROPE       199.0                     21                      1.055276e-07
5      WESTERN ASIA       299.0                     27                      9.030100e-08
6   NORTHERN EUROPE       108.0                      9                      8.333333e-08
7   NORTHERN AFRICA       256.0                     17                      6.640625e-08
8      CENTRAL ASIA        80.0                      5                      6.250000e-08
9   CENTRAL AMERICA       182