# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [19]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
import numpy as np

The example relies on some constants that help make the code a bit more readable.

In [20]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<es2@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


# Data collection

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [21]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



The function get_revID_for_articles takes in the location of a CSV which contains politician names and returns the output from a wikipedia API call as a dataframe. If the parameter full_dataframe is set to true, it'll return all information pulled from the wikipedia API, while if full_dataframe is set to False, it'll just return the article title and the last revision ID. The parameter remove_missing_rev_id will filter out articles that don't have a last revision ID if set to True or will keep them in with a Nan value if set to False.

In [45]:
def get_revID_for_articles(csv_loc, 
                           full_dataframe = False, 
                           remove_missing_rev_id = True):
    """
    Given the location of wikipedia article titles, output the results from a Wikipedia API call
    Parameters:
        - csv_loc: File location of CSV with article titles.
        - full_dataframe: Determines whether the full results from the API call (True) are returned 
                            or whether it's just the last revision ID (False) 
        - remove_missing_rev_id: Determines whether to filter out articles with no revision ID (True) or leave them in as Nan (False)
    Returns a dataframe of results
    """
    # load the data in
    politicians_by_country = pd.read_csv(f"data/{csv_loc}.csv")

    if politicians_by_country.isna().any().any():  # Check if there are any NA values in the DataFrame
        na_rows = politicians_by_country[politicians_by_country.isna().any(axis=1)]  # Get rows with any NA values
        num_na_rows = len(na_rows)  # Count those rows
        print(f"There are {na_rows} missing rows. They are listed below:")
        print(na_rows)
        print("They've been omitted")
    
    politician_list = politicians_by_country.dropna()["name"].tolist()
    unique_politician_list = list(set(politician_list))
    num_duplicates = len(politician_list) - len(unique_politician_list)
    print(f"There are {num_duplicates} duplicate politicians. This does not affect the revision ID collection")

    rows = []

    # Process articles in batches
    for i in range(0, len(unique_politician_list), 50):  # Example: process 50 articles at a time
        batch_titles = "|".join(unique_politician_list[i:i+50])  # Join article titles with '|'
        article_info = request_pageinfo_per_article(batch_titles)  # Call the function with the batched titles
        
        for keys, values in article_info["query"]["pages"].items():
            row = {}
            for column_name in values:
                row[column_name] = values[column_name]
            
            rows.append(row)  # Append the entire row dictionary 

    # Create a DataFrame from the list of rows
    df = pd.DataFrame(rows)

    if remove_missing_rev_id:
        removed_rows = df[df['lastrevid'].isna()]
        initial_row_count = df.shape[0]
        df.dropna(subset=['lastrevid'], inplace=True)
        final_row_count = df.shape[0]
        removed_row_count = initial_row_count - final_row_count
        pct_missing = removed_row_count / len(unique_politician_list)
        print("Number of rows removed due to missing last_revisionID:", removed_row_count)
        print("Percentage of politicians missing last_revisionID:", pct_missing)
        if not removed_row_count == 0:
            print("Politicians removed:", removed_rows["title"].values.tolist())

    
    # Convert lastrevid to integer
    df["lastrevid"] = df["lastrevid"].astype(int)

    if full_dataframe:
        return df  # Return the DataFrame
    else:
        return df[["title", "lastrevid"]]
    

The function call below inputs the location of the CSV with the politician article names and specifies that we only want the article title and the last revision ID. The idea is that a research can adjust this call to switch to the full dataframe if they want with a simple adjustment of the call below. We also filter out missing revision ID's. 

The missing politicians are also listed below and the % missing is given. Because there are so few politicians that are missing revision IDs we aren't worried and we will omit them from our final results.

In [46]:
# full_dataframes returns the full dataframe with all columns
# remove missing_rev_id takes out article titles that don't have revID
politician_revid = get_revID_for_articles("politicians_by_country_AUG.2024", full_dataframe=False, remove_missing_rev_id=True)

There are 44 duplicate politicians. This does not affect the revision ID collection
Number of rows removed due to missing last_revisionID: 8
Percentage of politicians missing last_revisionID: 0.0011250175783996624
Politicians removed: ["Segun ''Aeroland'' Adewale", 'Tomás Pimentel', 'Richard Sumah', 'Mehrali Gasimov', 'Bashir Bililiqo', 'Kyaw Myint', 'André Ngongang Ouandji', 'Barbara Eibinger-Miedl']


The cell below just saves the dataframe as a CSV to the location of choice. A researcher could adjust this depending on their desired save location. 

In [48]:
# saves the dataframe - adjust based on where you want the final dataframe
politician_revid.to_csv("results/politician_revid.csv", index=False)