# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - May 13, 2022



In [21]:
# 
# These are standard python modules
import json, time, urllib.parse
import pandas as pd
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

The example relies on some constants that help make the code a bit more readable.

In [63]:
#########
#
#    CONSTANTS
#
# Reading relevant files into dataframe
politician_df = pd.read_csv('/Users/kirsteenng/Desktop/UW/DATA 512/data-512-homework_2/politicians_by_country_SEPT_2022.csv')
population_df = pd.read_csv('/Users/kirsteenng/Desktop/UW/DATA 512/data-512-homework_2/population_by_country_2022.csv')

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = politician_df['name']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [19]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        print(type(response))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [65]:
ARTICLE_TITLES[0]

'Shahjahan Noori'

In [62]:
merged = pd.merge(politician_df, population_df, left_on= 'country', right_on = 'Geography', how = 'inner')
merged.head()

Unnamed: 0,name,url,country,Geography,Population (millions),Region
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Afghanistan,41.1,SOUTH ASIA


In [77]:

#TODO: return a title last revid panda dataframe
revid = []
length = len(ARTICLE_TITLES)
for i in range(0,length):
    info = request_pageinfo_per_article(ARTICLE_TITLES[i])
    try:
        info_dict = pd.DataFrame.from_dict(info['query']['pages']).loc['lastrevid'].values[0]
        revid.append(info_dict)
    except Exception as e:
        print(e)
        revid.append(0)


<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.models.Response'>
<class 'requests.mod

In [78]:
politician_df['revision_id'] = revid
politician_df.head()

Unnamed: 0,name,url,country,revision_id
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,1099689043
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,943562276
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,852404094
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,1095102390
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,1104998382


In [80]:
len(politician_df)

7584