# Article Page Info MediaWiki API Example
The following code access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This code requests summary 'page info' for a mutliple article pages. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code  was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 14, 2023. Please note that it is slightly modified to extract multiple pages info. 

In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv('us_cities_by_state_SEPT.2023.csv')
df.shape

(22157, 3)

There were duplicates which has same page_title so they were dropped to reduce redundancy and scraping time

In [3]:
df = df.drop_duplicates(subset=['page_title'])
df.shape

(21519, 3)

The following two steps are done to ensure that there are no duplicates present now. The empty list returns shows the same

In [4]:
temp = df['page_title'].value_counts()

In [5]:
values_to_extract = temp[temp > 1].index.tolist()
values_to_extract

[]

The titles of the page are extracted to call API in future 

In [6]:
titles = df['page_title'].values

In [7]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<mdn27@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
#ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]
ARTICLE_TITLES = ['Abbeville, Alabama', 'Adamsville, Alabama', 'Addison, Alabama']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [9]:
#print(f"Getting page info data for: {ARTICLE_TITLES[2]}")
#info = request_pageinfo_per_article(ARTICLE_TITLES[2])
#print(json.dumps(info,indent=4))
#titles

To improve processing speed and stay within the request time limit 50 of articles info was requested at a time

In [10]:
chunk_size = 50
page_chunks = [titles[i:i + chunk_size] for i in range(0, len(titles), chunk_size)]

In [11]:
final_list = []
i = 0
for chunk in page_chunks:
    # Join the page titles into a pipe-separated string
    titles_param = "|".join(chunk)
    request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
    request_info['titles'] = titles_param
    info = request_pageinfo_per_article(request_template=request_info)
    final_list.append(info['query']['pages'])
    i += 1
    print("iteration number", i)

iteration number 1
iteration number 2
iteration number 3
iteration number 4
iteration number 5
iteration number 6
iteration number 7
iteration number 8
iteration number 9
iteration number 10
iteration number 11
iteration number 12
iteration number 13
iteration number 14
iteration number 15
iteration number 16
iteration number 17
iteration number 18
iteration number 19
iteration number 20
iteration number 21
iteration number 22
iteration number 23
iteration number 24
iteration number 25
iteration number 26
iteration number 27
iteration number 28
iteration number 29
iteration number 30
iteration number 31
iteration number 32
iteration number 33
iteration number 34
iteration number 35
iteration number 36
iteration number 37
iteration number 38
iteration number 39
iteration number 40
iteration number 41
iteration number 42
iteration number 43
iteration number 44
iteration number 45
iteration number 46
iteration number 47
iteration number 48
iteration number 49
iteration number 50
iteration

iteration number 397
iteration number 398
iteration number 399
iteration number 400
iteration number 401
iteration number 402
iteration number 403
iteration number 404
iteration number 405
iteration number 406
iteration number 407
iteration number 408
iteration number 409
iteration number 410
iteration number 411
iteration number 412
iteration number 413
iteration number 414
iteration number 415
iteration number 416
iteration number 417
iteration number 418
iteration number 419
iteration number 420
iteration number 421
iteration number 422
iteration number 423
iteration number 424
iteration number 425
iteration number 426
iteration number 427
iteration number 428
iteration number 429
iteration number 430
iteration number 431


In [12]:
#final_list[0:5]

The extrated record was saved as a json file to extract raw data again without doing the entire scraping again

In [13]:
# Specify the file path where you want to save the JSON data
file_path = "record.json"

# Open the file in write mode and use json.dump() to write the list to the file
with open(file_path, "w") as json_file:
    json.dump(final_list, json_file, indent=4)

In [14]:
# Specify the path to your JSON file
file_path = 'record.json'  # Replace with the actual path

# Open and read the JSON file
with open(file_path, 'r') as file:
    json_data = json.load(file)

The data was processed in such a manner so that for every article we have a row when we make our dataframe

In [15]:
final_record = []
for dicts in json_data:
    for key in dicts:
        value = dicts[key]
        final_record.append(value)

In [16]:
df_new = pd.DataFrame(final_record)
df_new.head()

Unnamed: 0,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,length,talkid,fullurl,editurl,canonicalurl,watchers,redirect,new
0,104730,0,"Abbeville, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1171163550,24706,281244.0,"https://en.wikipedia.org/wiki/Abbeville,_Alabama",https://en.wikipedia.org/w/index.php?title=Abb...,"https://en.wikipedia.org/wiki/Abbeville,_Alabama",,,
1,104761,0,"Adamsville, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1177621427,18040,281272.0,"https://en.wikipedia.org/wiki/Adamsville,_Alabama",https://en.wikipedia.org/w/index.php?title=Ada...,"https://en.wikipedia.org/wiki/Adamsville,_Alabama",,,
2,105188,0,"Addison, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1168359898,13309,281517.0,"https://en.wikipedia.org/wiki/Addison,_Alabama",https://en.wikipedia.org/w/index.php?title=Add...,"https://en.wikipedia.org/wiki/Addison,_Alabama",,,
3,104726,0,"Akron, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1165909508,11710,281240.0,"https://en.wikipedia.org/wiki/Akron,_Alabama",https://en.wikipedia.org/w/index.php?title=Akr...,"https://en.wikipedia.org/wiki/Akron,_Alabama",,,
4,105109,0,"Alabaster, Alabama",wikitext,en,en,ltr,2023-10-10T22:35:37Z,1179139816,20343,281444.0,"https://en.wikipedia.org/wiki/Alabaster,_Alabama",https://en.wikipedia.org/w/index.php?title=Ala...,"https://en.wikipedia.org/wiki/Alabaster,_Alabama",,,


In [17]:
df_new.shape

(21519, 17)

The output was saved in a csv file

In [18]:
df_new.to_csv('page_record.csv')