# Data Collection: Wikimedia PageInfo API Call

The functionality to call the wikimedia API as a runnable script is found in src/libs/page_info_api_request.py python file. Ways to call the script from the command line with custom arguments is located in src/libs/run_scripts.sh. The following section of the notebook will walk through the specific steps in the script and explain them in detail.

## Step 1: Processing Command Line Arguments, Reading in Files, and Setting Up Logging

The following imports are required for this script: 

In [None]:
import json, urllib, time
import pandas as pd
import sys
import logging
import datetime
import requests 
import os

The call of this script from the command line generally takes the form of: 

bash
```
#move into scripts directory
!cd {local_machine_scripts_directory}
#Run api requests for a given access mode
python3 page_info_api_request.py ${intermediate_data_directory} ${pol_csv_file_name} ${revid_out_file_name}
```

Where local machine scripts directory is the /src/libs/ absolute path on a local machine. This script will execute api calls against the Wikimedia page views api and write a csv file of the output.

Inputs to this script include: 
- the intermediate local data directory within which the initial csv file is located, and the output csv file should be written
- the csv file containing the article names to get data for using the API 
- the output file name which should be written

Accordingly, the first few lines of the script include the following lines of code, meant to process these command line arguments, assign them to variables, read in necessary files, and set up logging so users of this code can track exactly which request succeed, fail, and how long the script takes to run.

In [None]:
start = datetime.datetime.now()
#set up logging 
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename=f"../../logs/pageinfo_api_requests.log")

#assign access and output file path variables 
data_dir = str(sys.argv[1])
csv_file_name = str(sys.argv[2])
out_file_name = str(sys.argv[3])
#join file paths
out_file = os.path.join(data_dir, out_file_name)
csv_file = pd.read_csv(os.path.join(data_dir, csv_file_name))

## Step 2: Setting Constants Required for API Calls

The next step of this script involves setting constant variables required for the rest of the script. This step is largely replicated from a sample code notebook available in the repo : "wp_page_info_example.ipynb." More thorough attribution for this code can be found in the README.md in this repo. 

The constants set in this step include: the api request endpoint url, the agent parameter for the api headers, request headers, and a parameters template for the page info api request.
Additionally, we set an assumed api call latency and set at throttle wait parameter so as to not overwhelm the endpoint with api calls. 

In [None]:
###############
###CONSTANTS###
###############

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<sgura99@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": ""
}

## Step 3: Initialize Functions Required for API Calls

The following functions in the script define two major steps, and we will walk through them in detail.

### Step 3.1: Modularized API Call Function

This function is a modularized version of the API Call. 
This function is also adapted, with some modifications, from the "wp_page_info_example.ipynb" in the src/notebooks subdirectory in this repo. 

Inputs to this function are: 
- the title of the article (dynamic input to the function)
- the endpoint url (See API_ENWIKIPEDIA_ENDPOINT in Step 2)
- the request template (See PAGEINFO_PARAMS_TEMPLATE in Step 2)
- the headers for the api call (See REQUEST_HEADERS in Step 2)
- an integer tracking the number of failed responses

Output of this function is:
- The json response from the call, and the updated failed response counter

In [None]:
def pageinfo_api_call(article_title = None, 
                      endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                      request_template = PAGEINFO_PARAMS_TEMPLATE,
                      headers = REQUEST_HEADERS,
                      failed_response_counter=0):
    #set article title if supplied
    if article_title:
        request_template['titles'] = article_title
    #raise error for no article title
    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")
        logging.info("API Call failed because valid article title not provided")

    # make the request
    try:
        #throttling 
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #make response 
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        #raise status errors 
        response.raise_for_status()
        #convert to json and log
        json_response = response.json()
        logging.info(f"API Call succeeded for article titles {article_title}")
    except Exception as e:
        #log error as needed 
        json_response = None
        logging.info(f"API Call Failed for article titles {article_title} due to error {e}")
        #update failed response counter
        failed_response_counter += 1

    return json_response, failed_response_counter

The function operates through the following steps: 
1. Dynamically ingest the article title, either through the article_title input or the request template
2. Try to make the request with an api throttler to limit requests, a get request using the endpoint url, headers, and the request template,
and json serialization 
3. Catch potential errors in api call, raise error in logging, update the failed response counter, and set the response to none.
5. Return the json response and the failed response counter.

### 3.2 Main Function to Execute Calls

This main function runs api calls against the list of articles

Inputs to this function are: 
- the csv file (set default. See csv_file in Step 1)
- the output file (set default. See out_file in Step 1)
- the start time

This function has no output. It ends with writing the file out to a csv. 

In [None]:
def main(csv_file = csv_file, 
         out_file = out_file, 
         start = start):
    #initialize list of article titles
    article_titles = csv_file.name
    #empty list to append to
    df_list = []
    #failure case counters
    failed_response_counter = 0
    no_rev_id_counter = 0
    #loop through article titles
    for article in article_titles:
        res, failed_response_counter = pageinfo_api_call(article, failed_response_counter=failed_response_counter)
        key = list(res['query']['pages'].keys())[0]
        value = res['query']['pages'][key]
        if value and 'lastrevid' in value.keys():
            df_list.append(pd.DataFrame({"article_title" : [value['title']],
                                            "revision_id" : [value['lastrevid']]}))     
        else:
            logging.info(f"No revision id available for article {value['title']}!")    
            no_rev_id_counter += 1       
    #make df
    df = pd.concat(df_list)
    #write csv
    df.to_csv(out_file, index = False)
    #logging statements
    logging.info(f"A total of {failed_response_counter} API Calls failed to execute succesfully.")
    logging.info(f"A total of {no_rev_id_counter} articles did not have a revision ID available for the given parameters.")
    end = datetime.datetime.now()
    logging.info(f"Run took {end - start} total seconds!") 

#execute main function
if __name__ == "__main__":
    main()

This function operates through the following steps: 
1. Set list of articles to be used in the page info calls 
2. Create an empty list to append results to, and set up failure case counters 
3. Loop through article titles and make the api call
4. Parse results from each api call. Create a dataframe from each result and append it to the result list.
5. Log articles with no revision id. 
6. Concatenate the results, write the csv.
7. Log any required parameters, such as the failure case counters and the total run time.