# Data Collection: API Calls

The functionality to call the wikimedia API as a runnable script is found in libs/src/api_request.py python file. Thorough in line commenting is also available, along with ways to call the script from the command line with custom arguments in libs/src/run_scripts.sh. The following section of the notebook will walk through the specific steps in the script and explain them in detail.

## Step 1: Processing Command Line Arguments, Reading in Files, and Setting Up Logging

Setting up dependencies for this script includes importing the following packages:

In [None]:
import json, urllib, time
import pandas as pd
import sys
import logging
import requests
import datetime

The call of this script from the command line generally takes the form of: 

bash
```
#move into scripts directory
!cd {local_machine_scripts_directory}
#Run api requests for a given access mode
!python3 api_request.py {access mode} {local_machine_data_directory} {article_csv_file_name}
```

Where local machine scripts directory is the /libs/src/ absolute path on a local machine. This script will execute api calls and write a json output for a given access mode. 

Inputs to this script include: 
- access mode (desktop, mobile-web, or mobile-app)
- a local data directory file path within which the json objects should be written, and the csv file with article title is located.
- the file name of the csv file with article titles 

Accordingly, the first few lines of the script include the following lines of code, meant to process these command line arguments, assign them to variables, read in necessary files, and set up logging so users of this code can track exactly which request succeed, fail, and how long the script takes to run.

In [None]:
#log start time for total time elapsed at end of script
start = datetime.datetime.now()
#assign access and output file path variables 
access = str(sys.argv[1])
data_dir = str(sys.argv[2])
csv_file_path = str(sys.argv[3])
#read in dataframe
rare_disease_df = pd.read_csv(data_dir + csv_file_path)
#assign article titles 
article_titles = rare_disease_df.disease.tolist()
#set up logging 
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename=f"../../logs/api_requests_{access}.log")

Logging is output to a file "api_requests_{access}.log" in a directory named logs in the main repo where it is named dynamically based on the access mode specified in the command line call to the python script. 

## Step 2: Setting Constants Required for Function Calls

The next step of this script involves setting constant variables required for the rest of the script. This step is largely replicated from a sample code notebook available in the repo : "wp_article_views_example.ipynb". More thorough attribution for this code can be found in the README.md in this repo. 

The constants set in this step include: the api request endpoint url, a parameterized string taht can be dynamically filled with the project name, access mode, agent, article, granularity, start, and end of the api call. 
Additionally, we set an assumed api call latency and set at throttle wait parameter so as to not overwhelm the endpoint with api calls. 

We initialize a parameter template dictionary we can dynamically fill with each specific api call, and will later use to form the api request per article param string. Finally, we set the output file path dynamically based on the access mode, the local data directory, and start and end time specified in the param template dictionary.

In [None]:
# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1/100) - 0.002

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': 'sgura99@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}
# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015070100",   # start and end dates need to be set
    "end":         "2024093000"    # this is likely the wrong end date
}

out_file_path = f"{data_dir}/rare-disease_monthly_{access}_{ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['start']}-{ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE['end']}.json"

# Step 3: Initalize Functions Required for API Call

The following functions in the script define three major steps, and we will walk through them one by one.

### Step 3.1: Formatting URLs for the API Call

This function formats the url for the api call. 

Inputs to this fuction are: 
- the title of the article (dynamic input to the function)
- the base endpoint url (See API_REQUEST_PAGEVIEWS_ENDPOINT in Step 2)
- the endpoint params (See API_REQUEST_PER_ARTICLE_PARAMS in Step 2)
- the request template (See ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE in Step 2)
- the access mode (dynamic input to the function: one of desktop, mobile-web, mobile-app, all-access)

Output of this function is:
- The formatted request URL for a given article and access mode, and the original article title



In [None]:
def format_url(article_title,
                base_endpoint_url, 
                endpoint_params,
                request_template,
                access):
    #replace article title in template
    if article_title:
        request_template['article'] = article_title
    #check for template
    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'), safe='')
    request_template['article'] = article_title_encoded
    request_template["access"] = access
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = base_endpoint_url+endpoint_params.format(**request_template)
    return article_title, request_url

This function is also largely adapted from the "wp_article_views_example.ipynb" in the libs/notebooks subdirectory in this repo. 

The function operates through the following steps: 
1. Replace the article parameter in the request template, and check if the article title is supplied to the function.
2. Encode the article title using urllib to make it parseable for the api call.
3. Replace the article and access parameters of the request template with those provided to the function.
4. Construct the full request url with the correctly encoded article title, the access parameter supplied to the function,
and the presets of the request template.
5. Return the full request url

### 3.2: Modularized API Call Function

This function modularizes the api call required to get page view results from the Wikimedia API.

Inputs to this function are: 
- The formatted full request url, outputted from the format_url() function.
- Request headers for the api call (See REQUEST_HEADERS in Step 2)

Output of this function is:
- The API call response post json serialization

In [None]:
def api_request(request_url, headers):
    # Check for request URL 
    if not request_url:
        raise Exception("Must supply a request URL to make an api request.")
    try:
        #Time sleep the throttle wiat 
        time.sleep(API_THROTTLE_WAIT)
        #make request
        response = requests.get(request_url, headers = headers)
        #raise error for bad response
        response.raise_for_status() 
        #serialize
        response = response.json()
        logging.info(f"Request for url {request_url} has succeeded!")
    except Exception as e:
        response = None
        logging.info(f"Request for url {request_url} has failed with error {e}")
    return response

This function operates through the following steps: 
1. Check if the request url is supplied to the function, and raise an error if not.
2. Set the system to sleep with time.sleep(), according to the API_THROTTLE_WAIT parameter set in step 2.
3. Raise an error for a bad API response 
4. Serialize the API Call into json format.
5. If the steps thus far have been successful, log a successful call.
6. If the steps thus far have not been successful, log an unsuccesful call and the error and set the response to equal None.
7. Return the response.

### Step 3.3 Call the Previous Two Functions and Format the Output Results

This function calls the previous two functions for a given article title, and formats the json output such that the "access" field is removed. 

Inputs to this function are: 
- the article title (dynamic input) for format_url()
- base url endpoint for format_url (See Step 3.1)
- endpoint params for format_url (See Step 3.1)
- the request template for format_url (See Step 3.1)
- the request headers for api_request (See Step 3.2)
- The access mode (dynamic input) for format_url (See Step 3.1)

Outputs of this function are:
- the original article title
- the json response

In [None]:
def get_page_views(article_title,
                        base_endpoint_url, 
                        endpoint_params,
                        request_template,
                        headers,
                        access):
    #format URL 
    original_article_title, request_url = format_url(article_title, 
                             base_endpoint_url,
                             endpoint_params,
                             request_template,
                             access)
    #execute api request 
    response = api_request(request_url, headers)
    #remove access key in response
    try:
        for iter in response['items']:
            del iter['access']
    except:
        response = None
    return original_article_title, response

The function operates through the following steps: 

1. Employ the format_url() function to obtain the original article title, and the constructed request url for the api call
2. Execute the api request using the api_request() function to obtain data from the api_request()
3. Attempt to remove the `'access'` field in the `response['items']` json field. If this raises an error, the api call was unsucessful. Therefore, the response is set to None. 
4. Return the original article title, and the response data from the api call.

### Step 3.4 Main Function to Execute the Api Calls and Generate the Output

This function pieces together all of the previous steps, and places them in a main function which is called at the end of the script. 

This function has no inputs or outputs, but simply executes the previous functions against the article titles from the csv file loaded in Step 1. 

It writes the output to the output file path set in Step 2.

In [None]:
def main():
    #initialize page view results
    page_view_res = {}
    responses = []
    #execute page view calls and add to response list
    for url in article_titles:
        res = get_page_views(url,
                            API_REQUEST_PAGEVIEWS_ENDPOINT,
                            API_REQUEST_PER_ARTICLE_PARAMS,
                            ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                            REQUEST_HEADERS,
                            access) 
        responses.append(res)
    #parse response iterables for article titles and json objects
    for response in responses:
        title, json_obj = response
        if json_obj:
            page_view_res[title] = json_obj['items']
    #log titles for which requests were unsuccesful
    unsuccessful_requests = [article for article in article_titles if article not in page_view_res.keys()]
    #Attempt page view call again for articles which weren't succesfully called in first loop
    final_unsuccessful_requests = 0
    for url in unsuccessful_requests:
        title, retry_json_obj = get_page_views(url,
                                                API_REQUEST_PAGEVIEWS_ENDPOINT,
                                                API_REQUEST_PER_ARTICLE_PARAMS,
                                                ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                                REQUEST_HEADERS,
                                                access)
        if retry_json_obj:
            page_view_res[title] = retry_json_obj['items']
        else:
            #log number of unsuccessful requests after first retry
            final_unsuccessful_requests += 1
    #write file 
    with open(out_file_path, 'w') as file:
        json.dump(page_view_res, file)
    #log run time and the number of requests failed
    logging.info(f"Total Run took {datetime.datetime.now() - start} seconds! A total of {final_unsuccessful_requests} failed to complete.")

main()

This function operates through the following steps: 
1. Initalize a page view result dictionary to track final output, and initialize a list to response objects.
2. Loop through the article title list initialized in Step 1, and get page views for these titles. Append these get_page_view() function results to the result list.
3. Parse the responses for each element of the result list, and unpackes the tuple output. Indexes the page view result dictionary with the original article title from this tuple as a key, and the json object `['items']` field as the request call data.
4. Log article titles for which api requests were unsuccessful, and retry these titles one more time in cases where too many api calls were made at once. Add successful retries to the page view result dictionary.
5. Track the number of finally unsuccessful requests after the retry.
6. Write the page view result dictionary to the output file path set in Step 2.
7. Log the final run time, and the number of failed requests.