# Data Collection: Wikimedia PageInfo API Call

The functionality to call the wikimedia API as a runnable script is found in src/libs/ores_api_request.py python file. Ways to call the script from the command line with custom arguments is located in src/libs/run_scripts.sh. The following section of the notebook will walk through the specific steps in the script and explain them in detail.

## Step 1: Processing Command Line Arguments, Reading in Files, and Setting Up Logging

The following imports are required for this script: 

In [3]:
import json, urllib, time
import pandas as pd
import sys
import os
import logging
import datetime
import requests
sys.path.append("../")
from libs.api_key_store import API_KEY_STORE

The call of this script from the command line generally takes the form of: 

bash
```
#move into local machine scripts directory
cd ${local_machine_scripts_directory}

#run liftwing API Call
python3 ores_api_request.py ${intermediate_data_directory} ${revid_csv_file_name} ${pol_revid_out_file_name}
```

Where local machine scripts directory is the /src/libs/ absolute path on a local machine. This script will execute api calls against the Wikimedia page views api and write a csv file of the output.

Inputs to this script include: 
- the local intermediate data directory within which the csv file from the page info API call is located, and the output csv file should be written
- the csv file containing the revision ids to get data for using the API 
- the output file name which should be written

Accordingly, the first few lines of the script include the following lines of code, meant to process these command line arguments, assign them to variables, read in necessary files, and set up logging so users of this code can track exactly which request succeed, fail, and how long the script takes to run.

In [None]:
#Initialize Key Store so access tokens will not be exposed
api_key_store = API_KEY_STORE("enwiki-articlequality")


start = datetime.datetime.now()
#set up logging 
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename=f"../../logs/ores_api_requests.log")

#assign access and output file path variables 
data_dir = str(sys.argv[1])
csv_file_name = str(sys.argv[2])
out_file_name = str(sys.argv[3])

out_file = os.path.join(data_dir, out_file_name)
csv_file = pd.read_csv(os.path.join(data_dir, csv_file_name))

## Step 2: Setting Up the API Key Store Object

In order to ensure user specific access keys and personal details are not exposed in this software, the api key store object is used to house these specific parameters. The api key store is discussed further in depth below.

In [None]:
import os 

class API_KEY_STORE():
    def __init__(self, model):
        #Set Wikimedia API parameters from env vars
        self.WIKIMEDIA_ACCESS_TOKEN = str(os.getenv("WIKIMEDIA_ACCESS_TOKEN"))
        self.WIKIMEDIA_CLIENT_SECRET = str(os.getenv("WIKIMEDIA_CLIENT_SECRET"))
        self.WIKIMEDIA_CLIENT_ID = str(os.getenv("WIKIMEDIA_CLIENT_ID"))
        self.WIKIMEDIA_EMAIL_ADDRESS = str(os.getenv("WIKIMEDIA_EMAIL_ADDRESS"))
        self.WIKIMEDIA_ORES_ENDPOINT = f"https://api.wikimedia.org/service/lw/inference/v1/models/{model}:predict"

In order to correctly initialize the API Key Store Object, several environment variables must be set prior to running the ores api call script. 
This can be achieved through the following (abstracted) lines of code. my_var represents the variable name to be set, in this case one of WIKIMEDIA_ACCESS_TOKEN, WIKIMEDIA_CLIENT_SECRET, WIKIMEDIA_CLIENT_ID, and WIKIMEDIA_EMAIL_ADDRESS. value represents the user specific parameter from the wikimedia api to be stored.

!conda env config vars set my_var=value

## Step 3: Setting Constants Required for API Calls

The next step of this script involves setting constant variables required for the rest of the script. This step is largely replicated from a sample code notebook available in the repo : "wp_ores_liftwing_example.ipynb." More thorough attribution for this code can be found in the README.md in this repo. 

The constants set in this step include: a template for api request headers, a parameter dictionary for this request template, and a template for the data parameter required for the api call.
Additionally, we set an assumed api call latency and set at throttle wait parameter so as to not overwhelm the endpoint with api calls.

In [None]:
#########
#
#    CONSTANTS
#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

## Step 4: Initialize Functions Required for API Calls

The following functions in the script define two major steps, and we will walk through them in detail.

### Step 4.1: Modularized API Call

This function is a modularized version of the API Call. 
This function is also adapted, with some modifications, from the "wp_ores_liftwing_example.ipynb." in the src/notebooks subdirectory in this repo. 

Inputs to this function are: 
- the revisionid of the article (dynamic input to the function)
- an api key store object (See Step 3)
- request data template (See ORES_REQUEST_DATA_TEMPLATE in Step 2)
- header template (See REQUEST_HEADER_TEMPLATE in Step 2)
- header parameters (See REQUEST_HEADER_PARAMS_TEMPLATE in Step 2)
- the wait time to throttle api requests (See API_THROTTLE_WAIT in Step 2)

Outputs of this function are:
- The original article revision ID
- Serialized JSON output of the response

In [None]:
def make_ores_api_call(article_revid = None,
                        api_key_store = None,
                        request_data = ORES_REQUEST_DATA_TEMPLATE,
                        header_format = REQUEST_HEADER_TEMPLATE,
                        header_params = REQUEST_HEADER_PARAMS_TEMPLATE,
                        request_wait_timing = API_THROTTLE_WAIT):

    #setting parameters based on revid and api key store
    request_data['rev_id'] = article_revid
    header_params['email_address'] = api_key_store.WIKIMEDIA_EMAIL_ADDRESS
    header_params['access_token'] = api_key_store.WIKIMEDIA_ACCESS_TOKEN
    #Raise exceptions in cases where required inputs are missing
    if not article_revid:
        raise Exception("Must provide a valid article revision ID")
        logging.info("API Call failed: Valid article revision ID not Provided!")
    if not header_params["email_address"]:
        raise Exception("Must provide valid api key store object. Email Address Not Available.")
        logging.info("API Call Failed: Valid API Key Store Object not Provided! Email Address Not Available.")
    if not header_params["access_token"]:
        raise Exception("Must provide valid api key store object. Access Token Not Available.")
        logging.info("API Call Failed: Valid API Key Store Object not Provided! Access Token Not Available.")
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    #execute the request
    try:
        #throttling
        time.sleep(request_wait_timing)
        #run post request to API
        response = requests.post(api_key_store.WIKIMEDIA_ORES_ENDPOINT, headers=headers, data=json.dumps(request_data))
        #raise for status 
        response.raise_for_status()
        #serialize json and log successful response
        json_response = response.json() 
        logging.info(f"ORES API Call for Article Revision ID {article_revid} succeeded!")

    except Exception as e:
        #log failure and set response to None
        logging.info(f"ORES API Call for Article Revision ID {article_revid} failed with reason: {e}")
        json_response = None 

    return article_revid, json_response

The function operates through the following steps: 
1. Set parameters for the API call based on inputs to the function. 
2. Raise errors if required inputs are not present
3. Create a request header from the template and parameters. 
4. Try to make the request with an api throttler to limit requests. This requires a post request with the hidden endpoint, the headers, and data for the request.
4. Serialize the response into json format and log a successful call.
5. Catch and log errors in the call, set the response to None.
6. Return the article revision ID and the response object.

### Step 4.2 Parse API Response for Prediction

This function parses the API call to extract the prediction field from the json response. 

Input to this function is: 
- json of the api response 

Output of this function is: 
- the prediction field

In [None]:
def parse_api_response(api_response_json):
    #extract scores dictionary
    scores_dict = api_response_json['enwiki']['scores']
    #extract the key of the scores dictionary
    scores_key = list(scores_dict.keys())[0]
    #extract prediction field
    prediction = scores_dict[scores_key]['articlequality']['score']['prediction']
    return prediction 

This function operates through the following steps: 
1. Extract the dictionary of scores from the response json
2. Extract the key corresponding to article revision ID from the scores dictionary
3. Use the key to extract the article quality prediction
4. Return the prediction.

### Step 4.3 Main Function to Execute Calls and Extract Predictions

This main function runs api calls against the list of articles

Inputs to this function are: 
- an api key store object (set default. See api_key_store in Step 2)
- the csv file (set default. See csv_file in Step 1)
- the output file (set default. See out_file in Step 1)

This function has no output. It ends with writing the file out to a csv. 

In [None]:
def main(api_key_store=api_key_store,
            csv_file=csv_file,
            out_file=out_file):
    #list of revids
    revids = csv_file.revision_id.tolist()
    #init response list 
    responses = []
    for revision_id in revids:
        #append responses for made call
        responses.append(make_ores_api_call(article_revid = int(revision_id),
                                            api_key_store= api_key_store))
    #assign title as keys and items of json obj
    formatted_response_list = []
    #failure case counter 
    response_failed_count = 0
    #loop through responses, and add dataframe if response is valid
    for response in responses:
         article_revision_id, json_response = response
         if json_response:
            prediction = parse_api_response(json_response)
            formatted_response_list.append(pd.DataFrame({"revision_id" : [article_revision_id],
                                                         "article_quality" : [prediction]}))
         else:
            response_failed_count += 1
    #concat formatted dataframes and write to out file
    response_dataframe = pd.concat(formatted_response_list)
    response_dataframe.to_csv(out_file)
    #logging of total time and api calls failed
    end = datetime.datetime.now()
    logging.info(f"This run took {end - start} seconds to complete!")
    logging.info(f"{response_failed_count} api calls failed.")

if __name__ == "__main__":
    main()       

This function operates through the following steps: 
1. Create a list of revision ids from the csv file.
2. Loop through those revision ids and append to a responses list.
3. Loop through the response list and format a dataframe per response by extracting the prediction.
4. Updating the failure case counter if the response is none.
5. Concatenating the dataframes across response and writing to the out file.
6. Logging the total time, and how many calls failed.

## Step 5: Failure Rate

Finally, we understand the API call failure rate for the ORES liftwing API across our calls. To do so, we read in our log file and analyze occurences of the failure key word.

In [6]:
lines = []
with open("../../logs/ores_api_requests.log", "r") as files:
    for line in files:
        lines.append(line)

By reviewing the last line, we can identify the number of failed calls.

In [10]:
print(lines[-1])

2024-10-13 17:49:44,841 - INFO - 2 api calls failed.



The last two lines are not api result logs, so we can subtract them from the length to find the total number of calls. We have logged 2 failed calls, so we can proceed with calculating a rate this way.

In [11]:
num_calls = len(lines) - 2
print(f"API Call Failure Rate for liftwing API: {round((2/num_calls) * 100, 3)} %")

API Call Failure Rate for liftwing API: 0.028 %
