# Data Wrangling: Combining API Results

The functionality to combine api results across access modes, as output by the libs/src/api_request.py, is found in libs/src/combine_api_results.py python file. Thorough in line commenting is also available, along with ways to call the script from the command line with custom arguments in libs/src/run_scripts.sh. The following section of the notebook will walk through the specific steps in the script and explain them in detail.

## Step 1: Processing Command Line Arguments, Reading in Files, and Setting Up Logging

Dependencies for this script involves importing the following libraries:

In [None]:
import json
import sys 
from collections import Counter, defaultdict
import pandas as pd
import os
import logging
import datetime 

The call of this script from the command line generally takes the form of: 

bash
```
#move into scripts directory
!cd ${local_machine_scripts_directory}
#combine api results for given access modes
!python3 combine_api_results.py ${access_mode_1} ${access_mode_2} ${delete_originals} ${output_access} ${local_machine_data_directory} 
```

Where local machine scripts directory is the /libs/src/ absolute path on a local machine. This script will combine api results for a given article by time stamp, and write an output json of the combined results. 

Inputs to this script are: 
- the first access mode to be combined 
- the second access mode to be combined 
- a boolean (True or False) of whether the original files should be deleted
- the name of the output access mode 
- the absolute path on a local machine of the data directory in this repo 

Accordingly, the first few lines of the script include the following lines of code, meant to process these command line arguments, assign them to variables, and set up logging so users of this code can track exactly which article merge request succeed, fail, and how long the script takes to run.

In [None]:
start_time = datetime.datetime.now()

#assign access and output file path variables 
access1 = str(sys.argv[1])
access2 = str(sys.argv[2])
#do we want to delete originals or no
delete_originals = bool(sys.argv[3] == "True")
out_access = str(sys.argv[4])
data_dir = str(sys.argv[5])

logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    filename=f"../../logs/combine_api_results_{access1}_{access2}.log")

Logging is output to a file "combine_api_results_{access1}_{access2}.log" in a directory named logs in the main repo where it is named dynamically based on the access modes to be merged, as specified in the command line call to the python script. 

## Step 2: Setting Constants Required for Function Calls

The next step of this script includes setting the start and end dates, hardcoded from the request template included in the "wp_article_views_example.ipynb" notebook. Additionally, the output file path and access json file paths strings are constructed from the variables. The two access jsons are read in into the objects `access1_json` and `access2_json`. For future steps, the records corresponding to a given article from each access json file are merged into a python dictionary named `merged_json`.  

In [None]:
#start and end for path strings
start = "2015070100"
end = "2024093000"

#format output file path
out_file_path = f"{data_dir}/rare-disease_monthly_{out_access}_{start}-{end}.json"

#format json paths
access1_json_path = f"{data_dir}/rare-disease_monthly_{access1}_{start}-{end}.json"
access2_json_path = f"{data_dir}/rare-disease_monthly_{access2}_{start}-{end}.json"

#load json objects
with open(access1_json_path, "r") as file:  
    access1_json = json.load(file)

with open(access2_json_path, "r") as file:  
    access2_json = json.load(file)

#put lists together per key to get all timestamps from both sources in same list
merged_json = {}
for title in access1_json.keys():
    merged_json[title] = access1_json[title] + access2_json[title]

## Step 3: Initialize Functions Required For API Call Result Merging

### Step 3.1: Summing Views for a Given Timestamp and Article

The `merged_json` object constructed in step 2 will have duplicated records for a given timestamp in the records corresponding to a given article. This function works to combine views for a given article by timestamp, and output a formatted json object.

Input to this function is:
-  the merged json object from Step 2

Output of this function is: 
- the merged output dictionary

In [None]:
#function to sum views for the same articles timestamps
def sum_unique_views_output_dict(merged_json_object):
    #initalize default dict of lists for output
    result_dict = defaultdict(list)
    #loop through list of api results (now with two entries per timestamp)
    for article, api_result_list in merged_json_object.items():
        #initialize default view tracker dictionary per timestamp
        view_tracker = defaultdict(lambda : {
                'project' : None,
                'granularity' : None,
                'agent' : None,
                'views' : 0,
        })
        #loop through every item in the list of resutls 
        for list_iter in api_result_list:
            #intialize the timestamp
            timestamp = list_iter['timestamp']
            #add the views from a given timestamp
            view_tracker[timestamp]['views'] += list_iter['views']
            #source other info from the iteration of the list if not already set
            if view_tracker[timestamp]['agent'] is None:
                view_tracker[timestamp]['project'] = list_iter['project']
                view_tracker[timestamp]['granularity'] = list_iter['granularity']
                view_tracker[timestamp]['agent'] = list_iter['agent']
        #add summed monthly entry to high level output dict
        for timestamp in view_tracker.keys():
            result_dict[article].append({
                'project' :  view_tracker[timestamp]['project'],
                'article' : article,
                'granularity' : view_tracker[timestamp]['granularity'],
                'timestamp' : timestamp,
                'agent' : view_tracker[timestamp]['agent'],
                'views' : view_tracker[timestamp]['views']
            })
        logging.info(f"Result merging has finished for article {article}")
    #convert from default dict to dict
    return dict(result_dict)

This function leverages the defaultdict object from the collections module. 

It operates through the following steps: 
1. Intialize a defaultdict object for the final result dictionary
2. Loop through the article title and api result list key, value pairs in the merged json object. 
3. Initialize a viewtracker defaultdict object for each article's timestamp. 
4. Loop through each api result in the api result list, and add the views for that result's timestamp to the viewtracker record for that timestamp. Each timestamp therefore only has one entry in the viewtracker defaultdict object.
5. Add the auxiliary information from the api result to the viewtracker dict if it does not already exist.
6. Loop through the timestamp keys of the viewtracker, and append add a dictionary of the information corresponding to this timestamp key to the final result list for the given article.
7. Log the result merging for the given article.
8. Return a dictionary object of the result dictionary.

### Step 3.2: Running the Merge Process for a Given Merged Json Object

This function uses the sum_unique_values_output_dict() function defined in Step 3.1, and runs the merge operation. 

The inputs to this function are: 
- A given merged JSON object
- A delete originals flag, corresponding to the script argument (See Step 1)

There are no outputs of this function.

In [None]:
def run_merge(merged_json, delete_originals=False):
    #run sum_unique_views_output_dict function for the merged json
    output_json = sum_unique_views_output_dict(merged_json)
    #write output
    with open(out_file_path, 'w') as file:
        json.dump(output_json, file)
    #delete originals if needed
    if delete_originals:
        os.remove(access1_json_path)
        os.remove(access2_json_path)
        logging.info(f"Temp Files {access1_json_path} and {access2_json_path} removed.")

This function operates through the following steps: 
1. Run the sum_unique_views_output to produce a summed json object for given original merged output.
2. Write the output to the output file path set in Step 2.
3. Delete originals based on the flag set in Step 1 and log the deletion. 

### Step 3.3 Main Function to Execute Merge Run

This function pieces together all of the previous steps, and places them in a main function which is called at the end of the script. 

In [None]:
def main():
    run_merge(merged_json,
              delete_originals)
    end_time = datetime.datetime.now() 
    logging.info(f"Total Run took {end_time - start_time} seconds!")
    
main()

This function operates through the following steps:
1. Run the merge for the merged json object generated in Step 2 and delete originals flag set in Step 1 from the command line.
2. Log the total run time for the script.