# White Noise: Senate Summary Collection

## 1. Explaining the problem

Since the data gathering process takes a lot of time, I gather the Senate Summary Data in this separate script, which I will let run on Azure ML. This will allow me to optimise my working schedule, and help me avoiding connection issues. As I have saved the bill numbers for the House of Representatives and the Senate in the `bill_numbers_house.json` and `bill_numbers_senate.json` files, I will be able to make my requests without getting the unique bill identifiers again.

In [1]:
# Packages for handling APIs and .JSON files
import os
import requests
import json

# Packages for exercising the virtue of patience and monitoring loop completion
import time
from IPython.display import clear_output

## 2. Getting all the Summary Data

In [2]:
# Loading bill numbers for the Senate
with open("bill_numbers_senate.json", "r") as r:
    bill_numbers_senate = json.load(r)
    
# Setting the API key as an environment variable
os.environ["api_key"] = "I used to be an adventurer like you until I took an arrow to the knee."

# Setting the base URL
base_url = 'https://api.congress.gov'

# Setting a delay of 4 seconds between API requests
rate_limit_delay = 4

In [3]:
def get_bill_summaries(congress_number, bill_type, bill_number, file_type):
    
    # I define the starting endpoint, which employs the function's arguments instead of hard-coding values.
    endpoint = base_url + f"/v3/bill/{congress_number}/{bill_type}/{bill_number}/summaries?format={file_type}&api_key={os.environ.get('api_key')}"
    
    # I define the file path for the .JSON lines diagnostics file, which will contain all bill numbers and congress numbers
    # of all the summaries that could not be fetched
    diagnostics_path = "summary_diagnostics.jsonl"
    
    print("Contacting the congress.gov API...")
    response = requests.get(endpoint)
        
    # I design a new "bill_type_message" string variable, which helps me to pretty print useful information for diagnostics.
            
    if bill_type == "hr":
        bill_type_message = "House of Representatives"
                
    elif bill_type == "s":
        bill_type_message = "Senate"

    # In the API documentation at https://api.congress.gov, it is stated that if the response's status code is not 200,
    # then there was an error in retrieving data. However, in this case I implement the check directly when I try to save
    # the data into the "summary" object. The reason is simple: even when the bill number is meaningless - i.e., 42424242 -
    # the endpoint still returns a .JSON response, which does not contain the "summaries" key. Thus, data regarding the 
    # API request in itself - i.e., the congress and bill number, and the bill type - is always present, whereas the summary
    # text cannot be saved and the command yields a KeyError.
        
    # I do not need to print a message that keeps the user updated on the pagination script's status, since I prefer to do so
    # by printing information on loop completion.   
    
    # I try to save the data I need for my analysis in the "summary" dictionary
    try:
    # I save the .JSON response into a local variable - i.e., a dictionary
        summary_data = response.json()
        
        summary = {
            "congress": summary_data["request"]["congress"], # The congress number
            "bill_number": summary_data["request"]["billNumber"], # The bill number
            "bill_type": summary_data["request"]["billType"], # The bill type
            "text": summary_data["summaries"][0]["text"] # The first existing summary
            
            # The reason why I get the first existing summary is that I am substantively interested in the legislation first
            # introduced by a certain party in a given Congress branch, as I deem it to be the closest to what the party wants
            # to approve, and thus to the latter's ideology regarding either the economy, or socio-cultural issues. In other
            # words, I do not want the text to be confounded by bipartisan negotiations.
        }
    
    # If the summary does not exist, the "except" clause is activated
    except:
        
        # I first print an error message
        print(f"Error retrieving data for Bill number {bill_number} of the {congress_number}th {bill_type_message}.")

        # The computer writes the bill numbers, congress numbers, and bill types of all the documents that did not get fetched
        # into a .JSON lines file, which I can eventually inspect to second-check what happened
        with open(diagnostics_path, mode = "a") as w:
            diagnostic = {
                "bill_number": bill_number,
                "congress_number": congress_number,
                "file_type": bill_type
            }
            
            json.dump(diagnostic, w)
            w.write("/n")
            return None

    # I instruct the computer to take a 4 seconds break to avoid overshooting the limit of hourly requests.
    time.sleep(rate_limit_delay)
        
    # The function returns a dictionary with all bill summaries from either the US House of Representative, or Senate, with
    # the required information under the defined keys.
    return summary

In [4]:
senate_number = 7864 # Defining the total number of bills to be parsed to keep the user updated during the loop
file_path = "bill_summaries_senate.jsonl" # Defining the path to the .JSON lines output file for the Senate summaries
bill_counter = 0 # Defining a counter which will keep the user updated during the loop.

# I write the output of each single request into a .JSON lines file, to avoid losing data if something goes wrong.
with open(file_path, mode = "w") as w:
    
    # I loop over all congress and bill numbers in the bill_numbers_house list I created
    for congress, bill_numbers in bill_numbers_senate.items():
        for bill_number in bill_numbers:

            bill_counter += 1 # I update the bill counter
            clear_output() # I clear the output, to update the counter
            
            print(f"Getting summary number {bill_counter} out of {senate_number}")
        
            # I get the output of my custom function into a temporary dictionary
            summary = get_bill_summaries(congress, "s", bill_number, "json")
            
            # If the "summary" object is not empty, I write it into the .JSON lines output file.
            
            if summary is not None:
                json.dump(summary, w)
                w.write('\n')
            
            # Remember that if the custom function returns "None", there was an error, but I already saved the data that is
            # necessary to second-check what happened.

Getting summary number 7864 out of 7864
Contacting the congress.gov API...


## 3. Wrapping Up

On a final note, one may observe that the `/summaries/{congress}/{billType}` API endpoint could have saved a lot of time, since I could have paginated over the .JSON responses, getting multiple bills per request, instead of fetching one bill per request. However, the endpoint does not work, and this forces me to go for the long and hard way. The reason why I used the `json`, and not the `jsonlines` library, is that the latter appears to be unavailable on Microsoft Azure ML, the environment on which I ran the script. There is no diagnostics file, and all data was retrieved! I now turn to data cleaning.