# White Noise: Metadata Collection

## 1. Explaining the problem

Since the data gathering process takes a lot of time, I gather the Bill Metadata in this separate script, which I will let run on the Microsoft Azure ML cloud during the Supervised Machine Learning steps. I only need them for the Data Analysis steps anyway, so this will allow me to optimise my working schedule. As I have saved the bill numbers for the House of Representatives and the Senate in the `bill_numbers_house.json` and `bill_numbers_senate.json` files, I will be able to make my requests without getting the unique bill identifiers again.

In [None]:
# Packages for handling APIs and .JSON files
import os
import requests
import json

# Packages for exercising the virtue of patience and monitoring loop completion
import time
from IPython.display import clear_output

## 2. Getting all the Bill Metadata

In [None]:
# Loading bill numbers for the House of Representatives
with open("bill_numbers_house.json", "r") as r:
    bill_numbers_house = json.load(r)

# Loading bill numbers for the Senate
with open("bill_numbers_senate.json", "r") as r:
    bill_numbers_senate = json.load(r)
    
# Setting the API key as an environment variable
os.environ["api_key"] = "I used to be an adventurer like you until I took an arrow to the knee."

# Setting the base URL
base_url = 'https://api.congress.gov'

# Setting a delay of 4 seconds between API requests
rate_limit_delay = 4

In [None]:
def get_bill_metadata(congress_number, bill_type, bill_number, file_type):
    
    # I define the starting endpoint, which employs the function's arguments instead of hard-coding values.
    endpoint = base_url + f"/v3/bill/{congress_number}/{bill_type}/{bill_number}/?format={file_type}&api_key={os.environ.get('api_key')}"
    
    # I define the file path for the .JSON lines diagnostics file, which will contain all bill numbers and congress numbers
    # of all the metadata that could not be fetched
    diagnostics_path = "metadata_diagnostics.jsonl"
    
    print("Contacting the congress.gov API...")
    response = requests.get(endpoint)
    
    # I design a new "bill_type_message" string variable, which helps me to pretty print useful information for diagnostics.
            
    if bill_type == "hr":
        bill_type_message = "House of Representatives"
                
    elif bill_type == "s":
        bill_type_message = "Senate"

    # In the API documentation at https://api.congress.gov, it is stated that if the response's status code is not 200,
    # then there was an error in retrieving data. However, in this case I implement the check directly when I try to save
    # the data into the "metadata" object. The reason is simple: even when the bill number is meaningless - i.e., 42424242 -
    # the endpoint still returns a .JSON response, which does not contain the "bill" key. Thus, data regarding the 
    # API request in itself - i.e., the congress and bill number, and the bill type - is always present, whereas the
    # metadata cannot be saved and the command yields a KeyError.
        
    # I do not need to print a message that keeps the user updated on the pagination script's status, since I prefer to do so
    # by printing information on loop completion.

    # I try to save the data I need for my analysis in the "metadata" dictionary
    try:
        # I save the .JSON response into a local variable - i.e., a dictionary
        bill_metadata = response.json()
    
        metadata = {
            "congress": bill_metadata["request"]["congress"], # The congress number
            "bill_number": bill_metadata["request"]["billNumber"], # The bill number
            "bill_type": bill_metadata["request"]["billType"], # The bill type
            "policy_area": bill_metadata["bill"]["policyArea"]["name"], # The policy area
            "sponsor_name": bill_metadata["bill"]["sponsors"][0]["firstName"], # The bill's main sponsor's name
            "sponsor_lastname": bill_metadata["bill"]["sponsors"][0]["lastName"], # The bill's main sponsor's surname
            "sponsor_state": bill_metadata["bill"]["sponsors"][0]["state"], # The bill's main sponsor's State of election
            "sponsor_party": bill_metadata["bill"]["sponsors"][0]["party"] # The bill's main sponsor's party belonging
            
            # The reason why I get metadata for the first sponsor - i.e., the main sponsor - only, and not of all the rest of
            # the co-sponsors, is that I am substantively interested in the legislation introduced by a certain party in a given
            # Congress branch, as I deem it to be the closest to what the party wants to approve, and thus to the latter's
            # ideology regarding either the economy, or socio-cultural issues. Co-sponsors are usually not responsible of the
            # legislative proposal and process, so it would not make sense to account for them in in my analysis.
        }
    
    # If the metadata does not exist, the "except" clause is activated
    except:
        
        # I first print an error message
        print(f"Error retrieving metadata for Bill number {bill_number} of the {congress_number}th {bill_type_message}.")
        
        # The computer writes the bill numbers, congress numbers, and bill types of all the documents that did not get fetched
        # into a .JSON lines file, which I can eventually inspect to second-check what happened
        with open(diagnostics_path, mode = "a") as a:
            diagnostic = {
                "bill_number": bill_number,
                "congress_number": congress_number,
                "file_type": bill_type
            }
            
            json.dump(diagnostic, a)
            a.write("/n")
            return None

    # I instruct the computer to take a 4 seconds break to avoid overshooting the limit of hourly requests.
    time.sleep(rate_limit_delay)
        
    # The function returns a dictionary with all the required metadata
    return metadata

In [None]:
house_number = 13956 # Defining the total number of bills to be parsed to keep the user updated during the loop
file_path = "bill_metadata_house.jsonl" # Defining the path to the .JSON lines output file for the House metadata
bill_counter = 0 # Defining a counter which will keep the user updated during the loop.

# I write the output of each single request into a .JSON lines file, to avoid losing data if something goes wrong.
with open(file_path, mode = "w") as w:
    
    # I loop over all congress and bill numbers in the bill_numbers_house list I created
    for congress, bill_numbers in bill_numbers_house.items():
        for bill_number in bill_numbers:

            bill_counter += 1 # I update the bill counter
            clear_output() # I clear the output, to update the counter
            
            print(f"Getting metadata for bill number {bill_counter} out of {house_number}")
        
            # I get the output of my custom function into a temporary dictionary
            metadata = get_bill_metadata(congress, "hr", bill_number, "json")
            
            # If the "metadata" object is not empty, I write it into the .JSON lines output file.
            
            if metadata is not None:
                json.dump(metadata, w)
                w.write('\n')
            
            # Remember that if the custom function returns "None", there was an error, but I already saved the data that is
            # necessary to second-check what happened.

In [None]:
senate_number = 7864 # Defining the total number of bills to be parsed to keep the user updated during the loop
file_path = "bill_metadata_senate.jsonl" # Defining the path to the .JSON lines output file for the Senate metadata
bill_counter = 0 # Defining a counter which will keep the user updated during the loop.

# I write the output of each single request into a .JSON lines file, to avoid losing data if something goes wrong.
with open(file_path, mode = "w") as w:
    
    # I loop over all congress and bill numbers in the bill_numbers_house list I created
    for congress, bill_numbers in bill_numbers_senate.items():
        for bill_number in bill_numbers:

            bill_counter += 1 # I update the bill counter
            clear_output() # I clear the output, to update the counter
            
            print(f"Getting metadata for bill number {bill_counter} out of {senate_number}")
        
            # I get the output of my custom function into a temporary dictionary
            metadata = get_bill_metadata(congress, "s", bill_number, "json")
            
            # If the "metadata" object is not empty, I write it into the .JSON lines output file.
            
            if metadata is not None:
                json.dump(metadata, w)
                w.write('\n')
            
            # Remember that if the custom function returns "None", there was an error, but I already saved the data that is
            # necessary to second-check what happened.

## 3. Wrapping Up

The data collection procedure went smoothly, yet not perfectly. The `bill_metadata_house.jsonl` and `bill_metadata_senate.jsonl` files do not contain metadata for all bills. In fact, the `metadata_diagnostics.jsonl` file shows that 76 documents were not correctly retrieved. I will take care of these rogue bills in a separate script and attach them to the final metadata dataset, if possible. I must note that, analogously to the other data collection procedures, I had to use the `json` package, and not the `jsonlines` package, because the latter appears to not be available within Microsoft Azure ML. Moreover, the output was not printed because I let the process run on the server end. If requested, I can show that the script indeed runs as expected.