# White Noise: Metadata Diagnostics

## 1. Explaining the problem

The `Metadata Collection (Mattia Guarnerio - 14350920)` script executed a smooth, yet not perfect data collection process. The `bill_metadata_house.jsonl` and `bill_metadata_senate.jsonl` files do not contain metadata for all bills. In fact, the `metadata_diagnostics.jsonl` file shows that 76 documents were not correctly retrieved. I wish to take care of these rogue bills in this separate script.

In [1]:
# Packages for handling APIs and .JSON files
import os
import requests
import json

# Package for data wrangling
import pandas as pd

# Packages for exercising the virtue of patience and monitoring loop completion
import time
from IPython.display import clear_output

## 2. Unpacking Pandora's Box

I start by re-structuring the `metadata_diagnostics.jsonl` file into three separate lists:
1. `bill_numbers`, which contains the US bill numbers unique within each branch and mandate of Congress;
2. `congress_numbers`, which encloses all US Congress mandates associated with each bill;
3. `bill_type`, which indicates the branch of US Congress where the single bill was introduced.

In [2]:
# I first create the three empty list where I wish to store the data from the "metadata_diagnostics.jsonl" file

bill_numbers = []
congress_numbers = []
bill_type = []

# I open the .jsonl file, and read it with the following helper function
with open("metadata_diagnostics.jsonl", "r") as r:
    
    # I loop over each line...
    for line in r:
        
        # ...and load the line into a temporary "temp" object
        temp = json.loads(line)
        
        # I can now append the separate data to their corresponding list...
        bill_numbers.append(temp["bill_number"]) # Bill numbers
        congress_numbers.append(temp["congress_number"]) # Congress numbers
        bill_type.append(temp["file_type"]) # US Congress branches

In [3]:
# I check the first 10 elements of each list to assess whether this data wrangling step went smoothly
bill_numbers[:10]

['7162',
 '7134',
 '7135',
 '6895',
 '6829',
 '6257',
 '6218',
 '6191',
 '6171',
 '5623']

In [4]:
congress_numbers[:10]

['115', '115', '115', '115', '115', '115', '115', '115', '115', '115']

In [5]:
bill_type[:10]

['hr', 'hr', 'hr', 'hr', 'hr', 'hr', 'hr', 'hr', 'hr', 'hr']

In [6]:
# I check the total length of each list to assess whether this data wrangling step went smoothly
print(f"I must still retrieve {len(congress_numbers)} bill numbers, associated with {len(bill_numbers)} bills.")

I must still retrieve 76 bill numbers, associated with 76 bills.


In [7]:
print(f"Thus, I wish to contact api.congress.gov for {len(bill_type)} additional times.")

Thus, I wish to contact api.congress.gov for 76 additional times.


In [8]:
# Setting the API key as an environment variable
os.environ["api_key"] = "I used to be an adventurer like you until I took an arrow to the knee."

# Setting the base URL
base_url = 'https://api.congress.gov'

# Setting a delay of 4 seconds between API requests
rate_limit_delay = 4

## 3. Metadata Retrieval

It's time to retrieve the rogue metadata. I design a data collection function that is analogous to the one I devised in the `Metadata Collection (Mattia Guarnerio - 14350920)` script. I visually inspect the URL endpoints of the missing metadata. It seems that the issue lies in the `policyArea` key - i.e., the bill's policy area label - which is not present in the specific bills that caused problems in Microsoft Azure ML. Thus, I must appropriately modify the data retrieval function to account for this missing value.

In [9]:
def get_bill_metadata(congress_number, bill_type, bill_number, file_type):
    
    # I define the starting endpoint, which employs the function's arguments instead of hard-coding values.
    endpoint = base_url + f"/v3/bill/{congress_number}/{bill_type}/{bill_number}/?format={file_type}&api_key={os.environ.get('api_key')}"
    
    # I define the file path for the .JSON lines diagnostics file, which will contain all bill numbers and congress numbers
    # of all the metadata that could not be fetched
    diagnostics_path = "metadata_diagnostics_inception.jsonl"

    response = requests.get(endpoint)
    
    # I design a new "bill_type_message" string variable, which helps me to pretty print useful information for diagnostics.
            
    if bill_type == "hr":
        bill_type_message = "House of Representatives"
                
    elif bill_type == "s":
        bill_type_message = "Senate"

    # In the API documentation at https://api.congress.gov, it is stated that if the response's status code is not 200,
    # then there was an error in retrieving data. However, in this case I implement the check directly when I try to save
    # the data into the "metadata" object. The reason is simple: even when the bill number is meaningless - i.e., 42424242 -
    # the endpoint still returns a .JSON response, which does not contain the "bill" key. Thus, data regarding the 
    # API request in itself - i.e., the congress and bill number, and the bill type - is always present, whereas the
    # metadata cannot be saved and the command yields a KeyError.
        
    # I do not need to print a message that keeps the user updated on the pagination script's status, since I prefer to do so
    # by printing information on loop completion.

    # I try to save the data I need for my analysis in the "metadata" dictionary
    try:
        # I save the .JSON response into a local variable - i.e., a dictionary
        bill_metadata = response.json()
    
        metadata = {
            "congress": bill_metadata["request"]["congress"], # The congress number
            "bill_number": bill_metadata["request"]["billNumber"], # The bill number
            "bill_type": bill_metadata["request"]["billType"], # The bill type
            "sponsor_name": bill_metadata["bill"]["sponsors"][0]["firstName"], # The bill's main sponsor's name
            "sponsor_lastname": bill_metadata["bill"]["sponsors"][0]["lastName"], # The bill's main sponsor's surname
            "sponsor_state": bill_metadata["bill"]["sponsors"][0]["state"], # The bill's main sponsor's State of election
            "sponsor_party": bill_metadata["bill"]["sponsors"][0]["party"] # The bill's main sponsor's party belonging
            
            # The reason why I get metadata for the first sponsor - i.e., the main sponsor - only, and not of all the rest of
            # the co-sponsors, is that I am substantively interested in the legislation introduced by a certain party in a given
            # Congress branch, as I deem it to be the closest to what the party wants to approve, and thus to the latter's
            # ideology regarding either the economy, or socio-cultural issues. Co-sponsors are usually not responsible of the
            # legislative proposal and process, so it would not make sense to account for them in in my analysis.
        }
    
    # If the metadata does not exist, the "except" clause is activated
    except:
        
        # I first print an error message
        print(f"Error retrieving metadata for Bill number {bill_number} of the {congress_number}th {bill_type_message}.")
        
        # The computer writes the bill numbers, congress numbers, and bill types of all the documents that did not get fetched
        # into a .JSON lines file, which I can eventually inspect to second-check what happened
        with open(diagnostics_path, mode = "a") as a:
            diagnostic = {
                "bill_number": bill_number,
                "congress_number": congress_number,
                "file_type": bill_type
            }
            
            json.dump(diagnostic, a)
            a.write("/n")
            return None

    # I instruct the computer to take a 4 seconds break to avoid overshooting the limit of hourly requests.
    time.sleep(rate_limit_delay)
        
    # The function returns a dictionary with all the required metadata
    return metadata

In [10]:
total_number = len(bill_numbers) # Defining the total number of bills to be parsed to keep the user updated during the loop
file_path = "bill_metadata_rogue.jsonl" # Defining the path to the .JSON lines output file for the rogue metadata
bill_counter = 0 # Defining a counter which will keep the user updated during the loop.

# I write the output of each single request into a .JSON lines file, to avoid losing data if something goes wrong.
with open(file_path, mode = "w") as w:
    
    # I loop over all congress and bill numbers in the bill_numbers_house list I created
    for congress, bill_number, branch in zip(congress_numbers, bill_numbers, bill_type):
        
        bill_counter += 1 # I update the bill counter
        
        # I do not need to clear the output to update the counter, because the files to retrieve are only 76!
        
        print(f"Catching rogue metadata for bill number {bill_counter} out of {total_number}...")
        
        # I get the output of my custom function into a temporary dictionary
        metadata = get_bill_metadata(congress, branch, bill_number, "json")
        
        # If the "metadata" object is not empty, I write it into the .JSON lines output file.
        if metadata is not None:
            json.dump(metadata, w)
            w.write('\n')
            
        # Remember that if the custom function returns "None", there was an error, but I already saved the data that is
        # necessary to second-check what happened.

print("\nI finally caught all those pesky rogue bills!")

Catching rogue metadata for bill number 1 out of 76...
Catching rogue metadata for bill number 2 out of 76...
Catching rogue metadata for bill number 3 out of 76...
Catching rogue metadata for bill number 4 out of 76...
Catching rogue metadata for bill number 5 out of 76...
Catching rogue metadata for bill number 6 out of 76...
Catching rogue metadata for bill number 7 out of 76...
Catching rogue metadata for bill number 8 out of 76...
Catching rogue metadata for bill number 9 out of 76...
Catching rogue metadata for bill number 10 out of 76...
Catching rogue metadata for bill number 11 out of 76...
Catching rogue metadata for bill number 12 out of 76...
Catching rogue metadata for bill number 13 out of 76...
Catching rogue metadata for bill number 14 out of 76...
Catching rogue metadata for bill number 15 out of 76...
Catching rogue metadata for bill number 16 out of 76...
Catching rogue metadata for bill number 17 out of 76...
Catching rogue metadata for bill number 18 out of 76...
C

## 3. Wrapping Up

The `bill_metadata_rogue.jsonl` file contains all the bills I was missing. The long data collection grind is over at last! When automatic content labelling with SML will be finished, I will combine this file with the `bill_metadata_house.jsonl` and `bill_metadata_senate.jsonl` documents into a single `DataFrame` object, saving it as a `.csv` file as a backup. Then, I will merge the metadata with the bill summaries and predicted labels to attain my final dataset.