# White Noise: Data Cleaning

## 1. Explaining the Problem

To start, I visually inspect the `bill_summaries_house.jsonl` and `bill_summaries_senate.jsonl` files. They respectively contain 13956, and 7846, lines of `text`, one for each document - i.e., bill - written down in a .JSON-like - i.e., dictionary-like - structure. Each row contains identifiers for the bill itself - i.e., `bill_number` - the bill's Congress number - i.e., `congress`, the Congress mandate in which the bill was proposed - and the bill's type - i.e., `bill_type`, which indicates whether the bill was originally introduced in the House of Representatives, or in the Senate. The bill summaries were probably scraped by the `api.congress.gov` developers, so they are still dirty with HTML boilerplates - i.e., tags, and character escapes.

The prospective steps for this data cleaning procedure are quite straightforward. First, I must re-structure the `.jsonl` data into a cleaning-friendly format - i.e., a `pandas` `DataFrame`. Second, I must join the data for the House and the Senate in a single data set. Third, I must pre-process the summaries contained in the `text` column to remove HTML tags and boilerplates. I decide to pre-process the data before the Supervised Machine Learning procedure because I want to get a clearer view of the texts when manually labelling them. Lastly, I must re-shuffle and save the `DataFrame` into three separate `.csv` files: one with the documents I will manually label - i.e., `summary_labelled.csv` - one with the documents on which I will make predictions with my best trained classifier - `summary_unlabelled.csv` - and one with all documents, to keep a unique backup in case something happens.

In [1]:
# Packages for handling .jsonl and .csv files
import jsonlines
import csv

# Package for data cleaning
import pandas as pd

## 2. Unpacking Pandora's Box

I start by re-structuring the data regarding House bills into a `DataFrame` object.

In [2]:
# I first create an empty list to store all lines from the "bill_summaries_house.jsonl" file
house_summaries = []

# I open the .jsonl file, and read it with the following helper function
with jsonlines.open("bill_summaries_house.jsonl") as r:
    
    # I loop over each line...
    for line in r:
        
        # ...and I append it to the originally empty list!
        house_summaries.append(line)

# I now turn the list of lines - i.e., dictionaries - into a pandas DataFrame object
d_house = pd.DataFrame(house_summaries)

In [3]:
# I check the first few lines of the dataset to assess if this cleaning step went smoothly

d_house.head()

Unnamed: 0,congress,bill_number,bill_type,text
0,111,6523,hr,<p>Ike Skelton National Defense Authorization...
1,111,6561,hr,<p>History Is Learned from the Living Act - E...
2,111,81,hr,<p></p> <p>Shark Conservation Act of 2009 - A...
3,111,6533,hr,<p>Local Community Radio Act of 2010 - Amends...
4,111,6510,hr,<p>Directs the Administrator of General Servi...


In [4]:
# I check the last few lines of the dataset to assess if this cleaning step went smoothly

d_house.tail()

Unnamed: 0,congress,bill_number,bill_type,text
13951,115,123,hr,<p><b>FHA Alternative Credit Pilot Program Re...
13952,115,152,hr,<p><b>Prince Hall Freemasonry Stamp Act</b></...
13953,115,149,hr,"<p><b>Veterans, Women, Families with Children..."
13954,115,125,hr,<p><b>FHA In-Person Servicing Improvement Act...
13955,115,122,hr,<p><b>Original Living Wage Act of 2017</b></p...


In [5]:
# I check the DataFrame object's shape to assess if this cleaning step went smoothly

d_house.shape

(13956, 4)

I have 13956 rows - i.e., House bills - and four columns - i.e., identifiers, and texts - into the `DataFrame` object. It appears everything went perfectly! Now, I turn to the data concerning Senate bills.

In [6]:
# I first create an empty list to store all lines from the "bill_summaries_senate.jsonl" file
senate_summaries = []

# I open the .jsonl file, and read it with the following helper function
with jsonlines.open("bill_summaries_senate.jsonl") as r:
    
    # I loop over each line...
    for line in r:
        
        # ...and I append it to the originally empty list!
        senate_summaries.append(line)

# I now turn the list of lines - i.e., dictionaries - into a pandas DataFrame object
d_senate = pd.DataFrame(senate_summaries)

In [7]:
# I check the first few lines of the dataset to assess if this cleaning step went smoothly

d_senate.head()

Unnamed: 0,congress,bill_number,bill_type,text
0,111,841,s,<p>Pedestrian Safety Enhancement Act of 2009 ...
1,111,4036,s,<p>Amends the Federal Credit Union Act regard...
2,111,3903,s,<p>Permits land held in trust for the Ohkay O...
3,111,3874,s,<p>Reduction of Lead in Drinking Water Act- A...
4,111,3592,s,Designates the facility of the United States P...


In [8]:
# I check the last few lines of the dataset to assess if this cleaning step went smoothly

d_senate.tail()

Unnamed: 0,congress,bill_number,bill_type,text
7859,115,15,s,<p><strong>Iran Ballistic Missile Sanctions A...
7860,115,16,s,<p><b>Federal Reserve Transparency Act of 201...
7861,115,14,s,"<p><b>No Budget, No Pay Act</b></p> <p>This b..."
7862,115,13,s,<p><b>No Windfalls for Government Service Act...
7863,115,11,s,<p><strong>Jerusalem Embassy and Recognition ...


In [9]:
# I check the DataFrame object's shape to assess if this cleaning step went smoothly

d_senate.shape

(7864, 4)

I have 7864 rows - i.e., Senate bills - and four columns - i.e., identifiers, and texts - into the `DataFrame` object. It appears everything went perfectly! Now, I must join the data for the House and the Senate in a single data set.

## 3. Joining the Summary Data

In [10]:
# I concatenate the two DataFrame objects vertically, ignoring their respective indexes because they do not provide any
# useful information, by setting "ignore_index" argument to "True"

d_final = pd.concat([d_house, d_senate], ignore_index = True)

In [11]:
# I check the first few lines of the dataset to assess if this cleaning step went smoothly

d_final.head()

Unnamed: 0,congress,bill_number,bill_type,text
0,111,6523,hr,<p>Ike Skelton National Defense Authorization...
1,111,6561,hr,<p>History Is Learned from the Living Act - E...
2,111,81,hr,<p></p> <p>Shark Conservation Act of 2009 - A...
3,111,6533,hr,<p>Local Community Radio Act of 2010 - Amends...
4,111,6510,hr,<p>Directs the Administrator of General Servi...


In [12]:
# I check the last few lines of the dataset to assess if this cleaning step went smoothly

d_final.tail()

Unnamed: 0,congress,bill_number,bill_type,text
21815,115,15,s,<p><strong>Iran Ballistic Missile Sanctions A...
21816,115,16,s,<p><b>Federal Reserve Transparency Act of 201...
21817,115,14,s,"<p><b>No Budget, No Pay Act</b></p> <p>This b..."
21818,115,13,s,<p><b>No Windfalls for Government Service Act...
21819,115,11,s,<p><strong>Jerusalem Embassy and Recognition ...


In [13]:
# I check the DataFrame object's shape to assess if this cleaning step went smoothly

d_final.shape

(21820, 4)

I have 21820 rows - i.e., bills - and four columns - i.e., identifiers, and texts - into the `DataFrame` object. It appears everything went perfectly! Now, I must appropriately pre-process the texts to remove the HTML tags and character escapes.

## 4. Text Pre-Processing

I must greatly thank Wouter van Atteveldt, Damian Trilling, and Carlos Arcila for their regex suggestions to clean up HTML tags and character escapes. The regular expressions I employed were provided in *Chapter 9* of their book (2022), in *Table 9.3: Regular expression syntax in Python and R*, available at the following link: https://cssbook.net/content/chapter09.html#tbl-regexample.

In [14]:
# I remove HTML tags by applying the pandas ".str.replace" method on the pandas Dataframe's "text" column.
# More specifically, I effectively replace all HTML tags with a whitespace, to avoid that words get attached and
# do not subsequently get recognised during SML training.

d_final["text"] = d_final["text"].str.replace("</?\w[^>]*>", " ", regex = True)

In [15]:
# I remove HTML character escapes by applying the pandas ".str.replace" method on the pandas Dataframe's "text" column.
# More specifically, I effectively replace all HTML character escapes with a whitespace, to avoid that words get attached and
# do not subsequently get recognised during SML training.

d_final["text"] = d_final["text"].str.replace("&[^;]+;", " ", regex = True)

In [16]:
# I check the first few lines of the dataset to assess if this cleaning step went smoothly

d_final.head()

Unnamed: 0,congress,bill_number,bill_type,text
0,111,6523,hr,Ike Skelton National Defense Authorization A...
1,111,6561,hr,History Is Learned from the Living Act - Est...
2,111,81,hr,Shark Conservation Act of 2009 - Amends t...
3,111,6533,hr,Local Community Radio Act of 2010 - Amends t...
4,111,6510,hr,Directs the Administrator of General Service...


It seems that the HTML tags were removed, although I am not sure about the HTML character escapes. I will carry out a visual inspection on the final data sets with Notepad++ to make sure this cleaning step went smoothly. On a final note, I do not remove punctuation and apply lowercasing to the summaries, since even the most basic tokenizer can do this automatically.

# 5. Shuffling and Saving the Data Set

In [17]:
# I shuffle the DataFrame object by applying the pandas .sample method and feeding the integer "1" to the "frac" argument.
# "frac" stands for "fraction", so I effectively instruct the computer to sample the whole data set and shuffle it.
# 27 is the "random_state" value I often employ in my projects. It's the date of my birthday, for the sake of clarity.

d_final = d_final.sample(frac = 1, random_state = 27)

In [18]:
# I check the first few lines of the dataset to assess if this cleaning step went smoothly

d_final.head()

Unnamed: 0,congress,bill_number,bill_type,text
12905,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...
10480,115,4105,hr,This bill extends funding through FY2022 for...
18266,115,3691,s,Expanding Transparency of Information and S...
5127,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...
6103,111,883,hr,"Amends the Internal Revenue Code to repeal, e..."


It seems that the data set was correctly shuffled, as indicated by the inconsistent index numbers and the presence of both House and Senate bills - i.e., `bill_type` - in the first few rows of the `DataFrame` object. As a final step, I slice the `DataFrame` object into two `.csv` files: `summary_labelled.csv`, and `summary_unlabelled.csv`. I will manually label all summaries contained in the former, taking 2200 documents as my training, validating, and testing sample, which constitutes approximatively 10% of all bills. There is not a lot of variation among these summaries, since they appear to be written and structured quite uniformly, so I deem this sub-sample to be sufficient for Supervised Machine Learning. I keep an additional data set with all documents as a backup.

In [19]:
# I save a backup of the whole data set as "summary_data.csv". I employ "|" as a separator to prevent the pd methods from
# confusing colons or semi-colons within the texts with the actual separators. I set the "index" argument to false, because
# the indexes are completely meaningless and do not need to be saved.

d_final.to_csv("summary_data.csv", sep = "|", index = False)

In [20]:
# I slice the first 2200 documents into a temporary DataFrame object to construct the "summary_labelled.csv" data set

d_labelled = d_final[:2200]

The code for the following cell was inspired by the `.insert` method in the `pandas` library documentation, which can be found at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html.

In [21]:
# I add two empty columns to make the manual labelling process easier, saving time on typing the "|" character
# I must add the first column to position 4's right - i.e., to the right of "text", hence the "loc" argument is equal to 4
# The same logic applies to the following column. Both get an empty string as an input, signalled by the "value" argument.

d_labelled.insert(loc = 4, column = "economic", value = "")
d_labelled.insert(loc = 5, column = "socio_cultural", value = "")

In [22]:
d_labelled.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
12905,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...,,
10480,115,4105,hr,This bill extends funding through FY2022 for...,,
18266,115,3691,s,Expanding Transparency of Information and S...,,
5127,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...,,
6103,111,883,hr,"Amends the Internal Revenue Code to repeal, e...",,


In [23]:
# I check the temporary DataFrame object's shape to assess if this cleaning step went smoothly

d_labelled.shape

(2200, 6)

In [24]:
# I save the sliced DataFrame as "summary_unlabelled.csv". I employ "|" as a separator to prevent the pd methods from confusing
# colons or semi-colons within the texts with the actual separators. I set the "index" argument to false, because the indexes
# are completely meaningless and do not need to be saved. This also prevents me from mistyping separators on the .csv file and
# ruining the document's structure.

d_labelled.to_csv("summary_labelled.csv", sep = "|", index = False)

In [25]:
# I slice the rest of the documents into a temporary DataFrame object to construct the "summary_unlabelled.csv" data set

d_unlabelled = d_final[2200:]

In [26]:
# I check the temporary DataFrame object's shape to assess if this cleaning step went smoothly

d_unlabelled.shape

(19620, 4)

In [27]:
# I save the sliced DataFrame as "summary_unlabelled.csv". I employ "|" as a separator to prevent the pd methods from confusing
# colons or semi-colons within the texts with the actual separators. I set the "index" argument to false, because the indexes
# are completely meaningless and do not need to be saved.

d_unlabelled.to_csv("summary_unlabelled.csv", sep = "|", index = False)

## 6. Wrapping Up

Since the bill metadata is still being collected at the moment, I will analogously unpack it and clean it in a separate notebook, joining it with the completely labelled dataset when the Supervised Machine Learning procedures are over. On a visual inspection of the `summary_data.csv`, it appears that the summary texts were correctly pre-processed, as I cannot see any HTML boilerplates or character escapes.