# White Noise: Data Labelling

## 1. Explaining the Problem

I manually annotated 2200 bill summaries contained within the `summary_labelled.csv` file. The classification was based on each bill having an Economic (1) or Non-Economic (0) content - i.e., the bill realises its proponent's economic vision - and a Socio-Cultural (1) or Non-Socio-Cultural content (0) - i.e., the bill realises its proponent's socio-cultural vision. The two categories were conceived and treated as separate during the labelling process. Since I utilised numbers, and not character strings, to save the maximum amount of time when classifying the documents, I now wish to check that the labelling procedure went smoothly, and re-map all labels as character values.

In [1]:
# In this script's context, I only need the pandas package for data wrangling 
import pandas as pd

## 2. Safety Checks

In [2]:
# I import the annotated data set with the "read_csv" function, specifying that I used the "|" separator
# This is crucial, because employing colons or semi-colons causes conflicts with the summaries' contents

d = pd.read_csv("summary_labelled.csv", sep = "|")

In [3]:
# I check the first few lines of the DataFrame object to assess if the "read_csv" command worked smoothly

d.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
0,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...,0,1
1,115,4105,hr,This bill extends funding through FY2022 for...,1,0
2,115,3691,s,Expanding Transparency of Information and S...,0,1
3,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...,1,1
4,111,883,hr,"Amends the Internal Revenue Code to repeal, e...",1,1


In [4]:
# I check the shape of the DataFrame object to assess if the "read_csv" command worked smoothly

d.shape

(2200, 6)

2200 classified documents, six columns - i.e., the original four columns I retrieved from `api.congress.gov`, plus the two columns that contain the categories I manually annotated. Everything seems perfect! Now, I check if there were any typos or misses during the labelling procedure, and count the number of positive and negative labels for each category.

In [5]:
# There should only be two values - i.e., "0" and "1" - if no typos happened during the labelling process.
econ_unique = d["economic"].unique()
print("Unique values for the 'economic' category:", econ_unique)

# I count the number of positive and negative labels for the "economic" category
econ_count = d["economic"].value_counts()
print("Value counts for the 'economic' category:", econ_count)

Unique values for the 'economic' category: [0 1]
Value counts for the 'economic' category: 1    1233
0     967
Name: economic, dtype: int64


In [6]:
# There should only be two values - i.e., "0" and "1" - if no typos happened during the labelling process.
sc_unique = d["socio_cultural"].unique()
print("Unique values for the 'socio-cultural' category:", sc_unique)

# I count the number of positive and negative labels for the "socio_cultural" category
sc_count = d["socio_cultural"].value_counts()
print("Value counts for the 'socio-cultural' category:", sc_count)

Unique values for the 'socio-cultural' category: [1 0]
Value counts for the 'socio-cultural' category: 1    1371
0     829
Name: socio_cultural, dtype: int64


It appears that no incorrect values were typed, and that many documents have a positive label for at least one of the classes. This makes sense, since my categories are pretty broad and generic in nature, as I will explain in the report. Next, I check whether there are some bill summaries that were not annotated at all.

In [7]:
# I calculate the number of missing values for the two categories of interest...
econ_mv = d["economic"].isnull().sum()
sc_mv = d["socio_cultural"].isnull().sum()

# ...and I print the diagnostics to see whether I appropriately coded all summaries.
print("Missing values in the 'economic' category:", econ_mv)
print("Missing values in the 'socio-cultural' category:", sc_mv)

print(f"I appropriately coded {len(d['economic'])-econ_mv} documents as (non-)economic.")
print(f"I appropriately coded {len(d['socio_cultural'])-sc_mv} documents as (non-)socio-cultural.")

Missing values in the 'economic' category: 0
Missing values in the 'socio-cultural' category: 0
I appropriately coded 2200 documents as (non-)economic.
I appropriately coded 2200 documents as (non-)socio-cultural.


Great! I now proceed to re-map all numeric labels as character values.

## 3. Re-Mapping All Labels

In [8]:
# I define the new character labels in the "re_map" dictionary...

re_map = {
    "economic": {1: "Economic", 0: "Non-Economic"},
    "socio_cultural": {1: "Socio-Cultural", 0: "Non-Socio-Cultural"}
}

# ...and I feed the dictionary as the only argument of the ".map" method provided by the pandas package,
# selecting only the appropriate key for its corresponding column.

d["economic"] = d["economic"].map(re_map["economic"]) # The "economic" column...
d["socio_cultural"] = d["socio_cultural"].map(re_map["socio_cultural"]) # ...and the "socio_cultural" column.

In [9]:
# I check the first few lines of the DataFrame object to assess if the ".map" method worked smoothly.

d.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
0,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...,Non-Economic,Socio-Cultural
1,115,4105,hr,This bill extends funding through FY2022 for...,Economic,Non-Socio-Cultural
2,115,3691,s,Expanding Transparency of Information and S...,Non-Economic,Socio-Cultural
3,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...,Economic,Socio-Cultural
4,111,883,hr,"Amends the Internal Revenue Code to repeal, e...",Economic,Socio-Cultural


In [10]:
# There should only be two values - i.e., "Non-Economic" and "Economic" - if no errors happened during the re-mapping process.
econ_unique = d["economic"].unique()
print("Unique values for the 'economic' category:", econ_unique)

# I count the number of positive and negative labels for the "economic" category
econ_count = d["economic"].value_counts()
print("Value counts for the 'economic' category:", econ_count)

Unique values for the 'economic' category: ['Non-Economic' 'Economic']
Value counts for the 'economic' category: Economic        1233
Non-Economic     967
Name: economic, dtype: int64


In [11]:
# There should only be two values - i.e., "Non-Socio-Cultural" and "Socio-Cultural" -
# if no errors happened during the re-mapping process.
sc_unique = d["socio_cultural"].unique()
print("Unique values for the 'socio-cultural' category:", sc_unique)

# I count the number of positive and negative labels for the "socio_cultural" category
sc_count = d["socio_cultural"].value_counts()
print("Value counts for the 'socio-cultural' category:", sc_count)

Unique values for the 'socio-cultural' category: ['Socio-Cultural' 'Non-Socio-Cultural']
Value counts for the 'socio-cultural' category: Socio-Cultural        1371
Non-Socio-Cultural     829
Name: socio_cultural, dtype: int64


The only unique values within the two columns of interest are the ones resulting from the re-mapping procedure, and the numbers of positive and negative labels for both categories are unchanged! In other words, everything went smoothly. I can finally save the data set's definitive version as `labelled.csv`.

In [12]:
# I save the DataFrame object's definitive version as "labelled.csv".
# I employ "|" as a separator to prevent the pd methods from confusing colons or semi-colons within the texts with actual
# separators. I set the "index" argument to false, because the indexes are completely meaningless and do not need to be saved.

d.to_csv("labelled.csv", sep = "|", index = False)

# 4. Wrapping Up

I am now ready to design a Supervised Machine Learning pipeline. I will first follow a more "classic" perspective, training a classifier with the Bag-Of-Words approach, and then move towards the state-of-the-art technique of fine-tuning a BERT transformer. I will carry out training procedures in two separate scripts. I do not wish to mount my "classic" SML pipeline on Google CoLab, as I prefer to program in the safest conditions possible when fine-tuning my BERT models on Google's GPUs - i.e., I want to take extreme caution in avoiding session time-outs while working within the the platform.