# White Noise: Downstream Task

## 1. Explaining the Problem

Now that I fine-tuned by BERT transformers, it is time to employ them to predict the labels of interest on the unlabelled bill summaries. However, this is not as straightforward as it seem. I must encode the unlabelled data and appropriately provide it to the fine-tuned models within the Torch framework, and save the predicted labels in a humanly understandable format within a new `unlabelled.csv` dataset. The latter will then be joined with the `labelled.csv` dataset and the metadata, to form my final version of the US bill summaries dataset.

This notebook is heavily inspired by Anne Kroon's [version](https://github.com/uvacw/teaching-bdaca/blob/main/modules/machinelearning-text-exercises/transformers_bert_classification.ipynb) of the [the BERT for Humanists Fine-Tuning for a Classification task tutorial](https://colab.research.google.com/drive/19jDqa5D5XfxPU6NQef17BC07xQdRnaKU?usp=sharing), designed by Maria Antoniak, Melanie Walsh, and the [BERT for Humanists](https://melaniewalsh.github.io/BERT-for-Humanists/) Team.

These are the steps involved in using BERT for making predictions:

    1. Convert the textual data into a format that BERT can process.
    2. Create a custom Torch dataset object to store the tokenized data.
    3. Load the pre-trained BERT models.
    4. Wrap the dataset objects and feed them to the models to predict the labels.
    5. Convert the predictions into character labels.
    6. Associate the character labels to their respective bill summaries and save them into a new .csv file.

In [1]:
# Installing the "transformers" package, in an older version (4.28.0)
!pip3 install transformers==4.28.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m80.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m83.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transform

I employ an older version of the `transformers` package because in the recent `4.29.0` release the function or method `PartialState` is not defined, leading to a `NameError` when trying to initialise the training parameters within the `Trainer` object. I must greatly thank the user `amyeroberts` who suggested to downgrade the `transformers` package within the following thread: https://github.com/huggingface/transformers/issues/22816.

In [2]:
# General packages
import gzip
import json
import pickle
import random
import sys
import csv
from collections import defaultdict

# Packages for data handling and cleaning
import numpy as np
import pandas as pd

# Packages for handling BERT transformers and making predictions
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [3]:
# Packages for mounting Drive and setting the working directory
import os
from google.colab import drive

# Mounting my Drive on Google Colab
drive.mount('/content/drive')

# Setting the working directory
os.chdir(r"/content/drive/My Drive/Colab Notebooks/")

Mounted at /content/drive


## **2. Unpacking the Unlabelled Data**

In [4]:
# I start by importing the "summary_unlabelled.csv" data set as a DataFrame object within the CoLab environment.
# I crucially specify the "|" separator, because employing colons or semi-colons causes conflicts with the summaries' contents.

d = pd.read_csv("summary_unlabelled.csv", sep = "|")

In [5]:
# I check the first few lines of the DataFrame object to assess if the "read_csv" command worked smoothly

d.head()

Unnamed: 0,congress,bill_number,bill_type,text
0,115,5405,hr,This bill amends the Agricultural Act of 201...
1,115,1980,s,Renewable Chemicals Act of 2017 This bil...
2,115,4769,hr,Helping Americans Seek Treatment Act of 201...
3,111,1558,s,Travel Reimbursement for Inactive Duty Trai...
4,115,3932,hr,Healthcare Expenditures for Low-income Popu...


In [6]:
# I check the shape of the DataFrame object to assess if the "read_csv" command worked smoothly

d.shape

(19620, 4)

19620 classified documents, the original four columns I retrieved from `api.congress.gov`. Everything seems perfect! I can now transform the columns where I respectively stored the textual data - i.e. `text` - the economic labels - i.e., `economic` - and the socio-cultural labels - i.e., `socio_cultural` - in three separate lists, which I subsequently split into suitable sets for training, validation, and testing.

The following cell's code is inspired by the official documentation for the `tolist()` `pandas` method, available at https://pandas.pydata.org/docs/reference/api/pandas.Series.tolist.html.

In [7]:
# I unpack the US bill summaries into a list format object

text = d["text"].tolist()

In [8]:
# I check the list's first 5 elements and overall length to assess whether this data wrangling step went smoothly.

text[:5]

['  This bill amends the Agricultural Act of 2014 to provide an exception to the rule that prohibits the Department of Agriculture from providing price loss coverage payments or agriculture risk coverage payments to a producer on a farm if the sum of the base acres on the farm is 10 acres or less. The exception applies if the sum of the base acres on the farm, when combined with the base acres of other farms in which the producer has an interest, is more than 10 acres. ',
 '   Renewable Chemicals Act of 2017    This bill amends the Internal Revenue Code to allow: (1) a business-related tax credit for the production of renewable chemicals, and (2) a tax credit for investment in renewable chemical production facilities.   The bill defines "renewable chemical" as any chemical that: (1) is produced in the United States from renewable biomass; (2) is sold or used for the production of chemical products, polymers, plastics, or formulated products or as chemicals, polymers, plastics, or formu

In [9]:
print(f"The unlabelled data list's total length is {len(text)}.")

The unlabelled data list's total length is 19620.


All the unlabelled data was correctly dumped into the `text` list. I now set some variables that will help me to avoid hard-coding some key values, such as the directories where my fine-tuned BERT models are stored.

In [10]:
# I set the maximum number of tokens in each document to be 512, which is the maximum length for BERT models.
max_length = 512

# I will rune my code on NVIDIA GPUs using Google CoLab's program management system.
device_name = "cuda"

# I define the directory where I saved my fine-tuned LEGAL-BERT model for the Economic / Non-Economic classification task.
save_directory_econ = "legal_bert_econ"

# I define the directory where I saved my fine-tuned base BERT for the Socio-Cultural / Non-Socio-Cultural classification task.
save_directory_sc = "base_bert_sc"

# To import LEGAL-BERT's AutoTokenizer, I refer to the "nlpaueb/legal-bert-base-uncased" HuggingFace transformer.
model_name_econ = "nlpaueb/legal-bert-base-uncased"

# To import base BERT's AutoTokenizer, I refer to the "bert-base-uncased" HuggingFace transformer.
model_name_sc = "bert-base-uncased"

## **3. Encoding Data for BERT**

To prepare my bill summaries data for drawing predictions with the fine-tuned BERT models, I need to encode the texts in a way that the models can understand. Here are the steps I must follow:

1. Convert the labels from strings to integers.
2. Tokenize the texts, which involves breaking them up into individual words, and then convert the words into "word pieces" that can be matched with their corresponding embedding vectors.
3. Truncate texts that are longer than 512 tokens, or pad texts that are shorter than 512 tokens with a special padding token.
4. Add special tokens to the beginning and end of each document, including a start token, a separator between sentences, and a padding token as necessary.

In [11]:
# I import my automatic BERT tokenizers from the respective pre-trained model's directories within HuggingFace
tokenizer_econ = AutoTokenizer.from_pretrained(model_name_econ) # LEGAL-BERT
tokenizer_sc = AutoTokenizer.from_pretrained(model_name_sc) # Base BERT

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

I now generate a mapping of integer keys to my Economic / Non-Economic, and Socio-Cultural / Non-Socio-Cultural labels, to be able to quickly convert the numeric predictions drawn from the fine-tuned BERT models to meaningful character labels. 

In [12]:
# a. Economic / Non-Economic

id2label_econ = {
    0: "Non-Economic",
    1: "Economic"
}

# b. Socio-Cultural / Non-Socio-Cultural

id2label_sc = {
    0: "Non-Socio-Cultural",
    1: "Socio-Cultural"
}

In [13]:
id2label_econ.items() # I check the items of the "label2id_econ" dictionary...

dict_items([(0, 'Non-Economic'), (1, 'Economic')])

In [14]:
id2label_sc.items() # ...and the "id2label_econ" dictionary.

dict_items([(0, 'Non-Socio-Cultural'), (1, 'Socio-Cultural')])

Next, I tokenize all the US bill summaries with their correspondent model's `AutoTokenizer`.

In [15]:
# I encode the unlabelled bill summaries with the pre-trained AutoTokenizers from the HuggingFace
# library. The truncation, padding, and "max_length" parameters are set to ensure that all the tokenized
# sequences are of the same length.

# 1. The truncation parameter ensures that any sequences longer than "max_length" (512)
# are truncated to the specified maximum length.

# 2. The padding parameter ensures that any sequences shorter than "max_length" (512)
# are padded with special tokens to the specified maximum length.

# 3. The "max_length" parameter specifies the maximum length of the tokenized sequences - i.e., 512.

econ_encodings = tokenizer_econ(text, truncation = True, padding = True, max_length = max_length) # Economic
sc_encodings = tokenizer_sc(text, truncation = True, padding = True, max_length = max_length) # Socio-Cultural

After the encoding procedure, I examine the newly created sets to check whether there are any issues. This is a bill summary in the `econ_encodings` set:

In [16]:
# I take the first document in the "econ_encodings" set, and I join its first 100 tokens by a whitespace to get a sneak peek.

" ".join(econ_encodings[0].tokens[0:100])

'[CLS] this bill amends the agricultural act of 2014 to provide an exception to the rule that prohibits the department of agricultur ##e from providing price loss coverage payments or agricultur ##e risk coverage payments to a producer on a farm if the sum of the base acres on the farm is 10 acres or less . the exception applies if the sum of the base acres on the farm , when combined with the base acres of other farm ##s in which the producer has an interest , is more than 10 acres . [SEP] [PAD] [PAD] [PAD] [PAD]'

This is the same bill summary in the `sc_encodings` set:

In [17]:
# I take the same document in the "sc_encodings" set, and I join its first 100 tokens by a whitespace to get a sneak peek.

" ".join(sc_encodings[0].tokens[0:100])

'[CLS] this bill amend ##s the agricultural act of 2014 to provide an exception to the rule that prohibits the department of agriculture from providing price loss coverage payments or agriculture risk coverage payments to a producer on a farm if the sum of the base acres on the farm is 10 acres or less . the exception applies if the sum of the base acres on the farm , when combined with the base acres of other farms in which the producer has an interest , is more than 10 acres . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

## **4. Preparing the Input Data**

I now save the tokenized data into two separate custom Torch datasets. I use the custom Torch `MyDataSet` class to make two `dataset` objects, analogously to what I did in the training scripts.

In [18]:
# The MyDataset custom class uses PyTorch's Dataset class as its parent class. 
# It takes in tokenized text data and returns these data points in a format suitable for use in PyTorch models.

class MyDataset(torch.utils.data.Dataset):

  # I define the __init__ method, which initializes the MyDataset object with one attribute: "encodings".

    def __init__(self, encodings):
        self.encodings = encodings # "encodings" is a dictionary containing the tokenized text data.

  # I create the __getitem__ method, which defines how each data point is returned
  # from the dataset. The "idx" parameter is used to index into the dataset to
  # retrieve a specific data point.

    def __getitem__(self, idx):
        # The method generates a dictionary item that contains the tokenized text data for the corresponding idx index.

        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # The torch.tensor() function is used to convert the dictionary values to PyTorch tensors.
        # The "key" variable contains the keys in the encodings dictionary, and "val"
        # contains the corresponding values for the current idx index.
    
        # I return the item as the method's output.
        return item
    
    # At last, I generate the "__len__" method, which returns the dataset's length as its output.
    # The latter is equal to any one of the lists of values contained within the "encodings" dictionary.
    def __len__(self):
        return len(self.encodings["input_ids"])

In [19]:
# I now apply the MyDataset custom class to the tokenized unlabelled texts.

econ_dataset = MyDataset(econ_encodings) # Economic / Non-Economic
sc_dataset = MyDataset(sc_encodings) # Socio-Cultural / Non-Socio-Cultural

I inspect the newly created custom Torch datasets to check whether there are any issues. This is a bill summary in the Torch `econ_dataset`:

In [20]:
# I take the first document in "econ_dataset" dataset, and I join its first 100 tokens by a whitespace to get a sneak peek.

" ".join(econ_dataset.encodings[0].tokens[0:100])

'[CLS] this bill amends the agricultural act of 2014 to provide an exception to the rule that prohibits the department of agricultur ##e from providing price loss coverage payments or agricultur ##e risk coverage payments to a producer on a farm if the sum of the base acres on the farm is 10 acres or less . the exception applies if the sum of the base acres on the farm , when combined with the base acres of other farm ##s in which the producer has an interest , is more than 10 acres . [SEP] [PAD] [PAD] [PAD] [PAD]'

This is the same bill summary in the Torch `sc_dataset`:

In [21]:
# I take the same document in "sc_dataset" dataset, and I join its first 100 tokens by a whitespace to get a sneak peek.

" ".join(sc_dataset.encodings[0].tokens[0:100])

'[CLS] this bill amend ##s the agricultural act of 2014 to provide an exception to the rule that prohibits the department of agriculture from providing price loss coverage payments or agriculture risk coverage payments to a producer on a farm if the sum of the base acres on the farm is 10 acres or less . the exception applies if the sum of the base acres on the farm , when combined with the base acres of other farms in which the producer has an interest , is more than 10 acres . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

The custom Torch datasets are appropriately set up. It is time to initialise the pre-trained BERT models and make my predictions.

## **5. Initialising the Fine-Tuned BERT Models**
I am now ready to import the existing fine-tuned BERT models and transfer them to the Compute Unified Device Architecture (CUDA) for efficient GPU computation. I repeat the process for each classification task and custom Torch dataset. Remember that I am employing the fine-tuned `LEGAL-BERT` for the Economic / Non-Economic classification, and the fine-tuned base BERT for the Socio-Cultural / Non-Socio-Cultural classification.

In [22]:
model_econ = AutoModelForSequenceClassification.from_pretrained(save_directory_econ, num_labels = len(id2label_econ)).to(device_name)
model_sc = AutoModelForSequenceClassification.from_pretrained(save_directory_sc, num_labels = len(id2label_sc)).to(device_name)

I now wrap an iterable around the two custom Torch datasets to enable easy access to the samples and the employment of batches with the efficient `DataLoader` tool. This solution was inspired by the PyTorch Team's tutorial for beginners on Datasets and Dataloaders, which is available [here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders). It is thanks to the `DataLoader` tool that I can load batches of 180 documents at a time on Google CoLab's GPU for the most efficient performance. This means that I will only need 109 iterations per classification task when predicting the labels. 

In [23]:
# Wrapping an iterable around my Torch datasets with the "DataLoader" tool
dl_econ = DataLoader(econ_dataset, batch_size = 180)
dl_sc = DataLoader(sc_dataset, batch_size = 180)

# I calculate the number of total batches for each classification task to update the user on the loop's completion
total_batches_econ = len(dl_econ)
total_batches_sc = len(dl_sc)

## **6. Making My Predictions**

I am now ready to make my predictions. I start with the Economic / Non-Economic classification task. The prediction loop is quite complex to explain without grounding one's interpretation on the script, so I illustrate it directly in the following cells' comments.

The solution is partially inspired by the user `Rohan Shetty`'s response to [this](https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch) StackExchange thread. I must greatly thank this user for informing me that, when one does not want to perform training - i.e., validation, testing, or any downstream task - the `torch.no_grad` method will reduce the memory usage and speed up computations.

Another inspiration is [this](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/) tutorial by Venelin Valkov, which helped me to correctly retrieve the input tensors from each batch. I must greatly thank the user `CLopez138` on StackOverflow for having led me to this tutorial, through [this](https://stackoverflow.com/questions/69820318/predicting-sentiment-of-raw-text-using-trained-bert-model-hugging-face) thread. Without it, I would have never tentatively understood how to appropriately feed data to the model to get the logit predictions.

In [24]:
# a. Economic / Non-Economic

counter = 0 # I create a counter to keep the user updated during the loop
predictions_econ = [] # I create an empty list where I store the batches of predictions

# I do not want to train the model - i.e., update the model's parameters - therefore
# I call Torch's "no_grad()" method, which allows me to disable gradient calculation
# to reduce the memory usage and speed up computations.

with torch.no_grad():
    # I loop over every batch in the "dl_econ" DataLoader object
    for batch in dl_econ:

        # I retrieve the required input tensors from the given batch and I send them to the GPU
        input_ids = batch["input_ids"].to(device_name)
        attention_mask = batch["attention_mask"].to(device_name)
        token_type_ids = batch["token_type_ids"].to(device_name)

        counter += 1 # I update the counter to update the user on the loop's completion

        # I print a message that serves to inform the user on the loop's completion
        print(f"Predicting batch number {counter} out of {total_batches_econ}...")

        # I get the predictions from the model by feeding it the required input tensors
        # I select the first element from the output, which is the logit that I need
        logits_econ = model_econ(input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids)[0]

        # I convert the logits into a one-dimensional integer (0 or 1) that signals the predicted class
        predictions_econ_num = torch.argmax(logits_econ, dim = 1)

        # I convert the predicted labels to the actual labels by using the label mapping I set up beforehand
        # I use a list comprehension by looping over the "predictions_econ_num" object while converting it into a list
        # For each label in the list, I apply the dictionary for re-mapping
        predictions_econ_lab = [id2label_econ[lab] for lab in predictions_econ_num.tolist()]

        # I append the predictions in the character format to the originally empty "predictions_econ" list
        predictions_econ.append(predictions_econ_lab)

print("\nAll predictions are drawn.")

Predicting batch number 1 out of 109...
Predicting batch number 2 out of 109...
Predicting batch number 3 out of 109...
Predicting batch number 4 out of 109...
Predicting batch number 5 out of 109...
Predicting batch number 6 out of 109...
Predicting batch number 7 out of 109...
Predicting batch number 8 out of 109...
Predicting batch number 9 out of 109...
Predicting batch number 10 out of 109...
Predicting batch number 11 out of 109...
Predicting batch number 12 out of 109...
Predicting batch number 13 out of 109...
Predicting batch number 14 out of 109...
Predicting batch number 15 out of 109...
Predicting batch number 16 out of 109...
Predicting batch number 17 out of 109...
Predicting batch number 18 out of 109...
Predicting batch number 19 out of 109...
Predicting batch number 20 out of 109...
Predicting batch number 21 out of 109...
Predicting batch number 22 out of 109...
Predicting batch number 23 out of 109...
Predicting batch number 24 out of 109...
Predicting batch number 2

Shots fired! I now check whether the loop functioned correctly.

In [25]:
# I check whether the "predictions_econ" list of list's length is 109, equal to the total number of batches
print(len(predictions_econ))

109


I must now flatten the list of lists, unpacking the sublists, to get the predictions for the 19620 unlabelled documents into a format that can be easily and orderly saved within a `DataFrame` object.

In [26]:
# I design a list comprehension to loop over all the sublists, and the values within the sublists.
# This list comprehension helps me flatten the original lists of lists!

predictions_econ_flat = [val for sublist in predictions_econ for val in sublist]

In [27]:
# I check whether the "predictions_econ_flat" list's length is 19620, equal to the total number of unlabelled documents
print(len(predictions_econ_flat))

19620


In [28]:
# I check whether the "predictions_econ_flat" list contains "Economic" and "Non-Economic" labels.
print(predictions_econ_flat[:5])

['Economic', 'Economic', 'Non-Economic', 'Economic', 'Economic']


Everything seems to have gone smoothly! Before moving on to the Socio-Cultural / Non-Socio-Cultural classification task, I want to get the total number of documents that were classified as either Economic, or Non-Economic. I do this by implementing the efficient `Counter` function. This solution was inspired by `Vidul`'s response to [this](https://stackoverflow.com/questions/69820318/predicting-sentiment-of-raw-text-using-trained-bert-model-hugging-face) StackOverflow thread. I must greatly thank this user for introducing me to this very useful tool.

In [29]:
# Package for counting unique values within a list
from collections import Counter

# I count the occurrences of each label in the list with the "Counter" function...
counterattack_econ = Counter(predictions_econ_flat)

# ...and I print them by looping over the "counterattack_econ" dictionary
for label, count in counterattack_econ.items():
    print(f"{label}: {count}")

Economic: 11013
Non-Economic: 8607


The relative numbers of Economic / Non-Economic classifications are quite believable. A notable majority of documents in the `labelled.csv` dataset were coded as "Economic", even though the model is probably slightly inflating their number. However, I expected this to happen, as all the tested models had somewhat of a problem with artificially inflating positive labels. I now turn to the Socio-Cultural / Non-Socio-Cultural classification task.

In [30]:
# b. Socio-Cultural / Non-Socio-Cultural

counter = 0 # I create a counter to keep the user updated during the loop
predictions_sc = [] # I create an empty list where I store the batches of predictions

# I do not want to train the model - i.e., update the model's parameters - therefore
# I call Torch's "no_grad()" method, which allows me to disable gradient calculation
# to reduce the memory usage and speed up computations.

with torch.no_grad():

    # I loop over every batch in the "dl_sc" DataLoader object
    for batch in dl_sc:

        # I retrieve the required input tensors from the given batch and I send them to the GPU
        input_ids = batch["input_ids"].to(device_name)
        attention_mask = batch["attention_mask"].to(device_name)
        token_type_ids = batch["token_type_ids"].to(device_name)

        counter += 1 # I update the counter to update the user on the loop's completion

        # I print a message that serves to inform the user on the loop's completion
        print(f"Predicting batch number {counter} out of {total_batches_sc}...")

        # I get the predictions from the model by feeding it the required input tensors
        # I select the first element from the output, which is the logit that I need
        logits_sc = model_sc(input_ids, attention_mask = attention_mask, token_type_ids = token_type_ids)[0]

        # I convert the logits into a one-dimensional integer (0 or 1) that signals the predicted class
        predictions_sc_num = torch.argmax(logits_sc, dim = 1)

        # I convert the predicted labels to the actual labels by using the label mapping I set up beforehand
        # I use a list comprehension by looping over the "predictions_sc_num" object while converting it into a list
        # For each label in the list, I apply the dictionary for re-mapping
        predictions_sc_lab = [id2label_sc[lab] for lab in predictions_sc_num.tolist()]

        # I append the predictions in the character format to the originally empty "predictions_sc" list
        predictions_sc.append(predictions_sc_lab)

print("\nAll predictions are drawn.")

Predicting batch number 1 out of 109...
Predicting batch number 2 out of 109...
Predicting batch number 3 out of 109...
Predicting batch number 4 out of 109...
Predicting batch number 5 out of 109...
Predicting batch number 6 out of 109...
Predicting batch number 7 out of 109...
Predicting batch number 8 out of 109...
Predicting batch number 9 out of 109...
Predicting batch number 10 out of 109...
Predicting batch number 11 out of 109...
Predicting batch number 12 out of 109...
Predicting batch number 13 out of 109...
Predicting batch number 14 out of 109...
Predicting batch number 15 out of 109...
Predicting batch number 16 out of 109...
Predicting batch number 17 out of 109...
Predicting batch number 18 out of 109...
Predicting batch number 19 out of 109...
Predicting batch number 20 out of 109...
Predicting batch number 21 out of 109...
Predicting batch number 22 out of 109...
Predicting batch number 23 out of 109...
Predicting batch number 24 out of 109...
Predicting batch number 2

Shots fired! I now check whether the loop functioned correctly.

In [31]:
# I check whether the "predictions_sc" list of list's length is 109, equal to the total number of batches
print(len(predictions_sc))

109


I must now flatten the list of lists, unpacking the sublists, to get the predictions for the 19620 unlabelled documents into a format that can be easily and orderly saved within a `DataFrame` object.

In [32]:
# I design a list comprehension to loop over all the sublists, and the values within the sublists.
# This list comprehension helps me flatten the original lists of lists!

predictions_sc_flat = [val for sublist in predictions_sc for val in sublist]

In [33]:
# I check whether the "predictions_sc_flat" list's length is 19620, equal to the total number of unlabelled documents
print(len(predictions_sc_flat))

19620


In [34]:
# I check whether the "predictions_sc_flat" list contains "Socio-Cultural" and "Non-Socio-Cultural" labels.

print(predictions_sc_flat[:5])

['Non-Socio-Cultural', 'Socio-Cultural', 'Socio-Cultural', 'Non-Socio-Cultural', 'Socio-Cultural']


Everything seems to have gone smoothly! Before moving on to save the predictions into the final `unlabelled.csv`dataset, I want to get the total number of documents that were classified as either Socio-Cultural, or Non-Socio-Cultural. I do this by implementing the efficient `Counter` function. This solution was inspired by `Vidul`'s response to [this](https://stackoverflow.com/questions/69820318/predicting-sentiment-of-raw-text-using-trained-bert-model-hugging-face) StackOverflow thread. I must greatly thank this user for introducing me to this very useful tool.

In [35]:
from collections import Counter

# I count the occurrences of each label in the list with the "Counter" function...
counterattack_sc = Counter(predictions_sc_flat)

# ...and I print them by looping over the "counterattack_sc" dictionary
for label, count in counterattack_sc.items():
    print(f"{label}: {count}")

Non-Socio-Cultural: 6022
Socio-Cultural: 13598


The relative numbers of Socio-Cultural / Non-Socio-Cultural classifications are quite believable. A sharp majority of documents in the `labelled.csv` dataset were coded as "Socio-Cultural", even though the model is probably slightly inflating their number. However, I expected this to happen, as all the tested models had somewhat of a problem with artificially inflating positive labels.

## **7. Wrapping Up**

Now that the predictions have been generated, it is time to orderly save them in the new `unlabelled.csv` dataset, which I will subsequently join with the `labelled.csv` dataset and the US bill summary metadata in the script for metadata cleaning, obtaining the final version of my dataset.

In [36]:
# I generate two new empty colums in the unlabelled dataset, with names that mirror the ones of the labelled dataset
d["economic"] = None # Economic / Non-Economic
d["socio_cultural"] = None # Socio-Cultural / Non-Socio-Cultural

# I now assign the respective lists of labels to their corresponding column in the unlabelled dataset
d["economic"] = predictions_econ_flat # Economic / Non-Economic
d["socio_cultural"] = predictions_sc_flat # Socio-Cultural / Non-Socio-Cultural

In [37]:
# I check the first and last few lines of the DataFrame object to assess if the assignment worked smoothly
d.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
0,115,5405,hr,This bill amends the Agricultural Act of 201...,Economic,Non-Socio-Cultural
1,115,1980,s,Renewable Chemicals Act of 2017 This bil...,Economic,Socio-Cultural
2,115,4769,hr,Helping Americans Seek Treatment Act of 201...,Non-Economic,Socio-Cultural
3,111,1558,s,Travel Reimbursement for Inactive Duty Trai...,Economic,Non-Socio-Cultural
4,115,3932,hr,Healthcare Expenditures for Low-income Popu...,Economic,Socio-Cultural


In [38]:
d.tail()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
19615,111,1063,hr,Amends the Energy Independence and Security ...,Non-Economic,Socio-Cultural
19616,111,2839,s,Torture Victims Relief Reauthorization Act o...,Economic,Socio-Cultural
19617,111,834,hr,Ramos and Compean Justice Act of 2009 - Amen...,Non-Economic,Socio-Cultural
19618,115,1935,s,Tribal Tax and Investment Reform Act of 201...,Economic,Socio-Cultural
19619,111,1940,hr,Wellness Trust Act - Amends the Public Healt...,Economic,Socio-Cultural


19620 classified documents, the original four columns I retrieved from `api.congress.gov`, plus the two columns with the automatically classified labels. Everything seems perfect! I can now proceed to save the final version of this dataset.

In [39]:
# I save the DataFrame as "unlabelled.csv". I employ "|" as a separator to prevent the pd methods from confusing
# colons or semi-colons within the texts with the actual separators. I set the "index" argument to false, because the indexes
# are completely meaningless and do not need to be saved.

d.to_csv("unlabelled.csv", sep = "|", index = False)

From a visual inspection in Notepad++, the predictions seem to be good! I can provide more detailed comments from the human coder perspective on the first few bill summaries, to demonstrate how powerful the models are:

1. The first summary is about a bill that amends the Agricultural Act of 2014 to reform price loss coverage payments or agriculture risk coverage payments to producers. This bill is correctly coded as Economic, because it is a law that revises public welfare for farmers, and as Non-Socio-Cultural, because it does not involve a socio-cultural issue.

2. The second summary is about a bill that amends the Internal Revenue Code to provide tax credits for investments in renewable and environmental-friendly chemicals. This bill is correctly coded as Economic and Socio-Cultural, because it involves a tax reform within the environmental policy framework.

3. The third summary is about a bill that amends the Public Health Service Act to require the Substance Abuse and Mental Health Services Administration (SAMHSA) to conduct a national public awareness campaign regarding its substance abuse treatment referral routing programs. This bill is correctly coded as Non-Economic, because it does not imply any budgetary action by the United States Congress, and Socio-Cultural, because it refers to the very salient socio-cultural issue of substance abuse.

4. The fourth summary is about a bill that authorises the Secretary of the military department concerned to reimburse a member of the reserves for certain types of transportation expenses. This bill is correctly coded as Economic, because it concerns public employees' reimbursement, and Non-Socio-Cultural, because even though the military is mentioned, this is not a law that responds to socio-cultural issues related to national security or international relations. This classification is very hard to make, proving that the models truly understand context.

5. The fifth summary is about a bill that increases Medicaid funding for Puerto Rico through the Financial Year 2019. This bill is correctly coded as Economic, because it implies that the US Congress is taking action on Medicaid's funding, and Socio-Cultural, because it regards the very salient socio-cultural issue of public / private healthcare, and the socio-culturally conflictual status of Puerto Rico as an unincorporated territory of the United States. In other words, the island of Puerto Rico is neither a sovereign nation, nor an US state, reflecting its colonial past.