# **White Noise: LEGAL-BERT Fine-Tuning**
## **1. Explaining the Problem**

The vectorizer + classifier combinations I fine-tuned with grid-searches, following the "classic" approach to Supervised Machine Learning, is a good benchmark, but there is a lot of room for improvement. The accuracy of the best solutions for the classification tasks I must solve - i.e., labelling a US Congress bill summary as Economic / Non-Economic, and Socio-Cultural / Non-Socio-Cultural in its content -  hovers around the 80% mark, which is pretty satisfactory. However, I find a recurring and underlying bias of the vectorizer + classifier combination towards the positive labels. The most flexible and powerful solution to transcend the Bag-Of-Words technique and try tackling this issue is to fine-tune a BERT transformer specifically trained on legal text in English, to ensure that the pre-training phase is consistent with my domain of interest - i.e., US Congress bills. However, this necessitates a great deal of supplementary computational effort, to the extent that I am forced to load this script on Google CoLab, in order to employ Google's GPUs.

My plan is to download the pre-trained `nlpaueb/legal-bert-base-uncased`, a BERT model from the `HuggingFace` library, created by the Athens University of Economics and Business's Natural Language Processing Group. The `LEGAL-BERT` model is pre-trained on a corpora of EU legislation, UK legislation, US contracts from the US Securities and Exchange Commission (SECOM), and cases from the European Court of Justice (ECJ), European Court of Human Rights (ECHR), and various courts across the USA. It is available at https://huggingface.co/nlpaueb/legal-bert-base-uncased. I expect that fine-tuning this transformer for my specific downstream tasks will lead to superior performances, ultimately yielding more nuanced predictions that take context and temporality into account.

This notebook is heavily inspired by Anne Kroon's [version](https://github.com/uvacw/teaching-bdaca/blob/main/modules/machinelearning-text-exercises/transformers_bert_classification.ipynb) of the [the BERT for Humanists Fine-Tuning for a Classification task tutorial](https://colab.research.google.com/drive/19jDqa5D5XfxPU6NQef17BC07xQdRnaKU?usp=sharing), designed by Maria Antoniak, Melanie Walsh, and the [BERT for Humanists](https://melaniewalsh.github.io/BERT-for-Humanists/) Team.

These are the steps involved in using BERT and HuggingFace:

    1. Split the labelled dataset into training, validation, and testing subsets.
    2. Convert the data into a format that BERT can process.
    3. Create dataset objects by joining the textual data and labels.
    4. Load the pre-trained BERT model.
    5. Refine the model by training it on the training set.
    6. Use the model to make predictions and assess its performance on your test data.

In [None]:
# Installing the "transformers" package, in an older version (4.28.0)
!pip3 install transformers==4.28.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transform

I employ an older version of the `transformers` package because in the recent `4.29.0` release the function or method `PartialState` is not defined, leading to a `NameError` when trying to initialise the training parameters within the `Trainer` object. I must greatly thank the user `amyeroberts` who suggested to downgrade the `transformers` package within the following thread: https://github.com/huggingface/transformers/issues/22816.

In [None]:
# General packages
import gzip
import json
import pickle
import random
import sys
import csv
from collections import defaultdict

# Packages for data handling and cleaning
import numpy as np
import pandas as pd

# Packages for SML and Transformer fine-tuning
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.utils import compute_sample_weight
from sklearn.metrics import f1_score
import torch
from transformers import Trainer, TrainingArguments

In [None]:
# Packages for mounting Drive and setting the working directory
import os
from google.colab import drive

# Mounting my Drive on Google Colab
drive.mount('/content/drive')

# Setting the working directory
os.chdir(r"/content/drive/My Drive/Colab Notebooks/")

Mounted at /content/drive


## **1. Unpacking the data, and splitting it into Training, Validation, and Test sets**

In [None]:
# I start by importing the "labelled.csv" data set as a DataFrame object within the CoLab environment.
# I crucially specify the "|" separator, because employing colons or semi-colons causes conflicts with the summaries' contents.

d = pd.read_csv("labelled.csv", sep = "|")

In [None]:
# I check the first few lines of the DataFrame object to assess if the "read_csv" command worked smoothly

d.head()

Unnamed: 0,congress,bill_number,bill_type,text,economic,socio_cultural
0,115,1308,hr,Frank and Jeanne Moore Wild Steelhead Speci...,Non-Economic,Socio-Cultural
1,115,4105,hr,This bill extends funding through FY2022 for...,Economic,Non-Socio-Cultural
2,115,3691,s,Expanding Transparency of Information and S...,Non-Economic,Socio-Cultural
3,111,1994,hr,Citizen Soldier Equality Act of 2009 - Requi...,Economic,Socio-Cultural
4,111,883,hr,"Amends the Internal Revenue Code to repeal, e...",Economic,Socio-Cultural


In [None]:
# I check the shape of the DataFrame object to assess if the "read_csv" command worked smoothly

d.shape

(2200, 6)

2200 classified documents, six columns - i.e., the original four columns I retrieved from `api.congress.gov`, plus the two columns that contain the categories I manually annotated. Everything seems perfect! I can now transform the columns where I respectively stored the textual data - i.e. `text` - the economic labels - i.e., `economic` - and the socio-cultural labels - i.e., `socio_cultural` - in three separate lists, which I subsequently split into suitable sets for training, validation, and testing.

The following cell's code is inspired by the official documentation for the `tolist()` `pandas` method, available at https://pandas.pydata.org/docs/reference/api/pandas.Series.tolist.html.

In [None]:
# I unpack the columns of interest into three separate lists with the .tolist() pandas method.

text = d["text"].tolist() # Textual data
economic = d["economic"].tolist() # Economic labels
socio_cultural = d["socio_cultural"].tolist() # Socio-cultural labels

In [None]:
# I check the first 5 elements and overall lengths of the three lists to assess whether this data wrangling step went smoothly.

text[:5]

['   Frank and Jeanne Moore Wild Steelhead Special Management Area Designation Act      This bill designates approximately 99,653 acres of Forest Service land in Oregon as the  Frank and Jeanne Moore Wild Steelhead Special Management Area.  ',
 '  This bill extends funding through FY2022 for the Department of Health and Human Services to award grants to states and certain other entities for demonstration projects that address health-professions workforce needs.  ',
 '   Expanding Transparency of Information and Safeguarding Toxics (EtO is Toxic) Act of 2018    This bill updates requirements for chemicals that pose an adverse public health risk. Specifically, the bill requires the Environmental Protection Agency (EPA) to publish an updated National Air Toxics Assessment once every two years. The assessment uses emissions data to estimate health risks from toxic air pollutants.    The bill also requires the EPA to use data from its Integrated Risk Information System when conducting rulem

In [None]:
print(f"The textual data list's total length is {len(text)}.")

The textual data list's total length is 2200.


In [None]:
economic[:5]

['Non-Economic', 'Economic', 'Non-Economic', 'Economic', 'Economic']

In [None]:
print(f"The economic label list's total length is {len(economic)}.")

The economic label list's total length is 2200.


In [None]:
socio_cultural[:5]

['Socio-Cultural',
 'Non-Socio-Cultural',
 'Socio-Cultural',
 'Socio-Cultural',
 'Socio-Cultural']

In [None]:
print(f"The socio-cultural label list's total length is {len(socio_cultural)}.")

The socio-cultural label list's total length is 2200.


All data was correctly dumped into separate lists. Now, I proceed to split them into suitable sets for training, validation, and testing with the `sklearn` `train_test_split` function.

In [None]:
# I set a given random seed to make my work reproducible. For the record, the 27th of August is my birthday.
my_seed = 27

# Running the train vs test split with the standard 80% vs 20% ratio.
text_train, text_test, econ_train, econ_test, sc_train, sc_test = train_test_split(
    text, economic, socio_cultural, test_size = 0.2, random_state = my_seed)

In [None]:
# Checking whether this splitting step went smoothly for the bill summaries...
print(f"The text sets have {len(text_train)} training instances and {len(text_test)} testing instances.")

The text sets have 1760 training instances and 440 testing instances.


In [None]:
# ...and for their labels.
print(f"The economic label sets have {len(econ_train)} training instances and {len(econ_test)} testing instances.")
print(f"The socio-cultural label sets have {len(sc_train)} training instances and {len(sc_test)} testing instances.")

The economic label sets have 1760 training instances and 440 testing instances.
The socio-cultural label sets have 1760 training instances and 440 testing instances.


I further split the remaining data into training and validation sets. This time, I apply a 75% versus 25% split ratio, saving one fourth of the bill summaries and relative labels for validation purposes, because I want the number of instances for validation and testing to be as close as possible.

In [None]:
# Running the train vs validate split with a 75% vs 25% ratio.
text_train, text_valid, econ_train, econ_valid, sc_train, sc_valid = train_test_split(
    text_train, econ_train, sc_train, test_size = 0.25, random_state = my_seed)

In [None]:
# Checking whether this splitting step went smoothly for the bill summaries...
print(f"The text sets have {len(text_train)} training instances and {len(text_valid)} validation instances.")

The text sets have 1320 training instances and 440 validation instances.


In [None]:
# ...and for their labels.
print(f"The economic label sets have {len(econ_train)} training instances and {len(econ_valid)} validation instances.")
print(f"The socio-cultural label sets have {len(sc_train)} training instances and {len(sc_valid)} validation instances.")

The economic label sets have 1320 training instances and 440 validation instances.
The socio-cultural label sets have 1320 training instances and 440 validation instances.


I now set some variables that will help me to avoid hard-coding some key values, such as the pre-trained BERT model's name, or the directories where I want to save my fine-tuned models.

In [None]:
# To import LEGAL-BERT in English, I refer to the "nlpaueb/legal-bert-base-uncased" transformer.
model_name = "nlpaueb/legal-bert-base-uncased"

# I will rune my code on NVIDIA GPUs using Google CoLab's program management system.
device_name = "cuda"

# I set the maximum number of tokens in each document to be 512, which is the maximum length for BERT models.
max_length = 512

# I define the directory where I'll save my fine-tuned model for the Economic / Non-Economic classification task.
save_directory_econ = "legal_bert_econ"

# I define the directory where I'll save my fine-tuned model for the Socio-Cultural / Non-Socio-Cultural classification task.
save_directory_sc = "legal_bert_sc"

On a final note, I choose to fine-tune the base - i.e., general - LEGAL-BERT pre-trained model, and not its more specific variants, such as `CONTRACTS-BERT-BASE`, only trained on the US contracts sub-corpus, `EURLEX-BERT-BASE`, only trained on the EU legislation sub-corpus, and
`ECHR-BERT-BASE`, only trained on the ECHR cases sub-corpus. This is because I do not want to lose the nuance provided by training on UK legislation, which I deem to be the corpora that is the closest to my domain - i.e, US legislation.

<br>

## **2. Encoding data for BERT**

To prepare my data set for use with the pre-trained `LEGAL-BERT`, I need to encode the texts and labels in a way that the model can understand. Here are the steps I must follow:

    1. Convert the labels from strings to integers.
    2. Tokenize the texts, which involves breaking them up into individual words, and then convert the words into "word pieces" that can be matched with their corresponding embedding vectors.
    3. Truncate texts that are longer than 512 tokens, or pad texts that are shorter than 512 tokens with a special padding token.
    4. Add special tokens to the beginning and end of each document, including a start token, a separator between sentences, and a padding token as necessary.

In [None]:
# Package for automatic BERT tokenizing
from transformers import AutoTokenizer

# I import my automatic BERT tokenizer from the "nlpaueb/legal-bert-base-uncased" model
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

I now generate a mapping of my Economic / Non-Economic, and Socio-Cultural / Non-Socio-Cultural labels to integer keys. I begin by extracting the unique labels from my dataset and creating two dictionaries that associate each label with an integer.

In [None]:
# a. Economic / Non-Economic

# I create a set called "unique_labels_econ" using a set comprehension.
# I iterate over each label in the econ_train variable and add it to the set.
# The end product of the loop is a set of all the unique labels in econ_train.
unique_labels_econ = set(label for label in econ_train)

# I now create a dictionary called "label2id_econ" using a dictionary comprehension,
# by iterating over all labels in the unique_labels_econ set. For each label, I
# generate a key-value pair in the dictionary where the key is the label and
# the value is its corresponding integer ID, which is defined thanks to the
# enumerate() function.
label2id_econ = {label: id for id, label in enumerate(unique_labels_econ)}

# I finally generate a dictionary called "id2label_econ" using another dictionary
# comprehension. This time, I iterate over each key-value pair in the newly
# created "label2id_econ" dictionary, and for each key-value pair, we set a new
# key-value pair in the "id2label_econ" dictionary, where the key is the integer ID
# and the value is the label.
id2label_econ = {id: label for label, id in label2id_econ.items()}

In [None]:
label2id_econ.keys() # I check the keys of the "label2id_econ" dictionary...

dict_keys(['Non-Economic', 'Economic'])

In [None]:
id2label_econ.keys() # ...and the "id2label_econ" dictionary.

dict_keys([0, 1])

In [None]:
# b. Socio-Cultural / Non-Socio-Cultural

# I create a set called "unique_labels_sc" using a set comprehension.
# I iterate over each label in the sc_train variable and add it to the set.
# The end product of the loop is a set of all the unique labels in sc_train.
unique_labels_sc = set(label for label in sc_train)

# I now create a dictionary called "label2id_sc" using a dictionary comprehension,
# by iterating over all labels in the unique_labels_sc set. For each label, I
# generate a key-value pair in the dictionary where the key is the label and
# the value is its corresponding integer ID, which is defined thanks to the
# enumerate() function.
label2id_sc = {label: id for id, label in enumerate(unique_labels_sc)}

# I finally generate a dictionary called "id2label_sc" using another dictionary
# comprehension. This time, I iterate over each key-value pair in the newly
# created "label2id_sc" dictionary, and for each key-value pair, we set a new
# key-value pair in the "id2label_sc" dictionary, where the key is the integer ID
# and the value is the label.
id2label_sc = {id: label for label, id in label2id_sc.items()}

In [None]:
label2id_sc.keys() # I check the keys of the "label2id_sc" dictionary...

dict_keys(['Socio-Cultural', 'Non-Socio-Cultural'])

In [None]:
id2label_sc.keys() # ...and the "id2label_sc" dictionary.

dict_keys([0, 1])

Next, I respectively tokenize and encode all bill summaries and labels.

In [None]:
# I first encode the bill summaries for training, validation, and testing with the
# pre-trained AutoTokenizer from the HuggingFace library. The truncation,
# padding, and "max_length" parameters are set to ensure that all the tokenized
# sequences are of the same length.

# 1. The truncation parameter ensures that any sequences longer than "max_length" (512)
# are truncated to the specified maximum length.

# 2. The padding parameter ensures that any sequences shorter than "max_length" (512)
# are padded with special tokens to the specified maximum length.

# 3. The "max_length" parameter specifies the maximum length of the tokenized sequences - i.e., 512.

train_encodings = tokenizer(text_train, truncation = True, padding = True, max_length = max_length)
valid_encodings = tokenizer(text_valid, truncation = True, padding = True, max_length = max_length)
test_encodings = tokenizer(text_test, truncation = True, padding = True, max_length = max_length)

# I then encode my labels by iterating over the training, validation, and
# testing datasets, and mapping each label to its corresponding integer ID,
# respectively using the "label2id_econ" and "label2id_sc" dictionaries.
# I store the resulting integer IDs into new "_encoded" lists.

# a. Economic / Non-Economic
econ_train_encoded = [label2id_econ[lab] for lab in econ_train]
econ_valid_encoded = [label2id_econ[lab] for lab in econ_valid]
econ_test_encoded = [label2id_econ[lab] for lab in econ_test]

# b. Socio-Cultural / Non-Socio-Cultural
sc_train_encoded = [label2id_sc[lab] for lab in sc_train]
sc_valid_encoded = [label2id_sc[lab] for lab in sc_valid]
sc_test_encoded = [label2id_sc[lab] for lab in sc_test]

After the encoding procedure, I examine the newly created sets to check whether there are any issues. This is a bill summary in the training set after encoding:

In [None]:
# I take the first document in the "train_encodings" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(train_encodings[0].tokens[0:100])

'[CLS] protect the homeland from north korea ##n and iranian ballistic missile ##s act - states the concern of congress over north korea ##n and iranian long - range ballistic missile technology and the spread of such technology . expresse ##s support for ballistic missile protection of u . s . allie ##s and forward deploy ##ed forces but also the belief that this should not come at the expense of u . s . homeland protection . direct ##s the secretary of defense to deploy specified numbers of ground - based intercept ##or ##s in alaska and california and'

This is a bill summary in the validation set after encoding:

In [None]:
# I take the first document in the "valid_encodings" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(valid_encodings[0].tokens[0:100])

"[CLS] continuation of useful resources to states act or courts act this bill extends and otherwise revise ##s funding for program ##s related to child welfare . specifically , the bill extends funding through f ##y ##202 ##2 for the promoting safe and stable families program ; extends funding through f ##y ##202 ##2 for , and otherwise revise ##s , the grant program for improving courts ' handling of foster - care and adoption proceedings ; and provides funding for the temporary assistance for need ##y families ( ta ##n ##f ) contingenc ##y fund for f ##y ##201"

This is a bill summary in the test set after encoding:

In [None]:
# I take the first document in the "test_encodings" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek

" ".join(test_encodings[0].tokens[0:100])

'[CLS] national security commission artificial intelligence act of 2018 this bill establishe ##s , as an independent commission within the executive branch , the national security commission on artificial intelligence to review the advance ##s in artificial intelligence , related machine learning developments , and associated technologies . such commission shall consider the methods and means necessary to advance the development of artificial intelligence , machine learning , and associated technologies by the united states in order to comprehensive ##ly address national security needs , including economic risk , and any other needs of the department of defense or the'

These are the Economic / Non Economic, and Socio-Cultural / Non-Socio-Cultural labels for training:

In [None]:
# I print the "econ_train_encoded" labels in the set format
set(econ_train_encoded)

{0, 1}

In [None]:
# I print the "sc_train_encoded" labels in the set format
set(sc_train_encoded)

{0, 1}

These are the Economic / Non Economic, and Socio-Cultural / Non-Socio-Cultural labels for validation:

In [None]:
# I print the "econ_valid_encoded" labels in the set format
set(econ_valid_encoded)

{0, 1}

In [None]:
# I print the "sc_valid_encoded" labels in the set format
set(sc_valid_encoded)

{0, 1}

These are the Economic / Non Economic, and Socio-Cultural / Non-Socio-Cultural labels for testing:

In [None]:
# I print the "econ_test_encoded" labels in the set format
set(econ_test_encoded)

{0, 1}

In [None]:
# I print the "sc_test_encoded" labels in the set format
set(sc_test_encoded)

{0, 1}

## **3. Creating custom Torch datasets**

I now combine the encoded labels and texts into two separate dataset objects. We use the custom Torch `MyDataSet` class to make a `train_dataset` object from the `train_encodings` and `train_encoded` sets for each type of label - i.e., Economic / Socio-Cultural. I also generate a `valid_dataset`, `test_dataset` object from the `test_encodings` and `valid_encodings`, and `valid_encoded` and `test_encoded`, following the same logic.

In [None]:
# The MyDataset custom class uses PyTorch's Dataset class as its parent class. 
# It takes in tokenized text data and their corresponding integer-encoded labels
# as inputs and returns these data points in a format suitable for use in PyTorch models.

class MyDataset(torch.utils.data.Dataset):

  # I define the __init__ method, which initializes the MyDataset object with
  # two attributes: "encodings" and "labels".
    def __init__(self, encodings, labels):
        self.encodings = encodings # "encodings" is a dictionary containing the tokenized text data
        self.labels = labels # labels is a list containing the integer-encoded labels

  # I create the __getitem__ method, which defines how each data point is returned
  # from the dataset. The "idx" parameter is used to index into the dataset to
  # retrieve a specific data point.

    def __getitem__(self, idx):
        # The method initially generates a dictionary item that contains the tokenized
        # text data for the corresponding idx index, along with the integer-encoded
        # label for the same index.

        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # The torch.tensor() function is used to convert the dictionary values to PyTorch tensors.
        # The "key" variable contains the keys in the encodings dictionary, and "val"
        # contains the corresponding values for the current idx index.

        # I add the integer-encoded label to the dictionary, and I return the latter as the method's output.
        item['labels'] = torch.tensor(self.labels[idx])
        return item

  # At last, I generate the "__len__" method, which returns the dataset's length as its output.
  # The latter is equal to the length of the labels list.
    def __len__(self):
        return len(self.labels)

In [None]:
# I now apply the MyDataset custom class to the textual and label encodings
# for training, validation, and testing, returning them as custom Torch datasets.

# a. Economic / Non-Economic
train_dataset_econ = MyDataset(train_encodings, econ_train_encoded)
valid_dataset_econ = MyDataset(valid_encodings, econ_valid_encoded)
test_dataset_econ = MyDataset(test_encodings, econ_test_encoded)

# b. Socio-Cultural / Non-Socio-Cultural
train_dataset_sc = MyDataset(train_encodings, sc_train_encoded)
valid_dataset_sc = MyDataset(valid_encodings, sc_valid_encoded)
test_dataset_sc = MyDataset(test_encodings, sc_test_encoded)

I inspect the newly created custom Torch datasets to check whether there are any issues. This is a bill summary in the Torch `train_dataset_econ` dataset after encoding:


In [None]:
# I take the first document in the "train_dataset_econ" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(train_dataset_econ.encodings[0].tokens[0:100])

'[CLS] protect the homeland from north korea ##n and iranian ballistic missile ##s act - states the concern of congress over north korea ##n and iranian long - range ballistic missile technology and the spread of such technology . expresse ##s support for ballistic missile protection of u . s . allie ##s and forward deploy ##ed forces but also the belief that this should not come at the expense of u . s . homeland protection . direct ##s the secretary of defense to deploy specified numbers of ground - based intercept ##or ##s in alaska and california and'

This is a bill summary in the Torch `train_dataset_sc` dataset after encoding:

In [None]:
# I take the first document in the "train_dataset_sc" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(train_dataset_sc.encodings[0].tokens[0:100])

'[CLS] protect the homeland from north korea ##n and iranian ballistic missile ##s act - states the concern of congress over north korea ##n and iranian long - range ballistic missile technology and the spread of such technology . expresse ##s support for ballistic missile protection of u . s . allie ##s and forward deploy ##ed forces but also the belief that this should not come at the expense of u . s . homeland protection . direct ##s the secretary of defense to deploy specified numbers of ground - based intercept ##or ##s in alaska and california and'

This is a bill summary in the Torch `valid_dataset_econ` dataset after encoding:

In [None]:
# I take the first document in the "valid_dataset_econ" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(valid_dataset_econ.encodings[0].tokens[0:100])

"[CLS] continuation of useful resources to states act or courts act this bill extends and otherwise revise ##s funding for program ##s related to child welfare . specifically , the bill extends funding through f ##y ##202 ##2 for the promoting safe and stable families program ; extends funding through f ##y ##202 ##2 for , and otherwise revise ##s , the grant program for improving courts ' handling of foster - care and adoption proceedings ; and provides funding for the temporary assistance for need ##y families ( ta ##n ##f ) contingenc ##y fund for f ##y ##201"

This is a bill summary in the Torch `valid_dataset_sc` dataset after encoding:


In [None]:
# I take the first document in the "valid_dataset_sc" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(valid_dataset_sc.encodings[0].tokens[0:100])

"[CLS] continuation of useful resources to states act or courts act this bill extends and otherwise revise ##s funding for program ##s related to child welfare . specifically , the bill extends funding through f ##y ##202 ##2 for the promoting safe and stable families program ; extends funding through f ##y ##202 ##2 for , and otherwise revise ##s , the grant program for improving courts ' handling of foster - care and adoption proceedings ; and provides funding for the temporary assistance for need ##y families ( ta ##n ##f ) contingenc ##y fund for f ##y ##201"

This is a bill summary in the Torch `test_dataset_econ` dataset after encoding:

In [None]:
# I take the first document in the "test_dataset_econ" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(test_dataset_econ.encodings[0].tokens[0:100])

'[CLS] national security commission artificial intelligence act of 2018 this bill establishe ##s , as an independent commission within the executive branch , the national security commission on artificial intelligence to review the advance ##s in artificial intelligence , related machine learning developments , and associated technologies . such commission shall consider the methods and means necessary to advance the development of artificial intelligence , machine learning , and associated technologies by the united states in order to comprehensive ##ly address national security needs , including economic risk , and any other needs of the department of defense or the'

This is a bill summary in the Torch `test_dataset_sc` dataset after encoding:

In [None]:
# I take the first document in the "test_dataset_sc" dataset, and I join its
# first 100 tokens by a whitespace to get a sneak peek.

" ".join(test_dataset_sc.encodings[0].tokens[0:100])

'[CLS] national security commission artificial intelligence act of 2018 this bill establishe ##s , as an independent commission within the executive branch , the national security commission on artificial intelligence to review the advance ##s in artificial intelligence , related machine learning developments , and associated technologies . such commission shall consider the methods and means necessary to advance the development of artificial intelligence , machine learning , and associated technologies by the united states in order to comprehensive ##ly address national security needs , including economic risk , and any other needs of the department of defense or the'

The custom Torch datasets are appropriately set up. It is time to initialise the `LEGAL-BERT` pre-trained model and configure its parameters for fine-tuning.

## **4. Initialising and configuring the LEGAL-BERT pre-trained model**
I now load the pre-trained `LEGAL-BERT` model and transfer it to the Compute Unified Device Architecture (CUDA) for efficient GPU computation. I repeat the process for each classification task and custom Torch dataset. 

In [None]:
# I load the "AutoModelForSequenceClassification" class from the HuggingFace
# "transformers" library, which is optimal for sequence classification tasks.
from transformers import AutoModelForSequenceClassification

# I initialize my pre-trained LEGAL-BERT model with the previously set "model_name" variable.

# The "num_labels" parameter is set to the number of unique labels in the dataset,
# which is determined by the length of either the "id2label_econ", or "id2label_sc"
# dictionary. This specifies the model how many output labels there are to predict.

# Finally, I indicate the device where the model will be stored during training,
# which is a GPU device ("cuda"), with the "to" method.

model_econ = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = len(id2label_econ)).to(device_name)
model_sc = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = len(id2label_sc)).to(device_name)

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at nlpaueb/legal-bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification wer

I now configure the parameters required for fine-tuning the BERT model. These parameters are crucial for fine-tuning my BERT transformers and are specified in the HuggingFace `TrainingArguments` objects that I subsequently pass to the HuggingFace `Trainer` object. While there are numerous other arguments, I will focus on the fundamental ones.


    | Parameter                   | Explanation                                                                                                                          
    |-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------
    | num_train_epochs            | The total number of training epochs. This refers to how many times the entire dataset will be processed. Too many epochs can lead to overfitting.
    | per_device_train_batch_size | The batch size per device during training.                                                                                           
    | per_device_eval_batch_size  | The batch size for evaluation.                                                                                                      
    | warmup_steps                | The number of warmup steps for the learning rate scheduler. A smaller value is recommended for small datasets.                         
    | weight_decay                | The strength of weight decay, which reduces the size of weights, similar to regularization.                                          
    | output_dir                  | The directory where the fine-tuned model and configuration files will be saved.                                                     
    | logging_dir                 | The directory where logs will be stored.                                                                                            
    | logging_steps               | How often to print logging output. This enables me to terminate training early if the loss is not decreasing.                        
    | evaluation_strategy         | Evaluates while training so that I can monitor accuracy improvements.                                                              
   

I continue by definining a custom evaluation function that returns the model's accuracy and macro-F1 score, since class imbalance is critical for both classification tasks - i.e., most models exhibit the tendency of artificially inflating the number of positive values.

In [None]:
# I define a custom function that takes in the argument "eval_pred", which is a
# tuple of the form ("eval_output", "eval_dataset"), containing the evaluation output
# and the evaluation dataset.

def compute_metrics(eval_pred):
    labels = eval_pred.label_ids # Extracting the ground truth labels from the "eval_pred" output
    preds = eval_pred.predictions.argmax(-1) # Extracting the predicted labels from the "eval_pred" output

    # I compute the accuracy score thanks to the pre-defined function from the
    # "scikit-learn" library. It takes in the labels and predictions as inputs and
    # returns the accuracy as a float.
    acc = accuracy_score(labels, preds)

    # I follow the same procedure for calculating the macro F1-score. The 
    # "sample_weight" parameter is additionally set to the "compute_sample_weight"
    # function from "scikit-learn" to calculate the sample weights for each class.
    macro_f1 = f1_score(labels, preds, average = "macro", sample_weight = compute_sample_weight("balanced", labels))

    # I return a dictionary with the computed metrics under meaningful keys
    return {"accuracy": acc, "macro_f1": macro_f1}

I choose to optimise my models by maximising the macro F1-score, since class imbalance is critical for both classification tasks. Next, I configure the parameters I desire to utilise when fine-tuning the `LEGAL-BERT` transformer by instantiating a `TrainingArguments` object. I make an informed choice regarding all parameters, eventually modifying them if performances are unsatisfactory, or there are strong signs of overfitting.

First, since my training dataset is moderately-sized, containing 1320 training instances, starting with 5 epochs seems reasonable in my perspective, as I want to avoid overfitting the training data. Second, I employ the standard batch size of 8 for both training and evaluation. I already tried bigger batch sizes of 16 and 32 for training purposes to potentially increase training efficiency, but I found out that this causes memory problems within the freely available CUDA, and I do not want to pay to win.

Third, I set an initial learning rate of 5e-5, the standard for fine-tuning BERT models. I Fourth, I try to specify a small number of warmup steps (200), still adequate for my medium-sized training dataset, to allow the optimizer to gradually adjust the original learning rate. Fifth, I set the standard weight decay of 0.01, which hopefully will prevent overfitting. Lastly, I command that the training output is logged every 20 steps.

In [None]:
metric_name = "macro_f1" # I wish to optimise the macro-F1 performance metric

In [None]:
# I instantiate an object of the TrainingArguments class, setting the aforementioned parameters:
training_args = TrainingArguments(
    
    # Number of training epochs
    num_train_epochs = 5,
    
    # Batch size for training
    per_device_train_batch_size = 8,
    
    # Batch size for evaluation
    per_device_eval_batch_size = 8,
    
    # Learning rate for optimization
    learning_rate = 5e-5,
    
    # Load the best model at the end of training
    load_best_model_at_end = True,
    
    # Metric used for selecting the best model (macro F1-score, in our case)
    metric_for_best_model = metric_name,
    
    # Number of warmup steps for the optimizer
    warmup_steps = 200,
    
    # L2 regularization weight decay
    weight_decay = 0.01,
    
    # Directory to save the fine-tuned model and configuration files
    output_dir = './results',
    
    # Directory to store logs
    logging_dir = './logs',
    
    # Log results every n steps
    logging_steps = 20,
    
    # Strategy for evaluating the model during training
    evaluation_strategy = 'steps',
)

## **5. Fine-Tuning the LEGAL-BERT models**

I now create two HuggingFace `Trainer` objects using the `TrainingArguments` object that I specified above. I also send my `compute_metrics` custom function to the `Trainer` objects, along with my custom Torch datasets for training and validation purposes. I repeat the fine-tuning procedure for both classification tasks. 

In [None]:
# a. Economic / Non-Economic

trainer_econ = Trainer(
    model = model_econ, # The instantiated HuggingFace transformer model to be trained
    args = training_args, # The training arguments, which I defined above
    train_dataset = train_dataset_econ, # The encoded PyTorch training dataset
    eval_dataset = valid_dataset_econ, # The encoded PyTorch validation dataset
    compute_metrics = compute_metrics # My custom evaluation function 
)

In [None]:
# b. Socio-Cultural / Non-Socio-Cultural

trainer_sc = Trainer(
    model = model_sc, # The instantiated HuggingFace transformer model to be trained
    args = training_args, # The training arguments, which I defined above
    train_dataset = train_dataset_sc, # The encoded PyTorch training dataset
    eval_dataset = valid_dataset_sc, # The encoded PyTorch validation dataset
    compute_metrics = compute_metrics # My custom evaluation function 
)

Time to finally fine-tune the models! I start with gearing `LEGAL-BERT` towards solving the Economic / Non-Economic classification task.

In [None]:
# a. Economic / Non Economic

trainer_econ.train() # I instruct the GPU to train the model



Step,Training Loss,Validation Loss,Accuracy,Macro F1
20,0.69,0.688904,0.586364,0.489799
40,0.6838,0.695772,0.527273,0.536662
60,0.6835,0.659121,0.590909,0.393514
80,0.637,0.668581,0.615909,0.481863
100,0.7,0.631711,0.643182,0.613059
120,0.6598,0.590638,0.661364,0.539369
140,0.5767,0.48275,0.743182,0.708821
160,0.4989,0.549003,0.704545,0.633798
180,0.4188,0.492229,0.806818,0.791339
200,0.4633,0.541395,0.786364,0.76709


TrainOutput(global_step=825, training_loss=0.3452735770290548, metrics={'train_runtime': 1229.9791, 'train_samples_per_second': 5.366, 'train_steps_per_second': 0.671, 'total_flos': 1736532965376000.0, 'train_loss': 0.3452735770290548, 'epoch': 5.0})

In [None]:
# b. Socio-Cultural / Non-Socio-Cultural

trainer_sc.train() # I instruct the GPU to train the model



Step,Training Loss,Validation Loss,Accuracy,Macro F1
20,0.6853,0.681111,0.577273,0.413337
40,0.6668,0.669108,0.611364,0.375599
60,0.6555,0.666176,0.595455,0.333333
80,0.6442,0.6431,0.595455,0.333333
100,0.6095,0.566355,0.745455,0.728741
120,0.5428,0.497316,0.754545,0.710593
140,0.5202,0.530167,0.738636,0.667623
160,0.4482,0.487352,0.8,0.80405
180,0.4315,0.500489,0.806818,0.770783
200,0.4219,0.468441,0.815909,0.794729


TrainOutput(global_step=825, training_loss=0.2807153329806346, metrics={'train_runtime': 1215.3787, 'train_samples_per_second': 5.43, 'train_steps_per_second': 0.679, 'total_flos': 1736532965376000.0, 'train_loss': 0.2807153329806346, 'epoch': 5.0})

## **6. Evaluating and Testing the Fine-Tuned Models**

After having trained the models, I wish to validate and test them on the respective sets. I call the `.evaluate` method of the `Trainer` object, which automatically runs the built-in validation procedure, referring to my custom `compute_metrics` function.

In [None]:
# a. Economic / Non-Economic

trainer_econ.evaluate() # I instruct the GPU to evaluate the model

{'eval_loss': 0.4876410961151123,
 'eval_accuracy': 0.825,
 'eval_macro_f1': 0.8225780574804976,
 'eval_runtime': 14.3387,
 'eval_samples_per_second': 30.686,
 'eval_steps_per_second': 3.836,
 'epoch': 5.0}

In [None]:
# b. Socio-Cultural / Non-Socio-Cultural

trainer_sc.evaluate() # I instruct the GPU to evaluate the model

{'eval_loss': 0.8547890186309814,
 'eval_accuracy': 0.8113636363636364,
 'eval_macro_f1': 0.7756871719606064,
 'eval_runtime': 14.3644,
 'eval_samples_per_second': 30.631,
 'eval_steps_per_second': 3.829,
 'epoch': 5.0}

`LEGAL-BERT` appears to fare better than the model I fine-tuned with the more "classic" SML techniques in the Economic / Non-Economic classification task, as its accuracy (0.82) and macro-F1 score (0.82) are all higher than the **SVM classifier with the `linear` kernel trick and the `TfIdfVectorizer`**'s corresponding metrics.

On the other hand, `LEGAL-BERT`'s performance with the Socio-Cultural / Non-Socio-Cultural classification is comparable with what should be a very computationally non-demanding baseline - i.e., a **Naive-Bayes classifier with the `CountVectorizer`**, which is quite concerning and leads me to believe that the extended version of the base English transformer, `Large BERT`, could be a better alternative. In other words, I suspect that within this specific task legal jargon is not as important to achieve state-of-the-art performance.

To corroborate these findings, I wish to get a more detailed evaluation of the fine-tuned models, with precision and recall metrics for all categories, and to account for potential overfitting of the validation set. Hence, I extract the labels predicted from the test set and compare them with the "ground truth" labels.

In [None]:
# a. Economic / Non-Economic

# I call the predict() method to make the predictions on the PyTorch test set
predicted_results_econ = trainer_econ.predict(test_dataset_econ)

In [None]:
# The "predicted_results_econ" object is a 2D matrix with all the predicted
# probabilities for the respective output labels, for each document contained in
# the test set (440 in total).

predicted_results_econ.predictions.shape 

(440, 2)

In [None]:
# I get the prediction with the highest probability
predicted_labels_econ = predicted_results_econ.predictions.argmax(-1)

# I flatten the predictions into a one-dimensional list object
predicted_labels_econ = predicted_labels_econ.flatten().tolist()

# I convert from integers back to strings for readability with my custom dictionary
predicted_labels_econ = [id2label_econ[lab] for lab in predicted_labels_econ]

In [None]:
# I compare the predictions against the "ground truth" labels with "scikit-learn"
print(classification_report(econ_test, predicted_labels_econ))

              precision    recall  f1-score   support

    Economic       0.84      0.80      0.82       222
Non-Economic       0.80      0.84      0.82       218

    accuracy                           0.82       440
   macro avg       0.82      0.82      0.82       440
weighted avg       0.82      0.82      0.82       440



The test's results corroborate the validation step's findings, although they must be taken with added caution since the test sample is almost perfectly balanced, whereas the overall sample is skewed towards the positive label. Nevertheless, the fine-tuned `LEGAL-BERT` achieves a performance that is very promising, with a 0.82 accuracy, and precisions and recalls never lower than 0.80 - i.e., very solid all across the board. This means that `LEGAL-BERT` is capable of classifying Economic / Non-Economic documents without yielding an inflated number of positive labels, unlike most of the classifiers I employed within the "classic" SML framework.

In [None]:
# b. Socio-Cultural / Non-Socio-Cultural

# I call the predict() method to make the predictions on the PyTorch test set
predicted_results_sc = trainer_sc.predict(test_dataset_sc)

In [None]:
# The "predicted_results_sc" object is a 2D matrix with all the predicted
# probabilities for the respective output labels, for each document contained in
# the test set (440 in total).

predicted_results_sc.predictions.shape 

(440, 2)

In [None]:
# I get the prediction with the highest probability
predicted_labels_sc = predicted_results_sc.predictions.argmax(-1)

# I flatten the predictions into a one-dimensional list object
predicted_labels_sc = predicted_labels_sc.flatten().tolist()

# I convert from integers back to strings for readability with my custom dictionary
predicted_labels_sc = [id2label_sc[lab] for lab in predicted_labels_sc]

In [None]:
# I compare the predictions against the "ground truth" labels with "scikit-learn"
print(classification_report(sc_test, predicted_labels_sc))

                    precision    recall  f1-score   support

Non-Socio-Cultural       0.91      0.69      0.79       157
    Socio-Cultural       0.85      0.96      0.90       283

          accuracy                           0.87       440
         macro avg       0.88      0.83      0.84       440
      weighted avg       0.87      0.87      0.86       440



The test's results corroborate the validation step's findings, which were not promising at all. `LEGAL-BERT` shows very high accuracy (0.87) and precisions - i.e., respectively, 0.91, and 0.85 - but this performance is driven by the massive amount of positive labels predicted by the model, as the recall for the Non-Socio-Cultural category is only 0.69. This is not an acceptable metric when dealing with fine-tuning a transformer, because `LEGAL-BERT` does not solve the class imbalance issues I experienced with the **Naive-Bayes classifier with the `CountVectorizer`**, even when I experiment with different learning parameters in the background.

## **7. Wrapping Up**

On a final note, I must specify that I wished to run an explicit hyperparameter tuning script for the few `Trainer` object hyperparameters that could be improved - i.e, the number of training epochs (3, 5, or 7), the learning rate (1e-5, 5e-5, or 1e-4), and the number of warmup steps (100, 200, or 300).

This could be done by creating a `ParameterGrid` object with the appropriate `sklearn` function, iterating over each hyperparameter combination within this object, and updating the `training_args` object by calling the `transformers` `update_from_dict` method within the loop. The model would be re-initialised at each iteration, so to ensure that each training iteration starts from an identical initial state, independently of the training executed during previous iterations. Another option would be looking into the `raytune` library, for achieving state-of-the-art hyperparameter tuning.

However, the GPU from Google CoLab is giving me serious problems, and constantly kicks me out or does not let me connect to the backend. Furthermore, I continously run into memory / RAM allocation issues when trying more complex procedures - i.e., changing batch sizes to improve the training's efficiency - or pre-trained models. Therefore, after partially experimenting (and failing!) in the background, I deem this to be the best solution I can come up with in the current conditions. It is unlikely that trying combinations for three parameters only could have changed the results by much, so I believe this approximation is acceptable in my research's context.

I decide to fine-tune the extended version of the base English transformer, `Large BERT`, to assess whether it is a better alternative, as within the specific Socio-Cultural / Non-Socio-Cultural classification task legal jargon may not be as relevant to achieve state-of-the-art performance. I reckon I will experience the same issues with the GPU backend, so I do not expect to be able to run hyperparameter tuning.

To conclude, I save both models and their configuration files to their set directories within Google CoLab, to keep them for potential future use, with the `save_model` method.

In [None]:
# a. Economic / Non-Economic

trainer_econ.save_model(save_directory_econ)

In [None]:
# b Socio-Cultural / Non-Socio-Cultural

trainer_sc.save_model(save_directory_sc)