# Lab 2: Fine-tuning a generative AI model on legal contracts and case law for enhanced contextual understanding.

!pip install transformers scikit-learn evaluate datasets

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Libraries and Modules Description

In [3]:
import pandas as pd
import torch
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments, pipeline
from transformers import DataCollatorForSeq2Seq
from transformers import GenerationConfig
from datasets import Dataset

1. **`pandas`**
   - A powerful data manipulation and analysis library in Python. It provides data structures like DataFrames and Series, which are ideal for handling and analyzing structured data.
   - **Common Usage**:
     - Loading, cleaning, and analyzing data in various formats (CSV, Excel, JSON, etc.).
     - Performing operations like filtering, grouping, and aggregating large datasets.

2. **`torch` (from PyTorch)**
   - A deep learning framework for building and training neural networks. It provides powerful tools for tensor computation, GPU acceleration, and automatic differentiation.
   - **Common Usage**:
     - Defining and training neural network models.
     - Performing tensor operations and manipulating data for deep learning tasks.
   - PyTorch is commonly used in conjunction with transformer models for NLP tasks.

3. **`transformers` (from `Hugging Face`)**
   - A widely used library for working with pre-trained deep learning models, especially in Natural Language Processing (NLP). It provides an easy-to-use interface for transformer-based models like BERT, GPT, T5, BART, etc.
   - **Common Usage**:
     - Loading, fine-tuning, and using pre-trained models for a variety of NLP tasks such as classification, summarization, question answering, and translation.

   - **Submodules**:
     
     - **`BartTokenizer`**:
       - The tokenizer for the BART model, responsible for preparing input text by splitting it into tokens that can be processed by the model.
     
     - **`BartForConditionalGeneration`**:
       - A pre-trained BART model specifically designed for conditional generation tasks such as text summarization, translation, and text generation.
     
     - **`Trainer`**:
       - A utility class that simplifies the process of training and evaluating transformer models. It handles the training loop, validation, and saving the model during the training process.
     
     - **`TrainingArguments`**:
       - A class that allows you to specify various hyperparameters and settings for the training process, such as batch size, learning rate, and evaluation strategy.

     - **`pipeline`**:
       - A high-level interface for performing specific tasks like text generation, translation, sentiment analysis, and more using pre-trained models. It abstracts away much of the underlying complexity.

     - **`DataCollatorForSeq2Seq`**:
       - A data collator used to efficiently batch and pad sequences of variable length for sequence-to-sequence tasks such as summarization or translation.

     - **`GenerationConfig`**:
       - A class that manages the generation configurations for models, allowing you to fine-tune various parameters like temperature, max length, and top-k sampling during the text generation process.

4. **`datasets` (from Hugging Face)**
   - A library for easily loading, processing, and working with large datasets. It provides easy access to a variety of datasets for tasks like text classification, question answering, and summarization.
   - **Common Usage**:
     - Loading datasets for machine learning tasks.
     - Preprocessing and transforming datasets into formats suitable for training or evaluation.

     - **`Dataset`**:
       - A class used to work with datasets, particularly when you need to preprocess, filter, or batch datasets in preparation for model training or evaluation.

#### Summary:
- **`pandas`** is used for data manipulation, ideal for handling structured data such as case law or other legal documents.
- **`torch`** provides the deep learning framework for model training and tensor computation.
- **`transformers`** is the core library for working with pre-trained models like BART, providing tools for tokenization, model generation, training, and evaluation.
- **`datasets`** provides an easy interface for loading and working with datasets, which is useful for model training and evaluation.

### Model and Tokenizer Initialization

In [None]:
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

- **`model_name = "facebook/bart-large-cnn"`**:
    - This sets the variable `model_name` to the identifier of the pre-trained BART model.
    - The model `"facebook/bart-large-cnn"` is a large transformer model fine-tuned specifically for CNN/Daily Mail summarization tasks.

- **`tokenizer = BartTokenizer.from_pretrained(model_name)`**:
    - The `BartTokenizer` is used to load the pre-trained tokenizer for the model.
    - The tokenizer is responsible for converting raw text into tokens that can be processed by the model. It also handles padding and truncation of text sequences.
    - The `from_pretrained()` function loads the tokenizer associated with the pre-trained BART model, ensuring the correct tokenization for summarization.

- **`model = BartForConditionalGeneration.from_pretrained(model_name)`**:
    - This line loads the pre-trained BART model using the identifier `model_name`.
    - The model is designed for conditional text generation tasks, such as summarization. In this case, it has been fine-tuned for CNN/Daily Mail summarization.
    - The `from_pretrained()` function downloads and initializes the model with its pre-trained weights.


In [6]:
df = pd.read_csv('/content/drive/MyDrive/summarized_case_law.csv')

In [7]:
df['text_length'] = df['cleaned_text'].apply(lambda x: len(x.split()))
print(df['text_length'].describe())  # Get a summary of text lengths

count       50.000000
mean      8486.280000
std       6390.552587
min        259.000000
25%       2995.750000
50%       7643.500000
75%      13282.250000
max      25752.000000
Name: text_length, dtype: float64


In [8]:
df['cleaned_text'] = df['cleaned_text'].astype(str)
print(df.dtypes)

cleaned_text    object
summary         object
text_length      int64
dtype: object


In [9]:
df['cleaned_text'] = df['cleaned_text'].str.replace('[^\w\s]', '', regex=True)
df['cleaned_text'] = df['cleaned_text'].str.strip()

### Splitting Text

In [11]:
def split_text(text, max_length=1024):
    words = text.split()
    return [' '.join(words[i:i+max_length]) for i in range(0, len(words), max_length)]

df['split_texts'] = df['cleaned_text'].apply(split_text)
split_rows = []

for _, row in df.iterrows():
    case_law_chunks = split_text(row['cleaned_text'], max_length=1024)
    for chunk in case_law_chunks:
        split_rows.append({
            'cleaned_text': chunk,
            'summary': row['summary']
        })

split_df = pd.DataFrame(split_rows)

print(split_df.head())

                                        cleaned_text  \
0  Judgments and decisions from 2001 onwards 2025...   
1  Commissioner when handling data protection com...   
2  clock and to make an order for an appropriate ...   
3  2023February14 against the decision of the ICO...   
4  Judgments and decisions from 2001 onwards 2025...   

                                             summary  
0  The Tribunal issued a stay of proceedings at t...  
1  The Tribunal issued a stay of proceedings at t...  
2  The Tribunal issued a stay of proceedings at t...  
3  The Tribunal issued a stay of proceedings at t...  
4  The Bank sought an order that Mr Bogolyubov mu...  


- **`split_text(text, max_length=1024)`**:
    - The `split_text` function is designed to break down long text into smaller, more manageable chunks. This is important when dealing with large documents, as models like BART or transformers in general have a maximum token limit (in this case, 1024 tokens).
    - The function splits the text into individual words and then groups them into chunks of a specified maximum length (`max_length=1024`). The `max_length` parameter ensures that each chunk contains no more than 1024 words, which is a suitable size for processing by the model.
    - **Why Split Text?**: The reason for splitting text is that many natural language models, including BART, cannot process long documents as a single input if the text exceeds the model’s token limit. By splitting long documents into smaller parts, you can ensure that the text fits within the model's constraints and avoids truncation.

- **`df['split_texts'] = df['cleaned_text'].apply(split_text)`**:
    - This line applies the `split_text` function to each entry in the `cleaned_text` column of the dataframe. The result is stored in a new column called `split_texts`, where each document is split into smaller chunks.
    - This ensures that each chunk of text is of a size that the model can process without running into memory or token limits.

- **`split_rows = []`**:
    - An empty list `split_rows` is created to store the individual rows that will form the new dataframe.

- **`for _, row in df.iterrows():`**:
    - This loop iterates through each row of the dataframe `df`. For each row, the cleaned text (`row['cleaned_text']`) is split into smaller chunks using the `split_text` function.

- **`case_law_chunks = split_text(row['cleaned_text'], max_length=1024)`**:
    - For each row, the `split_text` function is called to divide the `cleaned_text` into manageable chunks of up to 1024 words.

- **`for chunk in case_law_chunks:`**:
    - This nested loop iterates through each chunk of the split text. For every chunk, a new dictionary is created containing the chunk of `cleaned_text` and its corresponding `summary` from the original row.

- **`split_rows.append({...})`**:
    - Each dictionary is appended to the `split_rows` list. This list will eventually hold all the individual chunks of text and their summaries.

- **`split_df = pd.DataFrame(split_rows)`**:
    - The `split_rows` list is converted into a new dataframe `split_df`. This dataframe contains the split chunks of text along with their summaries, making it easier to process smaller parts of each document during training.

---

### Why Split the Text?

Splitting the text into smaller chunks is a common practice when preparing text for training machine learning models, particularly in the case of large documents. Here’s why this is important:

- **Token Limits**: Most models, like BART or other transformers, have a token limit (e.g., 1024 tokens). Documents longer than this limit cannot be processed as a whole, so splitting them ensures that each chunk fits within the model’s token constraints.
- **Improved Training**: Training a model with smaller, more manageable chunks allows the model to focus on understanding and summarizing specific portions of text, leading to potentially better performance when handling large documents.
- **Efficiency**: Breaking down documents into smaller pieces helps improve processing time, memory usage, and allows for batch processing, which can be more efficient during training.

### Outcome:

The code processes the dataset, splits long text documents into smaller chunks, and creates a new dataframe (`split_df`) that can be used for model training. Each chunk is paired with its corresponding summary, ensuring the model can learn from smaller, easily digestible pieces of information. This step is crucial for training on documents of varying lengths while adhering to the token limitations of the model.

### Summary:

This block of code prepares the text data for training by splitting long documents into smaller chunks. This ensures that the model can process and summarize each piece of text without hitting token limits, improving both performance and efficiency during model training.

In [12]:
split_df

Unnamed: 0,cleaned_text,summary
0,Judgments and decisions from 2001 onwards 2025...,The Tribunal issued a stay of proceedings at t...
1,Commissioner when handling data protection com...,The Tribunal issued a stay of proceedings at t...
2,clock and to make an order for an appropriate ...,The Tribunal issued a stay of proceedings at t...
3,2023February14 against the decision of the ICO...,The Tribunal issued a stay of proceedings at t...
4,Judgments and decisions from 2001 onwards 2025...,The Bank sought an order that Mr Bogolyubov mu...
...,...,...
435,demand the absolute most from your body or you...,"The appellant, Global by Nature Ltd (“GBN”), i..."
436,was marketed on the basis that it supplies amp...,"The appellant, Global by Nature Ltd (“GBN”), i..."
437,of Sunwarrior products it is necessary to focu...,"The appellant, Global by Nature Ltd (“GBN”), i..."
438,overall marketing approach which as we have fo...,"The appellant, Global by Nature Ltd (“GBN”), i..."


## Fine Tuning Model
### Explanation:

In [13]:
train_size = 0.8
train_df = split_df.sample(frac=train_size, random_state=42)
val_df = split_df.drop(train_df.index)

In [14]:
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

- **`train_size = 0.8`**:
    - This defines the proportion of data to be used for training. In this case, 80% of the data will be used for training the model.

- **`train_df = split_df.sample(frac=train_size, random_state=42)`**:
    - This line randomly samples 80% of the rows from `split_df` to create the training set (`train_df`). The `random_state=42` ensures reproducibility of the results by setting a fixed random seed.

- **`val_df = split_df.drop(train_df.index)`**:
    - This creates the validation set by dropping the rows in `train_df` from the original dataframe `split_df`. The remaining 20% of the data will be used for validation.

- **`train_dataset = Dataset.from_pandas(train_df)`** and **`val_dataset = Dataset.from_pandas(val_df)`**:
    - These lines convert the training and validation dataframes (`train_df` and `val_df`) into Hugging Face `Dataset` objects. The `Dataset` class allows for easier manipulation and interaction with the data, especially when using Hugging Face's `transformers` library.

### Tokenization:

In [15]:
def tokenize_data(examples):
    model_inputs = tokenizer(
        examples['cleaned_text'],
        max_length=1024,  # Input max length for BART
        truncation=True,
        padding=True  # Dynamic padding
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['summary'],
            max_length=150,  # Output max length
            truncation=True,
            padding=True  # Dynamic padding
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
train_dataset = train_dataset.map(tokenize_data, batched=True)
val_dataset = val_dataset.map(tokenize_data, batched=True)

- **`def tokenize_data(examples):`**:
    - This function is responsible for tokenizing the input text and its corresponding summary. Tokenization converts raw text into the numerical format (token IDs) required by the model.

- **`model_inputs = tokenizer(...)`**:
    - The input text (`cleaned_text`) is tokenized using the `tokenizer` object. The `max_length=1024` ensures that the input text is truncated if it exceeds 1024 tokens. The `truncation=True` option ensures that any text longer than 1024 tokens will be cut off, and `padding=True` applies dynamic padding to ensure that shorter sequences are padded to the same length.

- **`with tokenizer.as_target_tokenizer():`**:
    - This temporarily sets the tokenizer to behave as a "target tokenizer," meaning it will be used to tokenize the target text (the summaries) rather than the input text.

- **`labels = tokenizer(...)`**:
    - The summaries (`examples['summary']`) are tokenized in a similar way to the input text, but with a smaller `max_length=150`, as the output (summary) is typically shorter. Again, `truncation=True` and `padding=True` are used to handle longer or shorter summaries.

- **`model_inputs["labels"] = labels["input_ids"]`**:
    - This assigns the tokenized summary (target) to the `labels` key in the `model_inputs` dictionary, which is the format expected by the BART model for training.

- **`train_dataset = train_dataset.map(tokenize_data, batched=True)`** and **`val_dataset = val_dataset.map(tokenize_data, batched=True)`**:
    - These lines apply the `tokenize_data` function to both the training and validation datasets using the `.map()` function. The `batched=True` option allows tokenization of multiple rows at once, which can speed up the process.

### Data Collator:

In [17]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

- **`data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)`**:
    - The `DataCollatorForSeq2Seq` is a special data collator that dynamically pads the input and output sequences to the same length during batch processing. This is crucial when training a sequence-to-sequence model like BART because it ensures that each batch has consistent input/output lengths, improving efficiency.

## Tuning Parameters
### Explanation:
#### Training Arguments:

In [18]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=1e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy = "epoch",
    eval_strategy = "epoch",
    report_to="none",
)

- **`TrainingArguments`**:
    - This class defines all the hyperparameters and configurations for training the model. These settings control how the model is trained, including aspects like batch size, learning rate, logging, and more.

- **`output_dir="./results"`**:
    - This specifies the directory where the training results (e.g., model checkpoints, logs, etc.) will be saved.

- **`per_device_train_batch_size=4`** and **`per_device_eval_batch_size=4`**:
    - These define the batch size for training and evaluation, respectively. A batch size of 4 means that 4 samples are processed simultaneously per device (e.g., per GPU).

- **`learning_rate=1e-5`**:
    - The learning rate determines how much the model weights are adjusted during training with each update. A smaller learning rate (1e-5) ensures that the model learns gradually, reducing the risk of overshooting the optimal weights.

- **`num_train_epochs=10`**:
    - This specifies the number of epochs (full passes through the training dataset) during training. In this case, the model will be trained for 10 epochs.

- **`weight_decay=0.01`**:
    - Weight decay is a form of regularization to prevent overfitting by adding a penalty to the model's weights. A weight decay of 0.01 helps reduce the likelihood of overfitting by preventing the model from assigning too much importance to any one feature.

- **`logging_dir="./logs"`**:
    - The directory where logs, including training metrics, will be saved. These logs can later be analyzed for insights into the training process.

- **`logging_steps=10`**:
    - This specifies how often (in terms of steps) the model should log its metrics. Every 10 steps, the model will log its current performance.

- **`save_strategy = "epoch"`**:
    - This tells the trainer to save model checkpoints at the end of each epoch. This ensures that a version of the model is saved after each full pass through the dataset.

- **`eval_strategy = "epoch"`**:
    - This instructs the trainer to evaluate the model on the validation dataset at the end of each epoch, allowing you to monitor its performance over time.

- **`report_to="none"`**:
    - This disables reporting to third-party monitoring tools like TensorBoard or WandB, keeping the training process simpler by not sending the logs anywhere outside the local environment.

#### Trainer:


In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator
)

- **`Trainer`**:
    - The `Trainer` class is a high-level API provided by Hugging Face to streamline the training process. It handles many of the complexities of training, evaluation, and saving checkpoints.

- **`model=model`**:
    - This specifies the model to be trained. In this case, it is the BART model loaded earlier (`BartForConditionalGeneration`).

- **`args=training_args`**:
    - The `args` parameter provides the trainer with all the training configurations defined in the `TrainingArguments` block.

- **`train_dataset=train_dataset`** and **`eval_dataset=val_dataset`**:
    - These are the datasets used for training and validation. The `train_dataset` contains the training data, while the `eval_dataset` holds the validation data. The trainer will use these datasets to fine-tune the model and evaluate its performance after each epoch.

- **`data_collator=data_collator`**:
    - The `data_collator` is responsible for dynamically padding and batching the sequences, ensuring that each batch has consistent input lengths.

#### Training:

- **`trainer.train()`**:
    - This starts the training process using the configurations, datasets, and model provided. The model will train for the specified number of epochs (10) and use the settings defined in the `TrainingArguments` to control the process. During training, the model will log its progress every 10 steps and save a checkpoint at the end of each epoch.

In [None]:
trainer.train()

### Evaluation of the Case Law Summarizer Training Results

The table shows the training and validation loss over 10 epochs for the case law summarizer model. Here's an evaluation of the results:

| Epoch | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 1     | 1.320100      | 0.908481        |
| 2     | 0.398700      | 0.285914        |
| 3     | 0.141100      | 0.127164        |
| 4     | 0.080600      | 0.084278        |
| 5     | 0.031400      | 0.067373        |
| 6     | 0.043000      | 0.074902        |
| 7     | 0.028400      | 0.066691        |
| 8     | 0.016000      | 0.075075        |
| 9     | 0.011100      | 0.070565        |
| 10    | 0.009100      | 0.068837        |

#### Epoch 1: 
- **Training Loss:** 1.320100
- **Validation Loss:** 0.908481
  - At the start, both training and validation loss are relatively high. This indicates that the model has not yet learned how to generate accurate summaries, but this is expected in the first epoch as the model is still adjusting its weights.

#### Epoch 2: 
- **Training Loss:** 0.398700
- **Validation Loss:** 0.285914
  - Significant improvement is seen in both losses, especially in the validation loss, which drops by more than half. This suggests the model has started learning effectively and is beginning to generalize better to unseen data.

#### Epoch 3:
- **Training Loss:** 0.141100
- **Validation Loss:** 0.127164
  - Both losses continue to drop, with validation loss nearing 0.1. The model's ability to summarize case law accurately is improving, and overfitting has not yet set in, as validation loss still decreases.

#### Epoch 4-5:
- **Training Loss:** 0.080600 → 0.031400
- **Validation Loss:** 0.084278 → 0.067373
  - Training loss decreases at a faster rate than validation loss. The validation loss shows a steady decline, indicating continued improvement in generalization to unseen data.

#### Epoch 6-10:
- **Training Loss:** 0.043000 → 0.009100
- **Validation Loss:** 0.074902 → 0.068837
  - Although training loss continues to decrease, validation loss starts to fluctuate slightly around 0.07. This could indicate that the model has reached a point of diminishing returns in learning from the data. The slight increase in validation loss (epochs 6 and 8) might suggest early signs of overfitting.

### Key Observations:
1. **Steady Improvement:** The training and validation losses consistently decreased, especially during the first five epochs, indicating strong learning behavior.
2. **Early Overfitting Signs:** The small fluctuation in validation loss towards the end suggests that the model might have started overfitting slightly, even as training loss continued to improve. However, the overfitting is not severe, as validation loss remains low.
3. **Training Success:** Overall, the model has learned to effectively summarize case law, with both losses reaching very low values by epoch 10.

### Next Steps (Lab 3):
1. **Hyperparameter Tuning:** To further enhance performance and reduce the small overfitting signs, hyperparameter tuning will be crucial. Adjusting parameters like learning rate, batch size, and regularization terms may help stabilize validation loss.
2. **Detoxification:** As the model is designed for case law, detoxifying the model (e.g., removing biases or inappropriate content) will be critical to ensure it produces ethical and balanced summaries.

Lab 3 will focus on these improvements to refine the model’s performance and enhance its robustness for legal summarization tasks.

In [None]:
save_path = "/content/drive/MyDrive/fine_tuned_bart_best"

trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

In [18]:
model_path = "./fine_tuned_bart_best"

In [None]:
!zip -r /content/fine_tuned_bart_best.zip /content/fine_tuned_bart_best