### Task Definition

We aim to classify legal complaint text into predefined violation categories (multi-class classification) using two approaches:
- Fine-tuning a smaller language model
- Prompt engineering using the base model


This task is suitable for comparing both approaches because:
- It requires understanding of context and semantics.
- The dataset is rich with labeled examples.
- The use case mirrors real-world legal classification needs.


# **Step-1 Setup the environment**

In [None]:
!pip install -q datasets transformers


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Step-2 Load the **Dataset**

In [None]:
from datasets import load_dataset

# Load the LegalLensNLI dataset
dataset = load_dataset("darrow-ai/LegalLensNLI")

# Print available splits
print(dataset)

# Display a sample entry
dataset['train'][0]


README.md:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

LegalLensNLI.csv:   0%|          | 0.00/380k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/312 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'legal_act', 'label'],
        num_rows: 312
    })
})


{'premise': "Consumers who used an ADP timeclock in Illinois between June 5, 2013, and Nov. 6, 2020, may be eligible to claim a $250 class action rebate as part of a settlement. ADP, a company that provides human resources tools and services, was sued for violating the Illinois Biometric Information Privacy Act (BIPA) by collecting individuals' biometric information without their consent. BIPA prohibits the collection of biometric information without permission and requires disclosure about its storage and destruction. ADP agreed to pay $25 million to settle the lawsuit and will also provide a written retention policy on its website regarding biometric information. Class Members can submit a Claim Form to participate in the settlement, and the deadline for submission is Feb. 8, 2021.",
 'hypothesis': 'Really been enjoying that ADP timeclock at work in Illinois, makes clocking in and out a breeze!',
 'legal_act': 'privacy',
 'label': 'Neutral'}

# **Mapping the entire data set with Labels**

In [None]:
label2id = {
    "Entailed": 0,
    "Neutral": 1,
    "Contradict": 2
}
id2label = {v: k for k, v in label2id.items()}

def encode_labels(example):
    example["label"] = label2id[example["label"]]
    return example

# Apply label mapping to all dataset splits
dataset = dataset.map(encode_labels)


Map:   0%|          | 0/312 [00:00<?, ? examples/s]

# **Tokenizing the entire dataset**

In [None]:
from transformers import AutoTokenizer

# Use the Legal-BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")

# Tokenization function for premise and hypothesis
def tokenize_function(example):
    return tokenizer(
        example["premise"],
        example["hypothesis"],
        truncation=True,
        padding="max_length",
        max_length=256  # can be tuned depending GPU
    )

# Apply tokenization to all dataset splits
tokenized_datasets = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/312 [00:00<?, ? examples/s]

# **Spliting the tokenized dataset to train and validation sets**

In [None]:
# Split the tokenized train set into train/validation
split = tokenized_datasets["train"].train_test_split(test_size=0.1, seed=42)

# Extract split sets
train_dataset = split["train"]
val_dataset = split["test"]

# Create PyTorch DataLoaders
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)


# **Loading the LegalBert model with three labels**

In [None]:
from transformers import AutoModelForSequenceClassification

# Load LegalBERT model with 3 output classes (Entailed, Neutral, Contradict)
model = AutoModelForSequenceClassification.from_pretrained(
    "nlpaueb/legal-bert-base-uncased",
    num_labels=3
)


pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# **Defining the Optimizer and scheduler**

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

num_epochs = 10
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)


# ***The Training Loop***

In [None]:
from torch.optim import AdamW
from torch.optim.lr_scheduler import StepLR
import torch
from tqdm import tqdm

# Set hyperparameters
batch_size = 20  # Adjutable as needed for your memory capacity
num_epochs = 20
learning_rate = 1e-5  #  experimenting with this learning rate
weight_decay = 0.01  # L2 regularization
dropout_rate = 0.1  # Dropout for regularization (if not set in the model already)
patience = 5  # For early stopping (optional, if you want to implement early stopping)

# Set up optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = StepLR(optimizer, step_size=1, gamma=0.7)  # Decay learning rate every epoch

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training and Validation Loop
best_val_accuracy = 0
epochs_without_improvement = 0

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0
    correct = 0
    total = 0

    # Training phase
    progress_bar = tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{num_epochs}", ncols=100)
    for batch in progress_bar:
        batch = {key: value.to(device) for key, value in batch.items()}  # Move batch to device

        # Ensure that 'label' is renamed to 'labels'
        if 'label' in batch:
            batch['labels'] = batch.pop('label')

        optimizer.zero_grad()
        outputs = model(**batch)  # Forward pass
        loss = outputs.loss
        logits = outputs.logits

        # Backpropagation
        loss.backward()
        optimizer.step()

        # Calculate accuracy
        _, predicted = torch.max(logits, dim=1)
        correct += (predicted == batch['labels']).sum().item()
        total += batch['labels'].size(0)

        total_train_loss += loss.item()
        progress_bar.set_postfix(loss=loss.item())

    # Calculate average train loss and accuracy
    avg_train_loss = total_train_loss / len(train_loader)
    train_accuracy = correct / total

    # Validation phase
    model.eval()
    total_val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        progress_bar_val = tqdm(val_loader, desc="Validating", ncols=100)
        for batch in progress_bar_val:
            batch = {key: value.to(device) for key, value in batch.items()}  # Move batch to device

            # Ensure that 'label' is renamed to 'labels' for validation
            if 'label' in batch:
                batch['labels'] = batch.pop('label')

            outputs = model(**batch)
            loss = outputs.loss
            logits = outputs.logits

            # Calculate accuracy
            _, predicted = torch.max(logits, dim=1)
            correct += (predicted == batch['labels']).sum().item()
            total += batch['labels'].size(0)

            total_val_loss += loss.item()
            progress_bar_val.set_postfix(loss=loss.item())

    # Calculate average validation loss and accuracy
    avg_val_loss = total_val_loss / len(val_loader)
    val_accuracy = correct / total

    # Print results for the current epoch
    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    print(f"{'=' * 40}")
    print(f"Training Loss: {avg_train_loss:.4f} | Training Accuracy: {train_accuracy:.4f}")
    print(f"Validation Loss: {avg_val_loss:.4f} | Validation Accuracy: {val_accuracy:.4f}")
    print(f"{'=' * 40}")

    # Save the model if validation accuracy improves
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        epochs_without_improvement = 0
        # Save model checkpoint
        torch.save(model.state_dict(), f"best_model_epoch_{epoch + 1}.pt")
        print(f"Model saved at epoch {epoch + 1}")
    else:
        epochs_without_improvement += 1

    # Early stopping condition
    if epochs_without_improvement >= patience:
        print(f"Early stopping at epoch {epoch + 1} due to no improvement in validation accuracy.")
        break

    # Step the scheduler to adjust the learning rate
    scheduler.step()

# Final model save after training
torch.save(model.state_dict(), "final_trained_model.pt")
print("Final model saved.")


Training Epoch 1/10: 100%|██████████████████████████████| 18/18 [00:11<00:00,  1.52it/s, loss=0.622]
Validating: 100%|█████████████████████████████████████████| 2/2 [00:00<00:00,  4.86it/s, loss=0.982]



Epoch 1/10
Training Loss: 0.3472 | Training Accuracy: 0.9429
Validation Loss: 0.7138 | Validation Accuracy: 0.6562
Model saved at epoch 1


Training Epoch 2/10: 100%|██████████████████████████████| 18/18 [00:11<00:00,  1.52it/s, loss=0.186]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.80it/s, loss=1.25]



Epoch 2/10
Training Loss: 0.2398 | Training Accuracy: 0.9679
Validation Loss: 0.8281 | Validation Accuracy: 0.6562


Training Epoch 3/10: 100%|██████████████████████████████| 18/18 [00:12<00:00,  1.49it/s, loss=0.103]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.77it/s, loss=1.31]



Epoch 3/10
Training Loss: 0.1645 | Training Accuracy: 0.9821
Validation Loss: 0.8297 | Validation Accuracy: 0.6562


Training Epoch 4/10: 100%|██████████████████████████████| 18/18 [00:12<00:00,  1.47it/s, loss=0.124]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.61it/s, loss=1.25]



Epoch 4/10
Training Loss: 0.1214 | Training Accuracy: 0.9893
Validation Loss: 0.7900 | Validation Accuracy: 0.6875
Model saved at epoch 4


Training Epoch 5/10: 100%|██████████████████████████████| 18/18 [00:12<00:00,  1.43it/s, loss=0.133]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.40it/s, loss=1.33]



Epoch 5/10
Training Loss: 0.1030 | Training Accuracy: 1.0000
Validation Loss: 0.8479 | Validation Accuracy: 0.6250


Training Epoch 6/10: 100%|█████████████████████████████| 18/18 [00:12<00:00,  1.40it/s, loss=0.0661]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.27it/s, loss=1.39]



Epoch 6/10
Training Loss: 0.0836 | Training Accuracy: 1.0000
Validation Loss: 0.8681 | Validation Accuracy: 0.6562


Training Epoch 7/10: 100%|█████████████████████████████| 18/18 [00:12<00:00,  1.40it/s, loss=0.0739]
Validating: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00,  4.42it/s, loss=1.42]



Epoch 7/10
Training Loss: 0.0757 | Training Accuracy: 1.0000
Validation Loss: 0.8749 | Validation Accuracy: 0.6562
Early stopping at epoch 7 due to no improvement in validation accuracy.
Final model saved.


In [1]:
from transformers import AutoTokenizer
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

# Load the tokenizer for the model (same as used in training)
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")

# Load the model (final saved model after training)
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=3)
model.load_state_dict(torch.load("final_trained_model.pt"))  # Path to the final trained model
model.to(device)  # Ensure the model is on the right device (CPU/GPU)
model.eval()  # Set the model to evaluation mode

# Initialize variables to store predictions and true labels
predictions = []
true_labels = []

# Iterate over the test data
for premise, hypothesis, true_label in test_data:
    # Tokenize the inputs
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move to GPU if needed

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Get the predicted class (Entailment: 0, Neutral: 1, Contradiction: 2)
    predicted_class = torch.argmax(logits, dim=1).item()

    # Store predictions and true labels
    predictions.append(predicted_class)
    true_labels.append(true_label)

# Calculate the accuracy of the model on the test set
accuracy = accuracy_score(true_labels, predictions)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

# Calculate Precision, Recall, and F1-score for each class
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average=None, labels=[0, 1, 2])

# Print the classification report
print("\nClassification Report:")
print(classification_report(true_labels, predictions, target_names=["Entailment", "Neutral", "Contradiction"]))

# Optionally, print out Precision, Recall, and F1 for each class individually
print("\nPrecision, Recall, F1 for each class:")
for i, label in enumerate(["Entailed", "Neutral", "Contradict"]):
    print(f"{label}: Precision: {precision[i]:.4f}, Recall: {recall}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

NameError: name 'AutoModelForSequenceClassification' is not defined