**1. Introduction**

Legal contracts are usually long documents made up of many different clauses, such as termination clauses, confidentiality clauses, governing law clauses, and so on. Manually identifying and organising these clauses can be time-consuming, especially when dealing with large numbers of contracts.

The aim of this project is to investigate whether a Large Language Model (LLM), specifically a BERT-style model, can be fine-tuned to automatically classify individual contract clauses into their correct categories.

This task is framed as a multi-class text classification problem, where the input is a piece of legal text and the output is the type of clause it represents. In addition to fine-tuning a transformer-based model, a traditional machine learning baseline will also be implemented to allow for a meaningful comparison.

In [None]:
# Install the libraries needed for this notebook
# These are standard libraries used in NLP projects with transformers

!pip -q install transformers datasets evaluate accelerate scikit-learn matplotlib

In [None]:
# Import all required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    set_seed
)

print("Torch version:", torch.__version__)
print("GPU available:", torch.cuda.is_available())

**Dataset Description**

For this project, the LEDGAR dataset is used. This dataset contains clauses extracted from real legal contracts that have been filed with the US SEC. Each clause is labelled according to its type, such as Termination, Governing Law, Confidentiality, and others.

The dataset is provided as part of the LexGLUE benchmark and is publicly available via Hugging Face. This makes it suitable for academic use and ensures the results can be reproduced.

In [None]:
# Load the LEDGAR dataset from the LexGLUE benchmark

dataset = load_dataset("coastalcph/lex_glue", "ledgar")
dataset

In [None]:
# Inspecting the structure of the dataset

print(dataset)
print("\nColumns:", dataset["train"].column_names)

dataset["train"][0]

**Exploratory Data Analysis**

Before training any models, it is important to understand what the data looks like and whether there are any obvious issues.

In [None]:
# Convert a small subset of the training data to a DataFrame for easier inspection

sample_df = pd.DataFrame(dataset["train"][:2000])
sample_df.head()

In [None]:
# Check for missing values

sample_df.isna().sum()

In [None]:
# Look at the distribution of labels (top 20 only for readability)

label_counts = sample_df["label"].value_counts().head(20)

plt.figure(figsize=(10, 4))
label_counts.plot(kind="bar")
plt.title("Label distribution (sample of training data)")
plt.xlabel("Label ID")
plt.ylabel("Number of clauses")
plt.show()

In [None]:
# Display a few example clauses to understand the task better

for i in range(3):
    print("\n--- Clause Example", i + 1, "---")
    print("Label:", sample_df.loc[i, "label"])
    print(sample_df.loc[i, "text"][:600], "...")

**Dataset Splits**

The dataset already comes with predefined training, validation, and test splits. These are used directly to avoid data leakage and to follow best practice.

In [None]:
train_data = dataset["train"]
val_data = dataset["validation"]
test_data = dataset["test"]

print("Training samples:", len(train_data))
print("Validation samples:", len(val_data))
print("Test samples:", len(test_data))

**Baseline Model: TF-IDF + Logistic Regression**

Before using a large language model, a simple baseline model is implemented. This helps to assess whether the additional complexity of a transformer model is actually justified.

The baseline uses:
TF-IDF for text vectorisation
Logistic Regression for classification

In [None]:
# Extract text and labels for the baseline model

X_train = train_data["text"]
y_train = train_data["label"]

X_val = val_data["text"]
y_val = val_data["label"]

X_test = test_data["text"]
y_test = test_data["label"]

In [None]:
# Convert text to TF-IDF features

tfidf = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1, 2),
    min_df=2
)

X_train_vec = tfidf.fit_transform(X_train)
X_val_vec = tfidf.transform(X_val)
X_test_vec = tfidf.transform(X_test)

X_train_vec.shape

In [None]:
# Train the Logistic Regression baseline

baseline_model = LogisticRegression(
    max_iter=2000,
    n_jobs=-1
)

baseline_model.fit(X_train_vec, y_train)

In [None]:
# Evaluate the baseline model

def evaluate_model(y_true, y_pred, name):
    acc = accuracy_score(y_true, y_pred)
    p, r, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="macro", zero_division=0
    )
    print(f"{name}")
    print(f"Accuracy: {acc:.4f}")
    print(f"Macro Precision: {p:.4f}")
    print(f"Macro Recall: {r:.4f}")
    print(f"Macro F1-score: {f1:.4f}\n")

val_preds_baseline = baseline_model.predict(X_val_vec)
test_preds_baseline = baseline_model.predict(X_test_vec)

evaluate_model(y_val, val_preds_baseline, "Baseline (Validation)")
evaluate_model(y_test, test_preds_baseline, "Baseline (Test)")

**Fine-Tuning a BERT-style Model**

**bold text**
While the baseline provides a useful reference point, legal text often contains complex sentence structures and specialised vocabulary. For this reason, a transformer-based model trained on legal text is used.
In this project, LegalBERT is selected. It follows the same architecture as BERT but has been pre-trained on legal documents, making it well-suited to this task.

In [None]:
# Set a random seed so results are reproducible

set_seed(42)

model_name = "nlpaueb/legal-bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

num_labels = train_data.features["label"].num_classes

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

**Tokenisation and Preprocessing**

The text is tokenised using the BERT tokenizer. Padding and truncation are applied to ensure all inputs have a consistent length.

In [None]:
# Tokenisation function

def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=256
    )

tokenised_train = train_data.map(tokenize_batch, batched=True)
tokenised_val = val_data.map(tokenize_batch, batched=True)
tokenised_test = test_data.map(tokenize_batch, batched=True)

tokenised_train = tokenised_train.remove_columns(["text"])
tokenised_val = tokenised_val.remove_columns(["text"])
tokenised_test = tokenised_test.remove_columns(["text"])

tokenised_train.set_format("torch")
tokenised_val.set_format("torch")
tokenised_test.set_format("torch")

**Training Configuration**

The Hugging Face Trainer API is used to simplify the training process while still allowing full control over evaluation and metrics.

In [None]:
# Padding is handled dynamically within each batch

data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
# Metric function used during evaluation

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(
        labels, preds, average="macro", zero_division=0
    )

    return {
        "accuracy": acc,
        "macro_precision": p,
        "macro_recall": r,
        "macro_f1": f1
    }

In [None]:
# Training arguments

training_args = TrainingArguments(
    output_dir="legalbert_ledgar",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    logging_steps=100,
    report_to="none"
)

In [None]:
# Trainer setup

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_train,
    eval_dataset=tokenised_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
# Fine-tune the LegalBERT model

trainer.train()

In [None]:
# Evaluate on validation and test sets

val_results = trainer.evaluate(tokenised_val)
test_results = trainer.evaluate(tokenised_test)

print("LegalBERT (Validation):", val_results)
print("LegalBERT (Test):", test_results)

**Error Analysis**

To better understand the model’s behaviour, a confusion matrix and a small number of misclassified examples are examined.

In [None]:
# Generate predictions for the test set

predictions = trainer.predict(tokenised_test)
test_preds = np.argmax(predictions.predictions, axis=1)

In [None]:
# Confusion matrix

cm = confusion_matrix(y_test, test_preds)

plt.figure(figsize=(6, 6))
plt.imshow(cm)
plt.title("Confusion Matrix (Test Set)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.colorbar()
plt.show()

In [None]:
# Inspect a few misclassified clauses

errors_shown = 0

for i in range(len(test_preds)):
    if test_preds[i] != y_test[i]:
        print("\nTrue label:", y_test[i], "| Predicted:", int(test_preds[i]))
        print(test_data[i]["text"][:700], "...")
        errors_shown += 1

    if errors_shown == 5:
        break

In [None]:
# Save the trained model and tokenizer

save_path = "saved_legalbert_ledgar"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

save_path

In [None]:
# Simple inference example

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model=save_path,
    tokenizer=save_path,
    device=0 if torch.cuda.is_available() else -1
)

example_clause = """
This Agreement may be terminated by either party upon thirty (30) days written notice.
"""

classifier(example_clause)

In [None]:
# COMPREHENSIVE MODEL COMPARISON AND VISUALIZATION

# 1. Get final metrics for both models
print("\n1. FINAL PERFORMANCE METRICS")

# Baseline metrics (already calculated)
baseline_test_accuracy = accuracy_score(y_test, test_preds_baseline)
baseline_precision, baseline_recall, baseline_f1, _ = precision_recall_fscore_support(
    y_test, test_preds_baseline, average="weighted", zero_division=0
)

# LegalBERT metrics (from trainer evaluation)
legalbert_test_accuracy = test_results["eval_accuracy"]
legalbert_precision = test_results["eval_macro_precision"]
legalbert_recall = test_results["eval_macro_recall"]
legalbert_f1 = test_results["eval_macro_f1"]

print(f"\n BASELINE (TF-IDF + Logistic Regression):")
print(f"   Accuracy:  {baseline_test_accuracy:.4f}")
print(f"   Precision: {baseline_precision:.4f}")
print(f"   Recall:    {baseline_recall:.4f}")
print(f"   F1-score:  {baseline_f1:.4f}")

print(f"\n LEGAL-BERT (Fine-tuned):")
print(f"   Accuracy:  {legalbert_test_accuracy:.4f}")
print(f"   Precision: {legalbert_precision:.4f}")
print(f"   Recall:    {legalbert_recall:.4f}")
print(f"   F1-score:  {legalbert_f1:.4f}")

# 2. Create comparison visualization
print("\n2. PERFORMANCE COMPARISON VISUALIZATION")

fig, axes = plt.subplots(1, 4, figsize=(16, 5))
fig.suptitle('Model Performance Comparison', fontsize=16, y=1.05)

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
baseline_scores = [baseline_test_accuracy, baseline_precision, baseline_recall, baseline_f1]
legalbert_scores = [legalbert_test_accuracy, legalbert_precision, legalbert_recall, legalbert_f1]
colors = ['#3498db', '#2ecc71']

for idx, (ax, metric) in enumerate(zip(axes, metrics)):
    bars = ax.bar(['Baseline', 'Legal-BERT'],
                  [baseline_scores[idx], legalbert_scores[idx]],
                  color=colors)
    ax.set_title(f'{metric}')
    ax.set_ylabel('Score')
    ax.set_ylim([0, 1])
    ax.grid(True, alpha=0.3)

    # Add value labels on bars
    for bar, score in zip(bars, [baseline_scores[idx], legalbert_scores[idx]]):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{score:.4f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# 3. Improvement percentages
print("\n3. RELATIVE IMPROVEMENT")
print("-" * 40)

improvements = []
for i in range(4):
    improvement = ((legalbert_scores[i] - baseline_scores[i]) / baseline_scores[i]) * 100
    improvements.append(improvement)
    print(f"{metrics[i]}: {improvement:+.2f}%")

# 4. Detailed confusion matrix for top classes
print("\n4. CONFUSION MATRIX ANALYSIS")

# Get top 20 classes by frequency
top_classes = pd.Series(y_test).value_counts().head(20).index.tolist()

# Create mask for top classes
mask = [label in top_classes for label in y_test]
filtered_y_true = np.array(y_test)[mask]
filtered_y_pred = test_preds[mask]

# Create confusion matrix
cm = confusion_matrix(filtered_y_true, filtered_y_pred, labels=top_classes)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
plt.imshow(cm, cmap='Blues')
plt.colorbar(label='Count')
plt.title('Confusion Matrix (Top 20 Classes)', fontsize=14)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# Add text annotations
for i in range(len(top_classes)):
    for j in range(len(top_classes)):
        plt.text(j, i, str(cm[i, j]),
                 horizontalalignment='center',
                 verticalalignment='center',
                 color='white' if cm[i, j] > cm.max()/2 else 'black')

plt.tight_layout()
plt.show()

# 5. Error analysis - Most confused pairs
print("\n5. ERROR ANALYSIS - MOST CONFUSED CLASS PAIRS")

from collections import Counter

# Find misclassifications
misclassifications = []
for true, pred in zip(y_test, test_preds):
    if true != pred:
        misclassifications.append((true, pred))

# Count most common errors
error_counts = Counter(misclassifications).most_common(10)

print("\nTop 10 most common misclassifications:")
print("-" * 50)
for (true_label, pred_label), count in error_counts:
    print(f"True: {true_label:3d} \u2192 Predicted: {pred_label:3d} | Count: {count:3d}")

# 6. Class-wise performance
print("\n6. CLASS-WISE PERFORMANCE ANALYSIS")

from sklearn.metrics import classification_report

# Generate detailed classification report
report = classification_report(y_test, test_preds, output_dict=True)
report_df = pd.DataFrame(report).transpose()

# Sort by F1-score
report_df_sorted = report_df.sort_values('f1-score', ascending=False)

print("\nTop 10 best performing classes:")
print(report_df_sorted.head(10)[['precision', 'recall', 'f1-score', 'support']].round(4))

print("\nTop 10 worst performing classes:")
print(report_df_sorted.tail(10)[['precision', 'recall', 'f1-score', 'support']].round(4))

# 7. Training history visualization
print("\n7. TRAINING HISTORY")

# Extract training history from trainer
history = trainer.state.log_history

# Dictionaries to store metrics aggregated by epoch
train_losses_per_epoch = {}
eval_losses_per_epoch = {}
eval_accuracies_per_epoch = {}

for entry in history:
    epoch_val = entry.get('epoch')
    if epoch_val is None:
        continue

    # Ensure epoch_val is treated as an integer for grouping
    # and only consider epochs up to the trained number of epochs
    current_epoch = int(round(epoch_val))
    if current_epoch <= 0 or current_epoch > training_args.num_train_epochs:
        continue

    if 'loss' in entry and 'eval_loss' not in entry:
        # This is a training step log
        if current_epoch not in train_losses_per_epoch:
            train_losses_per_epoch[current_epoch] = []
        train_losses_per_epoch[current_epoch].append(entry['loss'])

    if 'eval_loss' in entry and 'eval_accuracy' in entry:
        # This is an evaluation step log that occurs at the end of an epoch
        # Take the last evaluation result for an epoch if multiple are logged (unlikely for eval_strategy="epoch")
        eval_losses_per_epoch[current_epoch] = entry['eval_loss']
        eval_accuracies_per_epoch[current_epoch] = entry['eval_accuracy']

# Prepare lists for plotting
plot_epochs = sorted(list(eval_losses_per_epoch.keys()))
plot_train_loss = []
plot_eval_loss = []
plot_eval_accuracy = []

for epoch in plot_epochs:
    # Use the average training loss for the epoch
    if epoch in train_losses_per_epoch and len(train_losses_per_epoch[epoch]) > 0:
        plot_train_loss.append(np.mean(train_losses_per_epoch[epoch]))
    else:
        # If no training loss is recorded for this epoch, consider handling it (e.g., NaN or skip)
        # For now, let's assume valid data for epochs where eval data exists
        plot_train_loss.append(np.nan) # Placeholder, will need to handle NaNs if they appear in plot

    plot_eval_loss.append(eval_losses_per_epoch[epoch])
    plot_eval_accuracy.append(eval_accuracies_per_epoch[epoch])

if plot_train_loss and plot_eval_loss and len(plot_epochs) == len(plot_train_loss) == len(plot_eval_loss) == len(plot_eval_accuracy):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

    # Plot loss
    ax1.plot(plot_epochs, plot_train_loss, 'b-', label='Training Loss (Avg)', marker='o')
    ax1.plot(plot_epochs, plot_eval_loss, 'r-', label='Validation Loss', marker='s')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Training and Validation Loss')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    # Plot accuracy
    ax2.plot(plot_epochs, plot_eval_accuracy, 'g-', label='Validation Accuracy', marker='s')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy')
    ax2.set_title('Validation Accuracy')
    ax2.legend()
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()
else:
    print("Training history not available in the expected format or not enough epoch-wise data to plot.")

# 8. Model size and efficiency comparison
print("\n8. MODEL SIZE AND EFFICIENCY")

import os

def get_directory_size(path):
    """Calculate total size of directory in MB"""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / (1024 * 1024)  # Convert to MB

# Calculate sizes
legalbert_size = get_directory_size(save_path)
print(f"Legal-BERT model size: {legalbert_size:.2f} MB")

# Estimate baseline size (TF-IDF matrix + LR coefficients)
tfidf_size = (X_train_vec.shape[1] * 8) / (1024 * 1024)  # Assuming 8 bytes per float
lr_size = (X_train_vec.shape[1] * num_labels * 8) / (1024 * 1024)
baseline_total_size = tfidf_size + lr_size

print(f"Baseline model size: ~{baseline_total_size:.2f} MB")

# 9. Summary statistics
print("\n" + "="*60)
print("PROJECT SUMMARY STATISTICS")
print("="*60)

summary_stats = {
    "Dataset": "LEDGAR (LexGLUE)",
    "Task": "Legal Clause Classification",
    "Number of Classes": num_labels,
    "Training Samples": len(train_data),
    "Validation Samples": len(val_data),
    "Test Samples": len(test_data),
    "Baseline Model": "TF-IDF + Logistic Regression",
    "Transformer Model": "Legal-BERT Base Uncased",
    "Training Epochs": training_args.num_train_epochs,
    "Batch Size": training_args.per_device_train_batch_size,
    "Learning Rate": training_args.learning_rate,
    "Best Validation F1": f"{val_results.get('eval_macro_f1', 'N/A'):.4f}",
    "Test F1 Improvement": f"{((legalbert_f1 - baseline_f1) / baseline_f1 * 100):.2f}%"
}

for key, value in summary_stats.items():
    print(f"{key:25}: {value}")
