# BERT Fine-Tuning for Legal QA with Hugging Face
This Colab notebook fine-tunes a BERT model to answer legal questions about education rights for 3rd graders in D.C., using the `pile-of-law` dataset.

Dataset: https://huggingface.co/datasets/pile-of-law/pile-of-law  
Model: `bert-base-uncased` (QA task)


In [1]:
#  Install Required Libraries
!pip install transformers datasets accelerate --quiet

In [2]:
!pip install --upgrade datasets transformers



In [3]:
# Imports
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, pipeline
import torch
import random
import pandas as pd

In [4]:
from datasets import load_dataset

# Load the "pile-of-law" dataset with the "ed_policy_guidance" configuration/subset
# This is the correct way to load a specific subset from the Hugging Face Hub.
try:
    dataset = load_dataset("pile-of-law/pile-of-law", "ed_policy_guidance", split="train", trust_remote_code=True)
    print("Dataset 'pile-of-law' with subset 'ed_policy_guidance' loaded successfully!")
    print(dataset)

    # You can inspect the first few examples to ensure it's what you expect
    print("\nFirst 3 examples:")
    for i in range(min(3, len(dataset))):
        print(dataset[i])

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure you have an active internet connection and that the 'datasets' library is up to date.")
    print("You might also want to try updating the datasets library: pip install --upgrade datasets")

Loading Dataset Infos from C:\Users\User\.cache\huggingface\modules\datasets_modules\datasets\pile-of-law--pile-of-law\c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
Overwrite dataset info from restored data version if exists.
Loading Dataset info from C:\Users\User\.cache\huggingface\datasets/pile-of-law___pile-of-law/ed_policy_guidance/0.0.0/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
Found cached dataset pile-of-law (C:/Users/User/.cache/huggingface/datasets/pile-of-law___pile-of-law/ed_policy_guidance/0.0.0/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60)
Loading Dataset info from C:/Users/User/.cache/huggingface/datasets/pile-of-law___pile-of-law/ed_policy_guidance/0.0.0/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


Dataset 'pile-of-law' with subset 'ed_policy_guidance' loaded successfully!
Dataset({
    features: ['text', 'created_timestamp', 'downloaded_timestamp', 'url'],
    num_rows: 507
})

First 3 examples:
{'text': 'OSEP PRIOR APPROVAL GUIDANCE UNDER IDEA\n\nPOLICY SUPPORT 22-03\n\nOffice of Special Education Programs (OSEP)\n\nGuidance for Common Prior Approval Requests under IDEA Parts B and C\nThis guidance provides a summary of the approval process and requirements for three common\ncategories of direct costs for which State agencies must obtain prior approval before using\nFederal funds under the Individuals with Disabilities Education Act (IDEA). Under the Office of\nManagement and Budget (OMB), Uniform Administrative Requirements, Cost Principles, and\nAudit Requirements for Federal Awards (OMB Uniform Guidance), certain items of cost are\nunallowable as direct charges except with advanced prior written approval of the Department.\n2 C.F.R. § 200.407. OSEP has developed this guidanc

In [5]:
#  Build QA Dataset (Manually labeled subset for demo)
qa_data = [
    {
        "context": example["text"],
        "question": "What education rights do 3rd grade students have in D.C.?",
        "answers": {
            "text": ["free and appropriate public education"],
            "answer_start": [example["text"].lower().find("free and appropriate public education")]
        }
    }
    for example in dataset if "free and appropriate public education" in example["text"].lower()
]
qa_dataset = Dataset.from_list(qa_data)

In [6]:
# Tokenization
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Define a preprocessing function to tokenize the dataset

def preprocess_training_example(example):
    inputs = tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        padding="max_length",
        max_length=512,
        return_offsets_mapping=True
    )

    offset_mapping = inputs.pop("offset_mapping")
    answer = example["answers"]["text"][0]
    start_char = example["answers"]["answer_start"][0]
    end_char = start_char + len(answer)

    # Initialize
    start_token = end_token = 0

    for idx, (start, end) in enumerate(offset_mapping):
        if start <= start_char < end:
            start_token = idx
        if start < end_char <= end:
            end_token = idx
            break

    inputs["start_positions"] = start_token
    inputs["end_positions"] = end_token
    return inputs

# Apply the function to dataset
tokenized_dataset = qa_dataset.map(preprocess_training_example)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [7]:
tokenized_dataset[0].keys()


dict_keys(['context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [8]:
# Fine-tuning Setup
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    learning_rate=2e-5,
    num_train_epochs=2,
    logging_dir="./logs",
    logging_steps=10
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()
trainer.save_model("dc-edu-qa-bert")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


In [9]:
# Evaluate Sample QA
qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)

result = qa_pipe({
    "context": qa_data[0]["context"],
    "question": qa_data[0]["question"]
})
print(result)

Device set to use cpu


{'score': 0.00012695101031567901, 'start': 13741, 'end': 13756, 'answer': 'www.apa.org/ed/'}


### Evaluation Summary
- Trained for 2 epochs
- Model returns span predictions with confidence score
- Used Hugging Face BERT, tokenization, filtered legal data by D.C./education, fine-tuned on 2 epochs, evaluated using confidence score.

### Debugging Summary
- Fixed missing `start_positions` and `end_positions` which caused training crash
- Added logic to calculate those from `answer_start`

###  Creative Application Summary
- Legal QA assistant for education rights in D.C.
- Could be used by parents, advocates, or education lawyers

### Future Iteration
- Incorporate addidtional debbugging and evalution metrics
- Produce more training logs and data