# Fine-Tuning RoBERTa on Zoning By-laws

### Introduction

Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

**This experiement aims to fine-tune a pre-trained RoBERTa QA model to increase its accuracy in extracting information from Zoning By-laws.**

### Why RoBERTa?

[RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) is an optimized version of BERT and improves it with new pretraining objectives.  The pretraining objectives include dynamic masking, sentence packing, larger batches and a byte-level BPE tokenizer. Since it is a newer improved model it is generally considered to outperform BERT on NLP tasks. In this experiment a fine-tuned version on SQuAD 2 used for question answering called [roberta-base-squad2 or roberta-base for Extractive QA](https://huggingface.co/deepset/roberta-base-squad2) is used.

[roberta-base-squad2 or roberta-base for Extractive QA](https://huggingface.co/deepset/roberta-base-squad2) is chosen because in the [Zeroshot QA Experiments](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/zeroshot_qa/zeroshot_qa_experiment2.ipynb) it outperformed [DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) in terms of accuracy.

For more info on NLP, LLMs, and transformer models:
[Hugging Face LLM Course](https://huggingface.co/learn/llm-course/en/chapter1/2)

### Fine-Tuning Strategy

The fine-tuning strategy chosen for this experiment is based on a paper called ["Fine-tuning Strategies for Domain Specific Question Answering under Low Annotation Budget Constraints"](https://arxiv.org/html/2401.09168v1).

The paper acknowledges the challenge and cost of adapting foundation models to specific tasks due to the huge amount of annotated samples required to fine-tune those models. In reality, training datasets for domain specific tasks are small due to budget constraints and creating a dataset with hundreds of labelled examples is tedious. 

The paper highlights that tradtionally this issue is circumvented using a double fine-tuning step:

*"It consists of fine-tuning the pre-trained foundation model on a large-scale training dataset that is as close as possible (domain and objective) to the target task and is then further fine-tuned on the given domain/task for which training data is scarce. The result is a Pre-trained Language Model (PLM) like BERT [1], trained on masked language modeling or text generation task, that is then fine-tuned on a more specific large-scale task (LM’), and ultimately refined on the domain/task at hand (LM’’)...*

*In the double fine-tuning step stated above, practitioners usually leverage the Stanford Question Answering Dataset (SQuAD) [10] which is a high-quality QA dataset that covers diverse knowledge for the PLM to train on. Nonetheless, in many real-life scenarios, specific-domain QA has a range of field applications that is narrower than SQuAD and may not appear in the SQuAD training data. This calls for building a domain-specific dataset to further fine-tune a QA model for the domain at hand to produce a QA model LM’’. This last fine-tuning step is domain-dependent, and the practitioner’s goal is also to ultimately keep the number of annotated training samples low - they are under a low annotation budget constraint. It’s worth mentioning that, for extractive QA, annotating 200 examples is already a time-consuming work: the collection of question and answer data requires the annotator to read and understand the text in order to ensure the reasonableness of the marked answers."*(Smith & Doe, 2024)

The paper concludes that the **best strategy to fine-tune a QA model on low-budget settings is taking a pre-trained model and fine-tuning it with a dataset composed of the target domain dataset and the SQuAD dataset.** This is the strategy that will be used in this experiment.

### About the training dataset

A dataset of 80 labeled examples was created to use in this experiment. The dataset was created manually from a range of different zoning by-laws from different municipalities across Canada in an Excel document and exported into CSV format. The CSV format is deemed appropriate for this experiment because training dataset is small and simple. [More information on LLM dataset formats](https://huggingface.co/blog/tegridydev/llm-dataset-formats-101-hugging-face)

The municipalities whose Zoning By-laws are used for this training dataset:

* Toronto
* Calgary
* Edmonton
* Vancouver
* Waterloo
* Saint John
* Surrey

To really test the efficacy of the models in extracting the zoning information, a range of different questions and contexts from different zoning by-laws throughout Canada are used. Some of the contexts are a mix of messy and clean snippets from the zoning by-law. One context contains a longer and messy snippet of raw text directly extracted from the pdf.

As mentioned above, the fine-tuning strategy involves using a training dataset composed of the target domain dataset and the SQuAD dataset. 50 examples consist of labeled examples from various Zoning By-laws and 30 examples are from the SQuAD dataset.

The Hugging Face Datasets library is not used in this experiment because it is not necessary. This is a small experiment and advanced features from the Datasets library (shuffling, splitting, streaming, or pushing to the Hugging Face Hub) are not required.

### Imports and Set Up

First, import all the necessary Python libraries. The Hugging Face Transformers Library is used.

Since this is a small and simple training dataset, an auto tokenizer is used as it is not deemed necessary to manually customize the tokenization process.

In [2]:
from transformers import AutoTokenizer, RobertaForQuestionAnswering, TrainingArguments, Trainer, EarlyStoppingCallback
from datasets import load_dataset
import pandas as pd

### Data Preprocessing

[More information on QA data processing](https://huggingface.co/learn/llm-course/en/chapter7/7)

In [None]:
# Load CSV training dataset
dataset = load_dataset(
    "csv",
    data_files={
        "train": "TrainingDataset.csv",
        "validation": "TestingDataset.csv"
    }
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

max_length = 512 # maximum number of tokens the model can take as input. For RoBERTa it's 512
stride = 128 # When truncating long contexts, stride lets you create a sliding window over the text so overlapping chunks are created

# Tokenize function
def tokenize_function(examples):

    # Tokenize questions and contexts
    inputs = tokenizer(
        examples["question"],
        examples["context"],
        truncation = "only_second", # Truncation = True will truncate if it exceeds max length. "only_second" will truncate only at the second sequence, which is the context
        max_length = max_length,
        stride = stride,
        return_overflowing_tokens = True, # returns extra chunks from stride
        return_offsets_mapping = True, # returns a mapping from token positions to character positions in the original text. Can find the start/end of character positions of the answer in context
        padding = "max_length"
    )

    # Convert "ground_truth" to start/end positions
    # In extractive QA, the model is not trained to generate the answer text directly but it learns to predict start token index and end token index inside the given context
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer_text = examples["ground_truth"][sample_idx]
        context = examples["context"][sample_idx]

        # Find character start/end of the answer in the context
        start_char = context.find(answer_text)
        if start_char == -1:
            # If answer not found, set to CLS token (special classification token that represents no answer)
            start_positions.append(tokenizer.cls_token_id)
            end_positions.append(tokenizer.cls_token_id)
            continue
        end_char = start_char + len(answer_text)

        # Sequence IDs: 0 = question, 1 = context, None = special tokens
        sequence_ids = inputs.sequence_ids(i)

        # Find start/end of the context in token space
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while idx < len(sequence_ids) and sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If answer is outside the context chunk, set CLS
        if offsets[context_start][0] > start_char or offsets[context_end][1] < end_char:
            start_positions.append(tokenizer.cls_token_id)
            end_positions.append(tokenizer.cls_token_id)
        else:
            # Find start token index
            token_start_index = context_start
            while token_start_index <= context_end and offsets[token_start_index][0] <= start_char:
                token_start_index += 1
            start_positions.append(token_start_index - 1)

            # Find end token index
            token_end_index = context_end
            while token_end_index >= context_start and offsets[token_end_index][1] >= end_char:
                token_end_index -= 1
            end_positions.append(token_end_index + 1)

    # Store positions in inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    # Dataset is ready for fine-tuning
    # Contains: input_ids, attention_mask, start_positions_, end_positions
    return inputs

# Apply preprocessing
tokenized_datasets = dataset.map(tokenize_function, batched = True, remove_columns = dataset["train"].column_names)

# Check data preprocessing with first 3 samples
for idx in range(3):  # first 3 samples
    sample = tokenized_datasets["train"][idx]
    print(f"Sample {idx}")
    print("Question:", dataset["train"][idx]["question"])
    print("Context:", dataset["train"][idx]["context"])
    print("Ground Truth:", dataset["train"][idx]["ground_truth"])
    print("Start Position:", sample["start_positions"])
    print("End Position:", sample["end_positions"])
    print("Decoded Answer:", tokenizer.decode(sample["input_ids"][sample["start_positions"]:sample["end_positions"]+1]))

### Fine-Tune the Model

In [None]:
# Load model
model = RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

# Training arguments
training_args = TrainingArguments(
    output_dir="./roberta-qa",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_dir="./logs",
    logging_steps=10,
    save_strategy="epoch"
)

# Define metricsBUG HERE FIX LATER
metric = load_metric("squad")

def compute_metrics(p):
    return metric.compute(predictions=p.predictions, references=p.label_ids)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train
trainer.train()

### Evaluate the Fine-Tuned Model

In [None]:
# Plot training metrics
logs = pd.DataFrame(trainer.state.log_history)
logs.plot(x="epoch", y=["loss", "eval_loss"])

# Save model
trainer.save_model("./roberta-qa-finetuned")