# Zero shot QA experiment 2 - DistilBERT vs RoBERTa
## January 2026 (An update to the October 2025 experiment)

### Introduction

Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

**This experiement aims to test out and evaluate the performance of the DistilBERT and RoBERTa question answering models to extract information from zoning by-laws. This experiment is an update aiming to improve upon the October 2025 experiment.**

### Changes from the October 2025 experiment

Some improvements have been made based on the lessons learned from the previous experiment.

* A dataset of 50 labeled examples was created to use in this experiment. The dataset was created manually from a range of different zoning by-laws from different municipalities across Canada.

* LegalBERT did not perform well in the previous experiment. Therefore, in this updated experiment it was removed and it was decided to compare the accuracy between DistilBERT and RoBERTa.

### Why DistilBERT and RoBERTa?
DistilBERT is a distilled or lighter version of the BERT model that was developed by Google. Because it is 40% smaller it makes it 60% faster at NLP tasks like text classification, sentiment analysis, and question answering. Although, it is smaller it still retains 97% of BERT's accuracy. In the [Claude LLM API Pipeline](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/llm_api_pipeline/src/README.md), Anthropic's Claude model was tested to extract information from zoning by-laws. One of the key limitations of using a model like Claude is that the generative component of the model is prone to hallucinations. **Unlike models like Claude and GPT, BERT is an encoder only model. This means it is good for tasks that require understanding of input like sentence classification or NER (named entity recognition).** For a task like extracting information from a zoning by-law, text generation is not that important. **LLMs like GPT and Claude who excel and are mainly used for generative tasks are not considered the most efficient at text classification and NER compared to bidirectional encoders like BERT. That is why a lighter version of BERT, DistilBERT, is chosen for this experiment.**

[RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) is an optimized versiton of BERT and improves it with new pretraining objectives.  The pretraining objectives include dynamic masking, sentence packing, larger batches and a byte-level BPE tokenizer. Since it is a newer improved model it is generally considered to outperform BERT on NLP tasks. In this experiment a fine-tuned version on SQuAD 2 used for question answering called [roberta-base-squad2 or roberta-base for Extractive QA](https://huggingface.co/deepset/roberta-base-squad2) is used.

For more info on NLP, LLMs, and transformer models:
[Hugging Face LLM Course](https://huggingface.co/learn/llm-course/en/chapter1/2)

### Why QA (question answering) models? Comparing different NLP tasks
The table below compares the pros and cons of different NLP tasks for extracting zoning by-law information. Based on the table below, question answering seems to be the most appropriate.

| Approach                           | What it does                                                                           | Pros                                                                                                                             | Cons                                                                                                                 |
| ---------------------------------- | -------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Text Classification**            | Assigns a label to an entire chunk of text (e.g. "this section contains height rules") | Simple to set up, works well if zoning is neatly sectioned                                                                       | Can’t extract numeric values, only gives category                                                                    |
| **NER (Named Entity Recognition)** | Finds predefined entities in text (e.g. `HEIGHT=9.1 m`, `LOT_COVERAGE=35%`)            | Good for structured outputs; works well if entity spans are clearly defined                                                      | Requires labeled token-level data, zoning text is irregular (tables, bullets, weird formatting), not great zero-shot |
| **QA (Question Answering)**        | Extracts a text span from context given a natural-language question                    | Works very well zero-shot, doesn’t need special labeling format, flexible | Requires splitting long contexts, can hallucinate occasionally                                                      |

### Imports and Set Up

First, import all the necessary Python libraries. The Hugging Face Transformers Library is used.

In [1]:
from transformers import pipeline
import pandas as pd
import evaluate

The Stanford Question Answering Dataset (SQuAD) is a widely used benchmark dataset for evaluating question answering models. The original paper (Rajpurkar et al., 2016) introduced two key evaluation metrics that have since become standard in the field.

* **Exact Match (EM):** This metric measures the percentage of questions where the model's answer exactly matches one of the ground truth answers.
* **F1 Score:** This metric calculates the overlap between the predicted answer and the ground truth answers. It considers both precision (the number of correct answers provided by the model) and recall (the number of correct answers that should have been provided). The F1 score is the harmonic mean of precision and recall, providing a balance between the two. A higher F1 score indicates a better performing model.

Hugging Face's evaluate library provides allows you to compute exact Mmatch (EM) and token-level F1.

In [2]:
# Load SQuAD metrics
squad_metric = evaluate.load("squad")

# Set up array to store LLM responses
results = []

# Evaluation helper function to prepare inputs for Hugging Face SQuAD metrics
def evaluate_model(res, model):

    # res or results: results dictionary containing the outputs of the LLMs/predictions and ground truth
    # model: "distil_answer" or "roberta_answer"

    predictions = []
    references = []

    for r in res:
        predictions.append({
            "id": str(r["doc_id"]),
            "prediction_text": r[model]
        })
        references.append({
            "id": str(r["doc_id"]),
            "answers": {
                "text": [r["ground_truth"]],
                "answer_start": [0]  # dummy value
            }
        })

    # Compute metrics
    return squad_metric.compute(predictions=predictions, references=references)

### Load the evaluation dataset

In the previous experiment, a small dataset was used and it was concluded that a larger one might yield more meaninful results. As a first experiment, it was a good starting point but for a more rigorous test a larger dataset was desirable. A dataset of 50 labeled examples was created to use in this experiment. The dataset was created manually from a range of different zoning by-laws from different municipalities across Canada in an Excel document and exported into CSV format.

To really test the efficacy of the models in extracting the zoning information, a range of different questions and contexts from different zoning by-laws throughout Canada are used. Some of the contexts are a mix of messy and clean snippets from the zoning by-law. One context contains a longer and messy snippet of raw text directly extracted from the pdf and another contains a similar long and messy snippet in raw markdown syntax. Snippets of tables in markdown syntax are also included.

The previous October 2025 experiment highlighted that the model did not perform well when given messy contexts from the zoning by-law. With this larger zero shot evaluation dataset, hopefully a clearer understanding of the models performance can be achieved. The previous experiment also showed that the model is unable to handle cases where the correct answer is absent from the provided context, requiring separate error handling to be added in the script. Since the goal of this experiment aims to test the performance and accuracy of the DistilBERT model itself, rather than its ability to detect missing answers, any contexts without the answer have been removed from the evaluation dataset. This ensures that all test cases focus solely on the model's answer extraction capabilities.

The Hugging Face Datasets library is not used in this experiment because it is not necessary. This is a small experiment and advanced features from the Datasets library (shuffling, splitting, streaming, or pushing to the Hugging Face Hub) are not required.

In [3]:
# Load the the zero shot evaluation dataset

dataset = pd.read_csv("ZeroshotDataset2.csv")

### Running and testing the models

The Hugging Face Transformers pipeline function is used. Since these are simple experiments the pipeline function is deemed adequate and there doesn't need to be more custom adjustments of tokenizers etc. 

In [None]:
# Load QA Pipelines for the model
# DistilBERT
distilbert_qa = pipeline(
    "question-answering",
    model = "distilbert-base-uncased-distilled-squad"
)

# RoBERTa base squad 2
roberta_qa = pipeline(
    "question-answering",
    model = "deepset/roberta-base-squad2"
)

The results of the zero shot classification are saved in an array called "results". The results are output in the data frame below.

In [5]:
# Run zero shot qa for DistilBERT and RoBERTa

for _, data in dataset.iterrows():
    q = data['question']
    ctext = data['context']
    truth = data['ground_truth']

    # DistilBERT
    distil_response = distilbert_qa(question=q, context=ctext)['answer']
    # RoBERTa base squad 2
    roberta_response = roberta_qa(question=q, context=ctext)['answer']

    results.append({
        "doc_id": data['doc_id'],
        "question": q,
        "ground_truth": truth,
        "municipality": data['municipality'],
        "distil_answer": distil_response,
        "roberta_answer": roberta_response
    })

dataframe = pd.DataFrame(results)
dataframe

Unnamed: 0,doc_id,question,ground_truth,municipality,distil_answer,roberta_answer
0,1,What is the maximum building height for access...,4.0 m | 1 storey,Burnaby,4.0 m | 1 storey,4.0 m
1,2,What is the maximum lot area for 1-3 small-sca...,-,Burnaby,280m2,280m2
2,3,Where does a child care facility in the R1 dis...,on a corner lot,Burnaby,on a corner lot,a corner lot
3,4,For lots in the R1 district on the Community H...,2.0 m,Burnaby,2.0 m,2.0 m
4,5,"According to the table, what is the minimum lo...",8 m,Burnaby,5 m,8 m 10 m
5,6,"For a lot with 1 to 3 total dwelling units, wh...",1 Unit,Burnaby,4 to 6,2 Units
6,7,"For a lot with 1 to 3 total dwelling units, wh...",1 Unit,Burnaby,4 to 6,4 to 6 Units
7,8,"According to the table, what is the minimum lo...",8 m,Burnaby,5 m,8 m 10 m
8,9,What is the minimum width and area for outdoor...,An outdoor amenity space with a minimum width ...,Burnaby,2.0 m,2.0 m and area of 10.0 m2
9,10,What is the maximum building height for a slop...,9.0 m (29.5 ft.),Burnaby,9.0 m,9.0 m (29.5 ft.)


### Concluding thoughts and evaluation and metrics

In [6]:
# Evaluation and metrics

distil_metrics = evaluate_model(results, "distil_answer")
roberta_metrics = evaluate_model(results, "roberta_answer")

print("DistilBERT Metrics:", distil_metrics)
print("RoBERTa Metrics:", roberta_metrics)

DistilBERT Metrics: {'exact_match': 40.0, 'f1': 66.0989861989862}
RoBERTa Metrics: {'exact_match': 50.0, 'f1': 73.62126466126468}


**DistilBERT Metrics**

* **Exact Match:** A score of 40% of predictions matching exactly the ground truth.
* **F1 Score:** 66.10% token-level overlap between predictions and ground truth

**RoBERTa Metrics**

* **Exact Match:** A score of 50% of predictions matching exactly the ground truth.
* **F1 Score:** 73.62% token-level overlap between predictions and ground truth

At first glance, it is understanble why the exact match (EM) scores for both models are only around 50%. The EM metric is very strict and even a small difference in wording or puncutation counts as incorrect. For example if the ground truth is "on a corner lot" but the model provides the response "a corner lot", the EM metric will count the response as incorrect. The F1 score is more forgiving and better reflects partial correctness.

Looking at the F1 score, a score of 70% or higher is considered "good" in the industry. However, in this scenario where accuracy is extremely important because it involves legal answers, a score of 90-95% or higher is desirable.

It is not surprising that RoBERTa outperforms DistilBERT on both metrics because it is larger and is fine-tuned on the SQuAD2 dataset. DistilBERT is a compressed faster model and trades speed for accuracy. However, it would not be surprising if RoBERTa outperforms BERT as well because it is an optimized version of BERT due to improved pre-training methodology.

The fact that there is a significant gap between the EM and F1 scores indicates that often produce partially correct answers that don't exactly match the ground truth/golden answer.

This experiment could consider using a more forgiving evaluation metric that looks at semantic similarity and partial matches, but it made sense to use an industry standard metric at the start.

It is also important to consider that the QA models were trained on a variety of other datasets, like SQuAD, and my zero shot dataset may be harder to interpret since they involve legal language and are very domain specific. When evaluated on SQuAD, top models normally achieve scores of EM > 85 and F1 > 90. When evaluated using zero shot on out of domain datasets, scores tend to drop significantly. Therefore, it might be a good idea to consider fine-tuning RoBERTa on text extracted from zoning by-laws.

### Lessons Learned and Improvements

* Consider fine-tuning RoBERTa on text extracted from zoning by-laws.
* Consider doing a cross-model evaluation exercise on other types of model architectures like Longformer and generative LLMs that are designed for processing long documents. RoBERTa has a maximum token size of 512 tokens, which is not that great when working with long legal documents like zoning by-laws. Since a pipeline has been developed to extract answers from zoning text snippets in the [Claude LLM API Pipeline Project](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/llm_api_pipeline/src/README.md), consider evaluating the model's accuracy to compare with others.
* Compared to fine-tuning a DistilBERT model, consider exploring alternative methods like OCR (optical character recognition). Using an OCR model may be more efficient although the results would have to be double checked and post-processing of the data is required.