# Zero shot QA experiment 2 - DistilBERT
## January 2026 (An update to the October 2025 experiment)

### Introduction

Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

**This experiement aims to test out and evaluate the performance of the DistilBERT question answering model to extract information from zoning by-laws. This experiment is an update aiming to improve upon the October 2025 experiment.**

### Changes from the October 2025 experiment

Some improvements have been made based on the lessons learned from the previous experiment. 

* A dataset of 50 labeled examples was created to use in this experiment. The dataset was created manually from a range of different zoning by-laws from different municipalities across Canada.

* On top of the SQuAD evaluation metric, an additional less strict evaluation method using fuzzy matching will be used as well (see below for more details).

* LegalBERT did not perform well in the previous experiment. Therefore, in this updated experiment it was removed.

### Why DistilBERT
DistilBERT is a distilled or lighter version of the BERT model that was developed by Google. Because it is 40% smaller it makes it 60% faster at NLP tasks like text classification, sentiment analysis, and question answering. Although, it is smaller it still retains 97% of BERT's accuracy. In the [Claude LLM API Pipeline](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/llm_api_pipeline/src/README.md), Anthropic's Claude model was tested to extract information from zoning by-laws. One of the key limitations of using a model like Claude is that the generative component of the model is prone to hallucinations. **Unlike models like Claude and GPT, BERT is an encoder only model. This means it is good for tasks that require understanding of input like sentence classification or NER (named entity recognition).** For a task like extracting information from a zoning by-law, text generation is not that important. **LLMs like GPT and Claude who excel and are mainly used for generative tasks are not considered the most efficient at text classification and NER compared to bidirectional encoders like BERT. That is why a lighter version of BERT, DistilBERT, is chosen for this experiment.**

For more info on NLP, LLMs, and transformer models:
[Hugging Face LLM Course](https://huggingface.co/learn/llm-course/en/chapter1/2)

### Why QA (question answering) models? Comparing different NLP tasks
The table below compares the pros and cons of different NLP tasks for extracting zoning by-law information. Based on the table below, question answering seems to be the most appropriate.

| Approach                           | What it does                                                                           | Pros                                                                                                                             | Cons                                                                                                                 |
| ---------------------------------- | -------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Text Classification**            | Assigns a label to an entire chunk of text (e.g. "this section contains height rules") | Simple to set up, works well if zoning is neatly sectioned                                                                       | Can’t extract numeric values, only gives category                                                                    |
| **NER (Named Entity Recognition)** | Finds predefined entities in text (e.g. `HEIGHT=9.1 m`, `LOT_COVERAGE=35%`)            | Good for structured outputs; works well if entity spans are clearly defined                                                      | Requires labeled token-level data, zoning text is irregular (tables, bullets, weird formatting), not great zero-shot |
| **QA (Question Answering)**        | Extracts a text span from context given a natural-language question                    | Works very well zero-shot, doesn’t need special labeling format, flexible | Requires splitting long contexts, can hallucinate occasionally                                                      |

### Imports and Set Up

First, import all the necessary Python libraries. The Hugging Face Transformers Library is used.

In [None]:
from transformers import pipeline
import pandas as pd
import evaluate
from rapidfuzz import fuzz

The SQuAD metric is used to evaluate the accuracy of the models. Hugging Face's evaluate library provides squad metrics that can calculate exact Mmatch (EM) and token-level F1.

SQuAD (Stanford Question Answering Dataset) is a metric widely used to evaluate and assess the performance of machine learning models. It is most often used for question answering and reading comprehension tasks.

* **Exact Match (EM):** This metric measures the percentage of questions where the model's answer exactly matches one of the ground truth answers.
* **F1 Score:** This metric calculates the overlap between the predicted answer and the ground truth answers. It considers both precision (the number of correct answers provided by the model) and recall (the number of correct answers that should have been provided). The F1 score is the harmonic mean of precision and recall, providing a balance between the two. A higher F1 score indicates a better performing model.

**Reference:**

Rajpurkar et al., "*SQuAD: 100,000+ Questions for Machine Comprehension of Text*", EMNLP 2016.

The October 2025 experiment highlighted that the SQuAD metric, a seemingly reliable widely used industry evaluation standard, may be too strict as an evaluation metric. This is because the Exact Match (EM) checks if the model's answer exactly matches the ground truth answer. In the previous experiment there were cases where the model provided an answer like "on a corner lot", which is correct, but the ground truth is also correct but provides more context like "be located on a corner lot". Therefore, this updated experiment will explore using fuzzy matching to evaluate the results from the model as well. A threshold of 80 will be used to determine a match. An open source Python library called [RapidFuzz](https://rapidfuzz.github.io/RapidFuzz/) will be used.

In [None]:
# Load SQuAD metrics
squad_metric = evaluate.load("squad")

# Set up array to store LLM responses
results = []

# Evaluation helper function to prepare inputs for Hugging Face SQuAD metrics
def evaluate_model(res, model):

    # res or results: results dictionary containing the outputs of the LLMs/predictions and ground truth
    # model: "distil_answer"

    predictions = []
    references = []

    for r in res:
        predictions.append({
            "id": str(r["doc_id"]),
            "prediction_text": r[model]
        })
        references.append({
            "id": str(r["doc_id"]),
            "answers": {
                "text": [r["ground_truth"]],
                "answer_start": [0]  # dummy value
            }
        })

    # Compute metrics
    return squad_metric.compute(predictions=predictions, references=references)

# Evaulation helper function for fuzzy matching
def evaluate_fuzzymatch_model(res, model):

    # res or results: results dictionary containing the outputs of the LLMs/predictions and ground truth
    # model: "distil_answer"

    output = []

    for r in res:
        output.append = process.extractOne(r, choices)

### Load the evaluation dataset

In the previous experiment, a small dataset was used and it was concluded that a larger one might yield more meaninful results. As a first experiment, it was a good starting point but for a more rigorous test a larger dataset was desirable. A dataset of 50 labeled examples was created to use in this experiment. The dataset was created manually from a range of different zoning by-laws from different municipalities across Canada in an Excel document and exported into CSV format.

To really test the efficacy of the models in extracting the zoning information, a range of different questions and contexts from different zoning by-laws throughout Canada are used. Some of the contexts are a mix of messy and clean snippets from the zoning by-law. One context contains a longer and messy snippet of raw text directly extracted from the pdf and another contains a similar long and messy snippet in raw markdown syntax. Snippets of tables in markdown syntax are also included.

The previous October 2025 experiment highlighted that the model did not perform well when given messy contexts from the zoning by-law. With this larger zero shot evaluation dataset, hopefully a clearer understanding of the models performance can be achieved. The previous experiment also showed that the model is unable to handle cases where the correct answer is absent from the provided context, requiring separate error handling to be added in the script. Since the goal of this experiment aims to test the performance and accuracy of the DistilBERT model itself, rather than its ability to detect missing answers, any contexts without the answer have been removed from the evaluation dataset. This ensures that all test cases focus solely on the model's answer extraction capabilities.

The Hugging Face Datasets library is not used in this experiment because it is not necessary. This is a small experiment and advanced features from the Datasets library (shuffling, splitting, streaming, or pushing to the Hugging Face Hub) are not required.

In [None]:
# Load the the zero shot evaluation dataset

datafile = pd.read_csv

### Running and testing the models

The Hugging Face Transformers pipeline function is used. Since these are simple experiments the pipeline function is deemed adequate and there doesn't need to be more custom adjustments of tokenizers etc. 

In [None]:
# Load QA Pipelines for the model
# DistilBERT
distilbert_qa = pipeline(
    "question-answering",
    model = "distilbert-base-uncased-distilled-squad"
)

The results of the zero shot classification are saved in an array called "results". The results are output in the data frame below.

In [None]:
# Run zero shot qa for DistilBERT

for data in dataset:
    q = data['question']
    ctext = data['context']
    truth = data['ground_truth']

    # DistilBERT
    distil_response = distilbert_qa(question=q, context=ctext)['answer']
    # LEGAL-BERT
    legal_response = legalbert_qa(question=q, context=ctext)['answer']

    results.append({
        "doc_id": data['doc_id'],
        "question": q,
        "ground_truth": truth,
        "distil_answer": distil_response,
        "legal_answer": legal_response
    })

dataframe = pd.DataFrame(results)
dataframe