# Zero shot classification experiments - testing and comparing DistilBERT and LEGAL-BERT

### Introduction

Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

**This experiement aims to test out and evaluate the performance of DistilBERT and LEGAL-BERT question answering models to extract information from zoning by-laws.**

### Why DistilBERT and LegalBERT
DistilBERT is a distilled or lighter version of the BERT model that was developed by Google. Because it is 40% smaller it makes it 60% faster at NLP tasks like text classification, sentiment analysis, and question answering. Although, it is smaller it still retains 97% of BERT's accuracy. In the [Claude LLM API Pipeline](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/llm_api_pipeline/src/README.md), Anthropic's Claude model was tested to extract information from zoning by-laws. One of the key limitations of using a model like Claude is that the generative component of the model is prone to hallucinations. **Unlike models like Claude and GPT, BERT is an encoder only model. This means it is good for tasks that require understanding of input like sentence classification or NER (named entity recognition).** For a task like extracting information from a zoning by-law, text generation is not that important. **LLMs like GPT and Claude who excel and are mainly used for generative tasks are not considered the most efficient at text classification and NER compared to bidirectional encoders like BERT. That is why a lighter version of BERT, DistilBERT, is chosen for this experiment.**

There are many pre-trained models of BERT in the Hugging Face Transformer's library. [LEGAL-BERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased) is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. LEGAL-BERT is pre-trained on 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. For more info on the data the model is pretrained on, refer to the model card on Hugging Face. **Since zoning by-law texts are legal documents, it would be interesting to compare the accuracy of DistilBERT vs LEGAL-BERT in this context**.

For more info on NLP, LLMs, and transformer models:
[Hugging Face LLM Course](https://huggingface.co/learn/llm-course/en/chapter1/2)

### Why QA (question answering) models? Comparing different NLP tasks
The table below compares the pros and cons of different NLP tasks for extracting zoning by-law information. Based on the table below, question answering seems to be the most appropriate.

| Approach                           | What it does                                                                           | Pros                                                                                                                             | Cons                                                                                                                 |
| ---------------------------------- | -------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Text Classification**            | Assigns a label to an entire chunk of text (e.g. "this section contains height rules") | Simple to set up, works well if zoning is neatly sectioned                                                                       | Can’t extract numeric values, only gives category                                                                    |
| **NER (Named Entity Recognition)** | Finds predefined entities in text (e.g. `HEIGHT=9.1 m`, `LOT_COVERAGE=35%`)            | Good for structured outputs; works well if entity spans are clearly defined                                                      | Requires labeled token-level data, zoning text is irregular (tables, bullets, weird formatting), not great zero-shot |
| **QA (Question Answering)**        | Extracts a text span from context given a natural-language question                    | Works very well zero-shot, doesn’t need special labeling format, flexible | Requires splitting long contexts, can hallucinate occasionally                                                      |

### Imports and Set Up

First, import all the necessary Python libraries. The Hugging Face Transformers Library is used.

In [1]:
from transformers import pipeline
import pandas as pd
import evaluate

The SQuAD metric is used to evaluate the accuracy of the models. Hugging Face's evaluate library provides squad metrics that can calculate exact Mmatch (EM) and token-level F1.

SQuAD (Stanford Question Answering Dataset) is a metric widely used to evaluate and assess the performance of machine learning models. It is most often used for question answering and reading comprehension tasks.

* **Exact Match (EM):** This metric measures the percentage of questions where the model's answer exactly matches one of the ground truth answers.
* **F1 Score:** This metric calculates the overlap between the predicted answer and the ground truth answers. It considers both precision (the number of correct answers provided by the model) and recall (the number of correct answers that should have been provided). The F1 score is the harmonic mean of precision and recall, providing a balance between the two. A higher F1 score indicates a better performing model.

**Reference:**

Rajpurkar et al., "*SQuAD: 100,000+ Questions for Machine Comprehension of Text*", EMNLP 2016.

In [None]:
# Load SQuAD metrics
squad_metric = evaluate.load("squad")

# Set up array to store LLM responses
results = []

# Evaluation helper function to prepare inputs for Hugging Face SQuAD metrics
def evaluate_model(res, model):

    # res or results: results dictionary containing the outputs of the LLMs/predictions and ground truth
    # model: "distil_answer" or "legal_answer"

    predictions = []
    references = []

    for r in res:
        predictions.append({
            "id": str(r["doc_id"]),
            "prediction_text": r[model]
        })
        references.append({
            "id": str(r["doc_id"]),
            "answers": {
                "text": [r["ground_truth"]],
                "answer_start": [0]  # dummy value
            }
        })

    # Compute metrics
    return squad_metric.compute(predictions=predictions, references=references)

### Creating the evaluation dataset

The functions in [Zoning PDF Text Extraction and Parsing Functions](https://github.com/JoT8ng/zoning-extraction-pipelines/tree/main/common_pdf_parsing) are used to extract text from a zoning by-law section. In the example below, the text from Burnaby's zoning by-law on R1 small scale multi unit housing is extracted. Unlike the Claude model, DistilBERT and LEGAL-BERT accept a smaller and more limited amount of tokens. Furthermore, the dataset had to be in a specific format containing the "context", "question", and "ground truth" for evaluating the model. Therefore, the chunks of text to be input as the "context" into the model had to be manually extracted.

To really test the efficacy of the models in extracting the zoning information, a range of different questions and contexts are used. Some of the contexts are a mix of messy and clean snippets from the zoning by-law. One context contains a longer and messy snippet of raw text directly extracted from the pdf and another contains a similar long and messy snippet in raw markdown syntax. Snippets of tables in markdown syntax are also included. To challenge the model, sometimes a context is provided that does not contain the answer to the question. In that scenario, the model is required to respond that there is no answer specified.

In [3]:
# Create a small labeled evaluation dataset
dataset = [
    {
        "doc_id": 1,
        "context": """
        Maximum Height:       
        Principal Building 
        12.0 m | 4 storeys,    
        Accessory Buildings 
        4.0 m | 1 storey 
        """,
        "question": "What is the maximum building height for accessory buildings?",
        "ground_truth": "4.0 m"
    },
    {
        "doc_id": 2,
        "context": """
        Rowhouse Maximum Lot Area:
        1-3 units: 280m2
        Small-Scale Multi-Unit Maximum Lot Area:
        1-3 units: -,
        4 units: -,
        5-6 units: -
        """,
        "question": "What is the maximum lot area for 5-6 small-scale multi-units?",
        "ground_truth": "not specified"
    },
    {
        "doc_id": 3,
        "context": """
        A child care facility in the R1 District must:
        (a) be limited to a maximum of 25 children;
        (b) be located on a corner lot;
        (c) comply with the development regulations under section 101.4 for 1 to 3 small-scale multi-unit dwelling units on a lot;
        (d) be located on a lot that does not contain a dwelling unit or any other principal use; and
        (e) comply with all other applicable regulations under this Bylaw.
        """,
        "question": "Where does a child care facility in the R1 district be located?",
        "ground_truth": "be located on a corner lot"
    },
    {
        "doc_id": 4,
        "context": """
        some or all of the following regulations may apply to lots in the R1 District on the Community Heritage Register:
        (a) panhandle lots and other irregularly shaped lots may be permitted subject to the following:
            (i) lots with lane access shall have a minimum panhandle width of 1 m that is clear to a height of 2.5 m; and
            (ii) lots without lane access shall have a minimum panhandle width of 4.5 m that is clear to a height of 2.5 m;
        (b) maximum lot coverage as set out in Section 101.4 may be increased to up to 60%;
        (c) all original architectural appurtenances, such as chimneys, railings, vents, decorative features, or similar, may be excluded from the maximum permitted height of a principal building;
        (d) lot line setbacks for street yards may meet a minimum of 2.0 m;
        (e) the minimum separation between buildings on the same lot as required under Section 101.4 may be reduced;
        """,
        "question": "For lots in the R1 district on the Community Heritage Register, what is the minimum lot line setback for street yards?",
        "ground_truth": "2.0 m"
    },
    {
        "doc_id": 5,
        "context": """
        **Dwelling Type**
        **Rowhouse[ .1]** **Small-Scale Multi-Unit**

        Minimum Lot Width[ .2]

        5 m, except 6.2 m for end unit
        Interior Lot 10 m

        lots

        Corner Lot - Street 8 m 10 m

        Corner Lot - Lane 6.2 m 10 m

        Lot Area[ .3]

        Minimum Lot Area       - 281 m[2]

        Maximum Lot Area 280 m[2]       
        .1 At the time of registration of the subdivision plan to create two or more rowhouse lots, the
        registration of a Section 219 Covenant will be required to ensure that all adjoining rowhouse
        dwellings will be constructed at the same time.

        |Permitted Uses|Col2|
        |---|---|
        |Principal Use|Use-Specific Regulations|
        |Small-Scale Multi-Unit Housing|-|
        |Rowhouse Dwellings|101.5.2|
        |Boarding, Lodging, or Rooming House|101.5.3|
        |Group Home|-|
        |Supportive Housing (Category A)|101.5.4|
        |Child Care Facilities|101.5.6|
        |Accessory Use|Use-Specific Regulations|
        |Boarding Use (up to 2 boarders)|-|
        |Home Occupations|6.8, 6.8A|
        |Urban Agriculture|6.30|
        |Accessory Buildings, Structures, and Uses|101.5.5, 6.6|
        """,
        "question": "What is the minimum lot width for a rowhouse that has a street corner lot?",
        "ground_truth": "Corner Lot - Street 8 m"
    },
    {
        "doc_id": 6,
        "context": """
        The minimum number of dwelling units with at least 3 bedrooms must be provided on a lot as follows:
        |Col1|Total Dwelling Units on a Lot|Col3|
        |---|---|---|
        ||1 to 3 Units|4 to 6 Units|
        |Minimum 3+ Bedroom Units:|1 Unit|2 Units|
        """,
        "question": "What is the total number of dwelling units permitted on a lot for 1 to 3 units?",
        "ground_truth": "Minimum 3+ Bedroom Units: 1 Unit"
    },
    {
        "doc_id": 7,
        "context": """
        101.6 General Regulations 101.6.1 Projections (1) The following features may project into the required minimum separation between buildings on the same lot: (a) steps and stairs; 
        (b) ornamental features, such as arbors, trellises, fish ponds, flag poles, or similar landscape features; 
        (c) terraces, decks, or other similar surfaces that are 1.0 m or less above grade; 
        (d) balconies, covered decks, uncovered decks, canopies, sunshades, or other similar features, including supporting structures, that are greater than 1.0 m above grade up to a maximum of 25 percent of the width of a required separation; 
        (e) belt courses, cornices, gutters, sills, chimneys, bay windows, or other similar features up to the lesser of 0.9 m or 25 percent of the width of a required separation; 
        (f) sunken access areas and window wells as per Section 6.9; 
        (g) outdoor appliances; and 
        (h) eaves up to the lesser of 1.2 m (3.94 ft.) or 25 percent of the width of a required separation.
        (2) Permitted projections into required yards are subject to Section 6.12, except that in the R1 District projections into required rear or side yards are limited to a maximum of 0.5 m where the rear or side yard abuts a lane to provide adequate fire truck clearance. 

        101.6.2 Outdoor Areas (1) An outdoor amenity space with a minimum width of 2.0 m and area of 10.0 m2 must be provided for each primary dwelling unit for its exclusive use and be directly accessible from the primary dwelling unit it serves.  
        
        101.6.3 Access and Fire Safety (1) Dwelling units located more than 45 m from a lot line abutting a street shall contain an automatic sprinkler system. 
        (2) All dwelling units shall have a minimum 1.0 m paved or gravel fire access corridor that: (a) provides direct pedestrian access from the dwelling unit entrance to a lot line abutting a street, or abutting a constructed lane where direct access to a street is not feasible; and 
        (b) is clear of any projections or obstructions to a minimum of 2.5 m in height.
        """,
        "question": "What is the minimum width and area for outdoor amenity space for each primary dwelling unit?",
        "ground_truth": "An outdoor amenity space with a minimum width of 2.0 m and area of 10.0 m2"
    }
]

### Running and testing the models

The Hugging Face Transformers pipeline function is used. Since these are simple experiments the pipeline function is deemed adequate and there doesn't need to be more custom adjustments of tokenizers etc. 

In [None]:
# Load QA Pipelines for each model
# DistilBERT
distilbert_qa = pipeline(
    "question-answering",
    model = "distilbert-base-uncased-distilled-squad"
)

# LEGAL-BERT
legalbert_qa = pipeline(
    "question-answering",
    model="nlpaueb/legal-bert-small-uncased"
)

Device set to use cpu
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at nlpaueb/legal-bert-small-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


The results of the zero shot classification are saved in an array called "results". The results are output in the data frame below.

In [None]:
# Run zero shot qa for DistilBERT and LEGAL-BERT to compare

for data in dataset:
    q = data['question']
    ctext = data['context']
    truth = data['ground_truth']

    # DistilBERT
    distil_response = distilbert_qa(question=q, context=ctext)['answer']
    # LEGAL-BERT
    legal_response = legalbert_qa(question=q, context=ctext)['answer']

    results.append({
        "doc_id": data['doc_id'],
        "question": q,
        "ground_truth": truth,
        "distil_answer": distil_response,
        "legal_answer": legal_response
    })

dataframe = pd.DataFrame(results)
dataframe

Unnamed: 0,doc_id,question,ground_truth,distil_answer,legal_answer
0,1,What is the maximum building height for access...,4.0 m,4.0 m | 1 storey,Accessory
1,2,What is the maximum lot area for 5-6 small-sca...,not specified,280m2,Area:\n 1-3 units
2,3,Where does a child care facility in the R1 dis...,be located on a corner lot,on a corner lot,be located on a lot that does not contain a dw...
3,4,For lots in the R1 district on the Community H...,2.0 m,2.0 m,access shall have a minimum panhandle
4,5,What is the minimum lot width for a rowhouse t...,Corner Lot - Street 8 m,280 m,Boarding Use (up to 2
5,6,What is the total number of dwelling units per...,Minimum 3+ Bedroom Units: 1 Unit,Col1,be provided on a lot as follows
6,7,What is the minimum width and area for outdoor...,An outdoor amenity space with a minimum width ...,2.0 m,provided for each primary


### Concluding thoughts and evaluation and metrics

In [6]:
# Evaluation and metrics

distil_metrics = evaluate_model(results, "distil_answer")
legal_metrics = evaluate_model(results, "legal_answer")

print("DistilBERT Metrics:", distil_metrics)
print("LegalBERT Metrics:", legal_metrics)

DistilBERT Metrics: {'exact_match': 14.285714285714286, 'f1': 42.176870748299315}
LegalBERT Metrics: {'exact_match': 0.0, 'f1': 6.722689075630251}


**DistilBERT Metrics**

* **Exact Match:** A score of 14.29% of predictions matching exactly the ground truth is low. Looking at the results data frame, only two questions were answered correctly with one response being close to the ground truth but still not correct.
* **F1 Score:** A score of 42.18% appears to indicate moderate performance. The model is not perfect in capturing the information correctly.

**LEGAL-BERT Metrics**

* **Exact Match:** A score of 0% of predictions means the model is struggling to provide the correct answers and may not be suited to the task because no questions were answered correctly.
* **F1 Score:** The F1 score is very low also indicating that this model may not be suited to the task.

DistilBERT outperforms LEGAL-BERT specifically in both metrics. This may be because the datasets used to train LEGAL-BERT are not similar to the language, formatting, and content present in zoning by-laws. Zoning by-laws may be legal documents, but the content and language appears to be more factual and in varying formats (tables and images besides raw text).

Although DistilBERT outperforms LEGAL-BERT, it appears that it is not ideal for this task.  Fine-tuning DistilBERT or exploring alternative methods, such as OCR (optical character recognition), may yield better results. It seems that manually extracting data from the zoning by-law or using the [Claude LLM API Pipeline](https://github.com/JoT8ng/zoning-extraction-pipelines/blob/main/llm_api_pipeline/src/README.md) would be more effective and efficient despite the problem of hallucinations. Perhaps, NLP models are not the solution for zoning by-laws due to their long, complex and varying formats. OCR (optical character recognition) models could potentially be more suitable for automating this task, although the results would similarly have to be double checked. The limitations of OCRs involve accuracy issues with complex PDF layouts or the need for post-processing the data outputs.