# Evaluating the Accuracy of the Results from the Claude LLM API Pipeline

This pipeline explores extracting the zoning information by first extracting and parsing the text as markdown from the by-law PDFs, then sending a query to the LLM API with the extracted text. The LLM responds with the zoning information and the response is processed and exported into CSV format and joined with a zoning GeoJSON dataset.

### What are Zoning By-laws and why do they matter?
Zoning By-laws contain important information about land use, building height, density, and other development regulations. They are important documents that inform urban planning and development decisions in cities.

They are often stored as long, unstructured PDF legal documents and it's difficult to find information within them. Zoning information is also spatial and tied to geospatial datasets. It would be great if the zoning information in the by-laws could be extracted in an efficient and automated way and joined with geospatial datasets.

### Evaluation Metric

**After developing this pipeline, its accuracy needs to be evaluated so it can be benchmarked against other models and pipelines. It's important to assess the usefulness, strengths, and weaknesses of different models and pipelines for the desired task.**

Although the Exactly Match (EM) and F1 score are metrics most often used to evaluate the accuracy of Question Answering NER models, it makes sense to apply them in this scenario because the LLM is being prompted in such a way as to act like a Question Answering NER model.

* **Exact Match (EM):** This metric measures the percentage of questions where the model's answer exactly matches one of the ground truth answers.
* **F1 Score:** This metric calculates the overlap between the predicted answer and the ground truth answers. It considers both precision (the number of correct answers provided by the model) and recall (the number of correct answers that should have been provided). The F1 score is the harmonic mean of precision and recall, providing a balance between the two. A higher F1 score indicates a better performing model. The F1 score is good of imbalanced datasets where accuracy can be misleading. [More information](https://www.geeksforgeeks.org/machine-learning/f1-score-in-machine-learning/)

A CSV file called "llm_api_evaluation_dataset.csv" containing the ground truth and the LLM responses will be used to evaluate the pipeline. For reference, a CSV file ("example_pipeline_output.csv") showing the raw output from the pipeline is placed in this repository folder.

### Imports and Set Up

First, import all the necessary Python libraries.

In [None]:
import pandas as pd
import evaluate

### Evaluation and Metrics

In [None]:
# Load the the evaluation dataset
dataset = pd.read_csv("llm_api_evaluation_dataset.csv")

# Load SQuAD metrics
squad_metric = evaluate.load("squad")

# Set up array to store evaluation dataset
results = []

# WORK IN PROGRESS BELOW
# Evaluation helper function to prepare inputs for Hugging Face SQuAD metrics
def evaluate_model(res, model):

    # res or results: results dictionary containing the outputs of the predictions and ground truth

    predictions = []
    references = []

    for r in res:
        predictions.append({
            "id": str(r["doc_id"]),
            "prediction_text": r[model]
        })
        references.append({
            "id": str(r["doc_id"]),
            "answers": {
                "text": [r["ground_truth"]],
                "answer_start": [0]  # dummy value
            }
        })

    # Compute metrics
    return squad_metric.compute(predictions=predictions, references=references)

# Evaluation
metrics = evaluate_model(results, "answer")

print("Metrics:", metrics)

### Concluding Thoughts