# Project Report: Reproducible Evaluation Results

**Audience:** Dr. Matthew Albrecht (WACRSR), Project Markers

This notebook serves as a companion to the final project report, providing the live code and reproducible results that support our findings. It demonstrates the core functionality of the `QandA` system and validates the quantitative evaluation discussed in **Section 4** of the report.

## 1. System Initialization

First, we set up the environment as described in our report. This involves initializing the `QandA` object with the `gemma3` model, which our evaluation identified as the best-performing choice for general-purpose extraction.

In [1]:
from pathlib import Path
import pandas as pd
import warnings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.vectorstores import InMemoryVectorStore
from qanda import QandA
from scores import calculate_bertscore_df

warnings.filterwarnings("ignore", category=UserWarning)

FILE_PATH = Path("jsondata/Rodier-Finding.jsonl")
GEN_MODEL = "gemma3"
EMBED_MODEL = "mxbai-embed-large"
VDB = InMemoryVectorStore
TOP_K = 3
PROMPT = ChatPromptTemplate.from_template(
    """Context information is below.\n
    ---------------------\n
    {context}\n
    ---------------------\n
    Given the context information and not prior knowledge, answer the query.\n
    Query: {input}\n
    Answer:\n"""
)

In [2]:
qanda = QandA(gen_model=GEN_MODEL,
              embed_model=EMBED_MODEL, 
              vdb=VDB,
              file_path=FILE_PATH,
              top_k=TOP_K,
              prompt=PROMPT)

Initializing, please wait...
Loading jsondata\Rodier-Finding.jsonl
Question Answer chain ready.


## 2. Quantitative Evaluation (Report Section 4)

This section reproduces the automated evaluation detailed in our report. We use a predefined set of questions and their corresponding 'ground truth' answers to quantitatively assess the model's performance using BERTScore.

### 2.1 Evaluation Data

Here we define the list of questions and the manually verified correct answers used for scoring. This corresponds to the methodology described in **Section 4.1** of the report.

In [3]:
QUESTIONS = [
    "Who is the coroner?", 
    "Who is the deceased?", 
    "What was the cause of death?"
]

CORRECT_ANSWERS = [
    "Sarah Helen Linton",
    "Frank Edward Rodier",
    "unascertained"
]

### 2.2 Generating and Scoring Model Answers

We now programmatically ask each question, collect the model's answer, and then use the `calculate_bertscore_df` function from our `scores.py` module to generate the final performance metrics.

In [4]:
print("Generating LLM answers for evaluation...")
llm_answers = [qanda.ask(q) for q in QUESTIONS]

Generating LLM answers for evaluation...


In [5]:
data = {
    'FILENAME': [FILE_PATH.stem] * len(QUESTIONS),
    'MODEL': [GEN_MODEL] * len(QUESTIONS),
    'QUESTION': QUESTIONS,
    'CORRECT_ANSWER': CORRECT_ANSWERS,
    'LLM_ANSWER': llm_answers
}
df = pd.DataFrame(data)

scores_df = calculate_bertscore_df(df)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.72 seconds, 1.74 sentences/sec


computing greedy matching.


done in 1.45 seconds, 2.07 sentences/sec


### 2.3 Evaluation Results

The table below presents the final BERT scores for the `gemma3` model on the sample questions. This table provides the empirical evidence for the performance metrics cited in **Table 1** and the analysis in **Section 4.4** of our report.

In [6]:
display(scores_df)

Unnamed: 0,FILENAME,MODEL,QUESTION,CORRECT_ANSWER,LLM_ANSWER,BERT_PRECISION,BERT_RECALL,BERT_F1
0,Rodier-Finding,gemma3,Who is the coroner?,Sarah Helen Linton,"Sarah Helen Linton, Deputy State Coroner",0.876953,0.963977,0.918408
1,Rodier-Finding,gemma3,Who is the deceased?,Frank Edward Rodier,Frank Edward Rodier is the deceased.,0.913599,0.96174,0.937052
2,Rodier-Finding,gemma3,What was the cause of death?,unascertained,The cause of death remains unascertained. The ...,0.811553,0.84024,0.825647


### 2.4 Analysis of Results

As discussed in the report, these results highlight the model's strong performance, particularly its high precision and recall on direct factual questions (e.g., F1-scores > 0.9). This quantitative data supports our conclusion that the Gemma model is the most suitable choice for the WACRSR's goal of building a reliable, fact-based database from coroner reports.