# Demo

## Libraries etc

In [1]:
import os
import pandas as pd
from pathlib import Path

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.vectorstores import InMemoryVectorStore

from qanda import QandA
from bert_score import score

## Run the Preprocessor
Do you have some new data? If you have a new document, place it in the `data/` directory and then run the `preprocessor.py` module. The document will be processed with OCR and the result will be placed in the `jsondata/` directory.

Let's say your new file is `Rodier-finding.pdf` and you've put it in the `data/` directory with all your other documents. Now you just need to invoke `preprocessor.py`. Heres how to do it:

In [2]:
%run preprocessor.py


Note: a message saying 'Token indices sequence length is longer than the specified maximum sequence length...' can be ignored in this case
Details: https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826

Processing Rodier-Finding.pdf
Document Rodier-Finding.pdf converted in 30.88 seconds.
jsondata/Blood-results-redacted.jsonl already exists.
jsondata/Nicholls-Diver-finding.jsonl already exists.
jsondata/TAULELEI-Jacob-Finding.jsonl already exists.
jsondata/Forkin-finding-2014.jsonl already exists.
jsondata/Baby-H-finding.jsonl already exists.

Finished.



**NB** If you don't need to use OCR on your document you can just go in and set `ocr=False` in the `batch_convert` function of the `preprocessor.py` module.

Now let's take a look in the `jsondata/` directory where we'll see our new document is ready to be loaded into a vector store, i.e., it's been converted into a pre-chunked and serialized JSONL object file with metadata attached - `Rodier-Finding.jsonl`. See:

In [3]:
os.listdir('jsondata/')

['TAULELEI-Jacob-Finding.jsonl',
 'Rodier-Finding.jsonl',
 'Blood-results-redacted.jsonl',
 'Forkin-finding-2014.jsonl',
 'Baby-H-finding.jsonl',
 'Nicholls-Diver-finding.jsonl']

## Initialize a QandA Object
No we've got our document ready let's start up a RAG question answer chain. To do so we'll need to initialise a `QandA` object which is just a Python object that encapsulates all the things we need. Those things are:

- `FILE_PATH`: the file path to the pre-processed docuement you want to analyse (i.e., `jsondata/Rodier-Finding.jsonl`)
- `GEN_MODEL`: which generative LLM model you want to use (i.e., from Ollama)
- `EMBED_MODEL`: the vector embedding model (i.e., `mxbai-embed-large`)
- `VDB`: the actual vector store (i.e., `InMemoryVectorStore`)
- `TOP_K`: how many sources of context to use for the vector similarity search
- `PROMP`: the prompt template

OK, let's intitialise the `QandA` object now:

In [4]:
# Set the file (document); generative LLM model; embedding model;
# vec db; num sources
FILE_PATH = Path("jsondata/Rodier-Finding.jsonl")
GEN_MODEL = "gemma3"
EMBED_MODEL = "mxbai-embed-large"
VDB = InMemoryVectorStore
TOP_K = 3

# Set the prompt
PROMPT = ChatPromptTemplate.from_template(
    """Context information is below.
    \n---------------------\n
    {context}
    \n---------------------\n
    Given the context information and not prior knowledge, answer the query.\n
    Query: {input}\n
    Answer:\n""",
)

# Initialize the qanda object
qanda = QandA(gen_model=GEN_MODEL,
              embed_model=EMBED_MODEL, 
              vdb=VDB,
              file_path=FILE_PATH,
              top_k=TOP_K,
              prompt=PROMPT)

Initializing, please wait...
Loading jsondata/Rodier-Finding.jsonl
Question Answer chain ready.


Take a look at the docstring.

In [5]:
# Qanda help
help(qanda)

Help on QandA in module qanda object:

class QandA(builtins.object)
 |  QandA(gen_model, embed_model, vdb, file_path, top_k, prompt)
 |
 |  A class for performing question-answering tasks using a language model and a vector database.
 |
 |  Attributes:
 |      gen_model (str): The name of the language model to be used for generating answers.
 |      embed_model (str): The name of the embedding model to be used for generating embeddings.
 |      vdb (str): The name of the vector database to be used for storing and retrieving documents.
 |      file_path (str): The path to the file containing the documents to be used for the question-answering task.
 |      top_k (int): The number of top-k most relevant documents to be retrieved for each question.
 |      prompt (str): The prompt to be used for the question-answering chain.
 |
 |  Methods:
 |      ask(question, verbose=False):
 |          Invokes the question-answering chain to generate an answer to the given question.
 |          Args:


## Several Methods to use a QandA Object
There are number of ways to use the `qanda` object you've created to extract and analyse info from your document. Let's go through some now.

## Method 1: Just answer the questions

In [6]:
# Ask some questions
qanda.ask("Who died?")

'Frank Edward Rodier died.'

In [7]:
qanda.ask("Acitvity involved in death?")

'Fishing.'

In [8]:
qanda.ask("Who went fishing?")

'Frank Rodier, Donald McLeod, and Ducas went fishing.'

In [9]:
# Create a list of questions
QUESTIONS = ["Who is the coroner?",
             "Who is the deceased?",
             "What was the cause of death?"]

# Get the answers
for i, QUESTION in enumerate(QUESTIONS):
    ANSWER = qanda.ask(QUESTION)
    print(f"Answer {i + 1}: ", ANSWER)

Answer 1:  Sarah Helen Linton, Deputy State Coroner.
Answer 2:  Frank Edward Rodier is the deceased.
Answer 3:  The cause of death remains unascertained. The report states, “Accordingly, his cause of death must remain unascertained.”


## Method 2: Answer the questions and score the answers

In [10]:
# make a scores function with the BERTScore metric
def calculate_bertscore_df(df):
    references = df['CORRECT_ANSWER'].tolist()
    candidates = df['LLM_ANSWER'].tolist()
    
    precision, recall, f1 = score(candidates, references, lang="en", verbose=True)
    
    df['BERT_PRECISION'] = precision.tolist()
    df['BERT_RECALL'] = recall.tolist()
    df['BERT_F1'] = f1.tolist()
    
    return df

In [11]:
# set the questions... and the correct answers
QUESTIONS = ["Who is the coroner?",
             "Who is the deceased?",
             "What was the cause of death?"]
CORRECT_ANSWERS = ["Sarah Helen Linton",
                   "Frank Edward Rodier",
                   "unascertained"]
LLM_ANSWERS = []

In [12]:
# get the answers from the RAG chain, i.e., the LLM_ANSWERS
for i, QUESTION in enumerate(QUESTIONS):
    ANSWER = qanda.ask(QUESTION)
    LLM_ANSWERS.append(ANSWER)
    print(f"Answer {i + 1}: ", ANSWER)

Answer 1:  Sarah Helen Linton, Deputy State Coroner.
Answer 2:  Frank Edward Rodier is the deceased.
Answer 3:  The cause of death remains unascertained. The coroner determined that while it was likely an accident, the death was caused by injuries sustained from the rocks, and the specific cause could not be determined.


In [13]:
# make a dataframe
data = {
    'FILENAME': ['Rodier-Finding'] * len(QUESTIONS),
    'MODEL': ['gemma3'] * len(QUESTIONS),
    'QUESTION': QUESTIONS,
    'CORRECT_ANSWER': CORRECT_ANSWERS,
    'LLM_ANSWER': LLM_ANSWERS
}

df = pd.DataFrame(data)

In [14]:
# score the answers
scores_df = calculate_bertscore_df(df)

print(scores_df.columns)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.60 seconds, 1.88 sentences/sec
Index(['FILENAME', 'MODEL', 'QUESTION', 'CORRECT_ANSWER', 'LLM_ANSWER',
       'BERT_PRECISION', 'BERT_RECALL', 'BERT_F1'],
      dtype='object')


In [15]:
# show the results
print(scores_df)

         FILENAME   MODEL                      QUESTION       CORRECT_ANSWER  \
0  Rodier-Finding  gemma3           Who is the coroner?   Sarah Helen Linton   
1  Rodier-Finding  gemma3          Who is the deceased?  Frank Edward Rodier   
2  Rodier-Finding  gemma3  What was the cause of death?        unascertained   

                                          LLM_ANSWER  BERT_PRECISION  \
0          Sarah Helen Linton, Deputy State Coroner.        0.888055   
1               Frank Edward Rodier is the deceased.        0.913599   
2  The cause of death remains unascertained. The ...        0.805766   

   BERT_RECALL   BERT_F1  
0     0.959326  0.922316  
1     0.961740  0.937052  
2     0.863840  0.833793  


## Method 3: Answer the question and provide the source context

In [16]:
# pose a question, get the answer... and sources
QUESTION = "What activity was implicated in the cause of death?"
ANSWER, SOURCES = qanda.ask(QUESTION, verbose=True)

In [17]:
print(ANSWER)

Fishing.


In [18]:
print(SOURCES)

[{'source': 1, 'text': 'IS DEATH ESTABLISHED?\n17. As is clear from the above; I am satisfied beyond reasonable doubt that Frank Rodier is deceased and that he died on 25 1975 in the sea after he was washed off the rocks while fishing with friends. May\n18. but I cannot exclude the possibility that he died from injuries he sustained from the rocks, or that injury at least contributed to his death. Accordingly, his cause of death must remain unascertained. As to the manner of death, I am satisfied he died by way of accident.', 'page': 6, 'document': 'data/Rodier-Finding.pdf'}, {'source': 2, 'text': 'INTRODUCTION\n- 2 In my capacity as the Acting State Coroner, I determined on the basis of information provided by the WA Police in August 2023 that   there was   reasonable cause to suspect that Frank had died and that his death was a reportable death under the Act. I therefore made a direction to the Commissioner of Police; pursuant to s 23(1) of the Coroners Act 1996 (WA) that the suspect

In [19]:
print(SOURCES[0]['text'])

IS DEATH ESTABLISHED?
17. As is clear from the above; I am satisfied beyond reasonable doubt that Frank Rodier is deceased and that he died on 25 1975 in the sea after he was washed off the rocks while fishing with friends. May
18. but I cannot exclude the possibility that he died from injuries he sustained from the rocks, or that injury at least contributed to his death. Accordingly, his cause of death must remain unascertained. As to the manner of death, I am satisfied he died by way of accident.


In [20]:
print(SOURCES[0]['page'])

6


In [21]:
print(SOURCES[0]['document'])

data/Rodier-Finding.pdf


##  Method 4: Compare answers of different models

In [22]:
# set the models
LLAMA = "llama3.2"
GEMMA = "gemma3"
PHI   = "phi4-mini"

In [23]:
# initialise RAG chains (i.e., QandA objects) for each model
qanda_llama = QandA(gen_model=LLAMA,
                    embed_model=EMBED_MODEL, 
                    vdb=VDB,
                    file_path=FILE_PATH,
                    top_k=TOP_K,
                    prompt=PROMPT)

qanda_gemma = QandA(gen_model=GEMMA,
                    embed_model=EMBED_MODEL, 
                    vdb=VDB,
                    file_path=FILE_PATH,
                    top_k=TOP_K,
                    prompt=PROMPT)

qanda_phi = QandA(gen_model=PHI,
                  embed_model=EMBED_MODEL, 
                  vdb=VDB,
                  file_path=FILE_PATH,
                  top_k=TOP_K,
                  prompt=PROMPT)

Initializing, please wait...
Loading jsondata/Rodier-Finding.jsonl
Question Answer chain ready.
Initializing, please wait...
Loading jsondata/Rodier-Finding.jsonl
Question Answer chain ready.
Initializing, please wait...
Loading jsondata/Rodier-Finding.jsonl
Question Answer chain ready.


In [24]:
# pose a question (which you know the answer to)
QUESTION = "What activity was implicated in the cause of death?"
CORRECT_ANSWER = "Fishing"
LLM_ANSWERS = []

In [25]:
# get the generated answer for each model RAG chain
for i, qanda_model in enumerate([qanda_llama, qanda_gemma, qanda_phi]):
    ANSWER = qanda_model.ask(QUESTION)
    LLM_ANSWERS.append(ANSWER)
    print(f"Answer {i + 1}: ", ANSWER)

Answer 1:  Based on the provided context, it can be inferred that fishing with friends is the activity implicated in the cause of death. This is because Frank Rodier was washed off the rocks while fishing with friends, which led to his subsequent drowning or accidental death.
Answer 2:  Fishing.
Answer 3:  Based on the provided context:

The specific circumstances surrounding Frank Rodier's drowning are described as occurring while he "was washed off the rocks" during an outing that involved fishing with friends. It is mentioned at one point, however, there might also have been a possibility for injuries from these rocky surfaces to contribute or exacerbate his demise.

Therefore:
Answer: Fishing on the sea with rock hazards implied in cause of death (accidental drowning potentially linked to both falling into water and possible injury)


In [26]:
# make a dataframe
data = {
    'FILENAME': ['Rodier-Finding'] * 3,
    'MODEL': [LLAMA, GEMMA, PHI],
    'QUESTION': [QUESTION] * 3,
    'CORRECT_ANSWER': [CORRECT_ANSWER] * 3,
    'LLM_ANSWER': LLM_ANSWERS
}
df = pd.DataFrame(data)

In [27]:
# score the answers
scores_df = calculate_bertscore_df(df)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.26 seconds, 2.39 sentences/sec


In [28]:
print(scores_df.columns)

Index(['FILENAME', 'MODEL', 'QUESTION', 'CORRECT_ANSWER', 'LLM_ANSWER',
       'BERT_PRECISION', 'BERT_RECALL', 'BERT_F1'],
      dtype='object')


In [29]:
# show the results
print(scores_df)

         FILENAME      MODEL  \
0  Rodier-Finding   llama3.2   
1  Rodier-Finding     gemma3   
2  Rodier-Finding  phi4-mini   

                                            QUESTION CORRECT_ANSWER  \
0  What activity was implicated in the cause of d...        Fishing   
1  What activity was implicated in the cause of d...        Fishing   
2  What activity was implicated in the cause of d...        Fishing   

                                          LLM_ANSWER  BERT_PRECISION  \
0  Based on the provided context, it can be infer...        0.793064   
1                                           Fishing.        0.968601   
2  Based on the provided context:\n\nThe specific...        0.770629   

   BERT_RECALL   BERT_F1  
0     0.819261  0.805950  
1     0.937267  0.952677  
2     0.828657  0.798590  


**_That's it!_**