# Notebook Desc

## üìí RAG PDF QA Development Notebook

This notebook contains the **experimental, prototyping, and evaluation work** for building a local Retrieval-Augmented Generation (RAG) system over PDF documents.

It serves as the **primary development environment** for testing components, validating retrieval quality, measuring system performance, and refining architecture before migrating stable logic into production `.py` modules.

---

## ‚úÖ What This Notebook Includes

### üîπ PDF Ingestion and Preprocessing
- Loading PDFs from disk using LangChain loaders  
- Page-level metadata extraction  
- Boilerplate filtering (copyright, TOC, publisher pages)  
- Document inspection and debugging  

### üîπ Text Chunking and Index Preparation
- Recursive character-based chunking experiments  
- Chunk size and overlap tuning  
- Low-information chunk removal  
- Page-aligned chunk ID generation  
- Chunk integrity verification  

### üîπ Embedding and Vector Storage
- Initializing and testing `BAAI/bge-large-en-v1.5` embeddings  
- Local embedding generation (CPU/GPU)  
- Creating and persisting Chroma vector database  
- Incremental indexing and duplicate prevention  
- Metadata validation and index inspection  

### üîπ Retrieval and Re-Ranking Experiments
- Top-K vector similarity retrieval testing  
- Retrieval quality debugging using source page IDs  
- Cross-encoder re-ranking integration  
- Testing `cross-encoder/ms-marco-MiniLM-L-6-v2`  
- Retrieval precision vs recall tuning  

### üîπ Prompt Engineering and LLM Integration
- RAG prompt design and refinement  
- Context grounding and hallucination control  
- Strict extraction-based answer prompts  
- Integration with Llama-3.1 via Ollama  
- Stateless and conversational prompt variants  

### üîπ Evaluation Pipeline Development
- Creation of manually curated QA evaluation datasets  
- Ground-truth source page annotation  
- Retrieval evaluation using Recall@K  
- End-to-end answer accuracy measurement  
- Hallucination rate measurement  
- Latency benchmarking  
- Failure case analysis and debugging  

### üîπ UI and Pipeline Integration Testing
- Testing pipeline functions before UI integration  
- Validation of conversational query rewriting  
- Source attribution verification  
- Document upload and incremental indexing testing  

---

## ‚ö†Ô∏è What This Notebook Is *Not*

This notebook is **not the production entrypoint**.

It does not serve as:

- ‚ùå Final pipeline executable  
- ‚ùå Gradio UI implementation  
- ‚ùå CLI entry script  
- ‚ùå Modular backend service  

Production logic has been migrated into dedicated modules:

- `Updated_pipeline.py` ‚Üí Core RAG pipeline  
- `app.py` ‚Üí Gradio UI application  
- `Chroma/` ‚Üí Persistent vector database  

---

## üéØ Purpose

This notebook exists to:

- Prototype and validate system components  
- Tune chunking, retrieval, and reranking  
- Develop and validate evaluation metrics  
- Debug retrieval and generation failures  
- Benchmark system performance  
- Test new ideas before production integration  

It serves as the **development and experimentation environment**, while `.py` modules provide the stable, deployable implementation.

# Lib Installs

In [1]:
#!pip install langchain langchain-community
#!pip install pypdf
#!pip install sentence-transformers
#!pip install chromadb
#!pip install langchain-chroma
#!pip install -U langchain-ollama
#!pip install sentence-transformers
#!pip install tqdm
#!pip install gradio

# Imports

In [293]:
import torch
import numpy as np
from pathlib import Path
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaLLM
from sentence_transformers import CrossEncoder
from tqdm import tqdm
import time
import gradio as gr
import shutil

# Paths

In [14]:
ROOT =Path(r"C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System")
DATA_DIR = ROOT/'Data'
CHROMA_DIR = ROOT/'Chroma'

In [15]:
print(ROOT)
print(DATA_DIR)
print(CHROMA_DIR)

C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Chroma


# PDF Ingession

In [16]:
def load_docs():
    doc_loader = PyPDFDirectoryLoader(DATA_DIR)
    return doc_loader.load()

In [6]:
docs = load_docs()
print(docs[2])

page_content='To the unrelenting voice in my head that will never allow me to stop.' metadata={'producer': 'calibre (2.85.1) [https://calibre-ebook.com]', 'creator': 'calibre (2.85.1) [https://calibre-ebook.com]', 'creationdate': '2020-06-25T21:00:51+00:00', 'author': 'David Goggins', 'moddate': '2020-06-25T21:01:00+00:00', 'title': "Can't Hurt Me: Master Your Mind and Defy the Odds", 'source': 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf', 'total_pages': 303, 'page': 2, 'page_label': '3'}


In [7]:
len(docs)

312

# Chunking

## Page level Filtering...  Removing Boilerplate and short pdf pages

In [17]:
def filter_pages(docs, min_chars = 200):
    cleaned = []
    blacklist = [
        "all rights reserved",
        "copyright",
        "isbn",
        "table of contents"
    ]
    for d in docs:
        text = d.page_content.lower()

        if len(text)<min_chars: #removes short pages
            continue
        if any(b in text for b in blacklist): # removes boilerplate pages
            continue
        cleaned.append(d)
    return cleaned

In [18]:
def split_docs(docs: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 800,
        chunk_overlap = 80,
        length_function = len, 
        is_separator_regex=False,
    )
    chunks =  text_splitter.split_documents(docs)
     # dropping small chunks >200 chars
    chunks = [c for c in chunks if len(c.page_content)>200]
    return chunks

In [10]:
docs = load_docs()
docs = filter_pages(docs)

chunks = split_docs(docs)
print(chunks[0])

page_content='CONTENTS
INTRODUCTION
1. I SHOULD HAVE BEEN A STATISTIC
2. TRUTH HURTS
3. THE IMPOSSIBLE TASK
4. TAKING SOULS
5. ARMORED MIND
6. IT‚ÄôS NOT ABOUT A TROPHY
7. THE MOST POWERFUL WEAPON
8. TALENT NOT REQUIRED
9. UNCOMMON AMONGST UNCOMMON
10. THE EMPOWERMENT OF FAILURE
11. WHAT IF?
ACKNOWLEDGMENTS
ABOUT THE AUTHOR' metadata={'producer': 'calibre (2.85.1) [https://calibre-ebook.com]', 'creator': 'calibre (2.85.1) [https://calibre-ebook.com]', 'creationdate': '2020-06-25T21:00:51+00:00', 'author': 'David Goggins', 'moddate': '2020-06-25T21:01:00+00:00', 'title': "Can't Hurt Me: Master Your Mind and Defy the Odds", 'source': 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf', 'total_pages': 303, 'page': 3, 'page_label': '4'}


In [11]:
print(len(docs))
print(len(chunks))

299
959


## Calculating chunk IDs

In [19]:
def calc_chunk_ids(chunks):
    # This will create IDs like "data/monopoly.pdf:6:2"
    # Page Source : Page Number : Chunk Index
    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source")
        page = chunk.metadata.get("page")
        current_page_id = f"{source}:{page}"

        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        chunk.metadata["id"] = chunk_id

    return chunks

### testing

In [25]:
test = calc_chunk_ids([chunks[5]])
print(test[0].metadata["id"])

C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:6:0


In [32]:
test = calc_chunk_ids(chunks[:10])
for c in test:
    print(c.metadata["id"])

C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:3:0
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:4:0
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:4:1
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:5:0
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:5:1
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:6:0
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:6:1
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:6:2
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:7:0
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-

# Embeddings Setup

In [277]:
def get_embeddings_function(device: str = None):
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    embeddings = HuggingFaceBgeEmbeddings(
        model_name = 'BAAI/bge-large-en-v1.5',
        model_kwargs={'device':'cuda'},
        encode_kwargs={'normalize_embeddings':True},
        query_instruction= "Represent this sentence for searching relevant passages:"
    )
    return embeddings 

### Testing

In [36]:
emb = get_embeddings_function()

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [39]:
vec = emb.embed_query("What is Faster R-CNN?")
print(len(vec))  # 1024 dim embedding
print(vec[:10])       

1024
[0.048546068370342255, 0.005149699281901121, -0.02971961721777916, 0.01099423784762621, 0.03847793862223625, 0.0020174733363091946, 0.0021198869217187166, -0.04009025916457176, -0.016631048172712326, 0.0679197907447815]


In [34]:
test = chunks[0]
test_vecs = emb.embed_documents([test.page_content])
print(len(test_vecs[0]))
#print(test_vecs)

1024


# Vector DB

### Testing

In [64]:
# load DB
db = Chroma(persist_directory=CHROMA_DIR,
            embedding_function = get_embeddings_function())

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [72]:
#Calc Page IDs 
chunks_with_ids = calc_chunk_ids(chunks)

In [73]:
# Add or update the documents
existing_items = db.get(include=[])
existing_ids = set(existing_items['ids'])
print(f"Number of existing dicuments in DB: {len(existing_ids)}")

Number of existing dicuments in DB: 0


In [76]:
# Only add docs that don't exist in the DB
new_chunks = []
for chunk in chunks_with_ids:
    if chunk.metadata['id'] not in existing_ids:
        new_chunks.append(chunk)

if len(new_chunks):
    print(f'Adding new documents: {len(new_chunks)}')
    new_chunk_ids = [chunk.metadata['id'] for chunk in new_chunks]
    db.add_documents(new_chunks, ids=new_chunk_ids)
    #db.persist()
else:
    print("No New Documents to add")

Adding new documents: 959


## 

In [21]:
def add_to_chroma(chunks: list[Document]):
    # load DB
    db = Chroma(persist_directory=CHROMA_DIR,
                embedding_function = get_embeddings_function())

    #Calc Page IDs 
    chunks_with_ids = calc_chunk_ids(chunks)

    # Add or update the documents
    existing_items = db.get(include=[])
    existing_ids = set(existing_items['ids'])
    print(f"Number of existing dicuments in DB: {len(existing_ids)}")

    # Only add docs that don't exist in the DB
    new_chunks = []
    for chunk in chunks_with_ids:
        if chunk.metadata['id'] not in existing_ids:
            new_chunks.append(chunk)
    
    if len(new_chunks):
        print(f'Adding new documents: {len(new_chunks)}')
        new_chunk_ids = [chunk.metadata['id'] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunk_ids)
        #db.persist()
    else:
        print("No New Documents to add")

## Loading DB

In [15]:
db = Chroma(
    persist_directory="chroma",
    embedding_function=get_embeddings_function())

print(db._collection.count())

  embeddings = HuggingFaceBgeEmbeddings(


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


959


## To clear db

In [14]:
#deletes the on-disk Chroma directory so the vector DB can be rebuilt from scratch.
# def clear_database():
#     if os.path.exists(CHROMA_PATH):
#         shutil.rmtree(CHROMA_PATH)

# Retriever Setup

In [178]:
# Initialize Embeddings
emb_fxn = get_embeddings_function()

# Initialize Vector Store
db = Chroma(
    persist_directory=CHROMA_DIR,
    embedding_function=emb_fxn
)

# Initialize LLM
llm = OllamaLLM(model='llama3.1')

# Cross encoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

[1mBertForSequenceClassification LOAD REPORT[0m from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Using Top-k Similarity Search 

In [132]:
Base_PROMPT = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [234]:
Base_PROMPT_strict = """
You must answer using ONLY the exact words from the context.

Rules:
- Do NOT explain.
- Do NOT rephrase.
- Do NOT add extra information.
- Return ONLY the answer phrase.

Context:
{context}

Question:
{question}

Answer:
"""


query_rag(query_text)  
    ‚Üí embed query  
    ‚Üí similarity_search  
    ‚Üí build prompt  
    ‚Üí LLM   

### Old but working function with no reranker only top k=45 similarity search

In [128]:
# def query_rag(query_text: str):
#     emb_fxn = get_embeddings_function()
#     db = Chroma(persist_directory=CHROMA_DIR,embedding_function = emb_fxn)

#     results = db.similarity_search_with_score(query_text, k=4)

#     context = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
#     prompt_template = ChatPromptTemplate.from_template(Base_PROMPT)
#     prompt = prompt_template.format(context=context, question=query_text)
#     #print(prompt)

#     model = OllamaLLM(model = 'llama3.1')
#     response_text = model.invoke(prompt)

#     sources = [doc.metadata.get('id', None) for doc, _source in results]
#     formatted_response = f'Response: {response_text}\n\nSources: {sources}'
#     #print(formatted_response)
#     return(response_text)

## Re-Ranker

In [148]:
def rerank(query, docs, top_n=4):
    """
    query: string
    docs: list of (Document, score) from Chroma
    """

    passages = [doc.page_content for doc, _ in docs]
    pairs = [(query, passage) for passage in passages]

    scores = cross_encoder.predict(pairs)

    scored_docs = list(zip(docs, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    # return top_n docs in original (doc, score) format
    return [doc for (doc, _orig_score), _ce_score in scored_docs[:top_n]]

### new fxn with reranker and top k=4 similarity search 

In [275]:
def query_rag(query_text: str, return_context=False):
    results = db.similarity_search_with_score(query_text, k=10)
  
    reranked_docs = rerank(query_text, results)
    
    context = "\n\n---\n\n".join([doc.page_content for doc in reranked_docs])

    prompt_template = ChatPromptTemplate.from_template(Base_PROMPT_strict)
    prompt = prompt_template.format(context=context, question=query_text)
    #print(prompt)
    
    response_text = llm.invoke(prompt)
    sources = [doc.metadata.get('id', None) for doc in reranked_docs]
    formatted_response = f'Response: {response_text}\n\nSources: {sources}'
    #print(formatted_response)
    if return_context:
        return response_text, context
    return(response_text)

In [274]:
test= "What was David Goggins max weight?"
query_rag(test)

'nearly 300 pounds'

In [170]:
test= "How many hell weeks did Goggins do ?"
query_rag(test)

'According to the text, Goggins did 2 Hell Weeks, but he also "participated" in 3 Hell Weeks. It\'s not clear what this means, but it seems that he was present during 3 Hell Weeks as either a student or an instructor, rather than being specifically tested through Hell Week himself.'

In [171]:
test= "What is a RPN?"
query_rag(test)

'A Region Proposal Network (RPN) is a fully-convolutional network that simultaneously predicts object bounds and ratios at a location. It can be trained end-to-end specifically for generating detection proposals, and it shares full-image convolutional features with the detection network.'

## testing

### Query Question

In [227]:
emb_fxn = get_embeddings_function()
db = Chroma(persist_directory="chroma",embedding_function=get_embeddings_function())

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [226]:
query_text = "When was Goggins' first Badwater? and how many did he run?"

In [228]:
results = db.similarity_search_with_score(query_text, k=4)

In [229]:
context = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
#print(context)

In [230]:
prompt_template = ChatPromptTemplate.from_template(Base_PROMPT)
#print(prompt_template)

In [231]:
prompt = prompt_template.format(context=context, question=query_text)
print(prompt)

Human: 
Answer the question based only on the following context:

around and through it so it would not derail me. By the time I toed up to the
line at Badwater at 6 a.m. on July 22, 2006, I‚Äôd moved my governor to 80
percent. I‚Äôd doubled my ceiling in six months, and you know what that
guaranteed me?
Jack fucking shit.
Badwater has a staggered start. Rookies started at 6 a.m., veteran runners
had an 8 a.m. start, and the true contenders wouldn‚Äôt take off until 10 a.m.,
which put them in Death Valley for peak heat. Chris Kostman was one
hilarious son of a bitch. But he didn‚Äôt know he‚Äôd given one hard
motherfucker a serious tactical advantage. Not me. I‚Äôm talking about Akos
Konya.
Akos and I met up the night before at the Furnace Creek Inn, where all the
athletes stayed. He was a first-timer too, and he looked a hell of a lot better

---

garb. I preferred to go incognito. I was the shadow figure filtering into a new
world of pain.
During my first Badwater
Although Akos set a

### Invoiking the model

In [21]:
model = OllamaLLM(model = 'llama3.1')

In [233]:
response_text = model.invoke(prompt)

In [234]:
print(response_text)

Based on the provided text:

* The date of David Goggins' first Badwater is not explicitly mentioned in the given snippet. However, it is mentioned that "During my first Badwater" refers to an event that occurred around July 22, 2006 (as mentioned in Chapter Eleven).
* It is implied that Goggins ran more than one Badwater, as he mentions running a second Badwater in 2014 and seems to be reminiscing about his previous experiences.


In [235]:
sources = [doc.metadata.get('id', None) for doc, _source in results]
formatted_response = f'Response: {response_text}\n\nSources: {sources}'
print(formatted_response)

Response: Based on the provided text:

* The date of David Goggins' first Badwater is not explicitly mentioned in the given snippet. However, it is mentioned that "During my first Badwater" refers to an event that occurred around July 22, 2006 (as mentioned in Chapter Eleven).
* It is implied that Goggins ran more than one Badwater, as he mentions running a second Badwater in 2014 and seems to be reminiscing about his previous experiences.

Sources: ['C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf:182:1', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf:183:0', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf:277:0', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Can_t-Hurt-Me-David-Goggins.pdf:186:0']


In [201]:
# print(prompt + '\n' + formatted_response) 

# Testing Rag OP Using Unit Testing

In [170]:
# EVAL_PROMPT = """
# Expected Response: {expected_response}
# Actual Response: {actual_response}
# ---
# Does the actual actual response mean the same as the expected response?
# (Answer with 'true' or 'false')
# """

In [171]:
# EVAL_PROMPT = """
# Expected Answer:
# {expected_response}

# Model Answer:
# {actual_response}

# ---

# Decide whether the model answer contains the same core factual information as the expected answer.

# Ignore wording differences, extra commentary, or stylistic changes.
# Focus only on whether the main fact(s) match.

# Reply with exactly one word: true or false.
# """

In [172]:
EVAL_PROMPT = """
Expected Answer:
{expected_response}

Model Answer:
{actual_response}

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.
"""

### Testing Function

In [142]:
question = 'What is a RPN?'
expected_response= 'A fully convolutional network that predicts object bounding boxes and objectness scores from shared feature maps to generate region proposals'

In [144]:
question="How many hell weeks did Goggins do ?"
expected_response="He did three Hell Weeks"

In [146]:
question = 'What dataset was used?'
expected_response = 'PASCAL VOC'

In [147]:
response_text = query_rag(question)
prompt = EVAL_PROMPT.format(expected_response = expected_response, actual_response = response_text)

model = OllamaLLM(model = "llama3.1")
eval_results_str = model.invoke(prompt)
eval_results_str_cleaned = eval_results_str.strip().lower()

print(prompt)
if "true" in eval_results_str_cleaned:
    print("\033[92m" + f"Response: {eval_results_str_cleaned}" + "\033[0m")
    print('True')
elif 'false' in eval_results_str_cleaned:
    print("\033[91m" + f"Response: {eval_results_str_cleaned}" + "\033[0m")
    print('False') 
else:
    raise ValueError(f'Invalid evaluation result. Cannot determine if "True" or "False".')

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
PASCAL VOC

Model Answer:
According to Table 2 and Table 3, the datasets used were:

* PASCAL VOC 2007 test set
* PASCAL VOC 2012 test set

Additionally, the training data for some experiments included:

* "07": VOC 2007 trainval
* "07+12": union set of VOC 2007 trainval and VOC 2012 trainval

---

Decide whether the model answer contains the same core factual information as the expected answer.

Ignore wording differences, extra commentary, or stylistic changes.
Focus only on whether the main fact(s) match.

Reply with exactly one word: true or false.

[92mResponse: true[0m
True


## Evaluation Function

In [153]:
def query_and_validate(question: str, expected_response: str):
    response_text = query_rag(question)
    prompt = EVAL_PROMPT.format(expected_response = expected_response, actual_response = response_text)
    
    model = OllamaLLM(model = "llama3.1")
    eval_results_str = model.invoke(prompt)
    eval_results_str_cleaned = eval_results_str.strip().lower()
    
    print(prompt)
    if "true" in eval_results_str_cleaned:
        print("\033[92m" + f"Response: {eval_results_str_cleaned}" + "\033[0m")
        print('True')
        return True
    elif 'false' in eval_results_str_cleaned:
        print("\033[91m" + f"Response: {eval_results_str_cleaned}" + "\033[0m")
        print('False')
        return False
    else:
        raise ValueError(f'Invalid evaluation result. Cannot determine if "True" or "False".')

### More single question tests on the funciton

In [164]:
def test_Cant_hurt_me():
    assert query_and_validate(
        question="How many hell weeks did Goggins do ?",
        expected_response="He did 3 Hell Weeks",
    )

In [165]:
test_Cant_hurt_me()

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
He did 3 Hell Weeks

Model Answer:
According to the text, Goggins survived two Hell Weeks (not as a student) and participated in three. However, it also mentions that "After surviving two Hell Weeks", implying that he was a participant/student at least twice, which would be his two participations mentioned earlier, but not necessarily as a survivor.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True


In [209]:
def test_Cant_hurt_me2():
    assert query_and_validate(
        #question="What physical breakdown did David Goggins experience at mile 70 of the San Diego One Day race?",
        question="What medical problems did David Goggins suffer at mile 70 of the San Diego One Day race?",
        expected_response="At mile 70, Goggins‚Äô body shut down due to a lack of training; he suffered from kidney failure, stress fractures, and lost control of his bladder and bowels while sitting in a lawn chair, yet he continued on to finish the race.",
    )
test_Cant_hurt_me2()

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Human: 
Answer the question based only on the following context:

ankles had vanished‚Ä¶because my feet had swollen enough to stabilize those
tendons. Was this a good thing long-term? Probably not, but you take what
you can get on the ultra circuit, where you have to roll with whatever gets
you from mile to mile. Meanwhile, my quads and calves ached like they‚Äôd
been thumped with a sledgehammer. Yeah, I had done a lot of running, but
most of it‚Äîincluding my ruck runs‚Äîon pancake flat terrain in San Diego,
not on slick jungle trails.
Kate was waiting for me by the time I completed my second lap, and after
spending a relaxing morning on Waikiki beach, she watched in horror as I
materialized from the mist like a zombie from the Walking Dead. I sat and
guzzled as much water as I could. By then, word had gotten out that it was
my first trail race.

---

one dragged out like an elastic thread, sending shockwaves of pain from my
toes to the space behind my eyeballs. I hacked and coughed, 

AssertionError: 

In [212]:
def test_Cant_hurt_me3():
    assert query_and_validate(
    question = "According to Can't Hurt Me, what humiliating physical incident did David Goggins admit happened to him during the San Diego One Day race after his body began to fail?",
    expected_response = "He admitted that he lost control of his bowels during the race but kept going anyway.",
    )
test_Cant_hurt_me3()

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Human: 
Answer the question based only on the following context:

accepting Trunnis Goggins as part of me, I was free to use where I came
from as fuel. I realized that each episode of child abuse that could have
killed me made me tough as hell and as sharp as a Samurai‚Äôs blade.
True, I had been dealt a fucked-up hand, but that night I started thinking of it
as running a 100-mile race with a fifty-pound ruck on my back. Could I still
compete in that race even if everyone else was running free and easy,
weighing 130 pounds? How fast would I be able to run once I‚Äôd shed that
dead weight? I wasn‚Äôt even thinking about ultras yet. To me the race was life
itself, and the more I took inventory, the more I realized how prepared I was
for the fucked-up events yet to come. Life had put me in the fire, taken me

---

typhoon.
‚ÄúPeople have a hard time going through BUD/S healthy, and you‚Äôre going
through it on broken legs! Who else would even think of this?‚Äù I asked.
‚ÄúWho else would b

AssertionError: 

In [182]:
def test_RCNN_Paper():
    assert query_and_validate(
        question = 'What dataset was used?',
        expected_response = 'PASCAL VOC',
    )

In [183]:
test_RCNN_Paper()

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
PASCAL VOC

Model Answer:
The datasets used are PASCAL VOC 2007 and PASCAL VOC 2012. Specifically:

* Table 1 uses PASCAL VOC (no specific year mentioned)
* Table 2 reports results for PASCAL VOC 2007 test set
* Table 3 reports results for PASCAL VOC 2012 test set

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True


## Questions sets to test 

In [186]:
CANT_HURT_ME_TESTS = [
    {
        "question": "How many hell weeks did Goggins do?",
        "expected": "He did three Hell Weeks",
    },
    {
        "question": "What was David Goggins' max weight?",
        "expected": "297 pounds",
    },
    {
        "question": "Before losing weight and training, which military unit was Goggins trying to join?",
        "expected": "The Navy SEALs",
    },
    {
        "question": "After serving as a SEAL, which elite Army unit did Goggins consider attempting to join?",
        "expected": "Delta Force",
    },
    {
        "question": "How many times did Goggins fail his pull-up record attempt?",
        "expected": "He failed twice",
    },
    {
        "question": "What was the previous pull-up world record?",
        "expected": "4000 pull-ups",
    },
    {
        "question": "What pull-up record did Goggins set?",
        "expected": "4030 pull-ups",
    },
    {
        "question": "What races did Goggins run before Badwater 135?",
        "expected": "San Diego One Day and Hurt 100",
    },
    {
        "question": "At what mile in the San Diego One Day race did Goggins soil himself?",
        "expected": "Mile 70",
    },
]

In [187]:
FASTER_RCNN_TESTS = [
    {
        "question": "What is a RPN?",
        "expected": "A fully convolutional network that predicts object bounding boxes and objectness scores from shared feature maps to generate region proposals",
    },
    {
        "question": "What dataset was used?",
        "expected": "PASCAL VOC",
    },
    {
        "question": "What task does Faster R-CNN perform?",
        "expected": "Object detection",
    },
    {
        "question": "What does Faster R-CNN improve over earlier R-CNN variants?",
        "expected": "It replaces external region proposal methods with a learned Region Proposal Network for end-to-end training",
    },
    {
        "question": "What is the role of anchors in Faster R-CNN?",
        "expected": "They are predefined boxes of different scales and aspect ratios used to propose candidate object regions",
    },
    {
        "question": "What two outputs does the RPN predict for each anchor?",
        "expected": "Bounding box offsets and objectness scores",
    },
    {
        "question": "What backbone network is commonly used in Faster R-CNN?",
        "expected": "A convolutional neural network such as VGG or ResNet",
    },
    {
        "question": "What is ROI pooling used for?",
        "expected": "To convert variable-sized region proposals into fixed-size feature maps",
    },
    {
        "question": "What loss components are used to train Faster R-CNN?",
        "expected": "Classification loss and bounding box regression loss",
    },
    {
        "question": "What does non-maximum suppression do in Faster R-CNN?",
        "expected": "It removes highly overlapping bounding boxes, keeping only the highest scoring ones",
    },
    {
        "question": "What is the purpose of sharing convolutional features between the RPN and detection network?",
        "expected": "To reduce computation and enable joint optimization",
    },
    {
        "question": "How are positive anchors defined during training?",
        "expected": "Anchors with high intersection-over-union overlap with a ground-truth box",
    },
    {
        "question": "What is the main output of Faster R-CNN at inference time?",
        "expected": "Class labels and refined bounding boxes for detected objects",
    },
]

In [188]:
def run_test_set(test_set_name: str):
    if test_set_name == "cant_hurt_me":
        tests = CANT_HURT_ME_TESTS
    elif test_set_name == "faster_rcnn":
        tests = FASTER_RCNN_TESTS
    else:
        raise ValueError("Unknown test set. Use 'cant_hurt_me' or 'faster_rcnn'.")

    print(f"\nRunning tests for: {test_set_name}\n")

    passed = 0

    for i, t in enumerate(tests, 1):
        print(f"Test {i}: {t['question']}")
        ok = query_and_validate(
            question=t["question"],
            expected_response=t["expected"],
        )

        if ok:
            print("PASS\n")
            passed += 1
        else:
            print("FAIL\n")

    print(f"Summary: {passed}/{len(tests)} passed.")

In [189]:
run_test_set("cant_hurt_me")


Running tests for: cant_hurt_me

Test 1: How many hell weeks did Goggins do?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
He did three Hell Weeks

Model Answer:
Based on the context, Goggins has done at least 2 Hell Weeks, as mentioned in the sentence:

"After surviving two Hell Weeks and participating in three..."

So, he has either participated or survived a total of 5 Hell Weeks.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 2: What was David Goggins' max weight?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
297 pounds

Model Answer:
According to the text, David Goggins weighed:

* 255 pounds in his last days in the Air Force
* Nearly 300 pounds after he continued to bulk up after his discharge.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 3: Before losing weight and training, which military unit was Goggins trying to join?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
The Navy SEALs

Model Answer:
Before losing weight and training, David Goggins was trying to join DEVGRU (a Navy SEAL unit), specifically Green Team, their training program. He had been approved by SEAL Team Five brass to screen for Green Team, but he had yet to attend Army Ranger School before doing so.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[91mResponse: false. 

the model answer refers to "devgru (a navy seal unit)" and specifically mentions "green team", whereas the expected answer only mentions "the navy seals". the entities differ in scope and specificity.[0m
False
FAIL

Test 4: After serving as a SEAL, which elite Army unit did Gogg

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Delta Force

Model Answer:
Delta Force

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 5: How many times did Goggins fail his pull-up record attempt?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
He failed twice

Model Answer:
According to the text, Goggins failed his pull-up record attempt twice before finally breaking the record with 4,030 pull-ups in 17 hours.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 6: What was the previous pull-up world record?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
4000 pull-ups

Model Answer:
According to the text, the author's goal was to break Stephen Hyland's record of 4,020 pull-ups in a twenty-four hour period, but there is no information provided about what the previous record was before that. However, it does mention that "after my second failure" and notes that the author was still over 800 pull-ups away from the target of 4,020, suggesting that Stephen Hyland's record may have been set previously or was a known standard at the time of the author's attempt.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 7: What pull-up record did Goggins set?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
4030 pull-ups

Model Answer:
Goggins set the 24-hour pull-up record of 4,030.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 8: What races did Goggins run before Badwater 135?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
San Diego One Day and Hurt 100

Model Answer:
Unfortunately, the text doesn't explicitly state what specific races Goggins ran before Badwater 135. However, it does mention that he had previously watched Scott Jurek win the 2006 edition of Badwater and that he was inspired to raise money for the Special Operations Warrior Foundation by doing an endurance event, which ultimately led him to decide to run Badwater 135.

The text also mentions "Hell Week" as a reference point when discussing Goggins' reaction to seeing images from Badwater. Given the context of the story and Hell Week's notorious reputation in Navy SEAL training, it can be inferred that Goggins was likely involved in military or law enforcement training before attempting Badwater 135.

Additionally, earlier in his career, he mentions running marathons, which were previously considered the pinnacle of endurance racing.

---

Determine whether the Model Answer states the same factual claim(s) as the Expecte

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Mile 70

Model Answer:
There is no mention of Goggins soiling himself at any point in the text. The narrative does describe Goggins' physical suffering and his severe dehydration, but it does not include an incident where he soils himself.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[91mResponse: false[0m
False
FAIL

Summary: 6/9 passed.


In [197]:
run_test_set("faster_rcnn")


Running tests for: faster_rcnn

Test 1: What is a RPN?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
A fully convolutional network that predicts object bounding boxes and objectness scores from shared feature maps to generate region proposals

Model Answer:
A Region Proposal Network (RPN) is a fully-convolutional network that simultaneously predicts object bounds and objectness scores for potential objects in an image.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 2: What dataset was used?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
PASCAL VOC

Model Answer:
The datasets mentioned are:

1. PASCAL VOC 2007 test set
2. VOC 2012 trainval
3. VOC 2007 trainval
4. Union set of VOC 2007 trainval and VOC 2012 trainval (denoted as "07+12" or "07++12")

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[91mResponse: false.[0m
False
FAIL

Test 3: What task does Faster R-CNN perform?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Object detection

Model Answer:
Faster R-CNN performs object detection, including:

* Hypothesizing object locations (using a Region Proposal Network, or RPN)
* Predicting class-specific scores and regressing box locations for detected objects.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 4: What does Faster R-CNN improve over earlier R-CNN variants?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
It replaces external region proposal methods with a learned Region Proposal Network for end-to-end training

Model Answer:
Based on the provided context, Faster R-CNN (which uses a Region Proposal Network, or RPN) improves over earlier R-CNN variants by providing "nearly cost-free region proposals" and achieving better accuracy. Specifically, it is mentioned that Faster R-CNN with an RPN has an mAP of 73.2%, which is higher than the 68.4% achieved by SS on the union set of VOC 2007 trainval+test and VOC 2012 trainval. Additionally, Faster R-CNN reduces the running time of object detection systems compared to earlier methods that use sliding windows.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
They are predefined boxes of different scales and aspect ratios used to propose candidate object regions

Model Answer:
Actually, the context does not explicitly mention "Faster R-CNN", but it does talk about Region Proposal Networks (RPN) and Fast R-CNN. According to the text, the number of anchor locations is mentioned as being approximately 2,400 (i.e., Nreg ‚àº 2,400).

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[91mResponse: false[0m
False
FAIL

Test 6: What two outputs does the RPN predict for each anchor?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Bounding box offsets and objectness scores

Model Answer:
The RPN predicts two outputs for each anchor:

1. A binary class label (object vs not object) represented by pi, where pi is the predicted probability of anchor i being an object.
2. The 4 parameterized coordinates of the predicted bounding box ti, which represents the location and size of the proposed bounding box.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 7: What backbone network is commonly used in Faster R-CNN?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
A convolutional neural network such as VGG or ResNet

Model Answer:
The text does not explicitly mention that a specific backbone network is commonly used in Faster R-CNN. However, it mentions "ZF" and "VGG nets" as networks that are tested with single-scale feature extraction.

Upon further review of the paper's abstract, it appears to imply that the VGG-16 model (a variant of VGGNet) is being used for object detection in the Faster R-CNN system.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 8: What is ROI pooling used for?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
To convert variable-sized region proposals into fixed-size feature maps

Model Answer:
The provided text does not explicitly mention "ROI pooling" by name, but it does discuss a similar concept called "Spatial Pyramid Pooling (SPP)" which is used in deep convolutional networks for visual recognition. However, since Fast R-CNN and Region Proposal Networks are mentioned, it can be inferred that ROI (Region of Interest) pooling is related to the extraction of features from objects or regions.

ROI pooling is a technique used to extract fixed-size feature maps from variable-sized regions or objects in an image, allowing for the use of shared convolutions across different sized inputs. This is relevant to object detection and classification tasks.

Given the context, it can be assumed that ROI pooling is used for efficient region-based object detection by extracting features from regions of interest in the image.

---

Determine whether the Model Answer states the same fac

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Classification loss and bounding box regression loss

Model Answer:
The text does not explicitly mention "Faster R-CNN" but it mentions training both region proposal and object detection networks. According to the context, these networks use two types of loss terms:

1. cls (classification) term
2. reg (regression) term

These terms are roughly equally weighted during training.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true[0m
True
PASS

Test 10: What does non-maximum suppression do in Faster R-CNN?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
It removes highly overlapping bounding boxes, keeping only the highest scoring ones

Model Answer:
Non-maximum suppression (NMS) reduces redundancy by eliminating proposal regions that have a high overlap with other proposal regions, leaving only the top-ranked proposal regions for detection. In this specific implementation, NMS uses an IoU threshold of 0.7 and leaves about 2k proposal regions per image.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 11: What is the purpose of sharing convolutional features between the RPN and detection network?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
To reduce computation and enable joint optimization

Model Answer:
The purpose of sharing convolutional features between the Region Proposal Network (RPN) and detection network is to improve the performance of both networks by allowing them to learn from each other's features. Specifically, it is mentioned that when the two networks share conv layers, "the proposal quality is improved" in the third step of the 4-step training process. This suggests that sharing features enables the RPN to generate better proposals for the detection network, which in turn improves the detection performance.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Anchors with high intersection-over-union overlap with a ground-truth box

Model Answer:
During training, a positive anchor is defined as either:

(i) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box, or
(ii) an anchor that has the highest Intersection-Over-Union (IoU) overlap with a ground-truth box.

In other words, anchors are assigned a positive label if they have a high enough IoU overlap with any of the ground-truth boxes in the image.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[92mResponse: true.[0m
True
PASS

Test 13: What is the main output of Faster R-CNN at inference time?


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Expected Answer:
Class labels and refined bounding boxes for detected objects

Model Answer:
According to Table 4, the "Region-wise" includes NMS, pooling, fc, and softmax. This suggests that the main output of Faster R-CNN at inference time is the class scores (output of softmax) and bounding box predictions (output of NMS and pooling), which are then combined with features extracted from a convolutional neural network to generate the final object detection results.

---

Determine whether the Model Answer states the same factual claim(s) as the Expected Answer.

Rules:
- If any number, count, or entity differs, answer false.
- If the Expected Answer is contradicted, answer false.
- Do NOT be generous.
- Do NOT infer or reinterpret.
- Ignore wording only when the facts clearly match.

Respond with exactly one word: true or false.

[91mResponse: false[0m
False
FAIL

Summary: 10/13 passed.


# Adding chat history

In [158]:
CHAT_HISTORY = []

In [159]:
def reset_chat_history():
    CHAT_HISTORY.clear()
    print("Chat history reset.")

In [160]:
REWRITE_PROMPT = """
Given the chat history and the latest question, rewrite the question so it is standalone and can be understood without the history.

Chat history:
{history}

Latest question: {question}

Standalone question:
"""

In [161]:
def rewrite_query_with_history(query: str, history: list):
    if not history:
        return query

    model = OllamaLLM(model="llama3.1")

    history_text = "\n".join(f"{role}: {msg}" for role, msg in history[-6:])

    prompt = REWRITE_PROMPT.format(history=history_text,question=query,)
    #print(prompt)
    
    rewritten = model.invoke(prompt)
    return rewritten.strip()

In [162]:
Base_PROMPT_2  = """
Chat history:
{history}

Context:
{context}

Answer the question based on the above context

Question: {question}
"""

query_rag(query_text, history)  
    ‚Üí rewrite question using history  
    ‚Üí embed rewritten query  
    ‚Üí similarity_search  
    ‚Üí build prompt with context + history  
    ‚Üí LLM  
    ‚Üí update history  

### Old but working function with no reranker only top k=4 similarity search

In [141]:
# def query_rag_hist(query_text: str, history: list):
#     emb_fxn = get_embeddings_function()
#     db = Chroma(persist_directory=CHROMA_DIR,embedding_function = emb_fxn)

#     standalone_query = rewrite_query_with_history(query_text, history)
    
#     results = db.similarity_search_with_score(standalone_query, k=4)

#     context = "\n\n---\n\n".join([doc.page_content for doc, _score in results])

#     history_text = "\n".join(f"{role}: {msg}" for role, msg in history[-6:])
    
#     prompt_template = ChatPromptTemplate.from_template(Base_PROMPT_2
#                                                       )
#     prompt = prompt_template.format(context=context, question=query_text, history=history_text,)
#     #print(prompt)

#     model = OllamaLLM(model = 'llama3.1')
#     response_text = model.invoke(prompt)

#     history.append(("user", query_text))
#     history.append(("assistant", response_text))

#     sources = [doc.metadata.get('id', None) for doc, _source in results]
    
#     print("Standalone query:", standalone_query)
    
#     print("\nResponse: ",response_text)
#     print("\nSources:", sources)

#     return response_text

### New Fxn using reranker

In [311]:
def query_rag_hist(query_text: str, history: list, return_context=False, return_sources=False):
    standalone_query = rewrite_query_with_history(query_text, history)
    
    results = db.similarity_search_with_score(standalone_query, k=10)
    reranked_docs = rerank(standalone_query, results)

    context = "\n\n---\n\n".join([doc.page_content for doc in reranked_docs])
    
    history_text = "\n".join(f"{role}: {msg}" for role, msg in history[-6:])
    prompt_template = ChatPromptTemplate.from_template(Base_PROMPT_2)
    prompt = prompt_template.format(context=context, question=query_text, history=history_text,)
    #print(prompt)

    response_text = model.invoke(prompt)
    history.append(("user", query_text))
    history.append(("assistant", response_text))
    sources = [doc.metadata.get('id', None) for doc in reranked_docs]

    #print("Standalone query:", standalone_query)
    #print("\nResponse: ",response_text)
    #print("\nSources:", sources)
    if return_context and return_sources:
        return response_text, context, sources
    
    if return_sources:
        return response_text, sources
    
    if return_context:
        return response_text, context
    
    return response_text

In [180]:
query_rag_hist("What is an RPN?", CHAT_HISTORY)

Standalone query: What is an RPN?

Response:  An RPN (Region Proposal Network) is a fully-convolutional network that simultaneously predicts object bounds and ratios at a location. It shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.

Sources: ['C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Faster-RCNN-Paper.pdf:0:0', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Faster-RCNN-Paper.pdf:1:1', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Faster-RCNN-Paper.pdf:3:3', 'C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Faster-RCNN-Paper.pdf:5:3']


'An RPN (Region Proposal Network) is a fully-convolutional network that simultaneously predicts object bounds and ratios at a location. It shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.'

In [181]:
CHAT_HISTORY

[('user', 'What is an RPN?'),
 ('assistant',
  'An RPN (Region Proposal Network) is a fully-convolutional network that simultaneously predicts object bounds and ratios at a location. It shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.')]

In [182]:
query_rag_hist("How does it differ from Fast R-CNN?", CHAT_HISTORY)

Standalone query: How does an RPN (Region Proposal Network) differ from Fast R-CNN?

Response:  Based on the provided context, it appears that the Region Proposal Network (RPN) is different from Fast R-CNN in several ways:

1. **Purpose**: The primary purpose of RPN is to generate region proposals, whereas Fast R-CNN uses these proposals for object detection.
2. **Shared features**: RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals, while Fast R-CNN relies on separate region proposal algorithms.
3. **Training approach**: The training process involves alternating optimization between the RPN and Fast R-CNN to learn shared convolutional features.

However, it is worth noting that both RPN and Fast R-CNN can be combined using a 4-step training algorithm (alternating optimization) to share conv layers, enabling state-of-the-art object detection accuracy with fast inference.

Sources: ['C:\\Users\\Archit\\Documents\\ML Projec

'Based on the provided context, it appears that the Region Proposal Network (RPN) is different from Fast R-CNN in several ways:\n\n1. **Purpose**: The primary purpose of RPN is to generate region proposals, whereas Fast R-CNN uses these proposals for object detection.\n2. **Shared features**: RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals, while Fast R-CNN relies on separate region proposal algorithms.\n3. **Training approach**: The training process involves alternating optimization between the RPN and Fast R-CNN to learn shared convolutional features.\n\nHowever, it is worth noting that both RPN and Fast R-CNN can be combined using a 4-step training algorithm (alternating optimization) to share conv layers, enabling state-of-the-art object detection accuracy with fast inference.'

In [183]:
CHAT_HISTORY

[('user', 'What is an RPN?'),
 ('assistant',
  'An RPN (Region Proposal Network) is a fully-convolutional network that simultaneously predicts object bounds and ratios at a location. It shares full-image convolutional features with the detection network, allowing for nearly cost-free region proposals.'),
 ('user', 'How does it differ from Fast R-CNN?'),
 ('assistant',
  'Based on the provided context, it appears that the Region Proposal Network (RPN) is different from Fast R-CNN in several ways:\n\n1. **Purpose**: The primary purpose of RPN is to generate region proposals, whereas Fast R-CNN uses these proposals for object detection.\n2. **Shared features**: RPN shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals, while Fast R-CNN relies on separate region proposal algorithms.\n3. **Training approach**: The training process involves alternating optimization between the RPN and Fast R-CNN to learn shared convolutional features.

In [184]:
reset_chat_history()

Chat history reset.


In [185]:
CHAT_HISTORY

[]

# Evaluation Part

In [81]:
emb = get_embeddings_function()
db = Chroma(persist_directory=str(CHROMA_DIR), embedding_function=emb)

Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


### This part was run 30 times to make the json files with the question , expected, gold pages

In [101]:
q = "What phrase does Goggins use to describe the mental callous formed through suffering?"
results = db.similarity_search_with_score(q, k=8)
for doc, score in results:
    print(q)
    print(doc.page_content[:400])
    print(doc.metadata["id"])
    print("-" * 90)

What phrase does Goggins use to describe the mental callous formed through suffering?
floods our soul, and influences the decisions which determine our character.
My fears were never just about the water, and my anxieties toward Class 235
weren‚Äôt about the pain of First Phase. They were seeping from the infected
wounds I‚Äôd been walking around with my entire life, and my denial of them
amounted to a denial of myself. I was my own worst enemy! It wasn‚Äôt the
world, or God, or the Devi
C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Data\Can_t-Hurt-Me-David-Goggins.pdf:120:1
------------------------------------------------------------------------------------------
What phrase does Goggins use to describe the mental callous formed through suffering?
pool. I didn‚Äôt want to say anything because I didn‚Äôt yet understand what I
now know.
Similar to using an opponent‚Äôs energy to gain an advantage, leaning on your
calloused mind in the heat of battle can shift your thinki

## Loading Json filez

In [135]:
import json

faster_path = r"C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Eval\faster_rcnn.json"
goggins_path = r"C:\Users\Archit\Documents\ML Projects\RAG-Based-PDF-QA-System\Eval\cant_hurt_me.json"

with open(faster_path, "r") as f:
    faster_data = json.load(f)

with open(goggins_path, "r") as f:
    goggins_data = json.load(f)

In [136]:
test_data = faster_data + goggins_data
print("Total Questions:", len(test_data))

Total Questions: 30


## Retrieval Evaluation

In [137]:
import os

def normalize_id(raw_id):
    # Remove full path
    filename_with_pages = os.path.basename(raw_id)

    # Split on colon
    parts = filename_with_pages.split(":")

    # Keep filename + page number only
    if len(parts) >= 2:
        return f"{parts[0]}:{parts[1]}"
    else:
        return filename_with_pages

In [138]:
normalize_id('C:\\Users\\Archit\\Documents\\ML Projects\\RAG-Based-PDF-QA-System\\Data\\Faster-RCNN-Paper.pdf:5:1')

'Faster-RCNN-Paper.pdf:5'

In [187]:
k = 4
total = len(test_data)
hits = 0

for item in test_data:
    question = item["question"]
    gold_pages = set(item["gold_pages"])

    # Step 1: retrieve more
    initial_results = db.similarity_search_with_score(question, k=10)

    # Step 2: rerank and take top k
    reranked_docs = rerank(question, initial_results, top_n=k)

    retrieved_ids = set(
        doc.metadata["id"] for doc in reranked_docs
    )

    normalized_retrieved = set(
        normalize_id(r) for r in retrieved_ids
    )

    if gold_pages & normalized_retrieved:
        hits += 1
    else:
        print("\nFAILED:")
        print("Q:", question)
        print("Gold:", gold_pages)
        print("Retrieved (normalized):", normalized_retrieved)

recall_at_k = hits / total

print("\n==========")
print(f"Total Questions: {total}")
print(f"PASSED {hits}")
print(f"FAILED {total - hits}")
print(f"Recall@{k}: {recall_at_k:.2f}")


FAILED:
Q: Before losing weight and training, which military unit was Goggins trying to join?
Gold: {'Can_t-Hurt-Me-David-Goggins.pdf:73'}
Retrieved (normalized): {'Can_t-Hurt-Me-David-Goggins.pdf:137', 'Can_t-Hurt-Me-David-Goggins.pdf:225', 'Can_t-Hurt-Me-David-Goggins.pdf:221', 'Can_t-Hurt-Me-David-Goggins.pdf:227'}

FAILED:
Q: After serving as a SEAL, which elite Army unit did Goggins consider attempting to join?
Gold: {'Can_t-Hurt-Me-David-Goggins.pdf:244'}
Retrieved (normalized): {'Can_t-Hurt-Me-David-Goggins.pdf:302', 'Can_t-Hurt-Me-David-Goggins.pdf:141', 'Can_t-Hurt-Me-David-Goggins.pdf:227'}

FAILED:
Q: What job did Goggins hold before pursuing the Navy SEALs?
Gold: {'Can_t-Hurt-Me-David-Goggins.pdf:66'}
Retrieved (normalized): {'Can_t-Hurt-Me-David-Goggins.pdf:198', 'Can_t-Hurt-Me-David-Goggins.pdf:302', 'Can_t-Hurt-Me-David-Goggins.pdf:225', 'Can_t-Hurt-Me-David-Goggins.pdf:72'}

FAILED:
Q: What major ultramarathon did Goggins finish despite severe kidney failure and dehydr

In [190]:
k_values = [4, 6, 8, 10]
total = len(test_data)

hits = {k: 0 for k in k_values}

for item in test_data:
    question = item["question"]
    gold_pages = set(item["gold_pages"])

    # Step 1: Retrieve once
    initial_results = db.similarity_search_with_score(question, k=10)

    # Step 2: Rerank once (max needed = 8)
    reranked_docs = rerank(question, initial_results, top_n=max(k_values))

    # Normalize once
    normalized_retrieved = [
        normalize_id(doc.metadata["id"]) 
        for doc in reranked_docs
    ]

    # Step 3: Check each k
    for k in k_values:
        top_k = set(normalized_retrieved[:k])
        if gold_pages & top_k:
            hits[k] += 1

print("\n==========")
print(f"Total Questions: {total}")

for k in k_values:
    recall = hits[k] / total
    print(f"Recall@{k}: {recall:.2f}")


Total Questions: 30
Recall@4: 0.83
Recall@6: 0.87
Recall@8: 0.90
Recall@10: 0.97


## Calculating Accuracy

In [193]:
print(test_data[0].keys())

dict_keys(['question', 'expected', 'gold_pages'])


In [209]:
judge_model = OllamaLLM(model="llama3.1")
def judge_answer(question, expected, predicted):
    prompt = f"""
You are evaluating a RAG system.

Question:
{question}

Expected Answer:
{expected}

Model Answer:
{predicted}

Is the model answer correct and consistent with the expected answer?

Respond with only one word:
CORRECT
or
INCORRECT
"""
    result = judge_model.invoke(prompt).strip().upper()
    return result == "CORRECT"

### For Query Model

In [244]:
def accuracy_loop():
    total = len(test_data)
    correct = 0
    for item in tqdm(test_data, desc="Computing Accuracy",unit='question'):
        
        question = item['question']
        expected = item['expected']
        #predicted = item['expected']
        predicted = query_rag(question)
    
        if judge_answer(question, expected, predicted):
            correct += 1
        else:
            print('\nWRONG')
            print('Question: ', question)
            print('Expected: ', expected)
            print('Predicted: ', predicted)
        accuracy = correct / total

    print("\n==========")
    print(f"Total Questions: {total}")
    print(f"Correct: {correct}")
    print(f"Wrong: {total - correct}")
    print(f"Accuracy: {accuracy:.2f}")

    return accuracy

In [245]:
accuracy_loop()

Computing Accuracy: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [02:21<00:00,  4.72s/question]


Total Questions: 30
Correct: 30
Wrong: 0
Accuracy: 1.00





1.0

### Trial run on 1 

In [200]:
question = item['question']
expected = item['expected']
predicted = query_rag(question)
predicted

'The phrase used by Goggins to describe the mental callous formed through suffering is "calloused mind".'

In [210]:
judge_answer(question, expected, predicted)

True

In [251]:
query_rag('What job did Goggins hold before pursuing the Navy SEALs?')

'recruitment staff in San Diego, where the SEALs train.'

In [246]:
# test on qs that got weing ans but the new stricter prompt fixed it
total = 1
correct = 0
question = 'What phrase does Goggins use to describe the mental callous formed through suffering?'
expected = 'Callousing the mind.'
#predicted = item['expected']
predicted = query_rag(question)

if judge_answer(question, expected, predicted):
    correct += 1
else:
    print('\nWRONG')
    print('Question: ', question)
    print('Expected: ', expected)
    print('Predicted: ', predicted)
accuracy = correct / total

print("\n==========")
print(f"Total Questions: {total}")
print(f"Correct: {correct}")
print(f"Wrong: {total - correct}")
print(f"Accuracy: {accuracy:.2f}")


Total Questions: 1
Correct: 1
Wrong: 0
Accuracy: 1.00


### On Conversational Model

In [247]:
def accuracy_loop_chat():
    total = len(test_data)
    correct = 0

    history = []

    for item in tqdm(test_data, desc="Computing Chat Accuracy", unit="question"):

        question = item["question"]
        expected = item["expected"]

        predicted = query_rag_hist(question, history)

        if judge_answer(question, expected, predicted):
            correct += 1
        else:
            print("\nWRONG")
            print("Question:", question)
            print("Expected:", expected)
            print("Predicted:", predicted)

    accuracy = correct / total

    print("\n==========")
    print(f"Total Questions: {total}")
    print(f"Correct: {correct}")
    print(f"Wrong: {total - correct}")
    print(f"Chat Accuracy: {accuracy:.2f}")

    return accuracy

In [250]:
accuracy_loop_chat()

Computing Chat Accuracy:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä           | 25/30 [05:45<01:15, 15.04s/question]


WRONG
Question: What job did Goggins hold before pursuing the Navy SEALs?
Expected: He worked as an exterminator.
Predicted: According to the provided text, David Goggins held the job of "car sales" before becoming a Navy SEAL. However, it's mentioned that he commuted everywhere on a bike and stopped into a Navy recruitment office because he knew he needed structure and purpose, and some warm clothes.

Here is an excerpt from the text:

"...He had a good car sales job and no car. He commuted everywhere on a rusted out ten-speed bike, literally freezing his balls off..."


Computing Chat Accuracy: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [06:44<00:00, 13.48s/question]


Total Questions: 30
Correct: 29
Wrong: 1
Chat Accuracy: 0.97





0.9666666666666667

# Hallucination Rate

In [266]:
def judge_hallucination(question, context, answer):
    prompt = f"""
You are evaluating whether an answer is grounded in the provided context.

Question:
{question}

Context:
{context}

Answer:
{answer}

If the answer is fully supported by the context, respond with YES.
If the answer includes unsupported or made-up information, respond with NO.

Respond with only YES or NO.
"""

    result = model.invoke(prompt).strip().upper()

    return result == "NO"

In [267]:
def hallucination_loop(query_function, use_history=False):

    total = len(test_data)
    hallucinations = 0

    history = []

    for item in tqdm(test_data, desc="Computing Hallucination Rate", unit="question"):

        question = item["question"]

        if use_history:
            answer, context = query_function(question, history, return_context=True)
        else:
            answer, context = query_function(question, return_context=True)

        if judge_hallucination(question, context, answer):

            hallucinations += 1

            print("\nHALLUCINATION:")
            print("Question:", question)
            print("Answer:", answer)

    rate = hallucinations / total

    print("\n==========")
    print(f"Hallucination Rate: {rate:.2f}")

    return rate

In [268]:
hallucination_loop(query_rag)

Computing Hallucination Rate:  10%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé                                                        | 3/30 [00:17<02:40,  5.94s/question]


HALLUCINATION:
Question: What task does Faster R-CNN perform?
Answer: perform object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP).


Computing Hallucination Rate:  27%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä                                              | 8/30 [00:44<02:06,  5.74s/question]


HALLUCINATION:
Question: What is ROI pooling used for?
Answer: Adaptively-sized pooling (SPP) [7] on shared conv feature maps is proposed for efÔ¨Åcient region-based object detection [7, 16] and semantic segmentation [2].


Computing Hallucination Rate:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                             | 16/30 [01:30<01:17,  5.54s/question]


HALLUCINATION:
Question: How many Hell Weeks did David Goggins complete?
Answer: three, I was a native speaker.


Computing Hallucination Rate:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 29/30 [02:26<00:04,  4.08s/question]


HALLUCINATION:
Question: What major ultramarathon did Goggins finish despite severe kidney failure and dehydration?
Answer: None given in the context.


Computing Hallucination Rate: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [02:30<00:00,  5.01s/question]


Hallucination Rate: 0.13





0.13333333333333333

In [269]:
hallucination_loop(query_rag_hist, use_history=True)

Computing Hallucination Rate:  13%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç                                                      | 4/30 [00:43<04:58, 11.49s/question]


HALLUCINATION:
Question: What does Faster R-CNN improve over earlier R-CNN variants?
Answer: According to the text, Faster R-CNN improves the detection accuracy of earlier R-CNN variants, including SPPnet [7] and Fast R-CNN [5]. Specifically, it achieves state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image.


Computing Hallucination Rate:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                             | 16/30 [03:48<03:12, 13.78s/question]


HALLUCINATION:
Question: How many Hell Weeks did David Goggins complete?
Answer: There is no information in the provided text about David Goggins or his "Hell Weeks". The text appears to be related to a research paper about object detection using Faster R-CNN. 

If you'd like, I can try to answer one of the original questions based on the context.


Computing Hallucination Rate:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè                        | 18/30 [04:22<03:04, 15.37s/question]


HALLUCINATION:
Question: Before losing weight and training, which military unit was Goggins trying to join?
Answer: According to the text, before losing weight and training, David Goggins was trying to join the Naval Academy. However, he wasn't recruited to the Naval Academy.

Later in the text, it mentions that after his discharge from the Air Force, he was trying to get into DEVGRU (also known as SEAL Team Six), but it says "there was a chance" they would meet there soon, implying that he hadn't yet joined.


Computing Hallucination Rate:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé                      | 19/30 [04:38<02:51, 15.57s/question]


HALLUCINATION:
Question: After serving as a SEAL, which elite Army unit did Goggins consider attempting to join?
Answer: According to the text, before trying to get into the Navy Special Warfare (SEAL) program, David Goggins was trying to join the Naval Academy. However, after being discharged from the Air Force and already serving as a SEAL, there is no indication that he tried to join any elite Army unit.

However, it's mentioned in the text that during his first platoon, Goggins put in a request to attend Army Ranger School between deployments, which suggests that he had an interest in attending Ranger School, but this was before serving as a SEAL.


Computing Hallucination Rate:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå            | 24/30 [05:50<01:33, 15.58s/question]


HALLUCINATION:
Question: At what mile during the San Diego One Day race did Goggins loose bowel control
Answer: There is no mention in the provided text that David Goggins lost bowel control at any point during the San Diego One Day race. The text does describe a physical training exercise where Goggins vomits blood and has bloody snot streaming from his nose and mouth while carrying a log on his shoulders, but it does not reference this incident happening during a specific mile of a running event.

However, the text does mention that in his past, David Goggins had to drop out of one of the ultra-marathon races due to losing control of his bowels. But no information about the San Diego One Day race is provided.


Computing Hallucination Rate:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã        | 26/30 [06:24<01:05, 16.25s/question]


HALLUCINATION:
Question: What condition did Goggins suffer from as a child that affected his learning?
Answer: Based on the provided text, it is not explicitly stated that David Goggins suffered from a specific condition that affected his learning. However, the text does mention that he had an "ear infection" that was so severe that if left untreated, it would have resulted in permanent hearing loss.

Additionally, there is a reference to Ritalin, which is often used to treat Attention Deficit Hyperactivity Disorder (ADHD), but the text only mentions that Goggins did not take Ritalin. It does not suggest that he was diagnosed with ADHD or any other learning disability.


Computing Hallucination Rate:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 29/30 [07:07<00:15, 15.08s/question]


HALLUCINATION:
Question: What major ultramarathon did Goggins finish despite severe kidney failure and dehydration?
Answer: There is no answer to this question as it was not asked. However, I can provide a response.

Based on the provided text, it appears that David Goggins finished an ultra-marathon despite suffering from severe symptoms of what doctors believed might be kidney failure, and later diagnosed with an Atrial Septal Defect (ASD). The specific details about the major ultramarathon he finished are not mentioned in the provided context.


Computing Hallucination Rate: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [07:24<00:00, 14.82s/question]


HALLUCINATION:
Question: What phrase does Goggins use to describe the mental callous formed through suffering?
Answer: The text doesn't explicitly mention a specific phrase used by Goggins to describe the mental callous formed through suffering. However, it describes his process of re-examining and accepting his past experiences, which ultimately led him to "find strength in enduring pain and abuse" and allowed him to "use where I came from" as a source of personal growth.

It's implied that Goggins developed a mental resilience or toughness through his experiences with suffering, but the text doesn't attribute a specific phrase to this concept.

Hallucination Rate: 0.27





0.26666666666666666

# Calculating Latency for both

In [270]:
def latency_loop(query_function, use_history=False):

    total = len(test_data)
    total_time = 0

    history = []  # only used for conversational mode

    for item in tqdm(test_data, desc="Measuring Latency", unit="question"):

        question = item["question"]

        start = time.time()

        if use_history:
            answer = query_function(question, history)
        else:
            answer = query_function(question)

        end = time.time()

        total_time += (end - start)

    avg_latency = total_time / total

    print("\n==========")
    print(f"Total Questions: {total}")
    print(f"Average Latency: {avg_latency:.2f} seconds")

    return avg_latency

In [271]:
latency_loop(query_rag)

Measuring Latency: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [01:53<00:00,  3.78s/question]


Total Questions: 30
Average Latency: 3.78 seconds





3.775906268755595

In [272]:
latency_loop(query_rag, use_history=True)

Measuring Latency: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [01:51<00:00,  3.72s/question]


Total Questions: 30
Average Latency: 3.72 seconds





3.7205656369527182

# UI - Gradio interface

In [335]:
def format_sources(sources):
    formatted = []
    for s in sources:
        try:
            path, page, chunk = s.rsplit(":", 2)
            filename = os.path.basename(path)
            formatted.append(f"‚Ä¢ {filename} (page {page})")
        except Exception:
            formatted.append(f"‚Ä¢ {s}")
    return "\n".join(formatted)

In [332]:
def chat_fn(message, history):
    response, sources = query_rag_hist(message, CHAT_HISTORY, return_sources=True)
    #sources_text = "\n".join([f"‚Ä¢ {s}" for s in sources])
    sources_text = format_sources(sources)
    final_response = f"""{response}
---
### Sources
{sources_text}
"""
    return final_response

In [328]:
def upload_pdf(file):
    if file is None:
        return 'No file Uploaded'
    save_path = Path(DATA_DIR)/file.name
    shutil.copy(file.name, save_path)

    docs = load_docs()
    docs = filter_pages(docs)
    chunks = split(docs)
    add_to_chroma(chunks)

    return f'Indexed: {file.name}'

In [347]:
def reset_chat_ui():
    CHAT_HISTORY.clear()
    return '', "Chat reset successfully."

In [339]:
def list_documents():
    files = []

    for file in os.listdir(DATA_DIR):
        if file.endswith(".pdf"):
            files.append(file)

    if not files:
        return "No documents uploaded."

    return "\n".join(f"‚Ä¢ {file}" for file in files)

In [350]:
gr.ChatInterface?

[31mInit signature:[39m
gr.ChatInterface(
    fn: [33m'Callable'[39m,
    *,
    multimodal: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    chatbot: [33m'Chatbot | None'[39m = [38;5;28;01mNone[39;00m,
    textbox: [33m'Textbox | MultimodalTextbox | None'[39m = [38;5;28;01mNone[39;00m,
    additional_inputs: [33m'str | Component | list[str | Component] | None'[39m = [38;5;28;01mNone[39;00m,
    additional_inputs_accordion: [33m'str | Accordion | None'[39m = [38;5;28;01mNone[39;00m,
    additional_outputs: [33m'Component | list[Component] | None'[39m = [38;5;28;01mNone[39;00m,
    editable: [33m'bool'[39m = [38;5;28;01mFalse[39;00m,
    examples: [33m'list[str] | list[MultimodalValue] | list[list] | None'[39m = [38;5;28;01mNone[39;00m,
    example_labels: [33m'list[str] | None'[39m = [38;5;28;01mNone[39;00m,
    example_icons: [33m'list[str] | None'[39m = [38;5;28;01mNone[39;00m,
    run_examples_on_click: [33m'bool'[39m = [38;5;28;01mTrue[

In [352]:
with gr.Blocks() as demo:
    gr.Markdown("# Local RAG PDF QA System")

    with gr.Row():
        with gr.Column(scale = 3):
            chatbot = gr.ChatInterface(fn=chat_fn,
                                       title="Ask questions about your PDFs")
            
            gr.Markdown("### Example Questions")
            gr.Examples(examples = ["What was David Goggins pullup record?",
                                    "How many Hell Weeks did Goggins complete?",
                                    "What is Faster R-CNN?",
                                    "What is ROI pooling used for?",],
                                    inputs = chatbot.textbox)

        with gr.Column(scale=1):
            gr.Markdown("## Documents")
            doc_list = gr.Textbox(value=list_documents(), 
                                  label="Indexed Documents",
                                  interactive=False,
                                  lines=4)
            gr.Markdown("## Upload new PDF")
            file_upload = gr.File(file_types = ['.pdf'],label = 'Select PDF')
            upload_btn = gr.Button("Add Document")
            upload_status = gr.Textbox(label = 'Upload Status')
            
            reset_btn = gr.Button("Reset Chat", variant="secondary")
            reset_status = gr.Markdown()
            
    reset_btn.click(fn=reset_chat_ui, outputs=[chatbot.textbox, reset_status])
    upload_btn.click(fn=upload_pdf, inputs=file_upload, outputs=upload_status).then(fn=list_documents, outputs=doc_list)

demo.launch(inline = False, inbrowser=True)# share=True

* Running on local URL:  http://127.0.0.1:7891
* To create a public link, set `share=True` in `launch()`.


