# RAG with Evaluation: Film Box Office ROI Calculator

This notebook demonstrates an **agentic RAG system** that analyzes film financial performance by:
- Extracting production budgets and box office grosses from Wikipedia
- Calculating Return on Investment (ROI)
- Classifying film performance (Blockbuster, Profitable, Break-even, Flop)
- **Evaluating** against ground truth data

This goes beyond simple question-answering‚Äîit requires **multi-step reasoning, calculations, and decision-making**.

## Why This Use Case?

Unlike simple Q&A, this agent must:
1. **Extract** budget and gross from unstructured text
2. **Calculate** ROI = (gross - budget) / budget √ó 100
3. **Classify** performance based on thresholds
4. **Verify** calculations against ground truth

At Cohere, all RAG calls come with... **precise citations**! üéâ
The model cites which groups of words, in the RAG chunks, were used to generate the final answer.  
These citations make it easy to check where the model's generated response claims are coming from.

**Note:** The baseline agent and eval harness are intentionally minimal to demonstrate systematic improvement through the eval-driven development cycle.

RAG consists of 3 steps:
- Step 1: Index film Wikipedia pages and retrieve relevant chunks
- Step 2: Optionally, rerank the retrieved chunks
- Step 3: Generate financial analysis with **precise citations**



## Step 0 - Imports & Getting Film Data

In this example, we'll use Wikipedia pages for multiple high-grossing films.   

We'll fetch individual film pages (e.g., "Avatar (2009 film)", "Avengers: Endgame") which contain:
- Production budgets
- Box office performance
- Production details
- Release information

The agent will need to extract this information and perform calculations independently.


In [16]:
# pip install cohere

import cohere
import os
from typing import List, Dict, Any
import httpx

# Create an insecure HTTP client
insecure_client = httpx.Client(verify=False)
# Initialize Cohere client
# Try to load from .env file first
try:
    from dotenv import load_dotenv
    load_dotenv('../.env')
    print('‚úÖ Loaded API key from .env file')
except ImportError:
    print('‚ö†Ô∏è  python-dotenv not installed')

api_key = os.environ.get('COHERE_API_KEY')

if not api_key:
    api_key = input("Enter your Cohere API key: ")
    os.environ['COHERE_API_KEY'] = api_key

co = cohere.ClientV2(api_key=api_key,httpx_client=insecure_client)
print('‚úÖ Cohere client initialized')
# Get your free API key: https://dashboard.cohere.com/api-keys


‚úÖ Loaded API key from .env file
‚úÖ Cohere client initialized


In [17]:
# For chunking let's use langchain to help us split the text
# ! pip install -qU langchain-text-splitters -qq

from langchain_text_splitters import RecursiveCharacterTextSplitter


In [40]:
with open("../data/text/spells5e.txt", "r", encoding="utf-8") as f:
    text = f.read()
print(len(text))

357874


In [41]:
# Create basic configurations to chunk the text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

# Split the text into chunks with some overlap
chunks_ = text_splitter.create_documents([text])
chunks = [c.page_content for c in chunks_]
print(f"The text has been broken down in {len(chunks)} chunks.")


The text has been broken down in 941 chunks.


### Embed every text chunk

Cohere embeddings are state-of-the-art.


In [42]:
# Because the texts being embedded are the chunks we are searching over, we set the input type as search_doc

model = "embed-v4.0"

def batch_embed(texts, batch_size=96):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        response = co.embed(
            texts=batch,
            model=model,
            input_type="search_document",
            embedding_types=['float']
        )
        all_embeddings.extend(response.embeddings.float)
    return all_embeddings

embeddings = batch_embed(chunks)
print(f"We just computed {len(embeddings)} embeddings.")


We just computed 941 embeddings.


### Store the embeddings in a vector database

We use the simplest vector database ever: a python dictionary using `np.array()`.


In [43]:
# We use the simplest vector database ever: a python dictionary
! pip install numpy -qq


In [44]:
import numpy as np
vector_database = {i: np.array(embedding) for i, embedding in enumerate(embeddings)}
# { 0: array([...]), 1: array([...]), 2: array([...]), ..., 10: array([...]) }


## Given a user query, retrieve the relevant chunks from the vector database



### Define the user question


In [45]:
query = "What level is fireball"


### Embed the user question

Cohere embeddings are state-of-the-art.


In [46]:
# Because the text being embedded is the search query, we set the input type as search_query
response = co.embed(
    texts=[query],
    model=model,
    input_type="search_query",
    embedding_types=['float']
)
query_embedding = response.embeddings.float[0]
print("query_embedding: ", query_embedding[:10] + ["..."])


query_embedding:  [-0.008171757, 0.011016739, 0.034139782, -0.032202773, -0.055931136, 0.03317128, -0.01779627, 0.0101692965, -0.00950345, -0.011743117, '...']


### Retrieve the most relevant chunks from the vector database

We use cosine similarity to find the most similar chunks


In [47]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_response_text(response) -> str:
    """
    Extract text from Cohere response.
    Handles both reasoning models (command-a-reasoning-*) and non-reasoning models.
    """
    for item in response.message.content:
        if item.type == 'text':
            return item.text
    return ""

# Calculate similarity between the user question & each chunk
similarities = [cosine_similarity(query_embedding, chunk) for chunk in embeddings]
print(f"Calculated similarity for {len(similarities)} chunks (top score: {max(similarities):.4f})")

# Get indices of the top 20 most similar chunks
sorted_indices = np.argsort(similarities)[::-1]

# Keep only the top 20 indices
top_indices = sorted_indices[:20]
print(f"Top 20 chunk indices: {list(top_indices[:5])} ... (showing first 5)")

# Retrieve the top 20 most similar chunks
top_chunks_after_retrieval = [chunks[i] for i in top_indices]
print(f"Retrieved {len(top_chunks_after_retrieval)} chunks. Here are the top 3:")
for t in top_chunks_after_retrieval[:3]:
    print("== " + t)


Calculated similarity for 941 chunks (top score: 0.5029)
Top 20 chunk indices: [np.int64(370), np.int64(380), np.int64(372), np.int64(378), np.int64(245)] ... (showing first 5)
Retrieved 20 chunks. Here are the top 3:
== Fireball 
3rd-level evocation 

Casting Time: 1 action 
Range: 150 feet 

Components: V, S, M (a tiny ball of bat 
guano and sulfur) 

Duration: Instantaneous 

A bright streak flashes from your pointing finger to a 
point you choose within range and then blossoms with 
a low roar into an explosion of flame. Each creature 
in a 20-foot-radius sphere centered on that point must 
make a Dexterity saving throw. A target takes 8d6 fire 



damage on a failed save, or half as much damage on a 
successful one.
== At Higher Levels. When you cast this spell using a 
spell slot of 6th level or higher, the fire damage or the 
radiant damage (your choice) increases by ld6 for each 
slot level above 5th. 

Flaming Sphere 

2nd-Ievel conjuration 

Casting Time: 1 action 
Range: 60 

## Step 2 - Rerank the chunks retrieved from the vector database

We rerank the 10 chunks retrieved from the vector database. Reranking boosts retrieval accuracy.

Reranking lets us go from 10 chunks retrieved from the vector database, to the 3 most relevant chunks.


In [48]:
response = co.rerank(
    query=query,
    documents=top_chunks_after_retrieval,
    top_n=5,
    model="rerank-v3.5",
)

# top_chunks_after_rerank = [result.document['text'] for result in response]

top_chunks_after_rerank = [top_chunks_after_retrieval[result.index] for result in response.results]

print("Here are the top 5 chunks after rerank: ")
for t in top_chunks_after_rerank:
    print("== " + t)


Here are the top 5 chunks after rerank: 
== You send negative energy coursing through a creature 
that you can see within range, causing it searing pain. 
The target must make a Constitution saving throw. It 
takes 7d8 + 30 necrotic damage on a failed save, or half 
as much damage on a successful one. 

A humanoid killed by this spell rises at the start of 
your next turn as a zombie that is permanently under 
your command, following your verbal orders to the best 
of its ability. 

Fireball 
3rd-level evocation
== Fireball 
3rd-level evocation 

Casting Time: 1 action 
Range: 150 feet 

Components: V, S, M (a tiny ball of bat 
guano and sulfur) 

Duration: Instantaneous 

A bright streak flashes from your pointing finger to a 
point you choose within range and then blossoms with 
a low roar into an explosion of flame. Each creature 
in a 20-foot-radius sphere centered on that point must 
make a Dexterity saving throw. A target takes 8d6 fire 



damage on a failed save, or half as muc

## Step 3 - Generate the model final answer, given the retrieved and reranked chunks


In [49]:
# preamble containing instructions about the task and the desired style for the output.
preamble = """
## Task & Context
You help people answer their questions about rules of the roleplaying game Dungeons and Dragons. You will be asked a very wide array of requests on all kinds of topics relating to the rules of Dungeons and Dragons. You will have access to a search function over the entire rulesbook. You should focus on serving the user's needs as best you can, which will be wide-ranging.
The question does not always contain all the information that is needed to answer the question, in that case ask the user for more information.

## Style Guide
Please return the minimun amount of information, if the user asks for a number answer only with the number. Otherwise keep the answer short.
"""


In [50]:
# retrieved documents - now using all 5 reranked chunks
documents = [
    {"data": {"title": f"chunk {i}", "snippet": chunk}} 
    for i, chunk in enumerate(top_chunks_after_rerank)
]

print(f"Passing {len(documents)} documents to the model")

# get model response
response = co.chat(
  model="command-a-reasoning-08-2025",
  messages=[{"role" : "system", "content" : preamble},
            {"role" : "user", "content" : query}],
  documents=documents,  
  temperature=0.3
)

print("Final answer:")
print(get_response_text(response))

Passing 5 documents to the model
Final answer:
3rd.


Note: this is indeed the answer you'd expect, and here was the passage of text in wikipedia explaining it!

" [...] Star Wars was originally scheduled to be released on October 20, 2023, but was delayed to November 17, 2023, before moving forward two weeks to November 3, 2023, to adjust to changes in release schedules from other studios. It was later postponed by over four months to March 15, 2024, due to the 2023 Hollywood labor disputes. After the strikes were resolved, the film moved once more up two weeks to March 1, 2024. [...]"


## Bonus: Citations come for free with Cohere! üéâ

At Cohere, all RAG calls come with... precise citations! üéâ
The model cites which groups of words, in the RAG chunks, were used to generate the final answer.  
These citations make it easy to check where the model‚Äôs generated response claims are coming from.  
They help users gain visibility into the model reasoning, and sanity check the final model generation.  
These citations are optional ‚Äî you can decide to ignore them.



In [None]:
print("Citations that support the final answer:")
for cite in response.message.citations:
    print(cite)


In [None]:
def insert_inline_citations(text, citations, field='text'):
    sorted_citations = sorted(citations, key=lambda c: c.start, reverse=True)
    
    for citation in sorted_citations:
        source_ids = [source.id.split(':')[-1] for source in citation.sources]
        citation_text = f"[{','.join(source_ids)}]"
        text = text[:citation.end] + citation_text + text[citation.end:]
    
    return text

def list_sources(citations, fields=['text']):
    unique_sources = set()
    for citation in citations:
        for source in citation.sources:
            source_data = tuple((field, source.document[field]) for field in fields if field in source.document)
            unique_sources.add((source.id.split(':')[-1], source_data))
    
    footnotes = []
    for source_id, source_data in sorted(unique_sources):
        footnote = f"[{source_id}] " + ", ".join(f"{key}: {value}" for key, value in source_data)
        footnotes.append(footnote)
    
    return "\n".join(footnotes)

# Use the functions
cited_text = insert_inline_citations(response.message.content[1].text, response.message.citations)

# Print the result with inline citations
print(cited_text)

# Print footnotes
if response.message.citations:
    print("\nSource documents:")
    print(list_sources(response.message.citations, fields=['title','snippet']))


In [None]:
def insert_inline_citations(text, citations, field='text'):
    sorted_citations = sorted(citations, key=lambda c: c.start, reverse=True)
    
    for citation in sorted_citations:
        source_ids = [source.id.split(':')[-1] for source in citation.sources]
        citation_text = f"[{','.join(source_ids)}]"
        text = text[:citation.end] + citation_text + text[citation.end:]
    
    return text

def list_sources(citations):
    unique_sources = set()
    for citation in citations:
        for source in citation.sources:
            source_data = tuple((key, value) for key, value in source.document.items() if key != 'id')
            unique_sources.add((source.id.split(':')[-1], source_data))
    
    footnotes = []
    for source_id, source_data in sorted(unique_sources):
        footnote = f"[{source_id}] " + ", ".join(f"{key}: {value}" for key, value in source_data)
        footnotes.append(footnote)
    
    return "\n".join(footnotes)

# Use the functions
cited_text = insert_inline_citations(response.message.content[1].text, response.message.citations)

# Print the result with inline citations
print(cited_text)

# Print footnotes
if response.message.citations:
    print("\nSource documents:")
    print(list_sources(response.message.citations))


---

# Part 2: RAG Evaluation

Now that we've built a RAG system, let's evaluate its performance!

We'll create test queries about Star Wars and measure how well our RAG system answers them.


---

# Evaluation Section

Now we'll evaluate our agent's ability to extract financial data and calculate ROI correctly.

## Define Test Queries

We'll create test queries that ask the agent to analyze films and extract their financial performance.

In [None]:
# Create test cases for Film ROI Analysis
# These match the films fetched from Wikipedia and include ground truth values from CSV
test_cases = [
    {
        "query": "What was Avengers: Endgame's budget and how much did it gross worldwide? What was the ROI?",
        "film_title": "Avengers: Endgame",
        "year": 2019,
        "budget": 356.0,              # Ground truth from CSV (millions)
        "worldwide_gross": 2797.5,    # Ground truth from CSV (millions)
        "roi": 685.81,                # Ground truth from CSV (percent)
        "difficulty": "medium"
    },
    {
        "query": "How much did Titanic cost to make and how much money did it earn at the box office?",
        "film_title": "Titanic",
        "year": 1997,
        "budget": 200.0,              # Ground truth from CSV (millions)
        "worldwide_gross": 2257.9,    # Ground truth from CSV (millions)
        "roi": 1028.95,               # Ground truth from CSV (percent)
        "difficulty": "easy"
    },
    {
        "query": "What was the production budget and worldwide box office gross for Star Wars: The Force Awakens?",
        "film_title": "Star Wars: The Force Awakens",
        "year": 2015,
        "budget": 245.0,              # Ground truth from CSV (millions)
        "worldwide_gross": 2068.2,    # Ground truth from CSV (millions)
        "roi": 744.17,                # Ground truth from CSV (percent)
        "difficulty": "medium"
    },
    {
        "query": "What was the budget and worldwide gross for Avengers: Infinity War?",
        "film_title": "Avengers: Infinity War",
        "year": 2018,
        "budget": 400.0,              # Ground truth from CSV (millions)
        "worldwide_gross": 2048.4,    # Ground truth from CSV (millions)
        "roi": 412.09,                # Ground truth from CSV (percent)
        "difficulty": "medium"
    },
    {
        "query": "Tell me about Spider-Man: No Way Home's financial performance - budget, gross, and profitability.",
        "film_title": "Spider-Man: No Way Home",
        "year": 2021,
        "budget": 200.0,              # Ground truth from CSV (millions)
        "worldwide_gross": 1922.6,    # Ground truth from CSV (millions)
        "roi": 861.30,                # Ground truth from CSV (percent)
        "difficulty": "hard"
    }
]


## Define Evaluation Function

This function will:
1. Run the RAG pipeline on each query
2. Check if expected keywords appear in the answer
3. Evaluate citation quality
4. Measure response time

In [None]:
from typing import Dict, Optional
import re

def parse_financial_numbers(text: str) -> Dict[str, Optional[float]]:
    """
    Extract budget, gross, and ROI from text response
    Returns values in millions of dollars
    """
    result = {"budget": None, "gross": None, "roi": None}
    
    # Pattern for budget (in millions or billions)
    budget_patterns = [
        r'budget.*?\$(\d+(?:\.\d+)?)\s*billion',
        r'budget.*?\$(\d+(?:,\d+)?(?:\.\d+)?)\s*million',
        r'\$(\d+(?:,\d+)?(?:\.\d+)?)\s*million.*?budget'
    ]
    
    # Pattern for gross (in millions or billions)
    gross_patterns = [
        r'(?:gross|earned|made).*?\$(\d+(?:\.\d+)?)\s*billion',
        r'(?:gross|earned|made).*?\$(\d+(?:,\d+)?(?:\.\d+)?)\s*million',
        r'\$(\d+(?:\.\d+)?)\s*billion.*?(?:gross|worldwide)'
    ]
    
    # Pattern for ROI percentage
    roi_patterns = [
        r'ROI.*?(\d+(?:\.\d+)?)\s*%',
        r'(\d+(?:\.\d+)?)\s*%.*?ROI',
        r'return.*?(\d+(?:\.\d+)?)\s*%'
    ]
    
    text_lower = text.lower()
    
    # Extract budget
    for pattern in budget_patterns:
        match = re.search(pattern, text_lower, re.IGNORECASE)
        if match:
            value = float(match.group(1).replace(',', ''))
            result["budget"] = value * 1000 if 'billion' in match.group(0).lower() else value
            break
    
    # Extract gross
    for pattern in gross_patterns:
        match = re.search(pattern, text_lower, re.IGNORECASE)
        if match:
            value = float(match.group(1).replace(',', ''))
            result["gross"] = value * 1000 if 'billion' in match.group(0).lower() else value
            break
    
    # Extract ROI
    for pattern in roi_patterns:
        match = re.search(pattern, text_lower, re.IGNORECASE)
        if match:
            result["roi"] = float(match.group(1))
            break
    
    return result

def evaluate_response(response: str, expected_budget: float, expected_gross: float, expected_roi: float) -> Dict:
    """
    Evaluate response by comparing extracted numbers to ground truth
    All values should be in millions
    """
    parsed = parse_financial_numbers(response)
    
    # Calculate errors (allow 5% tolerance)
    tolerance = 0.05
    
    budget_match = False
    gross_match = False
    roi_match = False
    
    budget_error = None
    gross_error = None
    roi_error = None
    
    if parsed["budget"]:
        budget_error = abs(parsed["budget"] - expected_budget) / expected_budget
        budget_match = budget_error <= tolerance
    
    if parsed["gross"]:
        gross_error = abs(parsed["gross"] - expected_gross) / expected_gross
        gross_match = gross_error <= tolerance
    
    if parsed["roi"]:
        roi_error = abs(parsed["roi"] - expected_roi) / expected_roi
        roi_match = roi_error <= tolerance
    
    # Pass if at least budget and gross are correct (ROI can be calculated from these)
    passed = budget_match and gross_match
    
    return {
        "passed": passed,
        "budget_match": budget_match,
        "gross_match": gross_match,
        "roi_match": roi_match,
        "parsed_budget": parsed["budget"],
        "parsed_gross": parsed["gross"],
        "parsed_roi": parsed["roi"],
        "budget_error": budget_error,
        "gross_error": gross_error,
        "roi_error": roi_error
    }

print("‚úÖ Evaluation function created with ground truth comparison")


## Run Evaluation

Let's test the agent on all our film financial queries.

In [None]:
results = []

for i, test in enumerate(test_cases):
    print(f"\nEvaluating query {i+1}/{len(test_cases)}: {test['query'][:60]}...")
    
    # Embed and retrieve
    query_resp = co.embed(
        texts=[test['query']],
        model=model,
        input_type="search_query",
        embedding_types=['float']
    )
    q_embedding = query_resp.embeddings.float[0]
    
    # Calculate similarities
    sims = [cosine_similarity(q_embedding, chunk_emb) for chunk_emb in embeddings]
    top_idx = np.argsort(sims)[-20:][::-1]
    top_chunks = [chunks[i] for i in top_idx]
    
    # Rerank
    rerank_resp = co.rerank(
        query=test['query'],
        documents=top_chunks,
        top_n=5,
        model="rerank-v3.5",
    )
    reranked = [top_chunks[r.index] for r in rerank_resp.results]
    
    # Generate response
    docs = [{"data": {"snippet": doc}} for doc in reranked]
    chat_resp = co.chat(
        model="command-a-reasoning-08-2025",
        messages=[{"role": "user", "content": test['query']}],
        documents=docs,
        temperature=0.3
    )
    
    answer = get_response_text(chat_resp)
    
    # Evaluate against ground truth from CSV
    eval_result = evaluate_response(
        answer, 
        test['budget'], 
        test['worldwide_gross'], 
        test['roi']
    )
    
    results.append({
        "query": test['query'],
        "answer": answer,
        "film_title": test['film_title'],
        "difficulty": test['difficulty'],
        **eval_result
    })
    
    # Display results
    status = "‚úÖ PASS" if eval_result['passed'] else "‚ùå FAIL"
    budget_status = "‚úì" if eval_result['budget_match'] else "‚úó"
    gross_status = "‚úì" if eval_result['gross_match'] else "‚úó"
    roi_status = "‚úì" if eval_result['roi_match'] else "‚úó"
    print(f"  {status} | Budget: {budget_status} | Gross: {gross_status} | ROI: {roi_status}")

print(f"\n‚úÖ Evaluation complete!")


## Analyze Results

Let's see how well our agent extracted and calculated film financial data.

In [None]:
# Overall metrics
total_tests = len(results)
passed = sum(1 for r in results if r['passed'])
budget_correct = sum(1 for r in results if r['budget_match'])
gross_correct = sum(1 for r in results if r['gross_match'])
roi_correct = sum(1 for r in results if r['roi_match'])

print("="*70)
print("RAG EVALUATION RESULTS - Film Box Office ROI Analysis")
print("="*70)
print(f"Total queries tested: {total_tests}")
print(f"Overall pass rate: {passed}/{total_tests} ({passed/total_tests*100:.1f}%)")
print()
print(f"Accuracy by metric:")
print(f"  Budget extraction:  {budget_correct}/{total_tests} ({budget_correct/total_tests*100:.1f}%)")
print(f"  Gross extraction:   {gross_correct}/{total_tests} ({gross_correct/total_tests*100:.1f}%)")
print(f"  ROI calculation:    {roi_correct}/{total_tests} ({roi_correct/total_tests*100:.1f}%)")
print()

# Show details for failures
failures = [r for r in results if not r['passed']]
if failures:
    print(f"Failed queries: {len(failures)}")
    for fail in failures:
        print(f"  ‚ùå {fail['film_title']}")
        if fail['parsed_budget']:
            print(f"     Budget: ${fail['parsed_budget']:.1f}M (error: {fail['budget_error']*100:.1f}%)")
        else:
            print(f"     Budget: Not found")
        if fail['parsed_gross']:
            print(f"     Gross: ${fail['parsed_gross']:.1f}M (error: {fail['gross_error']*100:.1f}%)")
        else:
            print(f"     Gross: Not found")
else:
    print("üéâ All tests passed!")


## Show Detailed Results

Let's examine each query to see how the agent performed on film financial extraction.

In [None]:
print("\n" + "="*60)
print("DETAILED RESULTS")
print("="*60)

for i, result in enumerate(results, 1):
    print(f"\n{'='*60}")
    print(f"Test {i}: {result['film_title']}")
    print(f"{'='*60}")
    print(f"Query: {result['query']}")
    print(f"\nPassed: {'‚úì' if result['passed'] else '‚úó'}")
    # print(f"Response time: {result['response_time']:.2f}s")
    # print(f"Citations: {len(result['citations'])}")
    
    print(f"\nAnswer:\n{result['answer'][:500]}...")  # First 500 chars
    
    print(f"\nKeyword check:")
    # for keyword in result['expected_keywords']:
    #     found = keyword.lower() in result['answer'].lower()
    #     print(f"  {'‚úì' if found else '‚úó'} '{keyword}'")


---

## üéâ Summary

### What We Built:

**Part 1: RAG Tutorial**
- ‚úÖ Document chunking and embedding with `embed-v4.0`
- ‚úÖ Vector similarity search (cosine similarity)
- ‚úÖ Reranking with `rerank-v3.5`
- ‚úÖ Generation with `command-a-reasoning-08-2025` and **citations**
- ‚úÖ Citation formatting utilities

**Part 2: Evaluation**
- ‚úÖ Test queries on the same dataset
- ‚úÖ Automated evaluation with keyword coverage
- ‚úÖ Pass/fail analysis by difficulty
- ‚úÖ Detailed results inspection

### Key Takeaways:

1. **RAG improves accuracy** by grounding responses in retrieved documents
2. **Citations build trust** by showing which sources were used
3. **Reranking boosts performance** by prioritizing the most relevant chunks
4. **Evaluation is essential** to measure and improve RAG systems

### Next Steps:

- **Try different parameters**: Experiment with chunk sizes, top-k values, temperature
- **Add more test queries**: Create a larger evaluation set
- **Improve retrieval**: Try different embedding models or hybrid search
- **Enhance evaluation**: Add more sophisticated metrics (BLEU, ROUGE, LLM-as-judge)

**Happy building! üöÄ**


---

# Advanced Evaluation: Ground Truth Validation

Now let's evaluate the agent's ability to **extract accurate financial data** and **calculate ROI correctly** by comparing against our ground truth dataset.

This evaluation will measure:
1. **Data Extraction Accuracy**: Did the agent find the correct budget and gross figures?
2. **Calculation Correctness**: Is the ROI calculation mathematically accurate?
3. **Classification Accuracy**: Does the performance classification match the actual ROI?


In [None]:
import pandas as pd
import re

# Load ground truth data
ground_truth_df = pd.read_csv('../data/ground_truth/film_box_office_ground_truth.csv')

print(f"Loaded {len(ground_truth_df)} films from ground truth")
print("\nSample data:")
print(ground_truth_df[['Year', 'Title', 'Budget', 'Worldwide gross', 'ROI']].head(10))


In [None]:
def extract_financial_data_with_rag(film_title, film_year=None):
    """
    Use RAG to extract budget and gross for a specific film
    """
    # Construct query
    query = f"What was the production budget and worldwide box office gross for {film_title}?"
    if film_year:
        query += f" (from {film_year})"
    
    # Embed query
    query_embed = co.embed(
        texts=[query],
        model=model,
        input_type="search_query",
        embedding_types=["float"]
    ).embeddings.float[0]
    
    # Retrieve chunks
    doc_ids = [id for id, _ in sorted(
        [(id, cosine_similarity(query_embed, emb)) for id, emb in embeddings.items()],
        key=lambda x: x[1],
        reverse=True
    )[:20]]
    docs_to_use = [chunks[id] for id in doc_ids]
    
    # Rerank
    rerank_results = co.rerank(
        query=query,
        documents=docs_to_use,
        top_n=5,
        model="rerank-v3.5"
    )
    
    docs_to_use = [docs_to_use[r.index] for r in rerank_results.results]
    
    # Generate response with specific instructions
    preamble = """You are a financial analyst extracting film production data. 
    Extract the EXACT budget and worldwide box office gross figures from the provided text.
    Report numbers in millions (e.g., '$237 million' or '$2.9 billion').
    If you find the data, also calculate the ROI using: ROI = (gross - budget) / budget √ó 100"""
    
    response = co.chat(
        model="command-a-03-2025",
        messages=[{"role": "user", "content": query}],
        documents=[{"text": doc} for doc in docs_to_use],
        preamble=preamble
    )
    
    return get_response_text(response), response.message.citations


In [None]:
def parse_financial_data(agent_response):
    """
    Parse budget and gross from agent's text response
    Returns values in dollars (not millions)
    """
    import re
    
    # Look for budget
    budget = None
    budget_patterns = [
        r'budget.*?\$([\d,]+)\s*million',
        r'\$([\d,]+)\s*million.*?budget',
        r'cost.*?\$([\d,]+)\s*million',
    ]
    
    for pattern in budget_patterns:
        match = re.search(pattern, agent_response, re.IGNORECASE)
        if match:
            budget = float(match.group(1).replace(',', '')) * 1_000_000
            break
    
    # Look for gross
    gross = None
    gross_patterns = [
        r'gross.*?\$([\d.]+)\s*billion',
        r'\$([\d.]+)\s*billion.*?gross',
        r'earned.*?\$([\d.]+)\s*billion',
        r'box office.*?\$([\d.]+)\s*billion',
    ]
    
    for pattern in gross_patterns:
        match = re.search(pattern, agent_response, re.IGNORECASE)
        if match:
            gross = float(match.group(1)) * 1_000_000_000
            break
    
    # If not found in billions, try millions
    if gross is None:
        gross_patterns_mil = [
            r'gross.*?\$([\d,]+)\s*million',
            r'\$([\d,]+)\s*million.*?gross',
        ]
        for pattern in gross_patterns_mil:
            match = re.search(pattern, agent_response, re.IGNORECASE)
            if match:
                gross = float(match.group(1).replace(',', '')) * 1_000_000
                break
    
    # Calculate ROI if we have both
    roi = None
    if budget and gross and budget > 0:
        roi = ((gross - budget) / budget) * 100
    
    return {
        'budget': budget,
        'gross': gross,
        'roi': roi
    }


In [None]:
# Test on a subset of films from ground truth (matching Wikipedia dataset)
test_films = [
    ('Avengers: Endgame', 2019),
    ('Titanic', 1997),
    ('Star Wars: The Force Awakens', 2015),
    ('Avengers: Infinity War', 2018),
    ('Spider-Man: No Way Home', 2021),
]

evaluation_results = []

for film_title_short, year in test_films:
    print(f"\nEvaluating: {film_title_short} ({year})")
    
    # Get ground truth
    gt_row = ground_truth_df[
        (ground_truth_df['Title'].str.contains(film_title_short, case=False)) & 
        (ground_truth_df['Year'] == year)
    ]
    
    if len(gt_row) == 0:
        print(f"  ‚ö†Ô∏è  Not found in ground truth")
        continue
    
    gt_row = gt_row.iloc[0]
    
    # Parse ground truth values
    gt_budget = float(gt_row['Budget'].replace('$', '').replace(',', ''))
    gt_gross = float(gt_row['Worldwide gross'].replace('$', '').replace(',', ''))
    gt_roi = float(gt_row['ROI'])
    
    # Get agent's extraction
    agent_response, citations = extract_financial_data_with_rag(film_title_short, year)
    agent_data = parse_financial_data(agent_response)
    
    # Calculate errors
    budget_error = None
    gross_error = None
    roi_error = None
    
    if agent_data['budget']:
        budget_error = abs(agent_data['budget'] - gt_budget) / gt_budget * 100
    
    if agent_data['gross']:
        gross_error = abs(agent_data['gross'] - gt_gross) / gt_gross * 100
    
    if agent_data['roi']:
        roi_error = abs(agent_data['roi'] - gt_roi) / gt_roi * 100
    
    result = {
        'film': film_title_short,
        'year': year,
        'gt_budget': gt_budget,
        'agent_budget': agent_data['budget'],
        'budget_error_pct': budget_error,
        'gt_gross': gt_gross,
        'agent_gross': agent_data['gross'],
        'gross_error_pct': gross_error,
        'gt_roi': gt_roi,
        'agent_roi': agent_data['roi'],
        'roi_error_pct': roi_error,
        'agent_response': agent_response[:200],
        'citations_count': len(citations) if citations else 0
    }
    
    evaluation_results.append(result)
    
    print(f"  Budget: ${gt_budget/1e6:.0f}M (GT) vs ${agent_data['budget']/1e6:.0f}M (Agent) - Error: {budget_error:.1f}%" if agent_data['budget'] else "  Budget: Not extracted")
    print(f"  Gross: ${gt_gross/1e6:.0f}M (GT) vs ${agent_data['gross']/1e6:.0f}M (Agent) - Error: {gross_error:.1f}%" if agent_data['gross'] else "  Gross: Not extracted")
    print(f"  ROI: {gt_roi:.1f}% (GT) vs {agent_data['roi']:.1f}% (Agent) - Error: {roi_error:.1f}%" if agent_data['roi'] else "  ROI: Not calculated")

print("\n" + "="*60)
print("GROUND TRUTH EVALUATION COMPLETE")
print("="*60)


In [None]:
# Calculate overall accuracy metrics
import pandas as pd

eval_df = pd.DataFrame(evaluation_results)

print("\n" + "="*60)
print("FINAL EVALUATION METRICS")
print("="*60)

# Data extraction success rate
budget_extracted = eval_df['agent_budget'].notna().sum()
gross_extracted = eval_df['agent_gross'].notna().sum()
roi_calculated = eval_df['agent_roi'].notna().sum()

print(f"\nData Extraction Success Rate:")
print(f"  Budget extracted: {budget_extracted}/{len(eval_df)} ({budget_extracted/len(eval_df)*100:.0f}%)")
print(f"  Gross extracted: {gross_extracted}/{len(eval_df)} ({gross_extracted/len(eval_df)*100:.0f}%)")
print(f"  ROI calculated: {roi_calculated}/{len(eval_df)} ({roi_calculated/len(eval_df)*100:.0f}%)")

# Accuracy of extracted values (within 5% is considered accurate)
budget_accurate = (eval_df['budget_error_pct'] < 5).sum()
gross_accurate = (eval_df['gross_error_pct'] < 5).sum()
roi_accurate = (eval_df['roi_error_pct'] < 5).sum()

print(f"\nExtraction Accuracy (within 5% of ground truth):")
print(f"  Budget accurate: {budget_accurate}/{budget_extracted} ({budget_accurate/budget_extracted*100:.0f}%)" if budget_extracted > 0 else "  Budget accurate: N/A")
print(f"  Gross accurate: {gross_accurate}/{gross_extracted} ({gross_accurate/gross_extracted*100:.0f}%)" if gross_extracted > 0 else "  Gross accurate: N/A")
print(f"  ROI accurate: {roi_accurate}/{roi_calculated} ({roi_accurate/roi_calculated*100:.0f}%)" if roi_calculated > 0 else "  ROI accurate: N/A")

# Average errors
print(f"\nAverage Errors:")
print(f"  Budget: {eval_df['budget_error_pct'].mean():.2f}%")
print(f"  Gross: {eval_df['gross_error_pct'].mean():.2f}%")
print(f"  ROI: {eval_df['roi_error_pct'].mean():.2f}%")

print(f"\n‚ú® Average citations per response: {eval_df['citations_count'].mean():.1f}")

# Show the dataframe
print("\n" + "="*60)
print("DETAILED RESULTS TABLE")
print("="*60)
display(eval_df[['film', 'year', 'budget_error_pct', 'gross_error_pct', 'roi_error_pct', 'citations_count']])
