# Thudbot Scaffold

This will be a prototype Thudbot built in jupyter.

Steps:
1. General setup
2. Data collection and Preparation
3. SDG with RAGAS to create a golden data set
4. Setup the RAG chain (finally)
5. Evaluate results with RAGAS
6. Refine RAG performance (prompt tuning, retreival methods)

Once everything works:
- Convert to a standalone Python script
- Build or reuse a chatbot front end to run it locally


Naming this 00_ so that it will be the first notebook every time I start a new session.

## Step 1 General setup

In [1]:
### API key management and environment variables

### Reminder: Place .env file inside the root of the project folder so when calling the below from inside the notebook it should find the .env fule and load it inside the notebook environment
### PLEASE ADD THIS `.env` FILE TO YOUR PROJECT'S `.gitignore` file before committing and pushing the changes to your remote repo, as it contains API Keys and Secrets in it

import os
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env", override=True)

# --- Verify API Keys ---
print("--- API Key Status ---")
print(f"OPENAI_API_KEY loaded: {'OPENAI_API_KEY' in os.environ}")
print(f"LANGCHAIN_API_KEY loaded: {'LANGCHAIN_API_KEY' in os.environ}")
print(f"TAVILY_API_KEY loaded: {'TAVILY_API_KEY' in os.environ}")
print(f"RAGAS_API_KEY loaded: {'RAGAS_API_KEY' in os.environ}")
print(f"ANTHROPIC_API_KEY loaded: {'ANTHROPIC_API_KEY' in os.environ}")
print(f"COHERE_API_KEY loaded: {'COHERE_API_KEY' in os.environ}")

# --- Verify General Settings ---
print("\n--- Project Settings Status ---")
print(f"DEBUG mode enabled: {os.environ.get('DEBUG') == 'True'}")
print(f"LangSmith Tracing V2 enabled: {os.environ.get('LANGCHAIN_TRACING_V2') == 'true'}")
print(f"LangChain Project Base: {os.environ.get('LANGCHAIN_PROJECT_BASE')}")
print(f"LangChain Project: {os.environ.get('LANGCHAIN_PROJECT')}")


--- API Key Status ---
OPENAI_API_KEY loaded: True
LANGCHAIN_API_KEY loaded: True
TAVILY_API_KEY loaded: True
RAGAS_API_KEY loaded: False
ANTHROPIC_API_KEY loaded: True
COHERE_API_KEY loaded: True

--- Project Settings Status ---
DEBUG mode enabled: True
LangSmith Tracing V2 enabled: True
LangChain Project Base: None
LangChain Project: THUDBOT-CC


including nltk, because it worked before

In [2]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/family/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/family/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

## Step 2: Data Collection and Preparation

My data is CSV structured, so using code from HW9

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/Thudbot_Hint_Data_1.csv",
    metadata_columns=[
        "question",
        "hint_level",
        "character",
        "speaker",
        "narrative_context",
        "planet",
        "location",
        "category",
        "tone",
        "follow_up_hint_id",
        "answer_keywords",
        "tags"
    ]
)

hint_data = loader.load()

# No need to overwrite page_content; not doing custom transformation
print(hint_data[0].page_content)     # This will already be the hint_text
print(hint_data[0].metadata)         # This will show all the metadata fields


question_id: TSB-001
hint_text: Press the escape key to exit the opening animations
puzzle_name: 
source: self
{'source': './data/Thudbot_Hint_Data_1.csv', 'row': 0, 'question': 'How do I stop the opening movie', 'hint_level': '1', 'character': 'Player', 'speaker': '', 'narrative_context': 'Meta', 'planet': '', 'location': '', 'category': 'Meta', 'tone': '', 'follow_up_hint_id': '', 'answer_keywords': '', 'tags': ''}


### Setting up QDrant! (from HW9)

Now that we have our documents, let's create a QDrant VectorStore with the collection name "ThudbotHints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

 

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents=hint_data,
    embedding=embeddings,
    location=":memory:",
    collection_name="Thudbot_Hints"
)

## Step 3: Synthetic data generation (SDG) with RAGAS to create a golden dataset

Adapted from HW 7 & 9

NOTE: this step is a one-time deal. Once the appropriate golden data set is finalized, don't run any of these cells, but pick up at the JSON import cell

### ⏭️ Skip these steps until/unless want new golden data

#### Group data by narrative context

the data is too granular for Ragas, I got this error:

ValueError: Documents appears to be too short (ie 100 tokens or less). Please provide longer documents

so,  I will to group it and 
Create a new list called  ```merged_docs``` , just to feed to ragas, but not for retrieval

In [6]:
from langchain.schema import Document
from collections import defaultdict

# Group hints by narrative context
grouped = defaultdict(list)

for doc in hint_data:
    key = doc.metadata.get("narrative_context", "unknown")
    grouped[key].append(doc.page_content)

# Create longer Documents for SDG
merged_docs = [
    Document(
        page_content="\n".join(hints),
        metadata={"narrative_context": context}
    )
    for context, hints in grouped.items()
]

# Optional: preview length
for doc in merged_docs:
    print(f"{doc.metadata['narrative_context']}: {len(doc.page_content.split())} words")


Meta: 505 words
Bar: 562 words
Background: 75 words
Thud Flashback: 94 words
Fleebix Flashback: 1725 words


Need to select a random subset, because the data is grouped

In [7]:
from ragas.testset import TestsetGenerator
import random

sampled_docs = random.sample(merged_docs, 5)
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
golden_dataset = generator.generate_with_langchain_docs(sampled_docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/3 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/5 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/4 [00:00<?, ?it/s]

Property 'summary' already exists in node 'd38336'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/5 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/10 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'd38336'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

In [8]:
golden_dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,TSB-041 what do I do?,[question_id: TSB-041 hint_text: You can't do ...,"You can't do much on your own as Fleebix, beca...",single_hop_specifc_query_synthesizer
1,What does UHS stand for in the context of the ...,[question_id: TSB-041 hint_text: You can't do ...,The context mentions UHS as the source related...,single_hop_specifc_query_synthesizer
2,What is the UHS that I need to look in the sma...,[look in the small package when you were insid...,The context mentions looking in the small pack...,single_hop_specifc_query_synthesizer
3,"In the context of the UHS puzzles, what does t...",[look in the small package when you were insid...,The term 'UHS' in the provided context refers ...,single_hop_specifc_query_synthesizer
4,How do I get to Quantelope Lodge after explori...,[<1-hop>\n\nquestion_id: TSB-019 hint_text: Lo...,"First, in the entry vestibule, you should try ...",multi_hop_abstract_query_synthesizer
5,how get to quantelope lodge from entry vestibule,[<1-hop>\n\nquestion_id: TSB-019 hint_text: Lo...,"First, you need to look around in the entry ve...",multi_hop_abstract_query_synthesizer
6,Wht lvl of travel goal is to get to quantelope...,[<1-hop>\n\nquestion_id: TSB-041 hint_text: Yo...,"To reach Quantelope Lodge, the goal is to foll...",multi_hop_abstract_query_synthesizer
7,How can understanding the clues from the entry...,[<1-hop>\n\nquestion_id: TSB-041 hint_text: Yo...,"To reach Quantelope Lodge, you need to explore...",multi_hop_specific_query_synthesizer
8,How can I use the information from the entry v...,[<1-hop>\n\nquestion_id: TSB-041 hint_text: Yo...,"To progress in getting to Quantelope Lodge, yo...",multi_hop_specific_query_synthesizer
9,how TSB-041 and TSB-031 help me get to Quantel...,[<1-hop>\n\nquestion_id: TSB-041 hint_text: Yo...,TSB-041 explains that you can't do much on you...,multi_hop_specific_query_synthesizer


save the RAGAS golden data set as a json

In [None]:
import json

with open("data/goldendataset.json", "w") as f:
    json.dump([sample.model_dump() for sample in golden_dataset], f, indent=2)


In [None]:
## Export Golden Dataset to Human-Readable Formats

import json
import pandas as pd

# Load the golden dataset
with open('./data/goldendataset.json', 'r') as f:
    golden_data = json.load(f)

# Extract the key information into a clean DataFrame
readable_data = []
for i, item in enumerate(golden_data):
    eval_sample = item['eval_sample']
    
    # Clean up the reference contexts (truncate if too long)
    contexts = eval_sample.get('reference_contexts', [])
    contexts_preview = str(contexts[0])[:200] + "..." if contexts else "No context"
    
    readable_data.append({
        'Question_ID': f"Q{i+1:02d}",
        'User_Question': eval_sample.get('user_input', ''),
        'Expected_Answer': eval_sample.get('reference', ''),
        'Synthesizer_Type': item.get('synthesizer_name', ''),
        'Context_Preview': contexts_preview,
        'Full_Context_Length': len(str(contexts)) if contexts else 0
    })

# Create DataFrame
df = pd.DataFrame(readable_data)

# Export to CSV
df.to_csv('./data/golden_dataset_readable.csv', index=False)

# Display preview
print("Golden Dataset Preview:")
print("=" * 60)
for _, row in df.head(3).iterrows():
    print(f"Question {row['Question_ID']}: {row['User_Question']}")
    print(f"Expected: {row['Expected_Answer']}")
    print(f"Synthesizer: {row['Synthesizer_Type']}")
    print("-" * 40)

print(f"\n✅ Exported {len(df)} questions to: ./data/golden_dataset_readable.csv")


In [None]:
## Option 2: Create Detailed Markdown Report

def create_markdown_report(golden_data, output_file='./data/golden_dataset_report.md'):
    """Create a detailed, human-readable markdown report of the golden dataset."""
    
    with open(output_file, 'w') as f:
        f.write("# Thudbot Golden Dataset Report\n\n")
        f.write("Generated synthetic test questions and answers for RAGAS evaluation.\n\n")
        f.write(f"**Total Questions:** {len(golden_data)}\n\n")
        f.write("---\n\n")
        
        for i, item in enumerate(golden_data, 1):
            eval_sample = item['eval_sample']
            
            f.write(f"## Question {i:02d}\n\n")
            f.write(f"**User Input:** {eval_sample.get('user_input', 'N/A')}\n\n")
            f.write(f"**Expected Answer:**\n{eval_sample.get('reference', 'N/A')}\n\n")
            f.write(f"**Synthesizer:** `{item.get('synthesizer_name', 'unknown')}`\n\n")
            
            # Add context information
            contexts = eval_sample.get('reference_contexts', [])
            if contexts:
                f.write("**Reference Context Preview:**\n")
                context_preview = str(contexts[0])[:300] + "..." if len(str(contexts[0])) > 300 else str(contexts[0])
                f.write(f"```\n{context_preview}\n```\n\n")
            
            f.write("---\n\n")
    
    print(f"✅ Created detailed report: {output_file}")

# Generate the markdown report
create_markdown_report(golden_data)


the answers are okay, but the questions are crappy. Let's see if we can get an LLM to rewrite the questions in a more natural game-player voice

##### Five step plan to re-write the questions:

1. Define rewriter prompt
2. Set up a rewriter chain using the rewriter prompt
3. Define rewrite function that invokes rewriter_chain
4. Run the function on the original "golden" dataset
5. Write the re-written questions to **platinum_dataset.json** (and an .md file)

"Vibe-check" the output, and iterate until I like it!

#### Question Rewriter: Transform Formal Questions → Casual Player Questions

Using the same pattern as HW7 for LLM chains


In [9]:
# Step 1: Create the rewriter prompt (following HW7 prompt pattern)

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

QUESTION_REWRITER_PROMPT = """\
You are helping rewrite questions to sound like real confused players asking for game hints.

Your job is to transform formal/academic questions into casual, human-like questions that real players would ask when stuck in "The Space Bar" adventure game.

IMPORTANT: If the question contains a question ID like "TSB-041" or similar, ignore that completely - it's just an organizational code that doesn't exist in the real game.

Original question: {original_question}
Game context: {context_preview}

Examples of good player questions:
- "How do I get past the guard?"
- "I'm stuck in this room, what do I do?"
- "Help! I can't figure out this puzzle"
- "Where's the token I need?"

Make it sound confused, or casual - like a real person, not an academic paper.

Rewritten question:"""

# Create the prompt template (same pattern as HW7)
rewriter_prompt = ChatPromptTemplate.from_template(QUESTION_REWRITER_PROMPT)


In [10]:
# Step 2: Set up a rewriter chain using the rewriter prompt

# Set up the LLM for rewriting (following HW7 pattern)
rewriter_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

# Create the rewriter chain (same LCEL pattern as HW7)
from langchain.schema import StrOutputParser

rewriter_chain = rewriter_prompt | rewriter_llm | StrOutputParser()


In [11]:
# Step 3: Define rewrite function that invokes rewriter_chain
# Main rewriting function (following HW7 iteration pattern)
def rewrite_questions_to_player_style(golden_data):
    """Rewrite formal RAGAS questions to sound like casual game players"""
    
    updated_data = []
    
    print("🔄 Rewriting questions to sound like real players...")
    print("-" * 50)
    
    for i, item in enumerate(golden_data, 1):
        eval_sample = item['eval_sample']
        original_q = eval_sample.get('user_input', '')
        
        # Get context preview for better rewriting
        contexts = eval_sample.get('reference_contexts', [])
        context_preview = str(contexts[0])[:200] + "..." if contexts else "No context available"
        
        # Rewrite the question using our chain
        try:
            rewritten_q = rewriter_chain.invoke({
                "original_question": original_q,
                "context_preview": context_preview
            })
            
            # Update the question in place
            eval_sample['user_input'] = rewritten_q.strip()
            eval_sample['original_question'] = original_q  # Keep original for comparison
            
            # Show progress
            print(f"✅ Question {i:02d}:")
            print(f"   Original: {original_q[:70]}...")
            print(f"   Rewritten: {rewritten_q.strip()[:70]}...")
            print()
            
        except Exception as e:
            print(f"❌ Error rewriting question {i}: {e}")
            # Keep original if rewriting fails
            eval_sample['original_question'] = original_q
        
        updated_data.append(item)
    
    print(f"✅ Successfully rewrote {len(updated_data)} questions!")
    return updated_data


In [12]:
#Step 4: Run the re-writer on the original "golden" dataset

import json

# Load the original golden dataset
with open('./data/goldendataset.json', 'r') as f:
    original_golden_data = json.load(f)

print(f"📊 Loaded {len(original_golden_data)} questions to rewrite")

# Rewrite the questions
rewritten_data = rewrite_questions_to_player_style(original_golden_data)


📊 Loaded 12 questions to rewrite
🔄 Rewriting questions to sound like real players...
--------------------------------------------------
✅ Question 01:
   Original: In the context of the game, what is the significance of the Yzore in r...
   Rewritten: "Wait, what’s the deal with the Yzore? How does it help me get to the ...

✅ Question 02:
   Original: TSB-041 what do I do?...
   Rewritten: "Uh, I'm really lost here. What am I supposed to do as Fleebix? I'm st...

✅ Question 03:
   Original: What UHS mean in the game?...
   Rewritten: "Wait, what does UHS even mean in this game?"...

✅ Question 04:
   Original: WhaT is TSB-027?...
   Rewritten: "Wait, what am I supposed to do in this entry vestibule? I'm totally l...

✅ Question 05:
   Original: H0w do the residue printer and fingerprint analysis help in cracking t...
   Rewritten: "Hey, I'm totally lost! How do I use the residue printer and the finge...

✅ Question 06:
   Original: how use tokens for bus and find object like door or c

In [13]:
# Step 5: Save the rewritten dataset and create comparison report
with open('./data/platinum_dataset.json', 'w') as f:
    json.dump(rewritten_data, f, indent=2)

# Create a comparison report to review the changes
def create_comparison_report(rewritten_data, output_file='./data/question_rewrite_comparison.md'):
    with open(output_file, 'w') as f:
        f.write("# Question Rewriting Comparison Report\n\n")
        f.write("Comparison of original RAGAS questions vs. player-style questions\n\n")
        f.write("---\n\n")
        
        for i, item in enumerate(rewritten_data, 1):
            eval_sample = item['eval_sample']
            original = eval_sample.get('original_question', 'N/A')
            rewritten = eval_sample.get('user_input', 'N/A')
            
            f.write(f"## Question {i:02d}\n\n")
            f.write(f"**Original (RAGAS):** {original}\n\n")
            f.write(f"**Rewritten (Player Style):** {rewritten}\n\n")
            f.write("---\n\n")
    
    print(f"📝 Created comparison report: {output_file}")

create_comparison_report(rewritten_data)
print(f"💾 Saved rewritten dataset to: ./data/platinum_dataset.json")
print("\n🎯 Ready to use the player-style questions for RAGAS evaluation!")


📝 Created comparison report: ./data/question_rewrite_comparison.md
💾 Saved rewritten dataset to: ./data/platinum_dataset.json

🎯 Ready to use the player-style questions for RAGAS evaluation!


I ran a few iterations with different prompts, and an now reasonably happy with the questions. 

Next step is to import the ```platinum_dataset.json``` for use by ragas eval

#### End of SDG work

### ▶️ Resume here to load data for any RAGAS eval

In [11]:
import json

# Load questions for testing retrievers
with open("data/platinum_dataset.json", "r") as f:
    platinum_data = json.load(f)



## Step 4: Setup the RAG chain


Starting with a "naive" dense vector retrieval

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

My first pass at a Thud-like prompt, named as ```THUD_TEMPLATE```

This will need tuning!

In [7]:
from langchain_core.prompts import ChatPromptTemplate

THUD_TEMPLATE = """\
You are Thud, a friendly and somewhat simple-minded patron at The Thirsty Tentacle. 

You're trying your best to help the player navigate the game "The Space Bar."

Use the clues and context provided below to offer a gentle hint — not a full solution.

If you're not sure, say so, or suggest the player look around more.

Player's question:
{question}

Context:
{context}

Your hint:"""

rag_prompt = ChatPromptTemplate.from_template(THUD_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")
#chat_model = ChatAnthropic(model="claude-3-5-sonnet-20240620")

### LCEL RAG Chain

We're going to use LCEL to construct our chain. (from HW9)


In [16]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    
    | RunnablePassthrough.assign(context=itemgetter("context"))

    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
).with_config({"run_name": "naive_retriever_chain"})

Test the chain, and the langsmith tracing with a question.
Might as well take the question from the platinum data set (just remember to load it above ▶️)

In [31]:
sample_q = platinum_data[0]["eval_sample"]["user_input"]
naive_retrieval_chain.invoke({"question": sample_q})


{'response': AIMessage(content="Oh, Yzore is important because you need a special token to ride the bus there. Maybe there's something you can do at Glom Hole to get that token or figure out a way into the bus. Look around and see if you can find a clue or item related to the bus or the token!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 62, 'prompt_tokens': 2070, 'total_tokens': 2132, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-C0IcJB7LA0t2SXT4aNbR1X6ilYuL3', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--2f4c421c-7ad9-4332-9172-a6a9e8e958c4-0', usage_metadata={'input_tokens': 2070, 'output_tokens': 62, 'total_tokens': 2132, 'input_token_details': {'

The design hypothesis is that multi-query or BM25 might be better. 
Let's define those chains now, so I can compare all three together

In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [33]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
).with_config({"run_name": "multi_query_chain"})

In [34]:
sample_q = platinum_data[0]["eval_sample"]["user_input"]
multi_query_retrieval_chain.invoke({"question": sample_q})

{'response': AIMessage(content="Oh, yzore, huh? Well, the Yzore planet's got a special token you need if you wanna hop on a bus there. Maybe look around for something that could be a token or a special item that helps you get on that bus. Just a little hint — sometimes those tokens are hidden or tucked away somewhere neat. Keep an eye out!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 74, 'prompt_tokens': 3071, 'total_tokens': 3145, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-C0Ie2O670bHFk084QEellk0wsin69', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--e9aa8a58-bd39-4a5b-93bb-fcce248c173d-0', usage_metadata={'input_tokens': 3071, 'output_tokens': 74, 

BM25 next

In [37]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(hint_data, k=10)

bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
).with_config({"run_name": "bm25_chain"})

In [38]:
sample_q = platinum_data[0]["eval_sample"]["user_input"]
bm25_retrieval_chain.invoke({"question": sample_q})

{'response': AIMessage(content='Oh, the Yzore planet and the Yzore Yzore Yzore... Well, I remember Fleebix talking about the Yzore a little. The Yzore Yzore—uh, the Yzore Yzore—can help you figure out the way to Quantelope Lodge... but I think you might need to find something specific about the Yzore to get there. Maybe look around and check if the Yzore has any special features or clues that can guide your way. Sometimes, the Yzore helps by revealing the right route when you pay attention to the details. Just keep an eye out!', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 132, 'prompt_tokens': 1712, 'total_tokens': 1844, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 1536}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-C0Iql6DsHVbX7gnyn

## Step 5: Evaluate results with RAGAS

next function is from HW9, but updated to add some metadata to the Langsmith trace!

In [20]:
#define function to run selected retriever on the eval (platinum) data
# import time

def run_retriever_on_dataset(name, retriever_chain, platinum_data):
    """Run retriever and format for Ragas evaluation - using the platinum dataset"""
    print(f"Running {name} on platinum data")
    outputs = []
    
    for item in platinum_data:
        question = item["eval_sample"]["user_input"]
        reference = item["eval_sample"]["reference"]
        
        # Run retriever with LangSmith config for trace grouping
        response = (
            retriever_chain
            .with_config({
                "run_name": f"{name}_retriever_chain",
                "tags": ["platinum-eval", name],
                "metadata": {
                    "retriever": name,
                    "dataset": "platinum"
                }
            })
            .invoke({"question": question})
        )
        
        outputs.append({
            "user_input": question,
            "reference": reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
            "retriever_name": name
        })

    #    # Add delay between requests if needed (for rate limiting)
    #     if i < len(platinum_dataset) - 1:  # Don't sleep after last item
    #         print(f"  Waiting 2 seconds before next request...")
    #         time.sleep(2)  # Adjust this value as needed

    return outputs


More cells straigh out of HW9

In [22]:
naive_outputs = run_retriever_on_dataset("naive", naive_retrieval_chain, platinum_data)
# bm25_outputs = run_retriever_on_dataset("bm25", bm25_retrieval_chain, platinum_data)
# multi_query_outputs = run_retriever_on_dataset("multi_query", multi_query_retrieval_chain, platinum_data)
# parent_doc_outputs = run_retriever_on_dataset("parent_doc", parent_document_retrieval_chain, platinum_data)
# ensemble_outputs = run_retriever_on_dataset("ensemble", ensemble_retrieval_chain, platinum_data)
# contextual_compression_outputs = run_retriever_on_dataset("contextual_compression", contextual_compression_retrieval_chain, platinum_dataset)

Running naive on platinum data


In [15]:
naive_outputs[:3]

[{'user_input': '"Wait, what’s the deal with the Yzore? How does it help me get to the Quantelope Lodge?"',
  'reference': "The hint_text indicates that in Yzore, you need a token to get on a bus, and there's a token in the cup. You can ask Thud to take the token from the cup, which is part of the process to reach the Quantelope Lodge.",
  'response': 'Oh, Yzore, huh? Well, I think the Yzore is kind of like a place where you need a special token to get on a bus. Maybe that token is important for your trip to Quantelope Lodge. Have you looked around Glom Hole or near the mailbox there? Sometimes, the things you need are hidden nearby or at a specific spot. I’d suggest checking out the mailbox or nearby areas to see if you can find something that might be your ticket—literally! Hope that helps a bit!',
  'retrieved_contexts': ['question_id: TSB-046\nhint_text: You need a token to get on a bus in Yzore\npuzzle_name: Getting to Quantelope Lodge\nsource: UHS',
   "question_id: TSB-048\nhint

In [23]:
import pandas as pd
from ragas import EvaluationDataset

# Step 1: Convert to DataFrame
naive_df = pd.DataFrame(naive_outputs)
# bm25_df = pd.DataFrame(bm25_outputs)
# multi_query_df = pd.DataFrame(multi_query_outputs)
# parent_doc_df = pd.DataFrame(parent_doc_outputs)
# ensemble_df = pd.DataFrame(ensemble_outputs)
# contextual_compression_df = pd.DataFrame(contextual_compression_outputs)

# Step 2: Convert to Ragas-compatible EvaluationDataset
naive_eval_dataset = EvaluationDataset.from_pandas(naive_df)
# bm25_eval_dataset = EvaluationDataset.from_pandas(bm25_df)
# multi_query_eval_dataset = EvaluationDataset.from_pandas(multi_query_df)
# parent_doc_eval_dataset = EvaluationDataset.from_pandas(parent_doc_df)
# ensemble_eval_dataset = EvaluationDataset.from_pandas(ensemble_df)
# contextual_compression_eval_dataset = EvaluationDataset.from_pandas(contextual_compression_df)



In [24]:
naive_df
# bm25_df
# multi_query_df
# parent_doc_df
# ensemble_df
# contextual_compression_df

Unnamed: 0,user_input,reference,response,retrieved_contexts,retriever_name
0,"""Wait, what’s the deal with the Yzore? How doe...","The hint_text indicates that in Yzore, you nee...","Oh, Yzore is a pretty interesting planet! From...",[question_id: TSB-046\nhint_text: You need a t...,naive
1,"""Uh, I'm really lost here. What am I supposed ...","You can't do much on your own as Fleebix, beca...","Oh, hey there! If you're stuck in the jar as F...",[question_id: TSB-041\nhint_text: You can't do...,naive
2,"""Wait, what does UHS even mean in this game?""",UHS is referenced as the source for puzzles an...,"Oh, UHS, huh? That sounds like a clue to somet...",[question_id: TSB-069\nhint_text: To omplete t...,naive
3,"""Wait, what am I supposed to do in this entry ...",Thud is not very bright.,"Oh, hey there! In the entry vestibule, it migh...",[question_id: TSB-019\nhint_text: Look around ...,naive
4,"""Hey, I'm totally lost! How do I use the resid...",The residue printer can tell you who may have ...,"Hey there, friend! Looks like you wanna use th...",[question_id: TSB-039\nhint_text: You better c...,naive
5,"""Hey, how do I use these tokens for the bus? A...","In the context, you need a token to get on the...",Hey there! Looks like the tokens are a little ...,[question_id: TSB-046\nhint_text: You need a t...,naive
6,"""Hey, I'm totally lost! How do I use the simul...",To effectively navigate the Simulator and expl...,"Hey there! Hmm, setting that nav dial just rig...",[question_id: TSB-056\nhint_text: Have Thud se...,naive
7,"""Wait, how am I supposed to get Fleebix to fol...",Fleebix can't do much alone and needs Thud's h...,"Oh dear, it sounds like you're a bit tangled u...",[question_id: TSB-040\nhint_text: Fleebix will...,naive
8,"""Wait, how do I get to Quantelope Lodge with T...","First, you need to find Thud and Fleebix and g...","Hey there, friend! It sounds like you're tryin...",[question_id: TSB-045\nhint_text: To get to th...,naive
9,"""Wait, how do Fleebix and Thud even get to the...",Fleebix and Thud need to reach the Quantelope ...,Hi there! It sounds like you're trying to figu...,[question_id: TSB-041\nhint_text: When you are...,naive


Ragas imports

In [25]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [None]:
custom_run_config = RunConfig(timeout=600)
results = {}

datasets = [
    #("bm25", bm25_eval_dataset),
    ("naive", naive_eval_dataset),
    #("multi_query", multi_query_eval_dataset),
    #("parent_doc", parent_doc_eval_dataset),
    #("ensemble", ensemble_eval_dataset),
    #("contextual_compression", contextual_compression_eval_dataset)
]

for name, dataset in datasets:
    print(f"Evaluating: {name}")
    try:
        result = evaluate(
            dataset=dataset,
            metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
            llm=evaluator_llm,
            run_config=custom_run_config
        )
        results[name] = result
    except Exception as e:
        print(f"❌ Error during {name}: {e}")
        results[name] = None  # or skip entirely


In [None]:
for name in results:
    print(f"\n{name.upper()} RESULTS:")
    print(results[name])
