# Thudbot Scaffold

This will be a prototype Thudbot built in jupyter.

Steps:
1. General setup
2. Data collection and Preparation
3. SDG with RAGAS to create a golden data set
4. Setup the RAG chain (finally)
5. Evaluate results with RAGAS
6. Refine RAG performance (prompt tuning, retreival methods)

Once everything works:
- Convert to a standalone Python script
- Build or reuse a chatbot front end to run it locally


Naming this 00_ so that it will be the first notebook every time I start a new session.

## Step 1 General setup

In [2]:
### API key management and environment variables

### Reminder: Place .env file inside the root of the project folder so when calling the below from inside the notebook it should find the .env fule and load it inside the notebook environment
### PLEASE ADD THIS `.env` FILE TO YOUR PROJECT'S `.gitignore` file before committing and pushing the changes to your remote repo, as it contains API Keys and Secrets in it

import os
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env", override=True)

# --- Verify API Keys ---
print("--- API Key Status ---")
print(f"OPENAI_API_KEY loaded: {'OPENAI_API_KEY' in os.environ}")
print(f"LANGCHAIN_API_KEY loaded: {'LANGCHAIN_API_KEY' in os.environ}")
print(f"TAVILY_API_KEY loaded: {'TAVILY_API_KEY' in os.environ}")
print(f"RAGAS_API_KEY loaded: {'RAGAS_API_KEY' in os.environ}")
print(f"ANTHROPIC_API_KEY loaded: {'ANTHROPIC_API_KEY' in os.environ}")
print(f"COHERE_API_KEY loaded: {'COHERE_API_KEY' in os.environ}")

# --- Verify General Settings ---
print("\n--- Project Settings Status ---")
print(f"DEBUG mode enabled: {os.environ.get('DEBUG') == 'True'}")
print(f"LangSmith Tracing V2 enabled: {os.environ.get('LANGCHAIN_TRACING_V2') == 'true'}")
print(f"LangChain Project Base: {os.environ.get('LANGCHAIN_PROJECT_BASE')}")
print(f"LangChain Project: {os.environ.get('LANGCHAIN_PROJECT')}")


--- API Key Status ---
OPENAI_API_KEY loaded: True
LANGCHAIN_API_KEY loaded: True
TAVILY_API_KEY loaded: True
RAGAS_API_KEY loaded: False
ANTHROPIC_API_KEY loaded: True
COHERE_API_KEY loaded: True

--- Project Settings Status ---
DEBUG mode enabled: True
LangSmith Tracing V2 enabled: True
LangChain Project Base: None
LangChain Project: THUDBOT-CC


including nltk, because it worked before

In [3]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/family/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/family/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
eval_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
eval_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)


## Step 2: Data Collection and Preparation

My data is CSV structured, so using code from HW9

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/Thudbot_Hint_Data_1.csv",
    metadata_columns=[
        "question",
        "hint_level",
        "character",
        "speaker",
        "narrative_context",
        "planet",
        "location",
        "category",
        "tone",
        "follow_up_hint_id",
        "answer_keywords",
        "tags"
    ]
)

hint_data = loader.load()

# No need to overwrite page_content; not doing custom transformation
print(hint_data[0].page_content)     # This will already be the hint_text
print(hint_data[0].metadata)         # This will show all the metadata fields


question_id: TSB-001
hint_text: Press the escape key to exit the opening animations
puzzle_name: 
source: self
{'source': './data/Thudbot_Hint_Data_1.csv', 'row': 0, 'question': 'How do I stop the opening movie', 'hint_level': '1', 'character': 'Player', 'speaker': '', 'narrative_context': 'Meta', 'planet': '', 'location': '', 'category': 'Meta', 'tone': '', 'follow_up_hint_id': '', 'answer_keywords': '', 'tags': ''}


### Setting up QDrant! (from HW9)

Now that we have our documents, let's create a QDrant VectorStore with the collection name "ThudbotHints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

 

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents=hint_data,
    embedding=embeddings,
    location=":memory:",
    collection_name="Thudbot_Hints"
)

## Step 3: Synthetic data generation (SDG) with RAGAS to create a golden dataset

Adapted from HW 7 & 9

NOTE: this step is a one-time deal. Once the appropriate golden data set is finalized, don't run any of these cells, but pick up at the JSON import cell

### ⏭️ Skip these steps until/unless want new golden data

#### Group data by narrative context

the data is too granular for Ragas, I got this error:

ValueError: Documents appears to be too short (ie 100 tokens or less). Please provide longer documents

so,  I will to group it and 
Create a new list called  ```merged_docs``` , just to feed to ragas, but not for retrieval

In [None]:
from langchain.schema import Document
from collections import defaultdict

# Group hints by narrative context
grouped = defaultdict(list)

for doc in hint_data:
    key = doc.metadata.get("narrative_context", "unknown")
    grouped[key].append(doc.page_content)

# Create longer Documents for SDG
merged_docs = [
    Document(
        page_content="\n".join(hints),
        metadata={"narrative_context": context}
    )
    for context, hints in grouped.items()
]

# Optional: preview length
for doc in merged_docs:
    print(f"{doc.metadata['narrative_context']}: {len(doc.page_content.split())} words")


Need to select a random subset, because the data is grouped

In [None]:
from ragas.testset import TestsetGenerator
import random

sampled_docs = random.sample(merged_docs, 5)
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
golden_dataset = generator.generate_with_langchain_docs(sampled_docs, testset_size=10)

In [None]:
golden_dataset.to_pandas()

save the RAGAS golden data set as a json

In [None]:
import json

with open("data/goldendataset.json", "w") as f:
    json.dump([sample.model_dump() for sample in golden_dataset], f, indent=2)


In [None]:
## Export Golden Dataset to Human-Readable Formats

import json
import pandas as pd

# Load the golden dataset
with open('./data/goldendataset.json', 'r') as f:
    golden_data = json.load(f)

# Extract the key information into a clean DataFrame
readable_data = []
for i, item in enumerate(golden_data):
    eval_sample = item['eval_sample']
    
    # Clean up the reference contexts (truncate if too long)
    contexts = eval_sample.get('reference_contexts', [])
    contexts_preview = str(contexts[0])[:200] + "..." if contexts else "No context"
    
    readable_data.append({
        'Question_ID': f"Q{i+1:02d}",
        'User_Question': eval_sample.get('user_input', ''),
        'Expected_Answer': eval_sample.get('reference', ''),
        'Synthesizer_Type': item.get('synthesizer_name', ''),
        'Context_Preview': contexts_preview,
        'Full_Context_Length': len(str(contexts)) if contexts else 0
    })

# Create DataFrame
df = pd.DataFrame(readable_data)

# Export to CSV
df.to_csv('./data/golden_dataset_readable.csv', index=False)

# Display preview
print("Golden Dataset Preview:")
print("=" * 60)
for _, row in df.head(3).iterrows():
    print(f"Question {row['Question_ID']}: {row['User_Question']}")
    print(f"Expected: {row['Expected_Answer']}")
    print(f"Synthesizer: {row['Synthesizer_Type']}")
    print("-" * 40)

print(f"\n✅ Exported {len(df)} questions to: ./data/golden_dataset_readable.csv")


In [None]:
## Option 2: Create Detailed Markdown Report

def create_markdown_report(golden_data, output_file='./data/golden_dataset_report.md'):
    """Create a detailed, human-readable markdown report of the golden dataset."""
    
    with open(output_file, 'w') as f:
        f.write("# Thudbot Golden Dataset Report\n\n")
        f.write("Generated synthetic test questions and answers for RAGAS evaluation.\n\n")
        f.write(f"**Total Questions:** {len(golden_data)}\n\n")
        f.write("---\n\n")
        
        for i, item in enumerate(golden_data, 1):
            eval_sample = item['eval_sample']
            
            f.write(f"## Question {i:02d}\n\n")
            f.write(f"**User Input:** {eval_sample.get('user_input', 'N/A')}\n\n")
            f.write(f"**Expected Answer:**\n{eval_sample.get('reference', 'N/A')}\n\n")
            f.write(f"**Synthesizer:** `{item.get('synthesizer_name', 'unknown')}`\n\n")
            
            # Add context information
            contexts = eval_sample.get('reference_contexts', [])
            if contexts:
                f.write("**Reference Context Preview:**\n")
                context_preview = str(contexts[0])[:300] + "..." if len(str(contexts[0])) > 300 else str(contexts[0])
                f.write(f"```\n{context_preview}\n```\n\n")
            
            f.write("---\n\n")
    
    print(f"✅ Created detailed report: {output_file}")

# Generate the markdown report
create_markdown_report(golden_data)


the answers are okay, but the questions are crappy. Let's see if we can get an LLM to rewrite the questions in a more natural game-player voice

##### Five step plan to re-write the questions:

1. Define rewriter prompt
2. Set up a rewriter chain using the rewriter prompt
3. Define rewrite function that invokes rewriter_chain
4. Run the function on the original "golden" dataset
5. Write the re-written questions to .json and .md files

"Vibe-check" the output, and iterate until I like it!

In [None]:
## Question Rewriter: Transform Formal Questions → Casual Player Questions

# Using the same pattern as HW7 for LLM chains


In [None]:
# Step 1: Create the rewriter prompt (following HW7 prompt pattern)

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

QUESTION_REWRITER_PROMPT = """\
You are helping rewrite questions to sound like real confused players asking for game hints.

Your job is to transform formal/academic questions into casual, human-like questions that real players would ask when stuck in "The Space Bar" adventure game.

IMPORTANT: If the question contains a question ID like "TSB-041" or similar, ignore that completely - it's just an organizational code that doesn't exist in the real game.

Original question: {original_question}
Game context: {context_preview}

Examples of good player questions:
- "How do I get past the guard?"
- "I'm stuck in this room, what do I do?"
- "Help! I can't figure out this puzzle"
- "Where's the token I need?"

Make it sound confused, or casual - like a real person, not an academic paper.

Rewritten question:"""

# Create the prompt template (same pattern as HW7)
rewriter_prompt = ChatPromptTemplate.from_template(QUESTION_REWRITER_PROMPT)


In [None]:
# Step 2: Set up a rewriter chain using the rewriter prompt

# Set up the LLM for rewriting (following HW7 pattern)
rewriter_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

# Create the rewriter chain (same LCEL pattern as HW7)
from langchain.schema import StrOutputParser

rewriter_chain = rewriter_prompt | rewriter_llm | StrOutputParser()


In [None]:
# Step 3: Define rewrite function that invokes rewriter_chain
# Main rewriting function (following HW7 iteration pattern)
def rewrite_questions_to_player_style(golden_data):
    """Rewrite formal RAGAS questions to sound like casual game players"""
    
    updated_data = []
    
    print("🔄 Rewriting questions to sound like real players...")
    print("-" * 50)
    
    for i, item in enumerate(golden_data, 1):
        eval_sample = item['eval_sample']
        original_q = eval_sample.get('user_input', '')
        
        # Get context preview for better rewriting
        contexts = eval_sample.get('reference_contexts', [])
        context_preview = str(contexts[0])[:200] + "..." if contexts else "No context available"
        
        # Rewrite the question using our chain
        try:
            rewritten_q = rewriter_chain.invoke({
                "original_question": original_q,
                "context_preview": context_preview
            })
            
            # Update the question in place
            eval_sample['user_input'] = rewritten_q.strip()
            eval_sample['original_question'] = original_q  # Keep original for comparison
            
            # Show progress
            print(f"✅ Question {i:02d}:")
            print(f"   Original: {original_q[:70]}...")
            print(f"   Rewritten: {rewritten_q.strip()[:70]}...")
            print()
            
        except Exception as e:
            print(f"❌ Error rewriting question {i}: {e}")
            # Keep original if rewriting fails
            eval_sample['original_question'] = original_q
        
        updated_data.append(item)
    
    print(f"✅ Successfully rewrote {len(updated_data)} questions!")
    return updated_data


In [None]:
#Step 4: Run the re on the original "golden" dataset

import json

# Load the original golden dataset
with open('./data/goldendataset.json', 'r') as f:
    original_golden_data = json.load(f)

print(f"📊 Loaded {len(original_golden_data)} questions to rewrite")

# Rewrite the questions
rewritten_data = rewrite_questions_to_player_style(original_golden_data)


In [None]:
# Step 5: Save the rewritten dataset and create comparison report
with open('./data/goldendataset_player_style.json', 'w') as f:
    json.dump(rewritten_data, f, indent=2)

# Create a comparison report to review the changes
def create_comparison_report(rewritten_data, output_file='./data/question_rewrite_comparison.md'):
    with open(output_file, 'w') as f:
        f.write("# Question Rewriting Comparison Report\n\n")
        f.write("Comparison of original RAGAS questions vs. player-style questions\n\n")
        f.write("---\n\n")
        
        for i, item in enumerate(rewritten_data, 1):
            eval_sample = item['eval_sample']
            original = eval_sample.get('original_question', 'N/A')
            rewritten = eval_sample.get('user_input', 'N/A')
            
            f.write(f"## Question {i:02d}\n\n")
            f.write(f"**Original (RAGAS):** {original}\n\n")
            f.write(f"**Rewritten (Player Style):** {rewritten}\n\n")
            f.write("---\n\n")
    
    print(f"📝 Created comparison report: {output_file}")

create_comparison_report(rewritten_data)
print(f"💾 Saved rewritten dataset to: ./data/goldendataset_player_style.json")
print("\n🎯 Ready to use the player-style questions for RAGAS evaluation!")


I ran a few iterations with different prompts, and an now reasonably happy with the questions. 

Next step is to import the ```goldendataset_player_style.json``` for use by ragas eval

#### End of SDG work

### ▶️ Resume here to load data for any RAGAS eval

In [12]:
import json

# Load questions for testing your retrievers
with open("data/goldendataset_player_style.json", "r") as f:
    golden_data = json.load(f)

def run_retriever_on_dataset(name, retriever_chain, golden_data):
    """Run retriever and format for Ragas evaluation - consistent with your existing function"""
    print(f"Running {name} on golden dataset")
    outputs = []
    
    for item in golden_data:
        question = item["eval_sample"]["user_input"]
        reference = item["eval_sample"]["reference"]
        
        # Run your retriever
        response = retriever_chain.invoke({"question": question})
        
        outputs.append({
            "user_input": question,
            "reference": reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
            "retriever_name": name
        })
    
    return outputs

## Step 4: Setup the RAG chain


Starting with a "naive" dense vector retrieval

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

My first pass at a Thud-like prompt, named as ```THUD_TEMPLATE```

This will need tuning!

In [8]:
from langchain_core.prompts import ChatPromptTemplate

THUD_TEMPLATE = """\
You are Thud, a friendly and somewhat simple-minded patron at The Thirsty Tentacle. 

You're trying your best to help the player navigate the game "The Space Bar."

Use the clues and context provided below to offer a gentle hint — not a full solution.

If you're not sure, say so, or suggest the player look around more.

Player's question:
{question}

Context:
{context}

Your hint:"""

rag_prompt = ChatPromptTemplate.from_template(THUD_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [9]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")
#chat_model = ChatAnthropic(model="claude-3-5-sonnet-20240620")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.


In [10]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    
    | RunnablePassthrough.assign(context=itemgetter("context"))

    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Test the chain, and the langsmith tracing with a question.
Might as well take the question from the golden data set (just remember to load it above ▶️)

In [13]:
sample_q = golden_data[0]["eval_sample"]["user_input"]
naive_retrieval_chain.invoke({"question": sample_q})


{'response': AIMessage(content="Well, Yzore's a tricky place, but I think the Yzore helps you because you need a special token to hop on a bus there. Maybe there's something at Glom Hole or around that mailbox that can help you get it. Keep looking around in those spots—you might find a clue or item that'll get you closer to your goal!", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 2067, 'total_tokens': 2138, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_38343a2f8f', 'id': 'chatcmpl-C0FBkhEpUPOjVrDBoI6Q5mzP4sJbT', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--0cbf6144-0850-4f7c-818a-32c915fc6738-0', usage_metadata={'input_tokens': 2067, 'output_tokens': 71, 'tot

#### RAGAS Evaluation

In [None]:
naive_outputs = run_retriever_on_dataset("bm25_thing", bm25_retrieval_chain, golden_dataset)


In [None]:
#updated version of the code, with the retriever name externalized
# import time

def run_retriever_on_dataset(name, retriever_chain, golden_dataset):
    print(f"Running {name} on golden dataset")
    outputs = []

    for test_row in golden_dataset:
        response = retriever_chain.invoke({
            "question": test_row.eval_sample.user_input})
        outputs.append({
            "user_input": test_row.eval_sample.user_input,
            "reference": test_row.eval_sample.reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
            "retriever_name": name #this is the new addition for being able to keep track of the retriever name later
        })


    #    # Add delay between requests
    #     if i < len(golden_dataset) - 1:  # Don't sleep after last item
    #         print(f"  Waiting 2 seconds before next request...")
    #         time.sleep(2)  # Adjust this value as needed

    return outputs


In [None]:
# bm25_outputs = run_retriever_on_dataset("bm25", bm25_retrieval_chain, golden_dataset) #this was the first version, hard-coded
# naive_outputs = run_retriever_on_dataset("naive", naive_retrieval_chain, golden_dataset)
multi_query_outputs = run_retriever_on_dataset("multi_query", multi_query_retrieval_chain, golden_dataset)
# parent_doc_outputs = run_retriever_on_dataset("parent_doc", parent_document_retrieval_chain, golden_dataset)
# ensemble_outputs = run_retriever_on_dataset("ensemble", ensemble_retrieval_chain, golden_dataset)
# contextual_compression_outputs = run_retriever_on_dataset("contextual_compression", contextual_compression_retrieval_chain, golden_dataset)