<a href="https://colab.research.google.com/github/SarahOstermeier/TechnicalExercises/blob/main/Arize_Technical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Planning

**Objective:**  Build a RAG application

## Approach

**Format:** Jupyter Notebook (Google Colab)  
**Stretch Goal:** Optimize performance (primary), Build UX (secondary)  
**Framework:** Langchain or DSPy  
**LLM Provider:** Huggingface or Mistral  
**Dataset:** [OpenStax](https://openstax.org/subjects)

****

## Requirements

**A working RAG app with some interface for Q&A**  
* ~75-80% of the time, 2-3 hours <br>


**Thorough documentation**  

* Clear setup instructions - make it so anyone can follow in your footsteps
* Tell us why you picked your tools
* Share what worked, what didn't, and how you dealt with it
* What would you do next if you had more time?
* ~20-25% of your time, 1 hour

****

## Tips

* Use those quickstart tools - no need to reinvent the wheel
* Document as you go - future you will thank you
* LLMs are your friend here, don’t be afraid to use them to help, just be sure you take the time to really understand what they tell you.
* Hit a wall? Don't spin your wheels - reach out!
* Keep it focused - better to nail the basics than half-finish three extra features

# How to run this notebook

## Set up project directory



In [37]:
# REPLACE WITH YOUR DIRECTORY PATH
project_dir_path = "/content/drive/MyDrive/Colab Notebooks/Arize_RAG_Exercise"

pdf_data_dir = "Data/PDF_Data"
vector_db_dir = "Data/Vector_DB"
eval_data_dir = "Data/Eval_Data"

# My process

## Planning

### Appraoch and Tools
* I decided to work in Google Colab since it is a tool I am familiar with and will allow me to get started quickly without much setup.
* As my RAG framework I chose DSPy, as I'm interested interested in trying out DSPy Optimizers and thought this would be a good opportunity to do so.  
* Related to the above, my stretch goal is to optimize performance.
* I'll be using Mistral as my LLM provider, as I already have an account set up and can access it easily.

### Use Case and Dataset Selection
I started a project on Claude and provided the exercise instructions and the Jupyter Notebook I started as project content. I used Claude to brainstorm project ideas and related datasets and eventually decided to build a RAG tool to query textbooks, using documents from [OpenStax](https://openstax.org/subjects) as my dataset.




# Environment setup

## Install and import relevant libraries

[**DSPy:**](https://dspy.ai/) Framework for RAG application

**PyDF2:** Extract text from PDFs

**Langchain:** Use the RecursiveCharacterTextSplitter for document chunking

**chromadb:** Vector database



In [1]:
!pip install PyPDF2
!pip install dspy
!pip install langchain
!pip install chromadb



In [41]:
# General imports
import os
import json
import random
import uuid

# Data processing
import PyPDF2
import pandas as pd
import numpy as np

# LLM application tools
import langchain
import dspy
import chromadb

## LLM setup


Set Model provider API keys as environment variables.

In [9]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata

os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
# os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [10]:
lm = dspy.LM('mistral/mistral-small-latest')
dspy.configure(lm=lm)

Test the endpoint.


In [11]:
lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you further? Let's test something if you'd like. How about I say something and you respond with the first word that comes to your mind? I'll start:\n\nCat\n\n(What word does that make you think of?)"]

# Implementation

## Data Collection and Processing




### Data Collection

**Approach:** I manually downloaded two textbooks and saved them in a google drive folder, where I access them directly from my Colab notebook. Since this was already a fairly large dataset (textbooks are long!), I did not bother to automate data collection for the purposes of this exercise

**Possible future improvement:** Implement webscraping for an automated data collection process.


### Initalize ChromaDB vector database to store embeddings

**Approach:** I was initially storing embeddings in a simple list of dictionaries, but added the vector database later to make application development easier. The database is hosted in the project folder. I used the default embedding function from Chroma.


In [12]:
# Initialize ChromaDB
persist_dir = os.path.join(project_dir_path, vector_db_dir)

chroma_client = chromadb.PersistentClient(path=persist_dir)

In [18]:
from chromadb.utils import embedding_functions

# Using the default embedding function for now
embedding_function = embedding_functions.DefaultEmbeddingFunction()


# Create a new collection (or load the one that already exists in the project directory)
collection_name = "textbooks" # if you want to start from scratch, change the collection_name

if collection_name not in chroma_client.list_collections():
    collection = chroma_client.create_collection(
        name=collection_name,
        embedding_function=embedding_function
    )
else:
    collection = chroma_client.get_collection(name=collection_name)


### Process the PDFs

**Approach:** I used Claude to generate the initial data processing functions. I wanted to save metadata such as page number and source title for each text chunk to allow for citations in the LLM responses.
I used used LangChain's RecursiveCharacterTextSplitter function for text chunking and started with the following parameters:

  chunk_size: 1000  
  chunk_overlap: 200

Later, I modified the function slightly when I added the ChromaDB vector database.

**Problems:**  
Initially I did this by first split the documents by page and then chunked the text within each page. However, I ran into some compatibility issues with the data structures the Claude-generated functions produced (Claude does not have access to recent libarary updates). After reviewing more recent LangChain and DSPy documentation, I updated my approach to take advantage of some simplified new functions and modified my data processing functions to do page splitting and chunking in one step to improve efficiency.


**Possible future improvement:** The current approach does not preserve chapter/section structure in the textbook or content such as images and structured tables. In a more sophisticated implemenation it might be worth doing some more structured data splitting and including more details such as chapter and section in the metadata.

In [20]:
def process_documents_for_chroma(pdf_directory, text_splitter):
    """
    Process PDF documents and prepare them for ChromaDB ingestion
    """
    documents = []
    metadatas = []
    ids = []

    for filename in os.listdir(pdf_directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(pdf_directory, filename)
            try:
                # Extract text with metadata
                doc_chunks = extract_and_chunk_pdf(file_path, text_splitter)

                # Process each chunk
                for chunk in doc_chunks:
                    documents.append(chunk['content'])
                    metadatas.append(chunk['metadata'])
                    ids.append(str(uuid.uuid4()))

                print(f"Processed {filename}: {len(doc_chunks)} chunks extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return documents, metadatas, ids

def extract_and_chunk_pdf(pdf_path, text_splitter):
    """Extract text from a PDF file, split into chunks, and maintain metadata."""
    chunked_documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with little or no text
                continue

            # Create metadata for this page
            metadata = {
                'source': title,
                'page': page_num + 1,
                'total_pages': total_pages
            }

            # Split this page's text into chunks
            page_chunks = text_splitter.split_text(text)

            # Create a document for each chunk with proper metadata
            for chunk_idx, chunk in enumerate(page_chunks):
                chunked_documents.append({
                    'content': chunk,
                    'metadata': {
                        **metadata,
                        'chunk_id': f"{page_num}-{chunk_idx}"
                    }
                })

    return chunked_documents


In [21]:
# NOTE: Can skip this step if you loaded the existing collection from the projec

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Set up the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False
)

documents, metadatas, ids = process_documents_for_chroma(os.path.join(project_dir_path, pdf_data_dir), text_splitter)

Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 3495 chunks extracted
Processed ConceptsofBiology-WEB.pdf: 2319 chunks extracted


In [22]:
# NOTE: Can skip this step if you loaded the existing collection from the project.  It takes a while to run

# Add documents to the collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


In [24]:
collection.peek()

{'ids': ['a3934cc1-36f4-41db-ae4a-9e54739549e2',
  '6257623d-1030-4725-98cb-80d69232d9d0',
  '18d02787-b6d7-4e2f-bffb-6bbfb2fcb334',
  '77507be1-c89f-428c-979b-02175c2487b1',
  'e9ae9325-1301-457c-b574-7bbcf6b055ed',
  '6c2dc307-a80d-4fac-8ffb-dfa66e8517ca',
  '79cea9f6-c4ff-474a-916d-a8e48ccb4119',
  'aba14d07-16a0-428c-b976-87a23646e637',
  '22c28e93-7fbf-4f24-b86f-a10bc8c4f3d3',
  'b543d50f-fb5f-4739-afee-8f92f3b06520'],
 'embeddings': array([[ 0.03137285, -0.04941513,  0.06488694, ...,  0.11583801,
         -0.06308822, -0.0459537 ],
        [-0.02433298, -0.0303662 , -0.03966172, ..., -0.02128791,
          0.0138738 ,  0.01283469],
        [-0.03150415, -0.01637185, -0.07035826, ..., -0.01801896,
          0.01942443,  0.01046865],
        ...,
        [ 0.01859296, -0.05630921, -0.02378133, ...,  0.07191464,
         -0.0512832 ,  0.0735969 ],
        [-0.11047661, -0.0109393 ,  0.00610211, ..., -0.01831879,
          0.08802343, -0.03965753],
        [-0.04994785, -0.00550003, 

## Application Design

### DSPy Module Setup

**Approach:** I designed the RAG application as an Education Assistant, to answer questions posed by students using the retrieved information from the textbooks. I also wanted the assitant to include some metadata in the form of a citation: The title of the textbook, and the pages the information was retrieved from.  

**Problems:**
* Claude is pretty out of date with DSPy's current capabilities, so I was not able to rely heavily on generated code for this part.
* DSPy's retriever module does not provide a way to retrieve metadata along with the context, making it difficult to add the citation I wanted to my assistant's response. Eventually I had to implement a custom "MetadataRetriever" class (with some help from Claude) to bypass DSPy's retriever alltogether and pass the retriever directly to my EducationalAssistant class when it is initialized.

**Possible future improvement:** Improve output formatting

In [25]:
# Define the RAG application
class EducationalQuery(dspy.Signature):
    """Query an educational assistant about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    answer = dspy.OutputField(desc="Comprehensive answer based on the retrieved information")
    sources = dspy.OutputField(desc="The sources used to answer the question")

class EducationalAssistant(dspy.Module):
    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.generate = dspy.ChainOfThought(EducationalQuery)

    def forward(self, question):
        # Get passages and metadata using a single query
        retrieved = self.retriever(question)

        # Create context from passages
        context = "\n\n".join(retrieved.passages)

        # Extract metadata for citations
        citation_strings = []
        for metadata in retrieved.metadatas:
            if isinstance(metadata, dict) and 'source' in metadata and 'page' in metadata:
                source = metadata['source'].replace('_', ' ').replace('-WEB', '')
                page = metadata['page']
                citation = f"{source} (Page {page})"
                if citation not in citation_strings:
                    citation_strings.append(citation)

        # Generate answer
        response = self.generate(
            question=question,
            context=context
        )

        # Format the citations as a bullet list
        citations_formatted = "\n".join([f"• {citation}" for citation in citation_strings])

        return dspy.Prediction(
            response=response.answer,
            citations=citations_formatted,
            context=context
            )

In [23]:
class MetadataRetriever:
    def __init__(self, collection_name, persist_dir, k=5):
        self.collection_name = collection_name
        self.persist_dir = persist_dir
        self.k = k
        self.client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_collection(collection_name)

    def __call__(self, query):
        # Query ChromaDB once to get both passages and metadata
        results = self.collection.query(
            query_texts=[query],
            n_results=self.k
        )

        # Create dspy-compatible result object
        retrieved_results = type('RetrievedPassages', (), {})()
        retrieved_results.passages = results['documents'][0]
        retrieved_results.metadatas = results['metadatas'][0]

        return retrieved_results

### DSPy Configuration

**Approach:** Initially I configured the embedder and retriever in DSPY, using the mistral-embed model, but I later switched to ChromaDB's default embedding model.

**Problems:**
* DSPy's documentation isn't very comprehensive and I was having issues with the mistral embedding model, leading to the switch to ChromaDB to simplify handling embeddings.  This also made it wasier to set up the custom retriever.

**Possible future improvement:** Experiment with different embedding models to optimize for performance, optimize chunk size, experiment with different base LLMs

In [28]:
# Set up the retriever
retriever = MetadataRetriever(
    collection_name=collection_name,
    persist_dir=persist_dir,
    k=3
)

# Initialize the assistant with the retriever
assistant = EducationalAssistant(retriever)

# Configure DSPy to use the LLM (but not the retriever)
dspy.settings.configure(lm=lm)

### Test the assistant

In [29]:
response = assistant("What is cellular respiration?")
print(response.response)
print(response.citations)
print(response.context)

Cellular respiration is the process by which cells convert energy from nutrients into adenosine triphosphate (ATP), the primary energy currency of the cell. It involves several stages, including glycolysis, which is the initial pathway for breaking down glucose to extract energy. During cellular respiration, cells take in oxygen and release carbon dioxide, which is facilitated by the respiratory system. This process ensures that cells have the energy they need to function properly.
• Introduction to Behavioral Neuroscience (Page 727)
• ConceptsofBiology (Page 125)
• ConceptsofBiology (Page 453)
cells that mak e up muscles thr oughout the body under v oluntar y contr ol, greatly incr ease their activity during
exercise and ther efore greatly incr ease their need f or oxygen fr om the bloods tream. A t the same time , the y release
mor e carbon dio xide , which the bloods tream cir culat es to the lungs t o exhale .
To ensur e an op timal amount o f oxygen and s wift r emo val of carbon 

### Application refinement

Ideally I wanted this assistant to help students apply critical thinking and develop excitement for science.  It should nudge students towards thinking for themselves rather than just providing an answere to a question.

To that end I created a new  GuidedEducationalAssistant module with an updated prompt to promote more engagement from the student.

In [30]:
class GuidedEducationalQuery(dspy.Signature):
    """Guide students through a critical thinking process to discover answers about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    guided_answer = dspy.OutputField(desc="""
        Formulate a response that guides the student's thinking rather than simply providing facts.
        Follow this structure:
        1. Acknowledge the question and its importance
        2. Offer a partial insight or starting point based on the context
        3. Pose 1-2 guiding questions that help the student think through the concept
        4. Provide connections to related concepts they might already understand
        5. Conclude with a concise summary of the key points while encouraging further exploration

        The goal is to prompt critical thinking rather than delivering a complete answer.
        Use a warm, encouraging tone appropriate for education.
    """)
    sources = dspy.OutputField(desc="The sources used to answer the question")

class GuidedEducationalAssistant(EducationalAssistant):
    def __init__(self, retriever):
        super().__init__(retriever)
        # Use ChainOfThought with the new GuidedEducationalQuery signature
        self.generate = dspy.ChainOfThought(GuidedEducationalQuery)

    def forward(self, question):
        # Get passages and metadata using a single query
        retrieved = self.retriever(question)

        # Create context from passages
        context = "\n\n".join(retrieved.passages)

        # Extract metadata for citations
        citation_strings = []
        for metadata in retrieved.metadatas:
            if isinstance(metadata, dict) and 'source' in metadata and 'page' in metadata:
                source = metadata['source'].replace('_', ' ').replace('-WEB', '')
                page = metadata['page']
                citation = f"{source} (Page {page})"
                if citation not in citation_strings:
                    citation_strings.append(citation)

        # Generate guided response
        response = self.generate(
            question=question,
            context=context
        )

        # Format the citations as a bullet list
        citations_formatted = "\n".join([f"• {citation}" for citation in citation_strings])

        return dspy.Prediction(
            response=response.guided_answer,
            citations=citations_formatted,
            context=context
            )

In [32]:
# Set up the guided assistant with the same retriever
guided_assistant = GuidedEducationalAssistant(retriever)

# Test the guided assistant
question = "What is cellular respiration?"
response = guided_assistant(question)

print(response.response)
print(response.citations)


Your question about cellular respiration is a great one! It's a fundamental process that helps us understand how our bodies produce energy.

First, let's start with what we know from the context. We see that muscles need more oxygen during exercise, and they release more carbon dioxide. This exchange of gases is crucial for cellular respiration. Cellular respiration is the process by which cells convert the energy from food into a usable form, called ATP. This process involves the exchange of gases—oxygen is taken in, and carbon dioxide is released.

To think through this concept, consider the following questions:
1. How might the increased need for oxygen during exercise relate to the process of cellular respiration?
2. If cellular respiration produces energy in the form of ATP, how might this energy be used by the muscles during exercise?

You might also think about how this process is similar to how a car engine uses fuel to produce energy. Just as a car needs oxygen to burn fuel ef

# Evaluation

**Approach:** To create a "golden" evaluation dataset I prompted Claude to generate 60 question and response pairs based on some sections of the textbooks, along with some response guidelines. I used the SemanticF1 metric to compare the golden responses to the actual responses.  

**Problems:**
* I had to keep the evaluation dataset small to save time/compute
* The "golden" dataset isn't really golden - it was LLM generated and not reviewed by human experts, but it works for the purposes of the exvercise

**Possible future improvement:**  
* Develop a more robust evaluation dataset
  * larger dataset
  * reviewed by human experts
  * include train, validation, and test splits
* Develop use-case specicific evaluation metrics, for example checking if citations are correct or testing whether the assistant asks the student thought-provoking questions, etc.

In [38]:
# Load the evaluation dataset
eval_data_dir = os.path.join(project_dir_path, eval_data_dir)
with open(os.path.join(eval_data_dir,'eval_data.json'), 'r') as f:
    eval_data = json.load(f)

In [47]:
eval_data_examples = [dspy.Example(**d).with_inputs('question') for d in eval_data]

# Load an example from the data
example = eval_data_examples[13]
example

Example({'question': 'What is the function of the cerebral cortex?', 'response': "You've asked about the brain's most evolutionarily advanced region! The cerebral cortex—the wrinkled outer layer of the cerebrum—is responsible for our highest cognitive functions including conscious thought, sensory perception, voluntary movement, language, reasoning, planning, and personality. It's organized into specialized regions (lobes) that handle different functions, though with considerable integration between areas. Have you thought about why the cortex is so wrinkled in humans compared to many other mammals? How might damage to specific cortical regions affect different abilities? Consider how this specialized yet integrated organization resembles other complex systems—like departments in a company that have distinct functions but must coordinate. In summary, the cerebral cortex represents the neural foundation for our most distinctly human abilities, enabling the complex cognitive and behavior

## Compare applications
I compared the guided assistant and the 'vanilla' assistant on the SemanticF1 metric, first on an initial example, and then on the whole evaluation dataset.  The guided assistant ouperformed the other one by about 4.5 points.

In [48]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Get predictions from both the guided and the regular models
pred_guided = guided_assistant(**example.inputs())
pred_regular = assistant(**example.inputs())

# Compute the metric score for the prediction for both models and compare results
score_guided = metric(example, pred_guided)
score_regular = metric(example, pred_regular)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n\n")
print(f"Guided Assistant Response: \t {pred_guided.response}\n")
print(f"Regular Assistant Response: \t {pred_regular.response}\n\n")
print(f"Guided Assistant Semantic F1 Score: {score_guided:.2f}\n")
print(f"Regular Assistant Semantic F1 Score: {score_regular:.2f}")

Question: 	 What is the function of the cerebral cortex?

Gold Response: 	 You've asked about the brain's most evolutionarily advanced region! The cerebral cortex—the wrinkled outer layer of the cerebrum—is responsible for our highest cognitive functions including conscious thought, sensory perception, voluntary movement, language, reasoning, planning, and personality. It's organized into specialized regions (lobes) that handle different functions, though with considerable integration between areas. Have you thought about why the cortex is so wrinkled in humans compared to many other mammals? How might damage to specific cortical regions affect different abilities? Consider how this specialized yet integrated organization resembles other complex systems—like departments in a company that have distinct functions but must coordinate. In summary, the cerebral cortex represents the neural foundation for our most distinctly human abilities, enabling the complex cognitive and behavioral repe

In [46]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=eval_data_examples, metric=metric, num_threads=24,display_progress=True, display_table=2)

# Evaluate the guided learning assistant
evaluate(guided_assistant)

Average Metric: 37.14 / 60 (61.9%): 100%|██████████| 60/60 [00:19<00:00,  3.05it/s]

2025/03/06 08:13:04 INFO dspy.evaluate.evaluate: Average Metric: 37.14088928615047 / 60 (61.9%)





Unnamed: 0,question,example_response,document_id,pred_response,citations,context,SemanticF1
0,What is the role of neurotransmitters in anxiety disorders?,You've asked about an important connection between neurobiology an...,8b9c0d1e-2f3g-4h5i-6j7k-8l9m0n1o2p,Your question about the role of neurotransmitters in anxiety disor...,• Introduction to Behavioral Neuroscience (Page 660)\n• Introducti...,the am ygdala and bed nucleus o f the s tria t erminalis (BNS T) t...,✔️ [0.400]
1,How does the nervous system control heart rate?,Excellent question about an essential regulatory process! Your ner...,3t4u5v6w-7x8y-9z0a-1b2c-3d4e5f6g7h8i,Your question about how the nervous system controls heart rate is ...,• ConceptsofBiology (Page 448)\n• ConceptsofBiology (Page 454)\n• ...,sympathetic ner vous s ystem include an ac celerated hear t rate a...,✔️ [0.635]


61.9

In [49]:
# Evaluate the vanilla learning assistant
evaluate(assistant)

Average Metric: 34.42 / 60 (57.4%): 100%|██████████| 60/60 [00:21<00:00,  2.80it/s]

2025/03/06 08:14:43 INFO dspy.evaluate.evaluate: Average Metric: 34.42314100823044 / 60 (57.4%)





Unnamed: 0,question,example_response,document_id,pred_response,citations,context,SemanticF1
0,What is the role of neurotransmitters in anxiety disorders?,You've asked about an important connection between neurobiology an...,8b9c0d1e-2f3g-4h5i-6j7k-8l9m0n1o2p,"Neurotransmitters play a crucial role in anxiety disorders, partic...",• Introduction to Behavioral Neuroscience (Page 660)\n• Introducti...,the am ygdala and bed nucleus o f the s tria t erminalis (BNS T) t...,✔️ [0.469]
1,How does the nervous system control heart rate?,Excellent question about an essential regulatory process! Your ner...,3t4u5v6w-7x8y-9z0a-1b2c-3d4e5f6g7h8i,The nervous system controls heart rate through the sympathetic and...,• ConceptsofBiology (Page 448)\n• ConceptsofBiology (Page 454)\n• ...,sympathetic ner vous s ystem include an ac celerated hear t rate a...,✔️ [0.769]


57.37

# Optimize Prompts
Finially I experimented briefly with optimizing the prompt on my guided assistant.

**Approach:** I used DSPy's BootstrapFewShotWithRandomSearch since it was recommended by DSPy for smaller evaluation datasets (I started with only 20 samples).  

**Problems:**
* Lot's of warnings about failure to deep copy attribute 'retriever' of GuidedEducationalAssistant, possibly because I used a custom retriever.  It didn't cause any major issues.

**Possible future improvement:** There's a lot to explore here.  DSPy offers many other optimization approaches, which would be worth experimenting with.


In [50]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# generate 8-shot examples of the program's steps
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=3, num_threads=4)

teleprompter = BootstrapFewShotWithRandomSearch(metric=metric, **config)
optimized_program = teleprompter.compile(GuidedEducationalAssistant(retriever), trainset=eval_data_examples)



Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 3 candidate sets.
Average Metric: 37.14 / 60 (61.9%): 100%|██████████| 60/60 [00:09<00:00,  6.29it/s]

2025/03/06 08:18:53 INFO dspy.evaluate.evaluate: Average Metric: 37.14088928615047 / 60 (61.9%)



New best score: 61.9 for seed -3
Scores so far: [61.9]
Best score so far: 61.9
Average Metric: 37.14 / 60 (61.9%): 100%|██████████| 60/60 [00:07<00:00,  8.36it/s]

2025/03/06 08:19:00 INFO dspy.evaluate.evaluate: Average Metric: 37.14088928615047 / 60 (61.9%)



Scores so far: [61.9, 61.9]
Best score so far: 61.9


 13%|█▎        | 8/60 [00:01<00:11,  4.71it/s]


Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.
Average Metric: 39.38 / 60 (65.6%): 100%|██████████| 60/60 [01:26<00:00,  1.44s/it]

2025/03/06 08:20:28 INFO dspy.evaluate.evaluate: Average Metric: 39.37815765992707 / 60 (65.6%)



New best score: 65.63 for seed -1
Scores so far: [61.9, 61.9, 65.63]
Best score so far: 65.63


 12%|█▏        | 7/60 [00:00<00:07,  7.35it/s]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Average Metric: 40.46 / 60 (67.4%): 100%|██████████| 60/60 [00:41<00:00,  1.45it/s]

2025/03/06 08:21:10 INFO dspy.evaluate.evaluate: Average Metric: 40.46187959042771 / 60 (67.4%)



New best score: 67.44 for seed 0
Scores so far: [61.9, 61.9, 65.63, 67.44]
Best score so far: 67.44


  7%|▋         | 4/60 [00:00<00:07,  7.31it/s]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 40.64 / 60 (67.7%): 100%|██████████| 60/60 [01:15<00:00,  1.27s/it]

2025/03/06 08:22:27 INFO dspy.evaluate.evaluate: Average Metric: 40.640160333451284 / 60 (67.7%)



New best score: 67.73 for seed 1
Scores so far: [61.9, 61.9, 65.63, 67.44, 67.73]
Best score so far: 67.73


  5%|▌         | 3/60 [00:00<00:07,  7.48it/s]


Bootstrapped 1 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 38.39 / 60 (64.0%): 100%|██████████| 60/60 [01:15<00:00,  1.25s/it]

2025/03/06 08:23:42 INFO dspy.evaluate.evaluate: Average Metric: 38.391522336934194 / 60 (64.0%)



Scores so far: [61.9, 61.9, 65.63, 67.44, 67.73, 63.99]
Best score so far: 67.73
6 candidate programs found.


In [53]:
optimized_program.candidate_programs[0]['program']

generate.predict = Predict(StringSignature(question, context -> reasoning, guided_answer, sources
    instructions='Guide students through a critical thinking process to discover answers about textbook content.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    context = Field(annotation=str required=True json_schema_extra={'desc': 'Retrieved passages from textbooks', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    guided_answer = Field(annotation=str required=True json_schema_extra={'desc': "\n        Formulate a response that guides the student's thinking rather than simply providing facts.\n        Follow this structure:\n        1. Acknowledge the question and its importance\n        2. Offe

In [54]:
optimized_program.save(os.path.join(project_dir_path, "optimized_program.json"))