<a href="https://colab.research.google.com/github/SarahOstermeier/TechnicalExercises/blob/main/Arize_Technical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Planning

**Objective:**  Build a RAG application

## Approach

**Format:** Jupyter Notebook (Google Colab)  
**Stretch Goal:** Optimize performance (primary), Build UX (secondary)  
**Framework:** Langchain or DSPy  
**LLM Provider:** Huggingface or Mistral  
**Dataset:** [OpenStax](https://openstax.org/subjects)

****

## Requirements

**A working RAG app with some interface for Q&A**  
* ~75-80% of the time, 2-3 hours <br>


**Thorough documentation**  

* Clear setup instructions - make it so anyone can follow in your footsteps
* Tell us why you picked your tools
* Share what worked, what didn't, and how you dealt with it
* What would you do next if you had more time?
* ~20-25% of your time, 1 hour

****

## Tips

* Use those quickstart tools - no need to reinvent the wheel
* Document as you go - future you will thank you
* LLMs are your friend here, don’t be afraid to use them to help, just be sure you take the time to really understand what they tell you.
* Hit a wall? Don't spin your wheels - reach out!
* Keep it focused - better to nail the basics than half-finish three extra features

# How to run this notebook

# My process

## Planning

### Appraoch and Tools
* I decided to work in Google Colab since it is a tool I am familiar with and will allow me to get started quickly without much setup.
* As my RAG framework I chose DSPy, as I'm interested interested in trying out DSPy Optimizers and thought this would be a good opportunity to do so.  
* Related to the above, my stretch goal is to optimize performance.
* I'll be using Mistral as my LLM provider, as I already have an account set up and can access it easily.

### Use Case and Dataset Selection
I started a project on Claude and provided the exercise instructions and the Jupyter Notebook I started as project content. I used Claude to brainstorm project ideas and related datasets and eventually decided to build a RAG tool to query textbooks, using documents from [OpenStax](https://openstax.org/subjects) as my dataset.




# Environment set up

## Install and import relevant libraries

[**DSPy:**](https://dspy.ai/) Framework for RAG application. DSPy provides a "prompts as code" library, enabling AI developers to standardize, modularize, and optimize their AI applicatins.

**PyDF2:** extract text from PDFs

**langchain:**

**chromadb:** vector database



In [1]:
!pip install PyPDF2
!pip install dspy
!pip install langchain
!pip install chromadb



In [23]:
# General imports
import os
import json

import uuid

# Data processing
import PyPDF2
import pandas as pd
import numpy as np

# LLM application tools
import langchain
import dspy
import chromadb

## Model set up


Set Model provider API keys as environment variables.

In [3]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata
os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [4]:
lm = dspy.LM('mistral/mistral-small-latest')
dspy.configure(lm=lm)

Test the endpoint.


In [5]:
lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you further? Let's test something if you'd like. How about I say something and you respond with the first word that comes to your mind? I'll start:\n\nCat\n\n(What word does that make you think of?)"]

# Implementation

## Data Collection and Processing

Started by downloading a couple of textbook and access them directly from google drive. Will add web scraping later if there's time


### Process the PDFs

**Approach:** I used Claude to generate the initial data processing functions. I wanted to save metadata such as page number and source title for each text chunk to allow for citations in the LLM responses.
I used used LangChain's RecursiveCharacterTextSplitter function for text chunking and started with the following parameters:

  chunk_size: 1000  
  chunk_overlap: 200


**Problems:**  
Initially I did this by first split the documents by page and then chunked the text within each page. However, I ran into some compatibility issues with the data structures the Claude-generated functions produced (Claude does not have access to recent libarary updates). After reviewing more recent LangChain and DSPy documentation, I updated my approach to take advantage of some simplified new functions and modified my data processing functions to do page splitting and chunking in one step to improve efficiency.


**Possible future improvement:** The current approach does not preserve chapter/section structure in the textbook or content such as images and structured tables. In a more sophisticated implemenation it might be worth doing some more structured data splitting and including more details such as chapter and section in the metadata.

In [6]:
from chromadb.utils import embedding_functions

def process_documents_for_chroma(pdf_directory, text_splitter):
    """
    Process PDF documents and prepare them for ChromaDB ingestion
    """
    documents = []
    metadatas = []
    ids = []

    for filename in os.listdir(pdf_directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(pdf_directory, filename)
            try:
                # Extract text with metadata
                doc_chunks = extract_and_chunk_pdf(file_path, text_splitter)

                # Process each chunk
                for chunk in doc_chunks:
                    documents.append(chunk['content'])
                    metadatas.append(chunk['metadata'])
                    ids.append(str(uuid.uuid4()))

                print(f"Processed {filename}: {len(doc_chunks)} chunks extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return documents, metadatas, ids

def extract_and_chunk_pdf(pdf_path, text_splitter):
    """Extract text from a PDF file, split into chunks, and maintain metadata."""
    chunked_documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with little or no text
                continue

            # Create metadata for this page
            metadata = {
                'source': title,
                'page': page_num + 1,
                'total_pages': total_pages
            }

            # Split this page's text into chunks
            page_chunks = text_splitter.split_text(text)

            # Create a document for each chunk with proper metadata
            for chunk_idx, chunk in enumerate(page_chunks):
                chunked_documents.append({
                    'content': chunk,
                    'metadata': {
                        **metadata,
                        'chunk_id': f"{page_num}-{chunk_idx}"
                    }
                })

    return chunked_documents


In [7]:
project_drive_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise"
project_data_folder = "Data"

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Set up the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False
)

documents, metadatas, ids = process_documents_for_chroma(os.path.join(project_drive_dir, project_data_folder), text_splitter)

Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 3495 chunks extracted
Processed ConceptsofBiology-WEB.pdf: 2319 chunks extracted


In [11]:
# Initialize ChromaDB
persist_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise/Vector_DB"

chroma_client = chromadb.PersistentClient(path=persist_dir)

In [13]:
# Create a new collection (or get existing one)
# You can choose the embedding function based on your needs
embedding_function = embedding_functions.DefaultEmbeddingFunction()
collection_name = "textbook_data"
collection = chroma_client.create_collection(
    name=collection_name,
    embedding_function=embedding_function
)

In [14]:
# Add documents to the collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


In [15]:
collection.peek()

{'ids': ['2d3cea33-afa6-4296-b38d-d9e33a54ed4c',
  'd5e2f742-e09d-4a92-a481-2f89ee102fc9',
  '003b82b7-b2e1-4564-8c49-b01e76495f48',
  '10c3fd8f-6569-4c10-a32c-482d58f2beac',
  'deb02e67-dab6-40b8-bd04-3e2d4de99ab4',
  '2b704352-0c87-4311-9220-ea71afcf6ae9',
  '4d494576-e93f-4bce-af80-4dbf147825d7',
  '442a6fd5-8eca-433f-9d72-61d1bc8713ac',
  'dad398a8-c891-4b16-8fa2-9d8c9381bd33',
  'afb8f992-488d-423f-9e4c-ff76441ff8a1'],
 'embeddings': array([[ 0.03137285, -0.04941513,  0.06488694, ...,  0.11583801,
         -0.06308822, -0.0459537 ],
        [-0.02433298, -0.0303662 , -0.03966172, ..., -0.02128791,
          0.0138738 ,  0.01283469],
        [-0.03150415, -0.01637185, -0.07035826, ..., -0.01801896,
          0.01942443,  0.01046865],
        ...,
        [ 0.01859296, -0.05630921, -0.02378133, ...,  0.07191464,
         -0.0512832 ,  0.0735969 ],
        [-0.11047661, -0.0109393 ,  0.00610211, ..., -0.01831879,
          0.08802343, -0.03965753],
        [-0.04994785, -0.00550003, 

## DSPy Setup for RAG

**Approach:** I configured the embedder and retriever in DSPY, using the mistral-embed model, retrieving 5 documents per query.

**Problems:**
* Claude is pretty out of date with DSPy's current capabilities, so I was not able to rely heavily on generated code for this part.
* I had to set a pretty small batch size for the Embedder to accomodate Mistral's token limit, since I didn't want to reduce the chunk size quite yet.

**Possible future improvement:** Experiment with different embedding models to optimize for performance, optimize chunk size

In [16]:
# Define the RAG application
class EducationalQuery(dspy.Signature):
    """Query an educational assistant about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    answer = dspy.OutputField(desc="Comprehensive answer based on the retrieved information")
    sources = dspy.OutputField(desc="The sources used to answer the question")

class EducationalAssistant(dspy.Module):
    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.generate = dspy.ChainOfThought(EducationalQuery)

    def forward(self, question):
        # Get passages and metadata using a single query
        retrieved = self.retriever(question)

        # Create context from passages
        context = "\n\n".join(retrieved.passages)

        # Extract metadata for citations
        citation_strings = []
        for metadata in retrieved.metadatas:
            if isinstance(metadata, dict) and 'source' in metadata and 'page' in metadata:
                source = metadata['source'].replace('_', ' ').replace('-WEB', '')
                page = metadata['page']
                citation = f"{source} (Page {page})"
                if citation not in citation_strings:
                    citation_strings.append(citation)

        # Generate answer
        response = self.generate(
            question=question,
            context=context
        )

        # Format the citations as a bullet list
        citations_formatted = "\n".join([f"• {citation}" for citation in citation_strings])

        return {
            "answer": response.answer,
            "citations": citations_formatted,
            "context": context
        }


In [17]:
class MetadataRetriever:
    def __init__(self, collection_name, persist_dir, k=5):
        self.collection_name = collection_name
        self.persist_dir = persist_dir
        self.k = k
        self.client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_collection(collection_name)

    def __call__(self, query):
        # Query ChromaDB once to get both passages and metadata
        results = self.collection.query(
            query_texts=[query],
            n_results=self.k
        )

        # Create dspy-compatible result object
        retrieved_results = type('RetrievedPassages', (), {})()
        retrieved_results.passages = results['documents'][0]
        retrieved_results.metadatas = results['metadatas'][0]

        return retrieved_results

In [18]:
# Set up the retriever
retriever = MetadataRetriever(
    collection_name=collection_name,
    persist_dir=persist_dir,
    k=3
)

# Initialize the assistant with the retriever
assistant = EducationalAssistant(retriever)

# Configure DSPy to use the LLM (but not the retriever)
dspy.settings.configure(lm=lm)

# Test the assistant
response = assistant("What is cellular respiration?")
print(response["answer"])
print("\nCitations:")
print(response["citations"])

Cellular respiration is the process by which cells convert energy from nutrients into adenosine triphosphate (ATP), the primary energy currency of the cell. It involves several stages, including glycolysis, which is the initial pathway for breaking down glucose to extract energy. During cellular respiration, cells take in oxygen and release carbon dioxide, which is facilitated by the respiratory system. This process ensures that cells have the energy they need to function properly.

Citations:
• Introduction to Behavioral Neuroscience (Page 727)
• ConceptsofBiology (Page 125)
• ConceptsofBiology (Page 453)


#### Critical Thinking Version

In [19]:
class GuidedEducationalQuery(dspy.Signature):
    """Guide students through a critical thinking process to discover answers about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    guided_answer = dspy.OutputField(desc="""
        Formulate a response that guides the student's thinking rather than simply providing facts.
        Follow this structure:
        1. Acknowledge the question and its importance
        2. Offer a partial insight or starting point based on the context
        3. Pose 1-2 guiding questions that help the student think through the concept
        4. Provide connections to related concepts they might already understand
        5. Conclude with a concise summary of the key points while encouraging further exploration

        The goal is to prompt critical thinking rather than delivering a complete answer.
        Use a warm, encouraging tone appropriate for education.
    """)
    sources = dspy.OutputField(desc="The sources used to answer the question")

class GuidedEducationalAssistant(EducationalAssistant):
    def __init__(self, retriever):
        super().__init__(retriever)
        # Use ChainOfThought with the new GuidedEducationalQuery signature
        self.generate = dspy.ChainOfThought(GuidedEducationalQuery)

    def forward(self, question):
        # Get passages and metadata using a single query
        retrieved = self.retriever(question)

        # Create context from passages
        context = "\n\n".join(retrieved.passages)

        # Extract metadata for citations
        citation_strings = []
        for metadata in retrieved.metadatas:
            if isinstance(metadata, dict) and 'source' in metadata and 'page' in metadata:
                source = metadata['source'].replace('_', ' ').replace('-WEB', '')
                page = metadata['page']
                citation = f"{source} (Page {page})"
                if citation not in citation_strings:
                    citation_strings.append(citation)

        # Generate guided response
        response = self.generate(
            question=question,
            context=context
        )

        # Format the citations as a bullet list
        citations_formatted = "\n".join([f"• {citation}" for citation in citation_strings])

        return {
            "guided_answer": response.guided_answer,
            "citations": citations_formatted,
            "context": context
        }

In [20]:
# Set up the guided assistant with the same retriever
guided_assistant = GuidedEducationalAssistant(retriever)

# Test the guided assistant
question = "What is cellular respiration?"
response = guided_assistant(question)

print("Original Question:", question)
print("\nGuided Answer:")
print(response["guided_answer"])
print("\nSources:")
print(response["citations"])

Original Question: What is cellular respiration?

Guided Answer:
Your question about cellular respiration is a great one! It's a fundamental process that helps us understand how our bodies produce energy.

First, let's start with what we know from the context. We see that muscles need more oxygen during exercise, and they release more carbon dioxide. This exchange of gases is crucial for cellular respiration. Cellular respiration is the process by which cells convert the energy from food into a usable form, called ATP. This process involves the exchange of gases—oxygen is taken in, and carbon dioxide is released.

To think through this concept, consider the following questions:
1. How might the increased need for oxygen during exercise relate to the process of cellular respiration?
2. If cellular respiration produces energy in the form of ATP, how might this energy be used by the muscles during exercise?

You might also think about how this process is similar to how a car engine uses f

# Evaluation

In [36]:
dspy.inspect_history()





[34m[2025-03-06T05:08:25.705451][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `context` (str): Retrieved passages from textbooks

Your output fields are:
1. `reasoning` (str)
2. `guided_answer` (str): 
        Formulate a response that guides the student's thinking rather than simply providing facts.
        Follow this structure:
        1. Acknowledge the question and its importance
        2. Offer a partial insight or starting point based on the context
        3. Pose 1-2 guiding questions that help the student think through the concept
        4. Provide connections to related concepts they might already understand
        5. Conclude with a concise summary of the key points while encouraging further exploration
        
        The goal is to prompt critical thinking rather than delivering a complete answer.
        Use a warm, encouraging tone appropriate for education.
    
3. `sources` (str): The sources used to answer the question

All in

In [39]:
eval_data_dir = os.path.join(project_drive_dir, "Eval Datasets")

with open(os.path.join(eval_data_dir,'eval_data.json'), 'r') as f:
    eval_dataset = json.load(f)

In [41]:
eval_data_examples = [dspy.Example(**d).with_inputs('question') for d in eval_dataset]

# Let's pick an `example` here from the data.
example = eval_data_examples[2]
example

Example({'question': 'What is the difference between mitosis and meiosis?', 'response': "You've asked about two fundamental cell division processes! Mitosis and meiosis differ in several key ways. Mitosis produces two genetically identical daughter cells with the same chromosome number as the parent cell, occurring in somatic cells for growth and repair. Meiosis, conversely, produces four genetically unique cells with half the original chromosome number, occurring only in germ cells for sexual reproduction. Have you considered why organisms need these different division processes? What would happen if egg and sperm cells were produced through mitosis instead? Think about how these processes relate to concepts like genetic diversity and evolutionary adaptation. How might errors in either process affect an organism? In summary, while both involve DNA replication and cell division, they serve distinct purposes—mitosis for growth and asexual reproduction, meiosis for creating genetic diver

In [42]:
import random

random.Random(0).shuffle(eval_data_examples)
trainset, devset = eval_data_examples[:20], eval_data_examples[20:40]

In [44]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Assuming your `response` is a dictionary with a 'guided_answer' key
pred = dspy.Prediction(response=response['guided_answer'])
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")


Question: 	 What is the difference between mitosis and meiosis?

Gold Response: 	 You've asked about two fundamental cell division processes! Mitosis and meiosis differ in several key ways. Mitosis produces two genetically identical daughter cells with the same chromosome number as the parent cell, occurring in somatic cells for growth and repair. Meiosis, conversely, produces four genetically unique cells with half the original chromosome number, occurring only in germ cells for sexual reproduction. Have you considered why organisms need these different division processes? What would happen if egg and sperm cells were produced through mitosis instead? Think about how these processes relate to concepts like genetic diversity and evolutionary adaptation. How might errors in either process affect an organism? In summary, while both involve DNA replication and cell division, they serve distinct purposes—mitosis for growth and asexual reproduction, meiosis for creating genetic diversity th

In [43]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
                         display_progress=True, display_table=2)

# Evaluate the assistant
evaluate(guided_assistant)

  0%|          | 0/20 [00:00<?, ?it/s]

2025/03/06 05:37:33 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'How does the immune system distinguish between self and non-self?', 'response': 'You\'ve asked about a critical aspect of immunity! The immune system distinguishes between "self" and "non-self" through several mechanisms, including negative selection of developing lymphocytes (eliminating those that strongly react to self-antigens), regulatory T cells that suppress immune responses against self, and the Major Histocompatibility Complex (MHC) molecules that present antigens to immune cells. These processes create "immunological tolerance" to self while maintaining reactivity against foreign material. Have you considered what might happen if these discrimination systems fail? What parallels might exist between this biological recognition system and security systems you\'re familiar with? How might understanding this process help explain autoimmune diseases? In summary, this self/non-self discri

Average Metric: 0.00 / 0 (0%):   5%|▌         | 1/20 [00:04<01:34,  4.96s/it]



Average Metric: 0.00 / 0 (0%):  10%|█         | 2/20 [00:05<00:41,  2.30s/it]

2025/03/06 05:37:35 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'What is the process of protein synthesis?', 'response': "You've asked about one of the most fundamental processes in all living cells! Protein synthesis is the process by which cells build proteins based on genetic instructions. It occurs in two main stages: transcription, where DNA information is copied into messenger RNA (mRNA) in the nucleus, and translation, where ribosomes in the cytoplasm use this mRNA template to assemble amino acids in the correct sequence. Have you considered the magnitude of this process—your cells make thousands of different proteins, each with specific functions? What might happen if errors occurred during protein synthesis? Think about how this molecular assembly line compares to manufacturing processes you're familiar with. In summary, protein synthesis represents biology's remarkable language translation system—converting the nucleic acid language of genes into

Average Metric: 0.00 / 0 (0%):  15%|█▌        | 3/20 [00:06<00:29,  1.75s/it]

2025/03/06 05:37:35 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'What is the process of photosynthesis?', 'response': "Excellent question about one of life's most fundamental processes! Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy stored in glucose, releasing oxygen as a byproduct. It occurs in two main stages: the light-dependent reactions, which capture energy from sunlight and convert it to chemical energy (ATP and NADPH), and the Calvin cycle (light-independent reactions), which uses this energy to fix carbon dioxide into glucose. Have you considered the global significance of this process? What would happen to our planet's ecosystems if photosynthesis suddenly decreased? How might this biological process compare to renewable energy technologies? In summary, photosynthesis represents nature's solar power system—capturing approximately 130 terawatts of energy globally each day, support

Average Metric: 0.00 / 0 (0%):  20%|██        | 4/20 [00:06<00:17,  1.10s/it]

AttributeError: 'dict' object has no attribute 'response'

In [34]:
response.response

AttributeError: 'dict' object has no attribute 'response'

In [None]:

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

# LLM Usage Disclosure

The following LLM-based assistants were used in the development of this notebook:

Claude 3.7 Sonnet for:
* Use case brainstorming and dataset selection



## Authorship
All core components, concepts, and technical implementation of this notebook were authored by Sarah Ostermeier. LLM assistance was limited to the specific tasks listed above.
