<a href="https://colab.research.google.com/github/SarahOstermeier/TechnicalExercises/blob/main/Arize_Technical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Planning

**Objective:**  Build a RAG application

## Approach

**Format:** Jupyter Notebook (Google Colab)  
**Stretch Goal:** Optimize performance (primary), Build UX (secondary)  
**Framework:** Langchain or DSPy  
**LLM Provider:** Huggingface or Mistral  
**Dataset:** [OpenStax](https://openstax.org/subjects)

****

## Requirements

**A working RAG app with some interface for Q&A**  
* ~75-80% of the time, 2-3 hours <br>


**Thorough documentation**  

* Clear setup instructions - make it so anyone can follow in your footsteps
* Tell us why you picked your tools
* Share what worked, what didn't, and how you dealt with it
* What would you do next if you had more time?
* ~20-25% of your time, 1 hour

****

## Tips

* Use those quickstart tools - no need to reinvent the wheel
* Document as you go - future you will thank you
* LLMs are your friend here, don’t be afraid to use them to help, just be sure you take the time to really understand what they tell you.
* Hit a wall? Don't spin your wheels - reach out!
* Keep it focused - better to nail the basics than half-finish three extra features

# How to run this notebook

# My process

## Planning

### Appraoch and Tools
* I decided to work in Google Colab since it is a tool I am familiar with and will allow me to get started quickly without much setup.
* As my RAG framework I chose DSPy, as I'm interested interested in trying out DSPy Optimizers and thought this would be a good opportunity to do so.  
* Related to the above, my stretch goal is to optimize performance.
* I'll be using HuggingFace or Mistral as my LLM provider, as I already have accounts for both and can access easily.

### Use Case and Dataset Selection
I started a project on Claude and provided the exercise instructions and the Jupyter Notebook I started as project content. I used Claude to brainstorm project ideas and related datasets and eventually decided to build a RAG tool to query textbooks, using documents from [OpenStax](https://openstax.org/subjects) as my dataset.

## Implementation
*italicized text*

# Environment set up

## Install and import relevant libraries

[**DSPy:**](https://dspy.ai/) Framework for RAG application. DSPy provides a "prompts as code" library, enabling AI developers to standardize, modularize, and optimize their AI applicatins.

**PyDF2:** To extract text from PDFs



In [1]:
!pip install PyPDF2
!pip install dspy
!pip install langchain
!pip install chromadb



In [2]:
import os

import PyPDF2
import pandas as pd
import numpy as np

import langchain
import dspy

## Model set up
In this tutorial I'll be accessing models thorugh [Mistral](https://mistral.ai/) and through Huggingface's ([ Serverless inference API](https://huggingface.co/docs/api-inference/index), both of which can be used for free, with some limitiations. The API calls will be made through **DSPy**, which [integrates with a wide range of model providers](https://dspy.ai/)

Set Model provider API keys as environment variables.

In [3]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata
os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [4]:
lm = dspy.LM('mistral/mistral-small-latest')
dspy.configure(lm=lm)

Test the endpoint.


In [5]:

lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you further? Let's test something if you'd like. How about I say something and you respond with the first word that comes to your mind? I'll start:\n\nCat\n\n(What word does that make you think of?)"]

# Implementation

## Data Collection and Processing

Started by downloading a couple of textbook and access them directly from google drive. Will add web scraping later if there's time


### Process the PDFs

**Approach:** I used Claude to generate the initial data processing functions. I wanted to save metadata such as page number and source title for each text chunk to allow for citations in the LLM responses.
I used used LangChain's RecursiveCharacterTextSplitter function for text chunking and started with the following parameters:

  chunk_size: 1000  
  chunk_overlap: 200


**Problems:**  
Initially I did this by first split the documents by page and then chunked the text within each page. However, I ran into some compatibility issues with the data structures the Claude-generated functions produced (Claude does not have access to recent libarary updates). After reviewing more recent LangChain and DSPy documentation, I updated my approach to take advantage of some simplified new functions and modified my data processing functions to do page splitting and chunking in one step to improve efficiency.


**Possible future improvement:** The current approach does not preserve chapter/section structure in the textbook or content such as images and structured tables. In a more sophisticated implemenation it might be worth doing some more structured data splitting and including more details such as chapter and section in the metadata.

In [6]:
import chromadb
from chromadb.utils import embedding_functions
import os
import uuid

def process_documents_for_chroma(pdf_directory):
    """
    Process PDF documents and prepare them for ChromaDB ingestion
    """
    documents = []
    metadatas = []
    ids = []

    for filename in os.listdir(pdf_directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(pdf_directory, filename)
            try:
                # Extract text with metadata
                doc_chunks = extract_and_chunk_pdf(file_path)

                # Process each chunk
                for chunk in doc_chunks:
                    documents.append(chunk['content'])
                    metadatas.append(chunk['metadata'])
                    ids.append(str(uuid.uuid4()))

                print(f"Processed {filename}: {len(doc_chunks)} chunks extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return documents, metadatas, ids

def extract_and_chunk_pdf(pdf_path):
    """Extract text from a PDF file, split into chunks, and maintain metadata."""
    chunked_documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with little or no text
                continue

            # Create metadata for this page
            metadata = {
                'source': title,
                'page': page_num + 1,
                'total_pages': total_pages
            }

            # Split this page's text into chunks (simple character-based chunking)
            chunk_size = 1000
            overlap = 200
            text_length = len(text)

            start = 0
            chunk_idx = 0

            while start < text_length:
                end = min(start + chunk_size, text_length)

                # If this is not the first chunk, include some overlap
                if start > 0:
                    start = max(0, start - overlap)

                chunk_text = text[start:end]

                # Add the chunk with metadata
                chunked_documents.append({
                    'content': chunk_text,
                    'metadata': {
                        **metadata,
                        'chunk_id': f"{page_num}-{chunk_idx}"
                    }
                })

                # Move to next chunk
                start = end
                chunk_idx += 1

    return chunked_documents


In [10]:
project_drive_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise"
project_data_folder = "Data"

In [14]:
documents, metadatas, ids = process_documents_for_chroma(os.path.join(project_drive_dir, project_data_folder))

Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 3091 chunks extracted
Processed ConceptsofBiology-WEB.pdf: 2057 chunks extracted


In [50]:
# Initialize ChromaDB
persist_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise/Vector_Data"

chroma_client = chromadb.PersistentClient(path=persist_dir)

In [51]:
# Create a new collection (or get existing one)
# You can choose the embedding function based on your needs
embedding_function = embedding_functions.DefaultEmbeddingFunction()
collection_name = "textbooks"
collection = chroma_client.create_collection(
    name=collection_name,
    embedding_function=embedding_function
)

In [53]:
# Add documents to the collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


In [54]:
collection.peek()

{'ids': ['23b101d2-4a10-4854-9f57-78da025c62b9',
  '84a6e605-2e92-4a54-8b4b-51dc57fd7d4e',
  'ba804869-baa8-4b81-a053-647ba79d2eb2',
  'd063c29c-bc74-4382-9cb7-7a5be9a17404',
  'c500af61-7fd1-4815-afb0-e2773a9bf91d',
  '83e79544-a83e-4242-b315-5a4a8d29abc0',
  '43f1517a-a48d-488b-9f71-575bfeab3e6b',
  '98a2d589-3660-4dc6-a543-1e9c0333052d',
  '1d7b9d9f-f64f-44de-aa1c-fbe5c22634b6',
  'bc606aec-c280-431d-b86c-199e8af3922f'],
 'embeddings': array([[ 0.03137285, -0.04941513,  0.06488694, ...,  0.11583801,
         -0.06308822, -0.0459537 ],
        [-0.02444978, -0.03617395, -0.04078932, ..., -0.0215064 ,
          0.01420389,  0.01402987],
        [-0.02652909, -0.01673495, -0.06976717, ..., -0.00866218,
          0.02032539,  0.00517704],
        ...,
        [ 0.01859296, -0.05630921, -0.02378133, ...,  0.07191464,
         -0.0512832 ,  0.0735969 ],
        [-0.11047661, -0.0109393 ,  0.00610211, ..., -0.01831879,
          0.08802343, -0.03965753],
        [-0.04421325, -0.00484203, 

## DSPy Setup for RAG

**Approach:** I configured the embedder and retriever in DSPY, using the mistral-embed model, retrieving 5 documents per query.

**Problems:**
* Claude is pretty out of date with DSPy's current capabilities, so I was not able to rely heavily on generated code for this part.
* I had to set a pretty small batch size for the Embedder to accomodate Mistral's token limit, since I didn't want to reduce the chunk size quite yet.

**Possible future improvement:** Experiment with different embedding models to optimize for performance, optimize chunk size

In [108]:
# Define your RAG application
class EducationalQuery(dspy.Signature):
    """Query an educational assistant about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    answer = dspy.OutputField(desc="Comprehensive answer based on the retrieved information")
    sources = dspy.OutputField(desc="The sources used to answer the question")

class EducationalAssistant(dspy.Module):
    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.generate = dspy.ChainOfThought(EducationalQuery)

    def forward(self, question):
        # Get passages and metadata using a single query
        retrieved = self.retriever(question)

        # Create context from passages
        context = "\n\n".join(retrieved.passages)

        # Extract metadata for citations
        citation_strings = []
        for metadata in retrieved.metadatas:
            if isinstance(metadata, dict) and 'source' in metadata and 'page' in metadata:
                source = metadata['source'].replace('_', ' ').replace('-WEB', '')
                page = metadata['page']
                citation = f"{source} (Page {page})"
                if citation not in citation_strings:
                    citation_strings.append(citation)

        # Generate answer
        response = self.generate(
            question=question,
            context=context
        )

        # Format the citations as a bullet list
        citations_formatted = "\n".join([f"• {citation}" for citation in citation_strings])

        return {
            "answer": response.answer,
            "citations": citations_formatted,
            "context": context
        }


In [115]:
class MetadataRetriever:
    def __init__(self, collection_name, persist_dir, k=5):
        self.collection_name = collection_name
        self.persist_dir = persist_dir
        self.k = k
        self.client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_collection(collection_name)

    def __call__(self, query):
        # Query ChromaDB once to get both passages and metadata
        results = self.collection.query(
            query_texts=[query],
            n_results=self.k
        )

        # Create dspy-compatible result object
        retrieved_results = type('RetrievedPassages', (), {})()
        retrieved_results.passages = results['documents'][0]
        retrieved_results.metadatas = results['metadatas'][0]

        return retrieved_results

In [109]:
# Set up the retriever
retriever = MetadataRetriever(
    collection_name=collection_name,
    persist_dir=persist_dir,
    k=3
)

# Initialize the assistant with the retriever
assistant = EducationalAssistant(retriever)

# Configure DSPy to use the LLM (but not the retriever)
dspy.settings.configure(lm=lm)

# Test the assistant
response = assistant("What is cellular respiration?")
print(response["answer"])
print("\nCitations:")
print(response["citations"])

Cellular respiration is the process by which cells generate energy from the reaction of oxygen with molecules derived from food. This process is essential for all living organisms as it provides the energy needed for various cellular activities. During cellular respiration, cells consume oxygen and produce carbon dioxide as a waste product. The energy generated is stored in the form of adenosine triphosphate (ATP), which is the primary energy currency of all cells. This energy is then used to power various cellular processes, including muscle contraction and other metabolic activities. The overall reaction of cellular respiration can be summarized as the reverse of photosynthesis, where oxygen is consumed, and carbon dioxide is released.

Citations:
• Introduction to Behavioral Neuroscience (Page 727)
• ConceptsofBiology (Page 125)
• ConceptsofBiology (Page 105)
• ConceptsofBiology (Page 114)
• Introduction to Behavioral Neuroscience (Page 729)


## Adding Educational Domain Knowledge

# LLM Usage Disclosure

The following LLM-based assistants were used in the development of this notebook:

Claude 3.7 Sonnet for:
* Use case brainstorming and dataset selection



## Authorship
All core components, concepts, and technical implementation of this notebook were authored by Sarah Ostermeier. LLM assistance was limited to the specific tasks listed above.
