<a href="https://colab.research.google.com/github/SarahOstermeier/TechnicalExercises/blob/main/Arize_Technical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Planning

**Objective:**  Build a RAG application

## Approach

**Format:** Jupyter Notebook (Google Colab)  
**Stretch Goal:** Optimize performance (primary), Build UX (secondary)  
**Framework:** Langchain or DSPy  
**LLM Provider:** Huggingface or Mistral  
**Dataset:** [OpenStax](https://openstax.org/subjects)

****

## Requirements

**A working RAG app with some interface for Q&A**  
* ~75-80% of the time, 2-3 hours <br>


**Thorough documentation**  

* Clear setup instructions - make it so anyone can follow in your footsteps
* Tell us why you picked your tools
* Share what worked, what didn't, and how you dealt with it
* What would you do next if you had more time?
* ~20-25% of your time, 1 hour

****

## Tips

* Use those quickstart tools - no need to reinvent the wheel
* Document as you go - future you will thank you
* LLMs are your friend here, don’t be afraid to use them to help, just be sure you take the time to really understand what they tell you.
* Hit a wall? Don't spin your wheels - reach out!
* Keep it focused - better to nail the basics than half-finish three extra features

# How to run this notebook

# My process

## Planning

### Appraoch and Tools
* I decided to work in Google Colab since it is a tool I am familiar with and will allow me to get started quickly without much setup.
* As my RAG framework I chose DSPy, as I'm interested interested in trying out DSPy Optimizers and thought this would be a good opportunity to do so.  
* Related to the above, my stretch goal is to optimize performance.
* I'll be using HuggingFace or Mistral as my LLM provider, as I already have accounts for both and can access easily.

### Use Case and Dataset Selection
I started a project on Claude and provided the exercise instructions and the Jupyter Notebook I started as project content. I used Claude to brainstorm project ideas and related datasets and eventually decided to build a RAG tool to query textbooks, using documents from [OpenStax](https://openstax.org/subjects) as my dataset.

## Implementation
*italicized text*

# Environment set up

## Install and import relevant libraries

[**DSPy:**](https://dspy.ai/) Framework for RAG application. DSPy provides a "prompts as code" library, enabling AI developers to standardize, modularize, and optimize their AI applicatins.

**PyDF2:** To extract text from PDFs



In [1]:
!pip install PyPDF2
!pip install dspy
!pip install langchain



In [2]:
import os

import PyPDF2
import pandas as pd
import numpy as np

import langchain
import dspy

## Model set up
In this tutorial I'll be accessing models thorugh [Mistral](https://mistral.ai/) and through Huggingface's ([ Serverless inference API](https://huggingface.co/docs/api-inference/index), both of which can be used for free, with some limitiations. The API calls will be made through **DSPy**, which [integrates with a wide range of model providers](https://dspy.ai/)

Set Model provider API keys as environment variables.

In [3]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata
os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
# os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [4]:
lm = dspy.LM('mistral/mistral-small-latest')
dspy.configure(lm=lm)

Test the endpoint.


In [5]:

lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you further? Let's test something if you'd like. How about I say something and you respond with the first word that comes to your mind? I'll start:\n\nCat\n\n(What word does that make you think of?)"]

# Implementation

## Data Collection and Processing

Started by downloading a couple of textbook and access them directly from google drive. Will add web scraping later if there's time


### Process the PDFs

**Approach:** I used Claude to generate the initial data processing functions. I wanted to save metadata such as page number and source title for each text chunk to allow for citations in the LLM responses.
I used used LangChain's RecursiveCharacterTextSplitter function for text chunking and started with the following parameters:

  chunk_size: 1000  
  chunk_overlap: 200


**Problems:**  
Initially I did this by first split the documents by page and then chunked the text within each page. However, I ran into some compatibility issues with the data structures the Claude-generated functions produced (Claude does not have access to recent libarary updates). After reviewing more recent LangChain and DSPy documentation, I updated my approach to take advantage of some simplified new functions and modified my data processing functions to do page splitting and chunking in one step to improve efficiency.


**Possible future improvement:** The current approach does not preserve chapter/section structure in the textbook or content such as images and structured tables. In a more sophisticated implemenation it might be worth doing some more structured data splitting and including more details such as chapter and section in the metadata.

In [6]:
project_drive_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise"
project_data_folder = "Data"

#### Old Approach

In [21]:
def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file, along with page numbers."""
    documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with litter or no text
                continue

            # Create a document with metadata for citation
            documents.append({
                'content': text,
                'metadata': {
                    'source': title,
                    'page': page_num + 1,
                    'total_pages': total_pages
                }
            })

    return documents

def process_pdfs_in_directory(directory):
    """Process all PDFs in a directory."""
    all_documents = []
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(directory, filename)
            try:
                documents = extract_text_from_pdf(file_path)
                all_documents.extend(documents)
                print(f"Processed {filename}: {len(documents)} pages extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return all_documents

In [22]:
# Load my documents
pdf_directory = os.path.join(project_drive_dir, project_data_folder)
documents = process_pdfs_in_directory(pdf_directory)

# Save to CSV for later use
df = pd.DataFrame(documents)
textbook_content_path = os.path.join(project_drive_dir, 'textbook_content.csv')
df.to_csv(textbook_content_path, index=False)

Processed ConceptsofBiology-WEB.pdf: 612 pages extracted
Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 912 pages extracted


#### New Approach

In [7]:
def extract_and_chunk_pdf(pdf_path, text_splitter):
    """Extract text from a PDF file, split into chunks, and maintain metadata."""
    chunked_documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with little or no text
                continue

            # Create metadata for this page
            metadata = {
                'source': title,
                'page': page_num + 1,
                'total_pages': total_pages
            }

            # Split this page's text into chunks
            page_chunks = text_splitter.split_text(text)

            # Create a document for each chunk with proper metadata
            for chunk_idx, chunk in enumerate(page_chunks):
                chunked_documents.append({
                    'content': chunk,
                    'metadata': {
                        **metadata,
                        'chunk_id': f"{page_num}-{chunk_idx}"
                    }
                })

    return chunked_documents

def process_pdfs_in_directory(directory, text_splitter):
    """Process all PDFs in a directory, splitting into chunks with metadata."""
    all_chunked_documents = []
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(directory, filename)
            try:
                chunked_docs = extract_and_chunk_pdf(file_path, text_splitter)
                all_chunked_documents.extend(chunked_docs)
                print(f"Processed {filename}: {len(chunked_docs)} chunks extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return all_chunked_documents

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Set up the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False
)

# Process all PDFs and get chunks with metadata
pdf_directory = os.path.join(project_drive_dir, project_data_folder)
chunked_documents = process_pdfs_in_directory(pdf_directory, text_splitter)

# Save to CSV for later use if needed
chunked_df = pd.DataFrame(chunked_documents)
textbook_chunks_path = os.path.join(project_drive_dir, 'textbook_chunks.csv')
chunked_df.to_csv(textbook_chunks_path, index=False)

# Extract just the content for the DSPy retriever
chunk_texts = [doc['content'] for doc in chunked_documents]

Processed ConceptsofBiology-WEB.pdf: 2319 chunks extracted
Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 3495 chunks extracted


## DSPy Setup for RAG

**Approach:** I configured the embedder and retriever in DSPY, using the mistral-embed model, retrieving 5 documents per query.

**Problems:**
* Claude is pretty out of date with DSPy's current capabilities, so I was not able to rely heavily on generated code for this part.
* I had to set a pretty small batch size for the Embedder to accomodate Mistral's token limit, since I didn't want to reduce the chunk size quite yet.

**Possible future improvement:** Experiment with different embedding models to optimize for performance, optimize chunk size

In [37]:
# !pip install -U faiss-cpu  # or faiss-gpu if you have a GPU

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [9]:
# Set up the embedder and retriever
embedder = dspy.Embedder(model='mistral/mistral-embed', batch_size=25)
retriever = dspy.retrievers.Embeddings(
    corpus=chunk_texts,
    embedder=embedder
    )

# Save embeddings to avoid the need to generate them again
embeddings_path = os.path.join(project_drive_dir, 'corpus_embeddings.npy')
np.save(embeddings_path, retriever.corpus_embeddings)

# # To re-load embeddings
# loaded_embeddings = np.load(embeddings_path)


In [10]:
# Configure DSPy
dspy.settings.configure(lm=lm, rm=retriever)

In [11]:
# Define signatures for your RAG modules
class EducationalQuery(dspy.Signature):
    """Query an educational assistant about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    answer = dspy.OutputField(desc="Comprehensive answer based on the retrieved information")
    sources = dspy.OutputField(desc="The sources used to answer the question")

# Create educational assistant module
class EducationalAssistant(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=5)  # Retrieve 5 most relevant passages
        self.generate = dspy.ChainOfThought(EducationalQuery)

    def forward(self, question):
        retrieved = self.retrieve(question)
        answer = self.generate(
            question=question,
            context="\n\n".join(retrieved.passages)
        )
        return {
            "answer": answer.answer,
            "sources": answer.sources
        }


In [12]:
# Initialize your assistant
assistant = EducationalAssistant()

# Example usage
response = assistant("What is cellular respiration?")
print(response["answer"])
print("\nSources:")
print(response["sources"])

TypeError: Embeddings.__call__() got an unexpected keyword argument 'k'

In [60]:

# Define signatures for your RAG modules
class EducationalQuery(dspy.Signature):
    """Query an educational assistant about textbook content."""
    question = dspy.InputField()
    context = dspy.InputField(desc="Retrieved passages from textbooks")
    answer = dspy.OutputField(desc="Comprehensive answer based on the retrieved information")
    sources = dspy.OutputField(desc="The sources used to answer the question")

# Create educational assistant module
class EducationalAssistant(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.retrieve()  # Retrieve 5 most relevant passages
        self.generate = dspy.ChainOfThought(EducationalQuery)

    def forward(self, question):
        # Retrieve relevant passages
        retrieved = self.retrieve(question)
        passages = retrieved.passages

        # Include metadata in prompts
        context_with_citations = []
        for i, passage in enumerate(passages):
            # Find the corresponding metadata for this passage
            metadata = chunked_documents[retrieved.indices[i]]['metadata']
            citation = f"[{metadata['source']}, Page {metadata['page']}]"
            context_with_citations.append(f"{passage} {citation}")

        # Format the context from retrieved passages
        context = "\n\n".join(context_with_citations)

        answer = self.generate(
            question=question,
            context=context)

        return {
            "answer": answer.answer,
            "sources": answer.sources
        }

# Initialize your assistant
assistant = EducationalAssistant()

# Example usage
response = assistant("What is cellular respiration?")
print(response["answer"])
print("\nSources:")
print(response["sources"])


TypeError: 'module' object is not callable

## Adding Educational Domain Knowledge

# LLM Usage Disclosure

The following LLM-based assistants were used in the development of this notebook:

Claude 3.7 Sonnet for:
* Use case brainstorming and dataset selection



## Authorship
All core components, concepts, and technical implementation of this notebook were authored by Sarah Ostermeier. LLM assistance was limited to the specific tasks listed above.
