<a href="https://colab.research.google.com/github/SarahOstermeier/TechnicalExercises/blob/main/Arize_Technical_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Planning

**Objective:**  Build a RAG application

## Approach

**Format:** Jupyter Notebook (Google Colab)  
**Stretch Goal:** Optimize performance (primary), Build UX (secondary)  
**Framework:** Langchain or DSPy  
**LLM Provider:** Huggingface or Mistral  
**Dataset:** [OpenStax](https://openstax.org/subjects)

****

## Requirements

**A working RAG app with some interface for Q&A**  
* ~75-80% of the time, 2-3 hours <br>


**Thorough documentation**  

* Clear setup instructions - make it so anyone can follow in your footsteps
* Tell us why you picked your tools
* Share what worked, what didn't, and how you dealt with it
* What would you do next if you had more time?
* ~20-25% of your time, 1 hour

****

## Tips

* Use those quickstart tools - no need to reinvent the wheel
* Document as you go - future you will thank you
* LLMs are your friend here, don’t be afraid to use them to help, just be sure you take the time to really understand what they tell you.
* Hit a wall? Don't spin your wheels - reach out!
* Keep it focused - better to nail the basics than half-finish three extra features

# How to run this notebook

# My process

## Planning

### Appraoch and Tools
* I decided to work in Google Colab since it is a tool I am familiar with and will allow me to get started quickly without much setup.
* As my RAG framework I chose DSPy, as I'm interested interested in trying out DSPy Optimizers and thought this would be a good opportunity to do so.  
* Related to the above, my stretch goal is to optimize performance.
* I'll be using HuggingFace or Mistral as my LLM provider, as I already have accounts for both and can access easily.

### Use Case and Dataset Selection
I started a project on Claude and provided the exercise instructions and the Jupyter Notebook I started as project content. I used Claude to brainstorm project ideas and related datasets and eventually decided to build a RAG tool to query textbooks, using documents from [OpenStax](https://openstax.org/subjects) as my dataset.

## Implementation
*italicized text*

# Environment set up

## Install and import relevant libraries

[**DSPy:**](https://dspy.ai/) Framework for RAG application. DSPy provides a "prompts as code" library, enabling AI developers to standardize, modularize, and optimize their AI applicatins.

**PyDF2:** To extract text from PDFs



In [12]:
!pip install PyPDF2
!pip install dspy
!pip install langchain



In [13]:
import os
import PyPDF2
import pandas as pd

import langchain
import dspy

## Model set up
In this tutorial I'll be accessing models thorugh [Mistral](https://mistral.ai/) and through Huggingface's ([ Serverless inference API](https://huggingface.co/docs/api-inference/index), both of which can be used for free, with some limitiations. The API calls will be made through **DSPy**, which [integrates with a wide range of model providers](https://dspy.ai/)

Set Model provider API keys as environment variables.

In [17]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata
os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
# os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [19]:
lm = dspy.LM('mistral/mistral-small-latest', api_key=os.environ["MISTRAL_API_KEY"])
dspy.configure(lm=lm)

Test the endpoint.


In [20]:

lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you further? Let's test something if you'd like. How about I say something and you respond with the first word that comes to your mind? I'll start:\n\nCat\n\n(What word does that make you think of?)"]

# Implementation

## Data Collection and Processing

Started by downloading a couple of textbook and access them directly from google drive. Will add web scraping later if there's time


### Process the PDFs

**Approach:** Used Claude to generate the functions to go through each file in the directory, extract text content from each page, and format all of the data into a list of dictionaries containing the content and metadata for each page. Metadata such as source(title of the textbook) and page number will be saved to use for citation later.

**Possible issue:** This does not preserve chapter/section structure in the textbook or content such as images and structured tables. In a more sophisticated implemenation it might be worth doing some more structured data splitting and including more details such as chapter and section in the metadata.

In [1]:
project_drive_dir = "/content/drive/MyDrive/Colab Notebooks/Arize RAG Exercise"
project_data_folder = "Data"

In [21]:
def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file, along with page numbers."""
    documents = []

    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)

        # Get basic information
        title = os.path.basename(pdf_path).replace('.pdf', '')
        total_pages = len(pdf_reader.pages)

        # Process each page
        for page_num in range(total_pages):
            # Extract text from the page
            page = pdf_reader.pages[page_num]
            text = page.extract_text()

            # Skip empty pages
            if not text or len(text.strip()) < 50:  # Skip pages with litter or no text
                continue

            # Create a document with metadata for citation
            documents.append({
                'content': text,
                'metadata': {
                    'source': title,
                    'page': page_num + 1,
                    'total_pages': total_pages
                }
            })

    return documents

def process_pdfs_in_directory(directory):
    """Process all PDFs in a directory."""
    all_documents = []
    for filename in os.listdir(directory):
        if filename.endswith('.pdf'):
            file_path = os.path.join(directory, filename)
            try:
                documents = extract_text_from_pdf(file_path)
                all_documents.extend(documents)
                print(f"Processed {filename}: {len(documents)} pages extracted")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return all_documents

In [22]:
# Load my documents
pdf_directory = os.path.join(project_drive_dir, project_data_folder)
documents = process_pdfs_in_directory(pdf_directory)

# Save to CSV for later use
df = pd.DataFrame(documents)
textbook_content_path = os.path.join(project_drive_dir, 'textbook_content.csv')
df.to_csv(textbook_content_path, index=False)

Processed ConceptsofBiology-WEB.pdf: 612 pages extracted
Processed Introduction_to_Behavioral_Neuroscience-WEB.pdf: 912 pages extracted


### Split into Chunks for Better Retrieval

This function was also generated by Claude. I used langchain here since it has a built-in text splitter and started with the following parameters:
* chunk_size: 1000
* chunk_overlap: 200

In [23]:
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# def split_documents(documents, chunk_size=1000, chunk_overlap=200):
#     """Split documents into smaller chunks for better retrieval."""
#     text_splitter = RecursiveCharacterTextSplitter(
#         chunk_size=chunk_size,
#         chunk_overlap=chunk_overlap
#     )

#     chunked_documents = []
#     for doc in documents:
#         chunks = text_splitter.split_text(doc['content'])
#         for i, chunk in enumerate(chunks):
#             chunked_documents.append({
#                 'content': chunk,
#                 'metadata': {
#                     **doc['metadata'],
#                     'chunk_id': i
#                 }
#             })

#     return chunked_documents

# # Split the documents
# chunked_docs = split_documents(documents)
# chunked_df = pd.DataFrame(chunked_docs)
# textbook_chunks_path = os.path.join(project_drive_dir, 'textbook_chunks.csv')
# chunked_df.to_csv(textbook_chunks_path , index=False)

## DSPy Setup for RAG

In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dspy.retrievers import Embeddings

# Use LangChain's text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_text("\n\n".join(documents))

# # Initialize DSPy embedder and retriever
# embedder = dspy.Embedder('openai/text-embedding-3-small')
# retriever = Embeddings(corpus=chunks, embedder=embedder, k=5)

# # Configure DSPy
# dspy.settings.configure(rm=retriever)

TypeError: sequence item 0: expected str instance, dict found

## Adding Educational Domain Knowledge

# LLM Usage Disclosure

The following LLM-based assistants were used in the development of this notebook:

Claude 3.7 Sonnet for:
* Use case brainstorming and dataset selection



## Authorship
All core components, concepts, and technical implementation of this notebook were authored by Sarah Ostermeier. LLM assistance was limited to the specific tasks listed above.
