# 00 - PDF Embeddings & RAG Pipeline Setup

This notebook processes the Gemini PDF and creates the foundation for all RAG experiments.

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

True

In [11]:
import fitz  # pymupdf
from openai import OpenAI
import numpy as np
import pandas as pd
import pickle
import time
from sklearn.metrics.pairwise import cosine_similarity

client = OpenAI()

In [14]:
# Load PDF
PDF_PATH = "../data/Gemini_FamilyOfMultimodelModels.pdf"

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = fitz.open(file)
        for page in pdf_reader:
            text += page.get_text() + "\n"
    return text

pdf_text = extract_text_from_pdf(PDF_PATH)
print(f"PDF loaded: {len(pdf_text)} characters")
print(f"First 500 chars: {pdf_text[:500]}...")

PDF loaded: 222829 characters
First 500 chars: Gemini: A Family of Highly Capable
Multimodal Models
Gemini Team, Google1
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities
across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano
sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained
use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model
advances the state...


In [15]:
# Text chunking function
def chunk_text(text, chunk_size=500, overlap=100):
    words = text.split()
    chunks = []
    
    step = chunk_size - overlap
    
    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i+chunk_size])
        if len(chunk.strip()) > 50:  # Skip very small chunks
            chunks.append({
                'text': chunk,
                'chunk_id': len(chunks),
                'word_count': len(chunk.split())
            })
        
        if i + chunk_size >= len(words):
            break
            
    return chunks

# Create chunks with default parameters
chunks = chunk_text(pdf_text, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks")
print(f"Sample chunk: {chunks[0]['text'][:200]}...")

Created 82 chunks
Sample chunk: Gemini: A Family of Highly Capable Multimodal Models Gemini Team, Google1 This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, vi...


In [16]:
# Generate embeddings
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def generate_embeddings(chunks, batch_size=10):
    embeddings = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1}")
        
        for chunk in batch:
            embedding = get_embedding(chunk['text'])
            embeddings.append(embedding)
            
        time.sleep(0.1)  # Rate limiting
    
    return embeddings

# Generate embeddings (this will take a few minutes)
print("Generating embeddings...")
embeddings = generate_embeddings(chunks)
print(f"Generated {len(embeddings)} embeddings")

Generating embeddings...
Processing batch 1/9
Processing batch 2/9
Processing batch 3/9
Processing batch 4/9
Processing batch 5/9
Processing batch 6/9
Processing batch 7/9
Processing batch 8/9
Processing batch 9/9
Generated 82 embeddings


In [17]:
# Save embeddings and chunks
rag_data = {
    'chunks': chunks,
    'embeddings': embeddings,
    'metadata': {
        'pdf_path': PDF_PATH,
        'chunk_size': 500,
        'overlap': 100,
        'total_chunks': len(chunks),
        'embedding_model': 'text-embedding-3-small'
    }
}

with open('../data/rag_embeddings.pkl', 'wb') as f:
    pickle.dump(rag_data, f)

print("RAG data saved to ../data/rag_embeddings.pkl")

RAG data saved to ../data/rag_embeddings.pkl


In [18]:
# RAG retrieval function
def retrieve_chunks(query, chunks, embeddings, k=5):
    query_embedding = get_embedding(query)
    
    # Calculate similarities
    similarities = cosine_similarity([query_embedding], embeddings)[0]
    
    # Get top k chunks
    top_indices = np.argsort(similarities)[::-1][:k]
    
    retrieved_chunks = []
    for idx in top_indices:
        retrieved_chunks.append({
            'chunk': chunks[idx],
            'similarity': similarities[idx],
            'rank': len(retrieved_chunks) + 1
        })
    
    return retrieved_chunks

# Test retrieval
test_query = "What are the key capabilities of Gemini models?"
retrieved = retrieve_chunks(test_query, chunks, embeddings, k=3)

print(f"Query: {test_query}")
print("\nTop 3 retrieved chunks:")
for item in retrieved:
    print(f"\nRank {item['rank']} (similarity: {item['similarity']:.3f}):")
    print(f"{item['chunk']['text'][:200]}...")

Query: What are the key capabilities of Gemini models?

Top 3 retrieved chunks:

Rank 1 (similarity: 0.731):
testing on Gemini Advanced: â€¢ Priority User Program: This program collected feedback from 120 power users, key influencers, and thought-leaders. This program enables the collection of real-time feedba...

Rank 2 (similarity: 0.719):
Ruddock, Art Khurshudov, Artemis Chen, Arthur Argenson, Avinatan Hassidim, Beiye Liu, Benjamin Schroeder, Bin Ni, Brett Daw, Bryan Chiang, Burak Gokturk, Carl Crous, Carrie Grimes Bostock, Charbel Kae...

Rank 3 (similarity: 0.703):
of high-quality demonstration data and feedback data for coding use cases. Gemini Apps and Gemini API models use a combination of human and synthetic approaches to collect such data. We evaluate our G...


In [19]:
# RAG generation function
def rag_generate(query, chunks, embeddings, retrieval_k=5, **llm_params):
    # Retrieve relevant chunks
    retrieved = retrieve_chunks(query, chunks, embeddings, k=retrieval_k)
    
    # Build context
    context = "\n\n".join([item['chunk']['text'] for item in retrieved])
    
    # Generate response
    prompt = f"""
Use the context below to answer the question. Be accurate and cite specific information from the context.

Context:
{context}

Question: {query}

Answer:"""
    
    start_time = time.time()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=llm_params.get('temperature', 0.3),
        max_tokens=llm_params.get('max_tokens', 300),
        top_p=llm_params.get('top_p', 0.9)
    )
    
    latency = time.time() - start_time
    
    return {
        'query': query,
        'answer': response.choices[0].message.content,
        'retrieved_chunks': retrieved,
        'context_length': len(context),
        'latency': latency,
        'usage': response.usage,
        'parameters': llm_params
    }

# Test RAG pipeline
result = rag_generate(
    "What are the main capabilities of Gemini models?",
    chunks, embeddings,
    retrieval_k=3,
    temperature=0.3,
    max_tokens=200
)

print(f"Query: {result['query']}")
print(f"\nAnswer: {result['answer']}")
print(f"\nMetrics:")
print(f"- Latency: {result['latency']:.2f}s")
print(f"- Context length: {result['context_length']} chars")
print(f"- Tokens used: {result['usage'].total_tokens}")

Query: What are the main capabilities of Gemini models?

Answer: Gemini models exhibit a range of advanced capabilities across multiple modalities, including text, code, image, audio, and video. Specifically, the most capable model, Gemini Ultra, demonstrates significant performance improvements in several areas:

1. **Natural Language Processing**: Gemini Ultra surpasses human-expert performance on the MMLU exam benchmark, scoring 90.0%, indicating its advanced capabilities in understanding and generating human language.

2. **Multimodal Understanding**: The models excel in image understanding, video understanding, and audio understanding benchmarks without requiring task-specific modifications or tuning. This includes the ability to parse complex images, such as charts and infographics, and reason over interleaved sequences of images, audio, and text.

3. **Reasoning Capabilities**: Gemini models are noted for their reasoning abilities, which enable them to tackle complex multi-step 

In [20]:
# Test queries for experiments
TEST_QUERIES = [
    "What are the key capabilities of Gemini models?",
    "How does Gemini compare to other multimodal models?",
    "What are the different versions of Gemini?",
    "What training data was used for Gemini?",
    "What are the safety measures in Gemini models?"
]

print("Test queries for parameter experiments:")
for i, query in enumerate(TEST_QUERIES, 1):
    print(f"{i}. {query}")

print(f"\nRAG pipeline ready! Use these queries in notebooks 01-09 for parameter testing.")

Test queries for parameter experiments:
1. What are the key capabilities of Gemini models?
2. How does Gemini compare to other multimodal models?
3. What are the different versions of Gemini?
4. What training data was used for Gemini?
5. What are the safety measures in Gemini models?

RAG pipeline ready! Use these queries in notebooks 01-09 for parameter testing.
