# Introduction to Simple RAG

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines information retrieval with generative models. It enhances the performance of language models by incorporating external knowledge, which improves accuracy and factual correctness.

In a Simple RAG setup, we follow these steps:

1. **Data Ingestion**: Load and preprocess the text data.
2. **Chunking**: Break the data into smaller chunks to improve retrieval performance.
3. **Embedding Creation**: Convert the text chunks into numerical representations using an embedding model.
4. **Semantic Search**: Retrieve relevant chunks based on a user query.
5. **Response Generation**: Use a language model to generate a response based on retrieved text.

This notebook implements a Simple RAG approach, evaluates the model’s response, and explores various improvements.

## Setting Up the Environment
We begin by importing necessary libraries.

In [18]:
import os
import numpy as np
import json
from PyPDF2 import PdfReader # Used for PDF extraction as per your original code
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import google.generativeai as genai # Import the Gemini library

ModuleNotFoundError: No module named 'sklearn'

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [2]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)




## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [3]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks

    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [4]:
import os
import google.generativeai as genai

# --- Configure Google Generative AI client ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    print("Error: GOOGLE_API_KEY environment variable is not set.")
    exit(1)

try:
    genai.configure(api_key=GOOGLE_API_KEY)
except Exception as e:
    print(f"Error configuring Google Generative AI: {e}")
    exit(1)

# --- Initialize models ---
# For chat completions
chat_model = genai.GenerativeModel("gemini-2.0-flash")

# For embeddings
embedding_model = genai.GenerativeModel("text-embedding-004")

# Example: verify configuration
print("Google Generative AI configured successfully.")
print("Chat model initialized.")
print("Embedding model initialized.")


Google Generative AI configured successfully.
Chat model initialized.
Embedding model initialized.


## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [5]:
# Define the path to the PDF file
pdf_path = "/Users/kekunkoya/Desktop/RAG Google/PEMA.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 69

First text chunk:
PENNSYLVANIA
EMERGENCY
PREPAREDNESS
GUIDE
Be Informed. Be Prepared. Be Involved. 
www.Ready.PA.gov 
readypa@pa.gov

Emergency Preparedness Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table of Contents
TABLE OF CONTENTS  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pages 2-3
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Page    4
TOP 10 EMERGENCIES . . . . . . . . . . . . . . . . . . . . . . Pages 4-7         
       
       
     
Floods • Fires • Winter Storms • Tropical Storms, Tornadoes 
and Thunderstorms • Influenza (Flu) Pandemic • Hazardous 
Material Incidents • Earthquakes and Landslides • Nuclear 
Threat • Dam Failures • Terrorism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BE PREPARED – MAKE A PLAN   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [6]:
import os
import google.generativeai as genai

# make sure you’ve already done:
# genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

def create_embeddings(text, model_name="text-embedding-004"):
    """
    Creates embeddings for the given text using the specified Google AI model.

    Args:
      text (str or list of str): The input text(s) for which embeddings are to be created.
      model_name (str): The model to be used for creating embeddings.

    Returns:
      list: A single embedding (list of floats) or a list of embeddings.
    """
    # Normalize to a list of strings
    inputs = text if isinstance(text, list) else [text]

    try:
        response = genai.embed_content(
            model=model_name,
            content=inputs,
            task_type="RETRIEVAL_DOCUMENT"
        )
        # response['embedding'] should be a list of lists of floats
        embeddings = response['embedding']

        # If the original input was a single string, return just its embedding
        if not isinstance(text, list):
            return embeddings[0]

        # Otherwise return the full list of embeddings
        return embeddings

    except Exception as e:
        print(f"Error creating embeddings: {e}")
        return []

# Usage example
text_chunks = [
    "First document to embed.",
    "Second document to embed."
]

chunk_embeddings = create_embeddings(text_chunks, model_name="text-embedding-004")
# chunk_embeddings is now [[…], […]] – a list of float arrays

# If you need to mimic OpenAI’s format elsewhere:
formatted_chunk_embeddings = [{"embedding": emb} for emb in chunk_embeddings]


## Performing Semantic Search
We implement cosine similarity to find the most relevant text chunks for a user query.

In [7]:

def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Handle zero vectors to avoid division by zero
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0.0
    return np.dot(vec1, vec2) / (norm_vec1 * norm_vec2)

def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    Retrieves the most relevant chunk along with its neighboring chunks.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    embeddings (List[dict]): List of chunk embeddings, where each dict has an 'embedding' key.
    k (int): Number of relevant chunks to retrieve (currently only k=1 supported for simplicity).
    context_size (int): Number of neighboring chunks to include.

    Returns:
    List[str]: Relevant text chunks with contextual information.
    """
    # Convert the query into an embedding vector
    # The create_embeddings function for a single string returns a list containing one embedding
    query_embedding = create_embeddings(query, model_name='text-embedding-004')[0]

    similarity_scores = []

    # Compute similarity scores between query and each text chunk embedding
    for i, chunk_embedding_dict in enumerate(embeddings):
        # Calculate cosine similarity between the query embedding and current chunk embedding
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding_dict['embedding']))
        # Store the index and similarity score as a tuple
        similarity_scores.append((i, similarity_score))

    # Sort chunks by similarity score in descending order (highest similarity first)
    similarity_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the index of the most relevant chunk
    if not similarity_scores:
        return [] # Return empty if no chunks or scores
    top_index = similarity_scores[0][0]

    # Define the range for context inclusion
    # Ensure we don't go below 0 or beyond the length of text_chunks
    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)

    # Return the relevant chunk along with its neighboring context chunks
    return [text_chunks[i] for i in range(start, end)]

## Running a Query on Extracted Chunks

In [8]:
import os
import json
import numpy as np

# --- Your existing context_enriched_search, fixed ---
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    Returns the top-k most similar chunks plus their neighbors for context.
    - query:     str
    - text_chunks:          list of str
    - embeddings:           list of dicts {'embedding': [float,...]}
    - k:         int, number of top matches
    - context_size: int, how many neighbors on each side
    """
    # 1) embed the query (assuming create_embeddings returns a flat list of floats)
    query_emb = create_embeddings(query)

    # 2) compute cosine similarity for each chunk
    scores = []
    for idx, emb_dict in enumerate(embeddings):
        chunk_emb = emb_dict['embedding']
        sim = np.dot(query_emb, chunk_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb))
        scores.append((idx, float(sim)))  # cast to float here

    # 3) sort by similarity descending
    scores.sort(key=lambda x: x[1], reverse=True)

    # 4) pick top-k indices
    top_idxs = [idx for idx, _ in scores[:k]]

    # 5) gather each top chunk plus neighbors
    results = []
    for idx in top_idxs:
        start = max(0, idx - context_size)
        end   = min(len(text_chunks), idx + context_size + 1)
        context = "\n\n".join(text_chunks[start:end])
        results.append(context)

    return results

# --- Your JSON loading & query extraction ---
val_json_path = '/Users/kekunkoya/Desktop/ISEM 770 GOOGLE Project/data/val.json'
try:
    with open(val_json_path, 'r') as f:
        data = json.load(f)
except FileNotFoundError:
    print(f"Error: '{val_json_path}' not found. Please ensure it is in the correct directory.")
    data = []
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from '{val_json_path}'. Check file format.")
    data = []

if data:
    query = data[0].get('question', '')
else:
    print("Warning: No validation data loaded. Using a default query.")
    query = "What is Explainable AI?"  # fixed typo

# --- Retrieve contexts ---
# (Assumes text_chunks and formatted_chunk_embeddings are already defined)
top_chunks = context_enriched_search(
    query,
    text_chunks,
    formatted_chunk_embeddings,
    k=1,
    context_size=1
)

# --- Print results ---
print("\nQuery:", query)
if top_chunks:
    for i, chunk in enumerate(top_chunks, start=1):
        print(f"\nContext {i}:\n{chunk}\n" + "="*40)
else:
    print("No relevant context chunks found.")



Query: What is 'Explainable AI' and why is it considered important?

Context 1:
First document to embed.

Second document to embed.


## Generating a Response Based on Retrieved Chunks

In [10]:
# cell: rag_with_google_ai_fixed

import os
import json
import numpy as np
import fitz                    # PyMuPDF
from openai import OpenAI     # pip install openai
import google.generativeai as genai  # pip install google-generativeai

# --- 1) Configure both clients ---

# 1a) Vertex AI via OpenAI-compatible endpoint (for chat)
openai_client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# 1b) Google GenerativeAI SDK (for embeddings)
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise RuntimeError("Please set the GOOGLE_API_KEY environment variable.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2) Helpers ---

def extract_text_chunks(pdf_path: str, chunk_size: int = 1000) -> list[str]:
    """Split the entire PDF text into chunks of ~chunk_size characters."""
    doc = fitz.open(pdf_path)
    text = "".join(page.get_text() for page in doc)
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

def create_embeddings(texts: list[str]) -> list[list[float]]:
    """
    Embed a list of texts via the Google GenerativeAI SDK's top-level embed_content.
    """
    inputs = texts if isinstance(texts, list) else [texts]
    resp = genai.embed_content(
        model="text-embedding-004",
        content=inputs,
        task_type="RETRIEVAL_DOCUMENT"
    )
    # resp["embedding"] is a list of float lists
    return resp["embedding"]

def semantic_search(query: str, chunks: list[str], embeddings: list[list[float]], k: int = 2) -> list[str]:
    """Return top-k text chunks most similar to the query (cosine similarity)."""
    q_emb = create_embeddings([query])[0]
    scores = [
        (i, float(np.dot(q_emb, emb) / (np.linalg.norm(q_emb) * np.linalg.norm(emb))))
        for i, emb in enumerate(embeddings)
    ]
    top_idxs = [i for i, _ in sorted(scores, key=lambda x: x[1], reverse=True)[:k]]
    return [chunks[i] for i in top_idxs]

def generate_with_gemini(system_prompt: str, user_message: str, model: str = "gemini-2.0-flash") -> str:
    """
    Sends a system + user prompt to Gemini via the OpenAI-compatible endpoint
    and returns the assistant’s reply.
    """
    resp = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_message}
        ],
        temperature=0.0
    )
    return resp.choices[0].message.content

# --- 3) Run RAG pipeline ---

# 3a) Extract & embed once per PDF
pdf_path = "/Users/kekunkoya/Desktop/RAG Google/PEMA.pdf"
text_chunks      = extract_text_chunks(pdf_path)
chunk_embeddings = create_embeddings(text_chunks)

# 3b) Load your validation query
with open("/Users/kekunkoya/Desktop/RAG Google/PA211_dataset.json") as f:
    data = json.load(f)
query = data[0]["question"]
print("Query:", query)

# 3c) Retrieve top-2 chunks
top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

# Debug: print what we’re sending
print("\n--- Retrieved Contexts ---")
for i, c in enumerate(top_chunks, start=1):
    print(f"[Context {i}]\n{c[:200]}…\n")  # first 200 chars

# 3d) Build prompts
system_prompt = (
    "You are an AI assistant that strictly answers based on the given context. "
    "If the answer cannot be derived directly from the provided context, "
    "respond with: 'I do not have enough information to answer that.'"
)
user_message = "\n\n".join(f"Context {i}:\n{c}" for i, c in enumerate(top_chunks, 1))
user_message += f"\n\nQuestion: {query}"

# 3e) Get your answer from Gemini
answer = generate_with_gemini(system_prompt, user_message)
print("\nAI Response:\n", answer)


Query: Where can I find emergency food in ZIP code 17104?

--- Retrieved Contexts ---
[Context 1]
to make sure the organization
asking you for money is registered as a 501(c) corporation, which means your
donation is tax deductible: https://apps.irs.gov/app/eos/
• Contact the Pennsylvania Departme…

[Context 2]
 
814-765-5357
ext. 1
Clinton County 
570-893-4090
ext. 209
Columbia County 
570-389-5720
Crawford County 
814-724-2552
Cumberland County 
717-218-2902
Dauphin County 
717-558-6801
Delaware County 
61…


AI Response:
 I do not have enough information to answer that.


## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [12]:
import os
from openai import OpenAI
import json

# --- 1) Initialize OpenAI-compatible client for Google Vertex AI ---
client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# --- 2) Load the dataset ---
with open("/Users/kekunkoya/Desktop/RAG Google/PA211_dataset.json") as f:
    data = json.load(f)

# --- 3) System prompt for evaluation ---
evaluate_system_prompt = (
    "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. "
    "If the AI assistant's response is very close to the true response, assign a score of 1. "
    "If the response is incorrect or unsatisfactory, assign a score of 0. "
    "If the response is partially aligned, assign a score of 0.5. "
    "Reply with ONLY the numeric score (0, 0.5, or 1)."
)

# --- 4) Function to get evaluation score ---
def evaluate_response(query, ai_response, true_answer):
    evaluation_prompt = (
        f"User Query: {query}\n\n"
        f"AI Response: {ai_response}\n\n"
        f"True Response: {true_answer}\n\n"
        f"{evaluate_system_prompt}"
    )
    eval_resp = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=[
            {"role": "system", "content": evaluate_system_prompt},
            {"role": "user", "content": evaluation_prompt}
        ],
        temperature=0.0
    )
    return float(eval_resp.choices[0].message.content.strip())

# --- 5) Loop through dataset ---
scores = []
for i, item in enumerate(data):
    # Replace this with your actual AI-generated answer from Gemini RAG
    ai_response = "YOUR_AI_RESPONSE_HERE"  

    score = evaluate_response(item["question"], ai_response, item["ideal_answer"])
    scores.append(score)
    print(f"Q{i+1}: Score = {score}")

# --- 6) Compute average score ---
avg_score = sum(scores) / len(scores) if scores else 0
print(f"\nAverage Evaluation Score: {avg_score:.2f}")


Q1: Score = 1.0
Q2: Score = 1.0
Q3: Score = 1.0
Q4: Score = 0.0
Q5: Score = 0.0

Average Evaluation Score: 0.60
