# Introduction to Simple RAG

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines information retrieval with generative models. It enhances the performance of language models by incorporating external knowledge, which improves accuracy and factual correctness.

In a Simple RAG setup, we follow these steps:

1. **Data Ingestion**: Load and preprocess the text data.
2. **Chunking**: Break the data into smaller chunks to improve retrieval performance.
3. **Embedding Creation**: Convert the text chunks into numerical representations using an embedding model.
4. **Semantic Search**: Retrieve relevant chunks based on a user query.
5. **Response Generation**: Use a language model to generate a response based on retrieved text.

This notebook implements a Simple RAG approach, evaluates the model’s response, and explores various improvements.

## Setting Up the Environment
We begin by importing necessary libraries.

In [5]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [6]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [7]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [8]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
 
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [10]:
# Import the required library for PDF reading
from PyPDF2 import PdfReader

# --- Function Definitions ---

def extract_text_from_pdf(pdf_path):
    """Extracts text from all pages of a PDF file."""
    text = ""
    # Open the PDF file in binary read mode
    with open(pdf_path, 'rb') as file:
        pdf = PdfReader(file)
        # Loop through each page and extract text
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

def chunk_text(text, chunk_size, overlap):
    """Chunks text into segments with a specified overlap."""
    chunks = []
    start = 0
    # Loop through the text and create overlapping chunks
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        # Move the start position forward by the chunk size minus the overlap
        start += chunk_size - overlap
    return chunks

# --- Your Original Logic ---

# Define the path to the PDF file
# IMPORTANT: Make sure this path is correct on your system
pdf_path = "/Users/kekunkoya/Desktop/RAG Project/PEMA.pdf"

# Extract text from the PDF file by calling the function we defined
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text by calling the function we defined
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
# Print only the first chunk if it exists
if text_chunks:
    print(text_chunks[0])

Number of text chunks: 66

First text chunk:
PENNSYLVANIA
EMERGENCY
PREPAREDNESSGUIDE
Be Informed. Be Prepared. Be Involved. 
www.Ready .PA.gov 
readypa@pa.govEmergency Preparedness GuideTable of Contents
TABLE OF CONTENTS  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pages 2-3
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Page    4
TOP 10 EMERGENCIES . . . . . . . . . . . . . . . . . . . . . . Pages 4-7
Floods • Fires • Winter Storms • Tropical Storms, Tornadoes 
and Thunderstorms • Influenza (Flu) Pandemic • Hazardous 
Material Incidents • Earthquakes and Landslides • Nuclear 
Threat • Dam Failures • Terrorism
BE PREPARED – MAKE A PLAN    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .    Pages 10-11
How to Make a Family Emergency Plan
HOME EMERGENCY KIT CHECKLIST .   .  .  .  .  .  .  .  .  .  .  .  .  .  .   Pages 12-15
Additional Special Items
BE PREPARED IN YOUR VEHICLE   . . . . . . . . . . . . . . . . . . . . . .  Page 16


## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [12]:
from dotenv import load_dotenv
import os

load_dotenv()
api_key=os.getenv('OPENAI_API_KEY')

In [13]:
import os
from openai import OpenAI

key = os.getenv("OPENAI_API_KEY") # No default needed if you ensure env var is set
client = OpenAI(api_key=key)

def create_embeddings(text_chunks, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=text_chunks
    )
    return response.data

# You'll need to define text_chunks before calling create_embeddings
# For example:
# text_chunks = ["This is a test chunk.", "Another piece of text."]

# embeddings_data = create_embeddings(text_chunks)
# print(f"Created {len(embeddings_data)} embeddings.")

In [14]:
import os
from openai import OpenAI

key = os.getenv("OPENAI_API_KEY") # No default needed if you ensure env var is set
client = OpenAI(api_key=key)

def create_embeddings(text_chunks, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=text_chunks
    )
    return response.data

# You'll need to define text_chunks before calling create_embeddings
# For example:
# text_chunks = ["This is a test chunk.", "Another piece of text."]

# embeddings_data = create_embeddings(text_chunks)
# print(f"Created {len(embeddings_data)} embeddings.")

In [15]:
# Make sure to install the library: pip install openai
from openai import OpenAI
import os # Import os to check for environment variables

# It's good practice to ensure the API key is set
# If OPENAI_API_KEY is not set, this will raise an error
# You might want to add error handling here for production code
if "OPENAI_API_KEY" not in os.environ:
    print("Warning: OPENAI_API_KEY environment variable is not set.")
    print("Please set it before running the script (e.g., export OPENAI_API_KEY='sk-...').")
    # For testing, you could temporarily uncomment the line below and put your key
    # os.environ["OPENAI_API_KEY"] = "sk-YOUR_TEST_KEY_HERE"
    # exit() # Or exit if the key is mandatory for execution

client = OpenAI() # This works if OPENAI_API_KEY is set in your environment

def create_embeddings(text_chunks, model="text-embedding-3-small"):
    """
    Creates embeddings for a list of text chunks using the specified OpenAI model.

    Args:
    text_chunks (list[str]): The list of input texts for which embeddings are to be created.
    model (str): The OpenAI model to be used. Default is "text-embedding-3-small".

    Returns:
    list: A list of embedding objects from the OpenAI API.
    """
    # The 'input' parameter for the OpenAI API can take a list of strings directly
    response = client.embeddings.create(
        model=model,
        input=text_chunks
    )

    # The embeddings are located in response.data
    return response.data

# --- THIS IS THE MISSING PART ---
# Define 'text_chunks' as a list of strings
text_chunks = [
    "This is the first sentence, which we want to embed.",
    "This is the second one, containing different information.",
    "A third text chunk to demonstrate the functionality."
]
# --- END OF MISSING PART ---

# Create embeddings for the text chunks
embeddings_data = create_embeddings(text_chunks)

# Print the number of embeddings created
print(f"Successfully created {len(embeddings_data)} embeddings.")

# Print the embedding for the first text chunk (optional)
if embeddings_data:
    print("\nEmbedding for the first chunk (first 5 values):")
    print(embeddings_data[0].embedding[:5])

Successfully created 3 embeddings.

Embedding for the first chunk (first 5 values):
[0.027513332664966583, 0.037257637828588486, -0.004034167155623436, 0.010535563342273235, 0.012971639633178711]


In [16]:
# Make sure to install the library: pip install openai
from openai import OpenAI


client = OpenAI() # This works if OPENAI_API_KEY is set in your environment

def create_embeddings(text_chunks, model="text-embedding-3-small"):
    """
    Creates embeddings for a list of text chunks using the specified OpenAI model.

    Args:
    text_chunks (list[str]): The list of input texts for which embeddings are to be created.
    model (str): The OpenAI model to be used. Default is "text-embedding-3-small".

    Returns:
    list: A list of embedding objects from the OpenAI API.
    """
    # The 'input' parameter for the OpenAI API can take a list of strings directly
    response = client.embeddings.create(
        model=model,
        input=text_chunks
    )

    # The embeddings are located in response.data
    return response.data

# Assume 'text_chunks' is a list of strings from your previous code
# Example: text_chunks = ["This is the first sentence.", "This is the second one."]

# Create embeddings for the text chunks
embeddings_data = create_embeddings(text_chunks)

# Print the number of embeddings created
print(f"Successfully created {len(embeddings_data)} embeddings.")

# Print the embedding for the first text chunk (optional)
if embeddings_data:
    print("\nEmbedding for the first chunk (first 5 values):")
    print(embeddings_data[0].embedding[:5])

Successfully created 3 embeddings.

Embedding for the first chunk (first 5 values):
[0.027504978701472282, 0.03724632412195206, -0.00400179997086525, 0.010563506744801998, 0.012980157509446144]


In [17]:
# 1. Import the OpenAI library
from openai import OpenAI

# 2. Initialize the client
# The client automatically looks for the OPENAI_API_KEY environment variable.
client = OpenAI()

# Assume 'text_chunks' is a list of strings from your previous code
# Example: text_chunks = ["This is the first sentence.", "This is the second one."]

# 3. Create embeddings using the specified OpenAI model
model_name = "text-embedding-3-small"
response = client.embeddings.create(
    model=model_name,
    input=text_chunks
)

# 4. Extract the embedding vectors from the response object
# The actual embeddings are in the `.data` attribute of the response.
embeddings = [embedding_item.embedding for embedding_item in response.data]

# Check the first embedding's first few values
if embeddings:
    print(f"Successfully created {len(embeddings)} embeddings with model '{model_name}'.")
    print("\nFirst 5 values of the first embedding:")
    print(embeddings[0][:5])

Successfully created 3 embeddings with model 'text-embedding-3-small'.

First 5 values of the first embedding:
[0.027504978701472282, 0.03724632412195206, -0.00400179997086525, 0.010563506744801998, 0.012980157509446144]


## Performing Semantic Search
We implement cosine similarity to find the most relevant text chunks for a user query.

In [18]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [19]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search on the text chunks using the given query and embeddings.

    Args:
    query (str): The query for the semantic search.
    text_chunks (List[str]): A list of text chunks to search through.
    embeddings (List[dict]): A list of embeddings for the text chunks.
    k (int): The number of top relevant text chunks to return. Default is 5.

    Returns:
    List[str]: A list of the top k most relevant text chunks based on the query.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]


## Running a Query on Extracted Chunks

In [20]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def semantic_search(query: str, text_chunks: list[str], embeddings: list[list[float]], k: int):
    """
    Performs semantic search using a query, text chunks, and their embeddings.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query)[0].embedding

    # Calculate similarity scores between the query and each text chunk
    similarity_scores = cosine_similarity(
        [query_embedding],
        embeddings
    )[0]

    # Get the indices of the top k scores
    top_k_indices = np.argsort(similarity_scores)[-k:][::-1]

    # Return the corresponding text chunks using the full variable name
    #
    # OLD, INCORRECT line:
    # return [text_chunks[i] for i in top_
    #
    # NEW, CORRECT line:
    return [text_chunks[i] for i in top_k_indices]

## Generating a Response Based on Retrieved Chunks

In [21]:
import os
from openai import OpenAI
from openai import AuthenticationError # Import for specific error handling
from dotenv import load_dotenv # Optional, for loading from .env file

# --- 1. Set up your OpenAI client ---
# Load environment variables from a .env file (if you're using one)
load_dotenv()

# Get your API key from the environment variable
key = os.getenv("OPENAI_API_KEY")

# Basic check for API key
if key is None:
    print("Error: OPENAI_API_KEY environment variable is not set.")
    print("Please set your API key before running the script.")
    exit() # Exit if the key is not found

# Initialize the OpenAI client
try:
    client = OpenAI(api_key=key)
except AuthenticationError as e:
    print(f"Authentication Error: {e}")
    print("Please check your OpenAI API key. It might be incorrect, expired, or have insufficient permissions.")
    exit()
except Exception as e:
    print(f"An unexpected error occurred during client initialization: {e}")
    exit()

# --- 2. Define the generate_response function ---
def generate_response(system_prompt, user_message, model="gpt-3.5-turbo"):
    """
    Generates a response from the OpenAI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The OpenAI model to be used for generating the response.
                 Default is now "gpt-3.5-turbo".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# --- 3. Define the system_prompt and user_prompt variables ---
# THIS IS THE MISSING PART THAT CAUSED THE NAMERROR
system_prompt = "You are a helpful and creative AI assistant. Provide concise and relevant answers."
user_prompt = "Tell me about disasters."
# --- END OF MISSING PART ---


# --- 4. Generate AI response using the corrected function ---
print("Generating AI response...")
try:
    ai_response = generate_response(system_prompt, user_prompt)

    # Print the final response content from the AI
    print("\nAI Response:")
    print(ai_response.choices[0].message.content)

except Exception as e:
    print(f"An error occurred during AI response generation: {e}")

Generating AI response...

AI Response:
Disasters are events that cause significant damage, destruction, and disruption to communities and the environment. They can be natural, such as earthquakes, hurricanes, and wildfires, or human-made, like industrial accidents and terrorist attacks. Disasters often result in loss of life, displacement of people, and economic hardship. Preparedness, response, and recovery efforts are crucial in mitigating the impact of disasters.


## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [25]:
import os
from openai import OpenAI
from openai import AuthenticationError # Import for specific error handling
from dotenv import load_dotenv # Optional, for loading from .env file

# --- 1. Set up your OpenAI client ---
# Load environment variables from a .env file (if you're using one)
load_dotenv()

# Get your API key from the environment variable
key = os.getenv("OPENAI_API_KEY")

# Basic check for API key
if key is None:
    print("Error: OPENAI_API_KEY environment variable is not set.")
    print("Please set your API key before running the script.")
    exit() # Exit if the key is not found

# Initialize the OpenAI client
try:
    client = OpenAI(api_key=key)
except AuthenticationError as e:
    print(f"Authentication Error: {e}")
    print("Please check your OpenAI API key. It might be incorrect, expired, or have insufficient permissions.")
    exit()
except Exception as e:
    print(f"An unexpected error occurred during client initialization: {e}")
    exit()

# --- 2. Define the generate_response function ---
def generate_response(system_prompt, user_message, model="gpt-3.5-turbo"):
    """
    Generates a response from the OpenAI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The OpenAI model to be used for generating the response.
                 Default is now "gpt-3.5-turbo".

    Returns:
    object: The response object from the AI model (e.g., ChatCompletion object).
            You access content via .choices[0].message.content
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# --- 3. Define the variables needed for the AI response and evaluation ---
# Define the user's query
query = "What can I find emergency food in 17104?"


# Define the data structure containing the true response
# This 'data' variable simulates where your true answers would come from,
# perhaps a loaded dataset or a database.

data = [
    {"question": "Where can I find emergency food in ZIP code 17104?", "ideal_answer": "Thanks for reaching out. Based on your ZIP code (17104), here are food resources available near you:\n\n🍫 Central Pennsylvania Food Bank\nLocation: 3908 Corey Rd, Harrisburg, PA 17109\nPhone: (717) 564-1700\nServices: Emergency food boxes, drive-thru distribution (Mon–Fri, 9am–3pm)\n\n🥗 St. Francis of Assisi Church Pantry\nAddress: 1439 Market St, Harrisburg, PA 17103\nServices: Walk-in pantry on Wednesdays, 10am–1pm\n\n🫾 You may also qualify for Disaster SNAP benefits (D-SNAP). A 211 Navigator can guide you through the process.\n\n📞 You can always call 2-1-1 or text 898-211 for live assistance."}
]
# Define the system prompt for the AI assistant
# This is the 'system_prompt' that generate_response uses to guide the AI's initial behavior
ai_assistant_system_prompt = "You are a helpful and creative AI assistant. Provide concise and relevant answers."

# --- 4. Generate AI response ---
print(f"User Query: {query}")
try:
    # Use ai_assistant_system_prompt for the AI assistant's behavior
    ai_response = generate_response(ai_assistant_system_prompt, query)
    ai_response_content = ai_response.choices[0].message.content
    print(f"AI Assistant's Response: {ai_response_content}")

except Exception as e:
    print(f"An error occurred during AI response generation: {e}")
    # Handle the error, maybe set ai_response_content to an error message
    ai_response_content = "Error generating AI response."


# --- 5. Define the evaluation system prompt ---
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5. Provide only the score (e.g., '1', '0', or '0.5') and nothing else." # Added instruction for only score

# --- 6. Create the evaluation prompt ---
# This evaluation_prompt acts as the 'user_message' for the evaluation AI
evaluation_prompt = (
    f"User Query: {query}\n"
    f"AI Response:\n{ai_response_content}\n" # Use the content directly
    f"True Response: {data[0]['ideal_answer']}\n"
    f"Instructions: Based on the true response, evaluate the AI response and provide a score according to the following rules: Score 1 if very close. Score 0 if incorrect/unsatisfactory. Score 0.5 if partially aligned. Output only the score."
)

# --- 7. Generate the evaluation response ---
print("\nGenerating Evaluation Response...")
try:
    # Use evaluate_system_prompt for the evaluation AI's behavior
    # evaluation_prompt becomes the user_message for the evaluation AI
    evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)
    evaluation_score = evaluation_response.choices[0].message.content

    # Print the evaluation response
    print(f"Evaluation Score: {evaluation_score}")

except Exception as e:
    print(f"An error occurred during evaluation response generation: {e}")

User Query: What can I find emergency food in 17104?
AI Assistant's Response: You can find emergency food resources in the 17104 area by contacting local food banks, community centers, or organizations like the Central Pennsylvania Food Bank. They can provide you with information on where to access emergency food assistance in your area.

Generating Evaluation Response...
Evaluation Score: 0.5
