# Introduction to Simple RAG

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines information retrieval with generative models. It enhances the performance of language models by incorporating external knowledge, which improves accuracy and factual correctness.

In a Simple RAG setup, we follow these steps:

1. **Data Ingestion**: Load and preprocess the text data.
2. **Chunking**: Break the data into smaller chunks to improve retrieval performance.
3. **Embedding Creation**: Convert the text chunks into numerical representations using an embedding model.
4. **Semantic Search**: Retrieve relevant chunks based on a user query.
5. **Response Generation**: Use a language model to generate a response based on retrieved text.

This notebook implements a Simple RAG approach, evaluates the model’s response, and explores various improvements.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [3]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [4]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
 
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [2]:
# Import the required library for PDF reading
from PyPDF2 import PdfReader

# --- Function Definitions ---

def extract_text_from_pdf(pdf_path):
    """Extracts text from all pages of a PDF file."""
    text = ""
    # Open the PDF file in binary read mode
    with open(pdf_path, 'rb') as file:
        pdf = PdfReader(file)
        # Loop through each page and extract text
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

def chunk_text(text, chunk_size, overlap):
    """Chunks text into segments with a specified overlap."""
    chunks = []
    start = 0
    # Loop through the text and create overlapping chunks
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        # Move the start position forward by the chunk size minus the overlap
        start += chunk_size - overlap
    return chunks

# --- Your Original Logic ---

# Define the path to the PDF file
# IMPORTANT: Make sure this path is correct on your system
pdf_path = "/Users/kekunkoya/Desktop/PHD/ISEM 770/Class Code SAT/Homelessness.pdf"

# Extract text from the PDF file by calling the function we defined
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text by calling the function we defined
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
# Print only the first chunk if it exists
if text_chunks:
    print(text_chunks[0])

Number of text chunks: 65

First text chunk:
19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>>Abstract _ Substantial progress has been made at EU level on defining home -
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this typology. The development and some of the remaining controversial issues concerning ETHOS and a reduced version of it are discussed in this chapter. Furthermore essential reasons and different approaches to measure home -
lessness are presented. It is argued that a single number will not be enough 
to understand homelessness and monitor progress in tackling it. More 
research and more work to improve information on homelessness at national levels will be needed before we can achieve compara

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [3]:
from dotenv import load_dotenv
import os

load_dotenv()
api_key=os.getenv('OPENAI_API_KEY')

In [4]:
# Make sure to install the library: pip install openai
from openai import OpenAI

# It's recommended to set your API key as an environment variable
# but for clarity in this example, we'll initialize it here.
# client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
client = OpenAI() # This works if OPENAI_API_KEY is set in your environment

def create_embeddings(text_chunks, model="text-embedding-3-small"):
    """
    Creates embeddings for a list of text chunks using the specified OpenAI model.

    Args:
    text_chunks (list[str]): The list of input texts for which embeddings are to be created.
    model (str): The OpenAI model to be used. Default is "text-embedding-3-small".

    Returns:
    list: A list of embedding objects from the OpenAI API.
    """
    # The 'input' parameter for the OpenAI API can take a list of strings directly
    response = client.embeddings.create(
        model=model,
        input=text_chunks
    )

    # The embeddings are located in response.data
    return response.data

# Assume 'text_chunks' is a list of strings from your previous code
# Example: text_chunks = ["This is the first sentence.", "This is the second one."]

# Create embeddings for the text chunks
embeddings_data = create_embeddings(text_chunks)

# Print the number of embeddings created
print(f"Successfully created {len(embeddings_data)} embeddings.")

# Print the embedding for the first text chunk (optional)
if embeddings_data:
    print("\nEmbedding for the first chunk (first 5 values):")
    print(embeddings_data[0].embedding[:5])

Successfully created 65 embeddings.

Embedding for the first chunk (first 5 values):
[-0.037515923380851746, 0.012890207581222057, 0.06607704609632492, 0.017732873558998108, 0.032095909118652344]


In [5]:
# 1. Import the OpenAI library
from openai import OpenAI

# 2. Initialize the client
# The client automatically looks for the OPENAI_API_KEY environment variable.
client = OpenAI()

# Assume 'text_chunks' is a list of strings from your previous code
# Example: text_chunks = ["This is the first sentence.", "This is the second one."]

# 3. Create embeddings using the specified OpenAI model
model_name = "text-embedding-3-small"
response = client.embeddings.create(
    model=model_name,
    input=text_chunks
)

# 4. Extract the embedding vectors from the response object
# The actual embeddings are in the `.data` attribute of the response.
embeddings = [embedding_item.embedding for embedding_item in response.data]

# Check the first embedding's first few values
if embeddings:
    print(f"Successfully created {len(embeddings)} embeddings with model '{model_name}'.")
    print("\nFirst 5 values of the first embedding:")
    print(embeddings[0][:5])

Successfully created 65 embeddings with model 'text-embedding-3-small'.

First 5 values of the first embedding:
[-0.037481676787137985, 0.012894639745354652, 0.06609976291656494, 0.01773896999657154, 0.032106947153806686]


## Performing Semantic Search
We implement cosine similarity to find the most relevant text chunks for a user query.

In [6]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [7]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search on the text chunks using the given query and embeddings.

    Args:
    query (str): The query for the semantic search.
    text_chunks (List[str]): A list of text chunks to search through.
    embeddings (List[dict]): A list of embeddings for the text chunks.
    k (int): The number of top relevant text chunks to return. Default is 5.

    Returns:
    List[str]: A list of the top k most relevant text chunks based on the query.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]


## Running a Query on Extracted Chunks

In [8]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def semantic_search(query: str, text_chunks: list[str], embeddings: list[list[float]], k: int):
    """
    Performs semantic search using a query, text chunks, and their embeddings.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query)[0].embedding

    # Calculate similarity scores between the query and each text chunk
    similarity_scores = cosine_similarity(
        [query_embedding],
        embeddings
    )[0]

    # Get the indices of the top k scores
    top_k_indices = np.argsort(similarity_scores)[-k:][::-1]

    # Return the corresponding text chunks using the full variable name
    #
    # OLD, INCORRECT line:
    # return [text_chunks[i] for i in top_
    #
    # NEW, CORRECT line:
    return [text_chunks[i] for i in top_k_indices]

## Generating a Response Based on Retrieved Chunks

In [10]:
# Assume 'client' is your initialized OpenAI client.
# Assume 'top_chunks' and 'query' are defined from previous steps.

# 1. Define the system prompt
system_prompt = (
    "You are an AI assistant that strictly answers based on the given context. "
    "If the answer cannot be derived directly from the provided context, "
    "respond with: 'I do not have enough information to answer that.'"
)

# 2. Define the user prompt by combining context and the query
context_for_prompt = "\n".join(
    f"Context {i + 1}:\n{chunk}\n=====================================\n"
    for i, chunk in enumerate(top_chunks)
)
user_prompt = f"{context_for_prompt}\nQuestion: {query}"


# Your function definition is correct
def generate_response(system_prompt, user_message, model="gpt-3.5-turbo"):
    """
    Generates a response from the OpenAI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The OpenAI model to be used for generating the response.
                 Default is now "gpt-3.5-turbo".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
    )
    return response


# Generate AI response using the defined prompts
ai_response = generate_response(system_prompt, user_prompt)

# Print the final response content from the AI
print(ai_response.choices[0].message.content)

NameError: name 'top_chunks' is not defined

## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [19]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 0.5
The AI response provides a general understanding of Explainable AI (XAI) and its importance, but it lacks the specific details about providing insights into AI decision-making and ensuring fairness in AI systems.
