#**Basic RAG Pipeline Implementation**  

**Overview**  
This is a basic RAG (Retrieval-Augmented Generation) pipeline implementation using:
- PyPDFLoader and RecursiveCharacterTextSplitter
- LangChain
- FAISS (Facebook AI Similarity Search)
- OpenAI embeddings
- GPT-4o-mini API

**Preprocessing**  
My grandfather's memoir titled "My Life Story" was split into 10 PDFs (chapters). Each PDF was processed using PyPDFLoader and chunked with RecursiveCharacterTextSplitter and cleaned of tab characters. A citation to the source chapter was appended to the end of each chunk.

In [1]:
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS

In [2]:
# Load from .env file that contains the OpenAI API key
load_dotenv() 

# Get OpenAI API key from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Make a list of the PDF paths
paths = [os.path.join(os.getcwd(), "RAG Eval", "pdfs", file) for file in os.listdir(os.path.join(os.getcwd(), "RAG Eval", "pdfs"))]

In [None]:
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

In [4]:
def encode_pdfs(paths, chunk_size, chunk_overlap):
    """
    Encodes multiple PDFs into a vector store using OpenAI embeddings.

    Args:
        paths: A list of paths to the PDF files.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded content of the PDF chunks with appended citations.
    """

    all_cleaned_texts = []

    for path in paths:
        # Load PDF documents
        loader = PyPDFLoader(path)
        documents = loader.load()

        # Split documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
        )
        texts = text_splitter.split_documents(documents)
        cleaned_texts = replace_t_with_space(texts)

        # Extract file name from path
        file_name = os.path.basename(path)

        # Append document citation to the end of each chunk
        for text in cleaned_texts:
            text.page_content = text.page_content + f" [Source: {file_name}]"

        all_cleaned_texts.extend(cleaned_texts)

    # Create embeddings
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)

    # Create vector store
    vectorstore = FAISS.from_documents(all_cleaned_texts, embeddings)

    return vectorstore

In [5]:
# Encode the PDFs
chunks_vector_store = encode_pdfs(paths, chunk_size=1000, chunk_overlap=200)

In [4]:
#save the vector store
#chunks_vector_store.save_local("basic_rag_citation.json")

#load the vector store
chunks_vector_store = FAISS.load_local("my_life_story_basic_rag_citation.json", OpenAIEmbeddings(), allow_dangerous_deserialization=True)

In [5]:
# Create a retriever
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

In [6]:
class QuestionAnswerFromContext(BaseModel):
    """
    Model to generate an answer to a query based on a given context.
    
    Attributes:
        answer_based_on_content (str): The generated answer and citation based on the context.
    """
    answer_based_on_content: str = Field(description="Generates an answer and [citation] to a query based on a given context.")
    
def create_question_answer_from_context_chain(llm):
    # Initialize the ChatOpenAI model with specific parameters
    question_answer_from_context_llm = llm

    # Define the prompt template for chain-of-thought reasoning
    question_answer_prompt_template = """ 
    You are querying a memior called "My Life Story" written by George Shambaugh.
    For the question below, provide a concise but suffice answer. If you don't know, only write "The RAG retrieval was unable to provide sufficient context":
    {context}
    Question
    {question}
    """

    # Create a PromptTemplate object with the specified template and input variables
    question_answer_from_context_prompt = PromptTemplate(
        template=question_answer_prompt_template,
        input_variables=["context", "question"],
    )

    # Create a chain by combining the prompt template and the language model
    question_answer_from_context_cot_chain = question_answer_from_context_prompt | question_answer_from_context_llm.with_structured_output(
        QuestionAnswerFromContext)
    return question_answer_from_context_cot_chain

In [7]:
def answer_question_from_context(question, context, question_answer_from_context_chain):
    """
    Answer a question using the given context by invoking a chain of reasoning.

    Args:
        question: The question to be answered.
        context: The context to be used for answering the question.

    Returns:
        A dictionary containing the answer, context, and question.
    """
    input_data = {
        "question": question,
        "context": context
    }
    output = question_answer_from_context_chain.invoke(input_data)
    answer = output.answer_based_on_content
    return {"answer": answer, "context": context, "question": question}

In [8]:
def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i + 1}:")
        print(c)
        print("\n")

In [9]:
def test_RAG(test_query):
    """
    Test the Retrieval-Augmented Generation (RAG) process with a given query. It also prints the context chunks retrieved from the vector store.

    Args:
        test_query (str): The query to be tested against the vector store created from my Grandfather's memoir.

    Returns:
        str: The answer generated by the language model based on the retrieved context.
    """
    # Retrieve chunks related to the test query from the vector store
    chunks = chunks_query_retriever.invoke(test_query)
    # Extract the content of each chunks to form the context
    context = [chunk.page_content for chunk in chunks]
    # Initialize the language model
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=2000)
    # Create a prompt template and combine with the language model
    question_answer_from_context_chain = create_question_answer_from_context_chain(llm)
    # Answer the question based on the retrieved context
    answer = answer_question_from_context(test_query, context, question_answer_from_context_chain)
    # Print the response generated by the language model
    print("Response:", answer["answer"], "\n")
    # Display the context chunks retrieved from the vector store
    show_context(context)
    

In [10]:
test_RAG("What is an example of one of his mom's sayings, and what did he refer to these sayings as?")



Response: One example of one of his mom's sayings is, "I told you not to climb that tree! When you fall out and break both your legs, don’t come running into me!" He referred to these sayings as "Momisms." 

Context 1:
There were other Momisms. We boys were forbidden to climb the tree in the front yard, so of course 
we would. Mom would say, ”I told you not to climb that tree! When you fall out and break both your 
legs, don’t come running into me!” We wouldn’t. If Mom was told that Paul hit Bob on the head with 
the ball bat, she would say “It’s a good thing it was his head or he could have bee n hurt.”. Substitute 
any boy’s name for either person., Mom had other sayings: “I know where you are going if you don’t 
mend your ways”; “If you spill salt it is bad luck unless yu throw a pinch of it over your left shoulder.” 
Don’t look cross-eyed are they will get stuck that way!” ”You only go to the hospital to die.” One saying 
Dad was heard to say was “He is so dumb, he couldn’t pour pi