#**Basic RAG Pipeline Implementation**  

**Overview**  
This is a basic RAG (Retrieval-Augmented Generation) pipeline implementation using:
- LangChain
- FAISS (Facebook AI Similarity Search)
- OpenAI embeddings
- GPT-4o-mini API

**Implementation Reference**  
[https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb](https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_rag.ipynb)

**Preprocessing**  
I preprocessed my grandfather's memoir titled "My Life Story" into 10 PDFs (chapters). Each PDF was processed using PyPDFLoader and chunked with RecursiveCharacterTextSplitter. A citation to the source chapter was appended to the end of each chunk to aid in retrieval.

In [1]:
import os
import sys
from dotenv import load_dotenv
from helper_functions import *

In [2]:
# Load from .env file that contains the OpenAI API key
load_dotenv() 

# Get OpenAI API key from .env file
openai_api_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Make a list of the PDF paths
paths = [os.path.join(os.getcwd(), "RAG Eval", "pdfs", file) for file in os.listdir(os.path.join(os.getcwd(), "RAG Eval", "pdfs"))]

In [4]:
def encode_pdfs(paths, chunk_size, chunk_overlap):
    """
    Encodes multiple PDFs into a vector store using OpenAI embeddings.

    Args:
        paths: A list of paths to the PDF files.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded content of the PDFs with citations.
    """

    all_cleaned_texts = []

    for path in paths:
        # Load PDF documents
        loader = PyPDFLoader(path)
        documents = loader.load()

        # Split documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
        )
        texts = text_splitter.split_documents(documents)
        cleaned_texts = replace_t_with_space(texts)

        # Extract file name from path
        file_name = os.path.basename(path)

        # Append document citation to the end of each chunk
        for text in cleaned_texts:
            text.page_content = text.page_content + f" [Source: {file_name}]"

        all_cleaned_texts.extend(cleaned_texts)

    # Create embeddings
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.OPENAI)

    # Create vector store
    vectorstore = FAISS.from_documents(all_cleaned_texts, embeddings)

    return vectorstore

In [5]:
# Encode the PDFs
chunks_vector_store = encode_pdfs(paths, chunk_size=1000, chunk_overlap=200)

In [None]:
#save the vector store
#chunks_vector_store.save_local("basic_rag_citation.json")

#load the vector store
chunks_vector_store = FAISS.load_local("basic_rag_citation.json", OpenAIEmbeddings(), allow_dangerous_deserialization=True)

In [4]:
# Create a retriever
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

In [11]:
def test_RAG(test_query):
    """
    Test the Retrieval-Augmented Generation (RAG) process with a given query. It also prints the context chunks retrieved from the vector store.

    Args:
        test_query (str): The query to be tested against the vector store created from my Grandfather's memoir.

    Returns:
        str: The answer generated by the language model based on the retrieved context.
    """
    context = retrieve_context_per_question(test_query, chunks_query_retriever)
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=2000)
    question_answer_from_context_chain = create_question_answer_from_context_chain(llm)
    answer = answer_question_from_context(test_query, context, question_answer_from_context_chain)
    print("Response:", answer["answer"], "\n")
    show_context(context)
    

In [18]:
test_RAG("Who is Laura?")



Response: Laura Lynn Shambaugh is the daughter of Rudy and was born on August 3, 1960. She is mentioned in the context as a young girl who needed glasses and had various adventures related to them. 

Context 1:
a young girl a little younger than Amy. The Archambaults next door had children of similar ages, so 
Amy and Tim had a lot of playmates. Rudy had become pregnant again, only this time her pregnancy 
was more of a problem. She was in and out of the hospital many times with various problems. At one 
point near the end of pregnancy when Rudy was in the hospital, the tissues of her mouth and throat 
started to break down in response to one of the medications she was given. It was a difficult , life-
threatening time for Rudy. Laura Lynn Shambaugh was born August 3, 1960 in good health. Elfleda and 
Mom Eaton both came out to help the burgeoning family. Rudy’s physician decided that it would be 
dangerous for her to have another pregnancy, so soon after Laura was born, Rudy had a com