# Retrieval Augmented Generation using LangChain

### 1. Setup and Initialization

This block sets up the environment and initializes the required API clients:
- Loads environment variables from a .env file.
- Asserts that an OpenAI API key is available.
- Initializes both the raw OpenAI client and a LangChain ChatOpenAI LLM for use in the RAG pipeline.

In [43]:
import os
from dotenv import load_dotenv
from openai import OpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()

assert os.getenv("OPENAI_API_KEY"), "Missing an OpenAI key!!!!"
assert os.getenv("LANGCHAIN_API_KEY"), "Missing LangChain key!!!!"
assert os.getenv("LANGCHAIN_ENDPOINT"), "Missing endpoint!!!!"

openai_client = OpenAI()
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

### 2. Load Document 
- Here I am loading a book my mother gave me for testing purposes.
- Veryfing the files existance.
- Splitting text into overlapping chunks so no important information is lost.

In [44]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

file_path = "data/book_abt_chairs.pdf"

if not os.path.exists(file_path):
    print("The PDF file does not exist!!")
    sys.exit(1)

loader = PyPDFLoader(file_path)
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " "]
)
splitted_docs = splitter.split_documents(pages)

### 3. Embedding
- Chose Embedding Model
- Store vector in RAM
- Retriever for most semantically similar documents

In [45]:
from langchain_core.vectorstores import InMemoryVectorStore

embeddings = OpenAIEmbeddings()
vector_store = InMemoryVectorStore.from_documents(splitted_docs, embedding=embeddings)

retriever = vector_store.as_retriever()

### 4. "clean_text" and formatting
- I added a method for removing the "\n" because I noticed some inconsistency
- format_doc method for joining docs and debugging

In [46]:
def clean_text(text: str) -> str:
    return text.replace("\n", " ").strip()

In [47]:
def format_doc(docs):
    cleaned_docs = []
    for doc in docs:
        cleaned = clean_text(doc.page_content)
        print("Cleaned snippet:", cleaned[:100])
        cleaned_docs.append(cleaned)
    return "\n\n".join(cleaned_docs)

### 5. Expand Query method 
- I added last and adjusted the code accordingly, because the user's question might not match how the information appears in the document and it helps in cases where the user is not 100% sure what they are looking for.
- First use of chaining in the program

In [48]:
def expand_query(question: str, llm_model) -> list[str]:
    prompt = ChatPromptTemplate.from_template(
        "Преформулирай следния въпрос в 5 разнообразни и семантично различни заявки:\n\n{question}"
    )
    query_chain = RunnablePassthrough() | prompt | llm_model | StrOutputParser()
    result = query_chain.invoke(question)
    return [q.strip() for q in result.split("\n") if q.strip()] #Creates a list of 5 queries


### 6. Similarity Search & Deduplication
- Does similarity search for each generated query and removes duplicate texts if there are any.

In [49]:
from typing import List
from langchain_core.documents import Document

def retrieve_documents(question: str, k: int = 2) -> list[Document]:
    expanded_questions = expand_query(question, llm)
    all_docs = []
    for q in expanded_questions:
        docs = vector_store.similarity_search(q, k=k)
        all_docs.extend(docs)

    seen = set()
    unique_docs = []
    for doc in all_docs:
        if doc.page_content not in seen:    # Remove duplicates
            unique_docs.append(doc)
            seen.add(doc.page_content)

    return unique_docs[:2]

Prompt Template in Bulgarian

In [50]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Отговори на въпроса САМО на базата на тази информация:\n\n{context}\n\n"
               "Ако отговорът не е намерен, кажи 'не мога да отговоря на базата на информацията от документа.'"),
    ("user", "{question}")
])

I decided to pull this out into a function for good practice even though it might be a bit useless here

In [None]:
def answer_question(question: str) -> str:
    return rag_chain.invoke(question)

### Main Chain

In [51]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda

rag_chain = (
    {"context": RunnableLambda(retrieve_documents) | RunnableLambda(format_doc), "question": RunnablePassthrough()}    | prompt
    | llm
    | StrOutputParser()
)

### Output

In [53]:
question = "Има ли тема за ергономия?"
response = answer_question(question)
print("Отговор:", response)

Cleaned snippet: 7  актуалност придобиват начините за подобряване на  физическото и психическо благосъстояние на потр
Cleaned snippet: 103  3. Екологичен аспект  Обединяващата идея при проектирането на мебели е тяхната  екологичност - 
Отговор: Да, в документа се обръща особено внимание на добрата ергономия, която означава създаването на среда, позволяваща на тялото да се придвижва по различни начини.
