## Basic RAG (Retrieval-Augmented Generation) with LangChain

- This project demonstrates the complete workflow of building a Basic RAG system using LangChain. It covers every step required to convert raw documents into a retrieval-enabled LLM application.

- The purpose is to show the core workflow that connects:

1. Document processing

2. Embedding generation

3. Vector database indexing

4. Retrieval

5. LLM-based answer generation

- The goal is to create a simple but fully functional pipeline where a user can ask a question and the system retrieves relevant information from loaded documents to generate an accurate, grounded answer.

### Project Overview

- RAG (Retrieval-Augmented Generation) enhances LLM responses by grounding them in external knowledge. Instead of relying solely on the model’s internal training data, the system retrieves relevant information from your documents and uses it during generation.


In [56]:
# Loading environment variable of groq and huggingface model

import os
from dotenv import load_dotenv
load_dotenv()
groq_api_key=os.getenv("GROQ_API_KEY")
os.environ["HF_TOKEN"]=os.getenv("HF_TOKEN")

python-dotenv could not parse statement starting at line 16


### 1. Document Loading

- Documents are ingested using LangChain's document loader utilities. This notebook supports text and PDF files but can be extended to other formats.

In [57]:
from langchain.document_loaders import PyPDFLoader

loader=PyPDFLoader("../research_papers/attention.pdf") # Replace with your file path
docs=loader.load()
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../research_papers/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoo

### 2. Splitting into Chunks

- Long documents are divided into manageable chunks using a text splitter with overlap to preserve context continuity.

In [58]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=250)
documents=text_splitter.split_documents(documents=docs)
documents[:3]


[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../research_papers/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGo

### 3. Embeddings Generation

- Each chunk is embedded into a high-dimensional vector representation. Embeddings translate natural language into numerical form for semantic search.

In [59]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


### 4. Building a Vector Store

- Chroma is used as the vector database to index and store all embeddings. It enables efficient similarity search across large datasets.

In [60]:
from langchain.vectorstores import Chroma

vector_store=Chroma.from_documents(documents=documents,embedding=embeddings)

vector_store.similarity_search_with_score(" self-Attention mechanisam")


[(Document(metadata={'creationdate': '2024-04-10T21:11:43+00:00', 'title': '', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'moddate': '2024-04-10T21:11:43+00:00', 'page': 1, 'producer': 'pdfTeX-1.40.25', 'source': '../research_papers/attention.pdf', 'creator': 'LaTeX with hyperref', 'page_label': '2', 'trapped': '/False', 'author': '', 'total_pages': 15, 'subject': '', 'keywords': ''}, page_content='self-attention and discuss its advantages over models such as [17, 18] and [9].\n3 Model Architecture\nMost competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].\nHere, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence\nof continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output\nsequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive\n[10], consuming the previously genera

### 5. LLM Model Selection

- Open source Groq model is used as the llm model. Along with provide api key of groq

In [61]:
from langchain_groq import ChatGroq

llm=ChatGroq(model="openai/gpt-oss-20b",api_key=groq_api_key)

llm.invoke("hi")

AIMessage(content='Hello! How can I help you today?', additional_kwargs={'reasoning_content': 'The user says "hi". Likely respond politely.'}, response_metadata={'token_usage': {'completion_tokens': 30, 'prompt_tokens': 72, 'total_tokens': 102, 'completion_time': 0.029977205, 'prompt_time': 0.003795284, 'queue_time': 0.13558457, 'total_time': 0.033772489, 'completion_tokens_details': {'reasoning_tokens': 12}}, 'model_name': 'openai/gpt-oss-20b', 'system_fingerprint': 'fp_d6de37e6be', 'service_tier': 'on_demand', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--e53addc0-eb5c-4942-a9ad-f5c97c352b45-0', usage_metadata={'input_tokens': 72, 'output_tokens': 30, 'total_tokens': 102})

In [62]:
# Designing  the prompt template for the llm

from langchain_core.prompts import ChatPromptTemplate

prompt=""" 
You are an helpful assistant document Q and A tasks. please provide the response to the user questions.
In which context is from the retriever. If you don't know the answer simply say i don't know.


Here is the context: 

{context}
"""

prompt_template=ChatPromptTemplate.from_messages(
    [

    ("system",prompt),
    ("human","{input}")
]
)

prompt_template

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template=" \nYou are an helpful assistant document Q and A tasks. please provide the response to the user questions.\nIn which context is from the retriever. If you don't know the answer simply say i don't know.\n\n\nHere is the context: \n\n{context}\n"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

### 6. Creating a Retriever

- The retriever component handles similarity search. Given a user query, it returns the most semantically relevant chunks from the vector store.

In [63]:
retrievier=vector_store.as_retriever()


In [64]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains.retrieval import create_retrieval_chain

qanda_chain=create_stuff_documents_chain(llm,prompt_template) # It creates a chain for passing a list of documents to a model

retriever_chain=create_retrieval_chain(retrievier,qanda_chain) # It creates a retrieval chain that retrieves documents

### 7. Retrieval-Augmented Generation

- The retrieved context is combined with the user query and passed to a chat model.
This ensures the model’s answer is guided by the document content rather than relying solely on its internal knowledge.

In [65]:
response=retriever_chain.invoke({"input":"what is self-attention mechanism"})

response

{'input': 'what is self-attention mechanism',
 'context': [Document(metadata={'page_label': '5', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'trapped': '/False', 'source': '../research_papers/attention.pdf', 'producer': 'pdfTeX-1.40.25', 'moddate': '2024-04-10T21:11:43+00:00', 'keywords': '', 'creator': 'LaTeX with hyperref', 'page': 4, 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'title': '', 'total_pages': 15, 'subject': ''}, page_content='3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n• In "encoder-decoder attention" layers, the queries come from the previous decoder layer,\nand the memory keys and values come from the output of the encoder. This allows every\nposition in the decoder to attend over all positions in the input sequence. This mimics the\ntypical encoder-decoder attention mechanisms in sequence-to-sequence models such as\n[38,

In [66]:
response["answer"]

'**Self‑attention** (also called *intra‑attention*) is a core component of the Transformer architecture.  \nIn a self‑attention layer every token (or position) in a sequence can “look at” every other token in the same sequence to gather contextual information.\n\nKey points:\n\n| Component | How it’s computed | What it represents |\n|-----------|-------------------|--------------------|\n| **Query (Q)** | Derived from the token’s own representation (e.g., by a linear projection) | The “request” or focus of the token |\n| **Key (K)** | Derived from all tokens in the sequence (same linear projection as Q) | The “identifier” used to compare with queries |\n| **Value (V)** | Derived from all tokens (same linear projection as K) | The actual information to be aggregated |\n\nThe attention score for a pair of tokens is the dot‑product of their Q and K vectors, usually scaled by the square root of the dimension and passed through a softmax to obtain a probability distribution. Each token then

In [67]:
response=retriever_chain.invoke({"input":"what are transformers"})

response["answer"]

'Transformers are a deep‑learning model architecture that processes sequences by stacking multiple identical layers of two main components:\n\n1. **Multi‑head self‑attention** – allows each token to attend to every other token in the sequence, capturing long‑range dependencies without recurrence.  \n2. **Position‑wise fully‑connected feed‑forward networks** – applied independently to each position, providing non‑linear transformations.\n\nEach of these sub‑layers is wrapped in a residual connection followed by layer normalization, ensuring stable training. The encoder stack (typically 6 layers in the original paper) feeds into a decoder stack that also uses self‑attention and cross‑attention to the encoder outputs. All sub‑layers and embedding layers produce outputs of the same dimensionality (d_model = 512 in the cited design). This architecture replaces traditional recurrent or convolutional networks for tasks such as machine translation.'