### Load Environment Variables

This cell loads environment variables required for the project:

- GROQ_API_KEY → API key for Groq LLM inference

- HF_TOKEN → Token for HuggingFace model download

dotenv automatically reads variables from a .env file.

In [4]:

import os
from dotenv import load_dotenv
load_dotenv()
groq_api_key=os.getenv("GROQ_API_KEY")
os.environ["HF_TOKEN"]=os.getenv("HF_TOKEN")

python-dotenv could not parse statement starting at line 16


### Load PDF File

This loads a PDF document using PyPDFLoader and converts each page into a Document object.
- docs[0] prints the first page for inspection.

In [5]:
from langchain.document_loaders import PyPDFLoader

loader=PyPDFLoader("../research_papers/attention.pdf") # Replace with your file path
docs=loader.load()
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../research_papers/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoo

### Split Text into Chunks

This cell splits the long PDF text into overlapping chunks:

- chunk_size = 1000 characters

- chunk_overlap = 250 characters

Chunking improves embedding quality and retrieval accuracy in RAG.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=250)
documents=text_splitter.split_documents(documents=docs)
documents[:3]


[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': '../research_papers/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGo

### Load Embedding Model

Initializes the popular sentence-transformer model all-MiniLM-L6-v2 for embedding document chunks.

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


### Store Documents in Chroma Vector DB

- This creates a Chroma vector database from your chunked documents and uses embeddings to store vector representations.
- The last line runs a similarity search query to test retrieval.

In [8]:
from langchain.vectorstores import Chroma

vector_store=Chroma.from_documents(documents=documents,embedding=embeddings)

vector_store.similarity_search_with_score(" self-Attention mechanisam")


[(Document(metadata={'title': '', 'author': '', 'subject': '', 'keywords': '', 'creationdate': '2024-04-10T21:11:43+00:00', 'creator': 'LaTeX with hyperref', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'trapped': '/False', 'producer': 'pdfTeX-1.40.25', 'source': '../research_papers/attention.pdf', 'page': 1, 'page_label': '2', 'total_pages': 15}, page_content='self-attention and discuss its advantages over models such as [17, 18] and [9].\n3 Model Architecture\nMost competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].\nHere, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence\nof continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output\nsequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive\n[10], consuming the previously genera

### Initialize Groq LLM

This initializes the Groq-hosted LLM GPT-OSS-20B.
llm.invoke("hi") tests whether the API key is working.


In [9]:
from langchain_groq import ChatGroq

llm=ChatGroq(model="openai/gpt-oss-20b",api_key=groq_api_key)

llm.invoke("hi")

AIMessage(content='Hello! How can I help you today?', additional_kwargs={'reasoning_content': 'We have a conversation: user says "hi". We need to respond politely. There\'s no special instruction. Just say hi back.'}, response_metadata={'token_usage': {'completion_tokens': 45, 'prompt_tokens': 72, 'total_tokens': 117, 'completion_time': 0.04523991, 'prompt_time': 0.003793385, 'queue_time': 0.045327169, 'total_time': 0.049033295, 'completion_tokens_details': {'reasoning_tokens': 27}}, 'model_name': 'openai/gpt-oss-20b', 'system_fingerprint': 'fp_e189667b30', 'service_tier': 'on_demand', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--7921af1d-0521-4613-a852-f64e4a39360e-0', usage_metadata={'input_tokens': 72, 'output_tokens': 45, 'total_tokens': 117})

### Create Prompt Template

Defines the main prompt template used by the LLM.
It includes:

- System message: instructs the assistant

- Human message: includes the user query {input}

- Context placeholder {context} inserted by retriever

In [10]:
# Designing  the prompt template for the llm

from langchain_core.prompts import ChatPromptTemplate

prompt=""" 
You are an helpful assistant document Q and A tasks. please provide the response to the user questions.
In which context is from the retriever. If you don't know the answer simply say i don't know.


Here is the context: 

{context}
"""

prompt_template=ChatPromptTemplate.from_messages(
    [

    ("system",prompt),
    ("human","{input}")
]
)

prompt_template

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template=" \nYou are an helpful assistant document Q and A tasks. please provide the response to the user questions.\nIn which context is from the retriever. If you don't know the answer simply say i don't know.\n\n\nHere is the context: \n\n{context}\n"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

### Create Retriever

Converts the vector store into a retriever that fetches relevant chunks based on similarity search.

In [11]:
retriever=vector_store.as_retriever()


### Create Retrieval + Q&A Chain

- create_stuff_documents_chain → feeds retrieved context into the prompt

- create_retrieval_chain → binds retriever + Q&A chain into one pipeline

This forms a standard RAG pipeline.

In [12]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains.retrieval import create_retrieval_chain

qanda_chain=create_stuff_documents_chain(llm,prompt_template) # It creates a chain for passing a list of documents to a model

retriever_chain=create_retrieval_chain(retriever,qanda_chain) # It creates a retrieval chain that retrieves documents

### Ask Questions Without History

- These cells test simple queries using the basic RAG chain (no chat history).
- The retrieved context is used to answer the question.


In [13]:
response=retriever_chain.invoke({"input":"what is self-attention mechanism"})

response['answer']

'**Self‑attention (also called intra‑attention)** is a mechanism that lets every element (token) in a sequence “look at” all other elements of the same sequence to compute a new representation for itself.  \n\nIn the Transformer, each token is first projected into three vectors:\n\n* **Query (Q)** – what the token is looking for  \n* **Key (K)** – what the token offers as context  \n* **Value (V)** – the actual content that will be aggregated\n\nFor a token *i*, the attention score to token *j* is computed as the dot‑product of *Qᵢ* and *Kⱼ*, scaled and passed through a softmax. These scores weight the *V* vectors from all positions, and the weighted sum becomes the new representation for token *i*.  \n\nWhen this operation is applied to every token simultaneously, each position can attend to all other positions in the same layer. Multi‑head attention simply repeats this process with different linear projections, allowing the model to capture diverse patterns of interaction.\n\nThus, s

In [14]:
response=retriever_chain.invoke({"input":"what is attention mechanism"})

response['answer']

'**Attention Mechanism (in the context of Transformers)**  \n\nAn attention mechanism is a way for a model to focus on different parts of its input when producing each output. In the Transformer architecture it works as follows:\n\n1. **Query, Key, Value Vectors**  \n   - Every token (word, sub‑word, etc.) is projected into three vectors: **query** (q), **key** (k), and **value** (v).  \n   - The query comes from the position for which we want to compute an output.  \n   - The keys and values come from all positions that the query may attend to (e.g., all tokens in the same layer or all tokens in the encoder for encoder‑decoder attention).\n\n2. **Similarity Scores**  \n   - The similarity between a query and each key is computed (typically by a dot product).  \n   - These scores are scaled (by the square root of the key dimension) to keep gradients stable.\n\n3. **Softmax Weighting**  \n   - A softmax is applied to the similarity scores, turning them into a probability distribution th

In [27]:
response=retriever_chain.invoke({"input":"what is what is the difference between them?"}) # It fails to answer the question because, it doesn't  have the previous context

response['answer']

'In the excerpt the authors are comparing **how different neural‑network layers connect the positions in a sequence** and what that means for learning long‑range dependencies, training speed and parallelism.  \nThe main differences are:\n\n| Layer type | Connectivity pattern | Path length (between any two positions) | Computational complexity | Parallelism | Typical use |\n|------------|----------------------|----------------------------------------|--------------------------|-------------|-------------|\n| **Self‑attention (Transformer)** | Every position attends to every other position in one operation | Constant (1 hop) | \\(O(n^2)\\) in memory, but **constant** number of sequential ops | Fully parallel (all queries, keys, values computed at once) | Machine translation, language modelling, etc. |\n| **Recurrent (RNN/LSTM/GRU)** | Each position depends only on the previous one | Linear (n hops) | \\(O(n)\\) sequential ops | No parallelism across time steps | Classic seq‑to‑seq, speec

## Adding Chat History

### Import History Modules

This imports:

- create_history_aware_retriever → makes retriever understand chat context

- MessagesPlaceholder → placeholder for conversation history inside prompts

In [16]:
from langchain_classic.chains.history_aware_retriever import create_history_aware_retriever

from langchain_core.prompts import MessagesPlaceholder

### Create History-Aware Prompt

This prompt rewrites user queries using chat history.

Example:
User: "Tell me more" → becomes a fully contextual question.

In [17]:


contextualize_aware_system_prompt = (
""" 
Use the chat history to answer the user questions.
In which the user question may related to the context presented in the cchat history.
Do not answer the question directly, change the formate of the question if needed.
"""
)


contextualize_q_prompt=ChatPromptTemplate.from_messages(
    [
        ("system",contextualize_aware_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human","{input}"),
    ]
)

### Create History-Aware Retriever

This retriever considers chat history before retrieving documents.

This is essential for follow-up questions.

In [18]:
history_aware_retriver=create_history_aware_retriever(llm,retriever,contextualize_q_prompt)
history_aware_retriver

RunnableBinding(bound=RunnableBranch(branches=[(RunnableLambda(lambda x: not x.get('chat_history', False)), RunnableLambda(lambda x: x['input'])
| VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000013B2A05F790>, search_kwargs={}))], default=ChatPromptTemplate(input_variables=['chat_history', 'input'], input_types={'chat_history': list[typing.Annotated[typing.Union[typing.Annotated[langchain_core.messages.ai.AIMessage, Tag(tag='ai')], typing.Annotated[langchain_core.messages.human.HumanMessage, Tag(tag='human')], typing.Annotated[langchain_core.messages.chat.ChatMessage, Tag(tag='chat')], typing.Annotated[langchain_core.messages.system.SystemMessage, Tag(tag='system')], typing.Annotated[langchain_core.messages.function.FunctionMessage, Tag(tag='function')], typing.Annotated[langchain_core.messages.tool.ToolMessage, Tag(tag='tool')], typing.Annotated[langchain_core.messages.ai.AIMessageChunk, Tag(tag

### Build RAG Chain With History

A new prompt that includes previous messages for context-aware generation.

In [19]:
qa_prompt=ChatPromptTemplate.from_messages(
    [
        ("system",prompt),
        MessagesPlaceholder("chat_history"),
        ("human","{input}"),
    ]
)

### Final RAG Chain (History-Aware)

This creates a history-aware retrieval-augmented generation chain, the core of conversational RAG.

In [20]:
question_answer_chain=create_stuff_documents_chain(llm,qa_prompt)

rag_chain=create_retrieval_chain(history_aware_retriver,question_answer_chain)

### Test Conversational RAG

Simulates a natural conversation:

1. User asks a question

2. RAG retrieves context

3. Chat history is updated

4. User asks a follow-up

5. History-aware retriever answers correctly

In [21]:
from langchain_core.messages import AIMessage,HumanMessage
chat_history=[]
question="What is Self attention mechanism?"
response1=rag_chain.invoke({"input":question,"chat_history":chat_history})

chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=response1["answer"])
    ]
)

question2="Tell me more about it?"
response2=rag_chain.invoke({"input":question,"chat_history":chat_history})
print(response2['answer'])

**Self‑attention** (also called *intra‑attention*) is a mechanism in the Transformer that lets every token in a sequence gather information from *all other tokens in the same sequence*.

- **Inputs**:  
  - **Queries** – the vector for the token we’re updating.  
  - **Keys** – the same set of vectors that represent all tokens.  
  - **Values** – the same set of vectors that hold the information to be aggregated.

- **Operation**:  
  1. Compute a similarity score between the query and each key (typically a dot product).  
  2. Apply a soft‑max to obtain attention weights.  
  3. Take a weighted sum of the values using those weights.  
  The result is a new representation for the query token that incorporates context from the whole sequence.

- **Why it matters**:  
  * In the encoder, every input token can attend to every other input token, capturing long‑range dependencies directly.  
  * In the decoder, self‑attention is masked so each position only looks at previous positions, pres

### Add Multi-Session Chat History

Imports utilities for multi-session chat history:

- Each session gets its own message buffer

- Required when building a Streamlit chatbot

In [22]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory


### Session ID → History Mapping

Creates a dictionary store that maps each session to its message history.

Used in RunnableWithMessageHistory to maintain separate conversations

In [23]:

store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


### Build Final Conversational RAG Chain

This wraps the RAG chain with chat history, enabling:

- Memory

- Multi-session support

- Cleaner conversation flow

In [24]:


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)



### Test Multi-Session Chat

Tests whether:

- Session ID works

- History persists

- Answers use previous context

In [25]:
conversational_rag_chain.invoke(
    {"input": "What is self-attention?"},
    config={
        "configurable": {"session_id": "abc123"}
    },  # constructs a key "abc123" in `store`.
)["answer"]

'Self‑attention (also called intra‑attention) is an attention mechanism that operates **within a single sequence**.  \nIn a self‑attention layer every token in the sequence produces three vectors—**query, key, and value**—all derived from the same representation of that token. Each token then “attends” to every other token (or to itself) in the same sequence:\n\n1. **Queries** from a token are compared with **keys** from all tokens to produce attention scores.  \n2. These scores are normalized (e.g., via softmax) to give a weighting for each token.  \n3. The weighted sum of the **values** produces a new representation for the token.\n\nThus, each position in the sequence can incorporate information from all positions, allowing the model to capture long‑range dependencies efficiently. Self‑attention is a core component of Transformer architectures and has been successfully applied to tasks such as reading comprehension, summarization, textual entailment, and sentence representation lear

In [26]:

conversational_rag_chain.invoke(
    {"input": "Tell me more about it?"},
    config={
        "configurable": {"session_id": "abc123"}
    },  # constructs a key "abc123" in `store`.
)["answer"]

'**Self‑attention in a nutshell**\n\n| Feature | What it means | Why it matters |\n|---------|---------------|----------------|\n| **Single‑sequence focus** | All queries, keys, and values come from the *same* token sequence (e.g., the encoder’s hidden states). | Lets every word “talk” to every other word in the same sentence, capturing global context in one layer. |\n| **Attention scores** | For each token *i*, we compute a score with every token *j* as  \\(\\text{score}_{ij}=q_i\\cdot k_j^T\\). | These scores are turned into a probability distribution (softmax) that tells how much *i* should look at *j*. |\n| **Weighted sum** | The output for token *i* is \\(\\sum_j \\alpha_{ij} v_j\\). | The token’s new representation is a mixture of all tokens, weighted by relevance. |\n| **Parallelism** | All tokens are processed simultaneously; no sequential recurrence. | Much faster than RNNs and can handle very long contexts. |\n| **Complexity** | \\(O(n^2)\\) per layer for a sequence of length