##Installing the dependencies


In [19]:
!pip install -qU langchain-huggingface langchain-community langchain-groq langgraph faiss-cpu pypdf

##Importing the required Libraries
Loads system utilities, Colab user data access, and all LangChain, embedding, vector store, LLM, agent, tool, and memory components needed for the RAG pipeline.

In [4]:
import os
from google.colab import userdata
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain.agents import create_agent
from langchain.tools import tool
from langgraph.checkpoint.memory import InMemorySaver

## Loading the Groq API key
Retrieves the Groq API key securely from Google Colab user secrets for authenticating the LLM.

In [5]:
from google.colab import userdata
from langchain_groq import ChatGroq

groq_api_key = userdata.get('GROQ_API_KEY')

## Initializing the Groq language model
Creates an LLM instance using Groq with the specified model for generating responses in the RAG pipeline.


In [6]:
llm = ChatGroq(
    api_key=groq_api_key,
    model='openai/gpt-oss-120b'  # or another available Groq model
)

## Loading, chunking, and embedding the document
Downloads the Transformer paper PDF, splits it into overlapping text chunks, generates local embeddings, and stores them in a FAISS vector index for retrieval.


In [7]:
# Load and process the Transformer Paper
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Local Embeddings (MiniLM-L6-v2)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

##Defining a document search tool
Creates a custom tool that queries the vector store to retrieve relevant sections from the research paper for agent use.


In [8]:
@tool
def search_paper(query: str) -> str:
    """Search the 'Attention is All You Need' research paper for specific technical details."""
    docs = retriever.invoke(query)
    return "\n\n".join([d.page_content for d in docs])

tools = [search_paper]

## Initializing memory and creating the agent
Sets up in-memory checkpointing, defines a research-focused agent with tool access, and configures a conversation thread for running the RAG demo.


In [10]:
# 1. Initialize Memory
memory = InMemorySaver()

# 2. Create Agent
system_msg = "You are a professional research assistant. Use the search_paper tool to provide accurate answers."
agent = create_agent(llm, tools=tools, checkpointer=memory, system_prompt=system_msg)

# 3. Running the Demo Conversation
config = {"configurable": {"thread_id": "final_project_user"}}

##Interactions with llm

**Interaction 1**

In [11]:
print("--- Start Conversation ---")
# Interaction 1
res1 = agent.invoke({"messages": [("user", "Hi, I'm a student researching Transformers.")]}, config)
print(f"Bot: {res1['messages'][-1].content}")

--- Start Conversation ---
Bot: Hello! I’d be happy to help with your research on Transformers.  

To give you the most useful information, could you let me know which aspects you’re focusing on? For example:

- The core architecture (self‑attention, multi‑head attention, positional encodings, etc.)  
- Training objectives and optimization tricks (e.g., label smoothing, learning‑rate schedules)  
- Variants and extensions (BERT, GPT, Vision Transformers, etc.)  
- Applications or performance benchmarks on specific tasks  
- Implementation details or code‑level considerations  

Feel free to be as specific as you like, and I can pull the relevant sections from the original *“Attention Is All You Need”* paper or other sources to support your work.


**Interaction 2**

In [12]:
# Interaction 2
res2 = agent.invoke({"messages": [("user", "What core concepts mentioned about attention in the paper?")]}, config)
print(f"\nBot: {res2['messages'][-1].content}")


Bot: Below is a concise “cheat‑sheet” of the **core attention concepts that the original *Attention Is All You Need* (Vaswani et al., 2017) paper introduces**.  
I have pulled the exact wording from the paper (see the excerpts returned by the search tool) and added brief explanations so you can see how each idea fits into the overall Transformer architecture.

---

## 1.  The General Attention Function  

> “An attention function can be described as mapping a **query** and a set of **key‑value pairs** to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values.”  *(Section 3.2 – “Attention”)*
  
**What it means**  
* The model decides *how much* each value (e.g., a word representation) should contribute to the result for a given query (e.g., the current word we are processing).  
* The weights are produced by a **compatibility function** that measures similarity between the query and each key.

---

## 2.  Scaled Dot‑

**Interaction 3**

In [13]:
#Interaction 3
res3 = agent.invoke({"messages": [("user", "What did i ask earlier?")]}, config)
print(f"\nBot: {res3['messages'][-1].content}")


Bot: You asked for **the core concepts about attention that are described in the original *“Attention Is All You Need”* paper**. In response I listed and explained the main attention‑related ideas introduced there (the general attention function, Scaled Dot‑Product Attention, Multi‑Head Attention, self‑attention, encoder‑decoder (cross) attention, positional encodings, and the residual‑plus‑layer‑norm wrapper).


**Output and Observations**

The system successfully retrieves relevant sections from the “Attention Is All You Need” paper and generates grounded responses based on the retrieved content. Technical questions related to attention mechanisms are answered using document-backed retrieval rather than free-form generation. The agent also maintains conversational context across multiple interactions, correctly recalling earlier user inputs. This confirms the correct integration of document loading, chunking, vector storage, retrieval, tool usage, and memory within the RAG pipeline.