In [138]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq


In [139]:
groq_api_key= os.getenv("GROQ_API_KEY")

In [140]:

# Load Document 
from langchain_community.document_loaders import WebBaseLoader

URL = "https://arxiv.org/html/1706.03762"

loader = WebBaseLoader(URL)
documents = loader.load()

print(f"Number of documents loaded: {len(documents)}")
print("\n--- Document preview ---\n")
print(documents[0].page_content[:1500])


Number of documents loaded: 1

--- Document preview ---





Attention Is All You Need












1 Introduction
2 Background

3 Model Architecture


3.1 Encoder and Decoder Stacks

Encoder:
Decoder:



3.2 Attention

3.2.1 Scaled Dot-Product Attention
3.2.2 Multi-Head Attention
3.2.3 Applications of Attention in our Model


3.3 Position-wise Feed-Forward Networks
3.4 Embeddings and Softmax
3.5 Positional Encoding


4 Why Self-Attention

5 Training

5.1 Training Data and Batching
5.2 Hardware and Schedule
5.3 Optimizer

5.4 Regularization

Residual Dropout
Label Smoothing





6 Results

6.1 Machine Translation
6.2 Model Variations
6.3 English Constituency Parsing



7 Conclusion

Acknowledgements








Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

Attention Is All You Need




  
Ashish Vaswani
Google Brain
avaswani@google.com
&Noam Shazeer11footnotemark:

In [141]:
# Chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks= text_splitter.split_documents(documents)

print(f"Total chunks created: {len(chunks)}")
print("\n--- First chunk preview ---\n")
print(chunks[0].page_content[:1000])

Total chunks created: 56

--- First chunk preview ---

Attention Is All You Need












1 Introduction
2 Background

3 Model Architecture


3.1 Encoder and Decoder Stacks

Encoder:
Decoder:



3.2 Attention

3.2.1 Scaled Dot-Product Attention
3.2.2 Multi-Head Attention
3.2.3 Applications of Attention in our Model


3.3 Position-wise Feed-Forward Networks
3.4 Embeddings and Softmax
3.5 Positional Encoding


4 Why Self-Attention

5 Training

5.1 Training Data and Batching
5.2 Hardware and Schedule
5.3 Optimizer

5.4 Regularization

Residual Dropout
Label Smoothing





6 Results

6.1 Machine Translation
6.2 Model Variations
6.3 English Constituency Parsing



7 Conclusion

Acknowledgements








Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.

Attention Is All You Need


In [142]:
# Embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model= HuggingFaceEmbeddings (model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = Chroma.from_documents(documents=chunks,embedding=embedding_model)

print("Vector store created successfully")
print(vectorstore._collection.count())


Vector store created successfully
280


In [143]:
# Creating retriever

retriever= vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":3})

# Test retriever
retriever.invoke("What problem does Transformer architecture solve?")

[Document(metadata={'title': 'Attention Is All You Need', 'language': 'en', 'source': 'https://arxiv.org/html/1706.03762'}, page_content='0.2\n\n\n4.95\n25.5\n\n\n\n\n\n\n\n\n\n\n0.0\n\n4.67\n25.3\n\n\n\n\n\n\n\n\n\n\n0.2\n\n5.47\n25.7\n\n\n\n\n\n(E)\n\npositional embedding instead of sinusoids\n\n4.92\n25.7\n\n\n\n\n\nbig\n6\n1024\n4096\n16\n\n\n0.3\n\n300K\n4.33\n26.4\n213\n\n\n\n\n\nTo evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table\xa03.\n\n\nIn Table\xa03 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality

In [144]:
# LLM
from langchain_groq import ChatGroq

llm= ChatGroq(model="openai/gpt-oss-20b", groq_api_key= groq_api_key, temperature= 0)

In [145]:
# RAG prompt 

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt= ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful AI assistant. "
            "Answer the question using ONLY the provided context. "
            "Cite sources in your answer using [Source 1], [Source 2], etc. "
            "If the answer is not in the context, say you don't know."
        ),
        MessagesPlaceholder("history"),
        (
            "human",
            "Context:\n{context}\n\nQuestion:\n{question}"
        )
    ]
)


In [146]:
def format_docs(docs):
    formatted=[]
    for i, doc in enumerate(docs,1):
        formatted.append(f"[source {i}\n{doc.page_content}")
    return "\n\n".join(formatted)

In [147]:
# RAG Chain
'''Any normal Python function can be used in a LangChain chain if and only if it is wrapped as a Runnable
(usually with RunnableLambda) and correctly returns an output.'''
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from operator import itemgetter

rag_chain=(
    {
        "docs": itemgetter("question")|retriever, # ans from retriever wrt to question asked        
        "question": itemgetter("question"),                                       
        "history":itemgetter("history")
    }
    | RunnableLambda(lambda x:{
        "question": x["question"],
        "history": x["history"],
        "context": format_docs(x["docs"]),
        "docs": x["docs"],
    })
    |RunnableLambda(lambda x:{
        "answer":(
             prompt
            |llm
            |StrOutputParser()
        ).invoke({
            "question": x["question"],
            "history": x["history"],
            "context": x["context"],
        }),
        "sources": x["docs"],
    })
   
)

In [151]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

#in-memory store
store = {}

def get_session_history(session_id: str)->BaseChatMessageHistory:
    if session_id not in store:
        store[session_id]=ChatMessageHistory()
    return store[session_id]

#Wrap the rag chain with message history
conversational_rag= RunnableWithMessageHistory(
    rag_chain,
    get_session_history, 
    input_messages_key="question",
    history_messages_key="history",
    output_messages_key="answer"
    )

In [152]:
def show_sources(question, k=3):
    doc = retriever.invoke(question)
    for i, doc in enumerate(docs[:k], 1):
        print(f"[Source {i}]")
        print(doc.page_content[:600])


In [153]:
#test
 
result = conversational_rag.invoke(
    {"question": "What problem does transformers solve?"},
    config={"configurable": {"session_id": "demo-session"}}
)

print("ANSWER:\n")
print(result["answer"])

print("\nSOURCES:\n")
for i, doc in enumerate(result["sources"], 1):
    print(f"\n[Source {i}]")
    print(doc.page_content[:500])


ANSWER:

Transformers were introduced to improve sequence‑to‑sequence tasks such as machine translation. In the work cited, the authors evaluate the model on **English‑to‑German translation** using the newstest2013 development set, showing how different architectural choices affect BLEU scores on this translation task. Thus, the primary problem that Transformers solve in this context is **accurate and efficient translation between languages**. [Source 1]

SOURCES:


[Source 1]
0.2


4.95
25.5










0.0

4.67
25.3










0.2

5.47
25.7





(E)

positional embedding instead of sinusoids

4.92
25.7





big
6
1024
4096
16


0.3

300K
4.33
26.4
213





To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We pres

[Source 2]
0.2



In [154]:
#test
 
result = conversational_rag.invoke(
    {"question": "Why is this approach better than RNN?"},
    config={"configurable": {"session_id": "demo-session"}}
)

print("ANSWER:\n")
print(result["answer"])

print("\nSOURCES:\n")
for i, doc in enumerate(result["sources"], 1):
    print(f"\n[Source {i}]")
    print(doc.page_content[:500])


ANSWER:

The Transformer’s key advantage over traditional RNN‑based sequence‑to‑sequence models is that it replaces the recurrent layers with **multi‑headed self‑attention**. This change allows the model to capture dependencies across the entire input sequence in parallel, rather than processing tokens one after another as RNNs do. As a result, the Transformer achieves higher accuracy—outperforming the BerkeleyParser even when trained only on a modest 40 k‑sentence WSJ set—and does so without requiring task‑specific tuning [Source 1][Source 2][Source 3].

SOURCES:


[Source 1]
Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].


In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.





7 