In [26]:
import langchain, chromadb, pypdf, openai, tiktoken
from langchain.schema import Document
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain.schema.runnable import RunnableLambda

In [2]:
load_dotenv()

True

In [3]:
pdf_path = "../papers/aiayn.pdf"
loader = PyPDFLoader(pdf_path)
docs = loader.load()

print(f"Total No of Pages : {len(docs)}")
print(docs[0].page_content[:500])


Total No of Pages : 15
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz 


In [4]:
def clean_text(text):
  return " ".join(text.split())

docs = [Document(page_content=clean_text(d.page_content),metadata=d.metadata) for d in docs]

In [5]:
print("1st document length : ",len(docs[0].page_content))
print("----------------------")
print(docs[0].page_content)
print("----------------------")
print("No of documents : ",len(docs))

1st document length :  2857
----------------------
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗ † University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗ ‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispen

In [6]:
cleaned_docs = [d for d in docs if not any(x in d.page_content.lower() 
                 for x in ["references", "acknowledgements", "arxiv", "google brain"])]

In [7]:
print("1st document length : ",len(cleaned_docs[0].page_content))
print("----------------------")
print(cleaned_docs[0].page_content)
print("----------------------")
print("No of documents : ",len(cleaned_docs))

1st document length :  4257
----------------------
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [ 35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant imp

In [8]:
import re

def is_valid_chunk(text):
  if len(text.strip()) < 50:
    return False
  if text.count("<pad>") > 3 or text.count("<EOS>") > 3:
    return False
  if re.search(r'[0-9]{4,}',text):
    return False
  return True

filtered_docs = [d for d in cleaned_docs if is_valid_chunk(d.page_content)]
print(f"Before : {len(cleaned_docs)} | After filtering : {len(filtered_docs)}")  

Before : 11 | After filtering : 3


In [9]:
print(filtered_docs[2].page_content[:500])

Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of


In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=600,
  chunk_overlap=120,
  separators=["\n\n","\n","."," "]
)

chunks = text_splitter.split_documents(filtered_docs)

print(f"Total chunks : ",len(chunks))
print("Length of the first chunk : ",len(chunks[0].page_content))
print(chunks[0].page_content)

Total chunks :  19
Length of the first chunk :  539
1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [ 35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences


In [11]:
sum(len(c.page_content) for c in chunks) / len(chunks)

483.89473684210526

In [12]:
embedding = OpenAIEmbeddings(model="text-embedding-3-small")

vector_store = Chroma.from_documents(
  documents=chunks,
  embedding=embedding,
  persist_directory='./vector_store'
)

vector_store.persist()

  vector_store.persist()


In [13]:
query = "What is the main contribution of this paper ?"
results = vector_store.similarity_search(query,k=2)

for r in results:
  print(r.page_content)

. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduc- tion models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]
. This makes it more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a rep

In [14]:
retriever = vector_store.as_retriever(search_kwargs={"k":3})

llm = ChatOpenAI(model='gpt-4',temperature=0)

prompt = ChatPromptTemplate.from_template(
  "Use the following context to answer the question. If you can't answer just say you don't know\n\n"
  "Context:\n{context}\n\n"
  "Question: {question}"
)

rag_chain = (
  {"context":retriever,"question":RunnablePassthrough()} | prompt | llm | StrOutputParser()
)

query = "What is this paper about ?"
answer = rag_chain.invoke(query)
print(answer)

The paper appears to be about computational efficiency and model performance in the field of artificial intelligence, specifically focusing on attention mechanisms and self-attention in sequence modeling and transduction models. It discusses the use of these mechanisms in various tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. The paper also mentions the use of end-to-end memory networks in simple-language question answering and language modeling tasks.


In [28]:
def inspect_retrieval(query,retriever):
  results = retriever.get_relevant_documents(query)
  for i,r in enumerate(results):
    print(f"Chunks {i+1}")
    print(r.page_content[:400],"\n\n")
    # print(r)

inspect_retrieval("What is this paper about ?",retriever)

Chunks 1
. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduc- tion models in various 


Chunks 2
. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recurrence and have been shown to perform well on simple-language question answeri 


Chunks 3
. This makes it more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is reduced to a constant number of operations, albeit a

In [16]:
query = "What is self-attention ?"
answer = rag_chain.invoke(query)
print(answer)

Self-attention is an attention mechanism that relates different positions of a single sequence in order to compute a representation of the sequence. It has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.


In [18]:
inspect_retrieval(query,retriever)

Chunks 1
. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recurrence and have been shown to perform well on simple-language question answeri 


Chunks 2
. This makes it more difficult to learn dependencies between distant positions [ 12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention m 


Chunks 3
Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers ru

In [19]:
compressor = LLMChainFilter.from_llm(ChatOpenAI(model="gpt-4"))
compression_retriever = ContextualCompressionRetriever(
  base_compressor=compressor,
  base_retriever=retriever
)

In [21]:
prompt_2 = ChatPromptTemplate.from_template(
  "Use the context below to answer. Cite sources if available.\n\n"
  "Context:\n{context}\n\n"
  "Question: {question}"
)

rag_chain_2 = (
  {"context":compression_retriever,"question":RunnablePassthrough()} | prompt_2 | llm | StrOutputParser()
)

query = "What is this paper about ?"
answer = rag_chain_2.invoke(query)
print(answer)

The paper discusses advancements in computational efficiency and model performance in the field of artificial intelligence. It specifically focuses on the use of attention mechanisms in sequence modeling and transduction models. The paper also discusses the use of self-attention in various tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. Additionally, it mentions the successful application of end-to-end memory networks in simple-language question answering and language modeling tasks.


In [34]:
def format_docs(docs):
  formatted = []
  for d in docs:
    src = d.metadata.get("source","Unknown source")
    page = d.metadata.get("page_label","Unknown page")
    formatted.append(f"[Source: {src}, Page: {page}]\n{d.page_content}")
  return "\n\n".join(formatted)

In [35]:
prompt_3 = ChatPromptTemplate.from_template("""
Use the context below to answer the question.
Cite sources using the [Source: ..., Page: ...] information when relevant.

Context:
{context}

Question: {question}
""")

rag_chain_with_citations = (
  {
    "context": compression_retriever | RunnableLambda(format_docs),
    "question": RunnablePassthrough()
  }
  | prompt_3
  | llm 
  | StrOutputParser()
)

In [36]:
query = "What is this paper about?"
answer = rag_chain_with_citations.invoke(query)
print(answer)

The paper discusses the use of attention mechanisms in sequence modeling and transduction models. It specifically focuses on the Transformer model and the use of self-attention, or intra-attention, to relate different positions of a single sequence in order to compute a representation of the sequence. The paper also discusses the application of self-attention in various tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. [Source: ../papers/aiayn.pdf, Page: 2]


In [37]:
query = "What model architecture is used in this paper ?"
answer = rag_chain_with_citations.invoke(query)
print(answer)

The model architecture used in this paper is the Transformer. This architecture relies entirely on an attention mechanism to draw global dependencies between input and output, eschewing recurrence. It uses stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder and decoder are composed of a stack of six identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. [Source: ../papers/aiayn.pdf, Page: 2-3]


In [38]:
query = "how to stay healthy ?"
answer = rag_chain_with_citations.invoke(query)
print(answer)

The context does not provide information on how to stay healthy.
