In [None]:
!pip install langchain==0.2.14
!pip install langchain-openai==0.1.8
!pip install langchainhub==0.1.20
!pip install pypdf==4.2.0
!pip install chromadb==0.5.0

In [None]:
import os

# Set OPENAI API Key

os.environ["OPENAI_API_KEY"] = "your openai key"

# OR (load from .env file)

# from dotenv import load_dotenv
# make sure you have python-dotenv installed
# load_dotenv("./.env")

Let's set up a study workflow using Jupyter Notebooks, LLMs, and langchain.

In [15]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

In [6]:
MODEL="gpt-4o-mini"

In [7]:
pdf_path = "./assets-resources/attention-paper.pdf"
loader = PyPDFLoader(pdf_path) # LOAD
pdf_docs = loader.load_and_split() # SPLIT
pdf_docs

[Document(metadata={'source': './assets-resources/attention-paper.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network archite

In [8]:
doc_obj = pdf_docs[0]
doc_obj

Document(metadata={'source': './assets-resources/attention-paper.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architec

In [9]:
from IPython.display import display, Markdown

Markdown(doc_obj.page_content)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it successfully to English constituency parsing both with
large and limited training data.
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.arXiv:1706.03762v7  [cs.CL]  2 Aug 2023

In [14]:
embeddings = OpenAIEmbeddings() # EMBED
embeddings
vectordb = Chroma.from_documents(pdf_docs, embedding=embeddings) # STORE


# Definition of a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/#:~:text=A%20retriever%20is,Document's%20as%20output.):

# > A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.
retriever = vectordb.as_retriever() 
# retriever
llm = ChatOpenAI(model=MODEL, temperature=0)
# source: https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/#question-answering-with-rag

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# prompt
question_answer_chain = create_stuff_documents_chain(llm, prompt)

In [16]:
# question_answer_chain
# This method `create_stuff_documents_chain` [outputs an LCEL runnable](https://arc.net/l/quote/bnsztwth)
query = "What are the key components of the transformer architecture?"
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": query})

results

{'input': 'What are the key components of the transformer architecture?',
 'context': [Document(metadata={'page': 2, 'source': './assets-resources/attention-paper.pdf'}, page_content='Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [ 11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer\nitself. To facilitate these residua

In [17]:
from IPython.display import Markdown

final_answer = results["answer"]

Markdown(final_answer)

The key components of the Transformer architecture include the encoder and decoder stacks, each composed of six identical layers. Each encoder layer has a multi-head self-attention mechanism and a position-wise fully connected feed-forward network, while the decoder adds a third sub-layer for multi-head attention over the encoder's output. Additionally, the architecture employs residual connections and layer normalization for each sub-layer.

In [18]:
query_summary = "Write a simple bullet points summary about this paper"

 # adding chat history so the model remembers previous questions
output = rag_chain.invoke({"input": query_summary})

Markdown(output["answer"])

- The paper introduces the Transformer, a new network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions.
- It demonstrates superior performance in machine translation tasks, achieving a BLEU score of 28.4 for English-to-German and 41.8 for English-to-French.
- The Transformer is shown to be more parallelizable and requires significantly less training time compared to existing models, generalizing well to other tasks like English constituency parsing.

In [20]:
def ask_pdf(pdf_qa,query):
    print("QUERY: ",query)
    result = pdf_qa.invoke({"input": query})
    answer = result["answer"]
    print("ANSWER", answer)
    return answer


ask_pdf(rag_chain,"How does the self-attention mechanism in transformers differ from traditional sequence alignment methods?")

QUERY:  How does the self-attention mechanism in transformers differ from traditional sequence alignment methods?
ANSWER The self-attention mechanism in transformers allows for modeling dependencies between all positions in a sequence simultaneously, without regard to their distance, which contrasts with traditional sequence alignment methods that rely on sequential computation and often struggle with long-range dependencies. In traditional methods, such as recurrent neural networks, the computation is inherently sequential, limiting parallelization and making it difficult to learn relationships between distant positions. Self-attention, on the other hand, computes a representation of the entire sequence in a single step, enabling more efficient processing and better handling of long-range dependencies.


'The self-attention mechanism in transformers allows for modeling dependencies between all positions in a sequence simultaneously, without regard to their distance, which contrasts with traditional sequence alignment methods that rely on sequential computation and often struggle with long-range dependencies. In traditional methods, such as recurrent neural networks, the computation is inherently sequential, limiting parallelization and making it difficult to learn relationships between distant positions. Self-attention, on the other hand, computes a representation of the entire sequence in a single step, enabling more efficient processing and better handling of long-range dependencies.'

In [21]:
quiz_questions = ask_pdf(rag_chain, "Quiz me with 3 simple questions on the positional encodings and the role they play in transformers.")

quiz_questions

QUERY:  Quiz me with 3 simple questions on the positional encodings and the role they play in transformers.
ANSWER 1. What is the purpose of positional encodings in a Transformer model?  
2. How do positional encodings help the model understand the order of the input sequence?  
3. What are the two types of positional encodings mentioned in the context of Transformers?


'1. What is the purpose of positional encodings in a Transformer model?  \n2. How do positional encodings help the model understand the order of the input sequence?  \n3. What are the two types of positional encodings mentioned in the context of Transformers?'

In [22]:
llm = ChatOpenAI(model=MODEL, temperature=0.0)

In [23]:
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List

class QA(BaseModel):
    questions: List[str] = Field(description='List of questions about a given context.')

In [27]:
from langchain_core.prompts.chat import SystemMessagePromptTemplate, HumanMessagePromptTemplate

template = f"You transform unstructured questions about a topic into a structured list of questions."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_message_prompt = HumanMessagePromptTemplate.from_template("Return ONLY a PYTHON list containing the questions in this text: {questions}")

In [28]:
from langchain_core.prompts import ChatPromptTemplate

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,human_message_prompt])

In [29]:
quiz_chain = chat_prompt | llm.with_structured_output(QA)

In [32]:
questions_list = quiz_chain.invoke({"questions": quiz_questions})
questions_list

QA(questions=['What is the purpose of positional encodings in a Transformer model?', 'How do positional encodings help the model understand the order of the input sequence?', 'What are the two types of positional encodings mentioned in the context of Transformers?'])

In [34]:
questions = questions_list.questions

In [35]:
questions

['What is the purpose of positional encodings in a Transformer model?',
 'How do positional encodings help the model understand the order of the input sequence?',
 'What are the two types of positional encodings mentioned in the context of Transformers?']

In [36]:
# the questions variable was created within the string inside the `questions_list` variable.
answers = []
for q in questions:
    answers.append(ask_pdf(rag_chain,q))

QUERY:  What is the purpose of positional encodings in a Transformer model?
ANSWER Positional encodings in a Transformer model provide information about the position of tokens in the input sequence, as the model itself does not have any inherent understanding of token order due to its reliance on self-attention mechanisms. These encodings allow the model to capture the sequential nature of the data, enabling it to differentiate between tokens based on their positions. This is crucial for tasks like language modeling and translation, where the order of words significantly impacts meaning.
QUERY:  How do positional encodings help the model understand the order of the input sequence?
ANSWER Positional encodings provide information about the position of each element in the input sequence, allowing the model to differentiate between them. Since the Transformer architecture does not use recurrence or convolution, these encodings are essential for maintaining the order of the sequence. By add

In [55]:
evaluations = []

for q,a in zip(questions, answers):
    # Check for results
    evaluations.append(ask_pdf(rag_chain,f"Is this: {a} the correct answer to this question: {q} according to the paper? Return ONLY '''YES''' or '''NO'''. Output:"))

evaluations

QUERY:  Is this: Positional encodings in a Transformer model provide information about the position of tokens in the input sequence, as the model itself does not have any inherent understanding of token order due to its reliance on self-attention mechanisms. These encodings allow the model to capture the sequential nature of the data, enabling it to differentiate between tokens based on their positions. This is crucial for tasks like language modeling and translation, where the order of words significantly impacts meaning. the correct answer to this question: What is the purpose of positional encodings in a Transformer model? according to the paper? Return ONLY '''YES''' or '''NO'''. Output:
ANSWER YES
QUERY:  Is this: Positional encodings provide information about the position of each element in the input sequence, allowing the model to differentiate between them. Since the Transformer architecture does not use recurrence or convolution, these encodings are essential for maintaining t

['YES', 'YES', 'YES']

In [56]:
scores = []

yes_count = evaluations.count('YES')
score = str(yes_count/len(evaluations) * 100) + "%"
print(score)

100.0%


A more 'langchain way' to do this would be:

In [48]:
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt_eval_template = """
You take in context, a question and a generated answer and you output ONLY a score of YES if the answer is correct,
or NO if the answer is not correct.

<context>
{context}
<context>

<question>
{question}
<question>

<answer>
{answer}
<answer>
"""

prompt_eval = ChatPromptTemplate.from_template(prompt_eval_template)

answer_eval_chain = (
    {
        'context': lambda x: format_docs(x['context']),
        'question': lambda x: x['question'],
        'answer': lambda x: x['answer']
        }
    ) | prompt_eval | llm | StrOutputParser()

In [51]:
evaluations = []
for q,a in zip(questions, answers):
    evaluations.append(answer_eval_chain.invoke({'context': pdf_docs, 'question': q, 'answer': a}))

In [52]:
evaluations

['YES', 'YES', 'YES']

In [54]:
scores = []

yes_count = evaluations.count('YES')
score = str(yes_count/len(evaluations) * 100) + "%"
print(score)

100.0%


In this example notebook we introduced a few interesting ideas:
1. Structured outputs
2. Some simple evaluation of rag answers using the 'llm-as-a-judge' strategy