# 2. RAG system - Retriever, Generator pipeline

The idea of this notebook is to show how to configure the RAG system to make an inference pipeline.
The pipeline is composed first by:
- defining all the configuration needed
- defining the LLM and the embeddings models
- loading the ingested vector database and defining the retrieval from it
- defining the prompt template and the conversational memory buffer,
- finally defining the chain(retrieval QA chain).

<b>The whole pipeline (RAG system - Retriever, Generator pipeline is circled with blue-2)</b>

![image.png](attachment:image.png)

In [1]:
# Install the langchain and langchain-community packages
!pip install langchain langchain-community boto3



For our use case, we are going to work **Langchain**.

**Langchain** is a very powerful LLM/Agent orchestration tool that allows us to easily create and manage LLMs and Agents. **Langchain** provides all the necessary tools to create a whole RAG system, from the data ingestion pipeline to the inference pipeline.

In [5]:
# Importing all the neccessary modules/libraries
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
#from langchain_community.chat_models import BedrockChat
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings
#from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import FAISS

In this step, we need to define the configuration of the first step of the pipeline. In the configuration we are going to define the:
- Region name and the credentials profile name of our AWS account
- The embedding model that we are going to use to embed the data and the configuration of the embedding model(dimension of the vector embeddings and the normalization of the embeddings)
- The LLM model that we are going to use to generate the output and the configuration of the LLM model(maximum tokens, top k, top p, temperature)
- The retriever configuration, basically the metric and the number of documents that we are going to retrieve(other additional arguments if the metrics of the retriever is changed)
- The path of the vector database that we are going to use for the retrieval
- The input key, memory key and input variables that are going to be used in the prompt.
- The prompt template used for the inference(chain) pipeline

In [6]:
# Defining the configuration
REGION_NAME = "us-east-1"
#CREDENTIALS_PROFILE_NAME = "ML"

EMBEDDER_MODEL_ID = "amazon.titan-embed-text-v2:0"
EMBEDDER_MODEL_KWARGS = {
    "dimensions": 512,
    "normalize": True
}

LLM_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0" # anthropic.claude-3-haiku-20240307-v1:0 or anthropic.claude-3-sonnet-20240229-v1:0 or anthropic.claude-v2:1
LLM_MODEL_KWARGS = {
    "max_tokens": 4096,
    "temperature": 0.1,
    "top_p": 1,
    "top_k": 250,
    "stop_sequences": ["\n\nHuman"]
}

SEARCH_TYPE = "similarity"
RETRIEVER_KWARGS = {
    "k": 5
}

VECTOR_STORE_PATH = "./vector_database/"

INPUT_KEY = "question"
MEMORY_KEY = "history"
INPUT_VARIABLES = ["context", "history", "question"]

# Inside in the prompt template, you can play with the system's persona, the context, the history, and the question.
PROMPT_TEMPLATE = """
System: You are a helpful, respectful and honest assistant for Machine Learning.
Always answer as helpfully as possible, while being safe.
Please ensure that your responses are socially unbiased and positive in nature.
When addressing the user, always base your responses on the context provided and the previous chat history if its available.
If you are unsure about the answer, please let the user know.
If the user asks something that is not related to Machine Learning, please let the user know.
Human:
----------
<context>
{context}
</context>
----------
<history>
{history}
</history>
----------
<question>
{question}
</question>
----------
Assistant:
"""


We define the LLM model and the embeddings model that we are going to use in the pipeline.

For the LLM we are utilizing the most powerful Antropic model designed for systems like ours, that's the **Claude 3 Sonnet model**. This model is also very cheap to run and has excellent performance. The model has 200k context size window and is very powerful.

For the embeddings model we are going to work with with the same embedding model that we used in the data loading splitting ingestion pipeline.

In [7]:
# Creating the LLM and Embedder models
llm = ChatBedrock(region_name=REGION_NAME,model_id=LLM_MODEL_ID, model_kwargs=LLM_MODEL_KWARGS)
embedder = BedrockEmbeddings(region_name=REGION_NAME, model_id=EMBEDDER_MODEL_ID, model_kwargs=EMBEDDER_MODEL_KWARGS)


We are going to load the vector database that we used in the data ingestion pipeline and define the retriever that we are going to use in this pipeline.

**The retriever** is going to be used to retrieve the most similar documents to the input query. The search type is going to be similarity(cosine) and we are going to retrieve the top 5 documents. Some other approaches are changing the search type to mmr or similarity search with a threshold.

In [8]:
# Loading the vector store and creating retriever
vector_store = FAISS.load_local(VECTOR_STORE_PATH, embeddings=embedder, allow_dangerous_deserialization=True)
retriever = vector_store.as_retriever(search_type=SEARCH_TYPE, **RETRIEVER_KWARGS)


We want to define the **conversational memory buffer** which will store the previous question and the history of the conversation(for that question).

Also we are defining the **prompt template** that is going to be used in the pipeline. **The prompt template** is going to be used to generate the input for the LLM model. In the prompt template the context, history and the question is going to be passed + addional system prompt which can be changed.

In [10]:
# Creating the memory and the prompt template
memory = ConversationBufferWindowMemory(memory_key=MEMORY_KEY, input_key=INPUT_KEY, k=3, ai_prefix="Assistant")
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=INPUT_VARIABLES)


In this step we are going to define **the chain** that is going to be used in the pipeline.
**The chain** is going to be composed of:
- retriever,
- LLM model,
- conversational memory buffer
- prompt template.
  
The chain is going to be used to generate the output for the input query.

In [11]:
# Creating the Chain for usage
chain = RetrievalQA.from_chain_type(
            llm=llm,
            retriever=retriever,
            verbose=True,
            return_source_documents=True,
            chain_type_kwargs={
                "prompt": prompt,
                "memory": memory
            }
)
response = chain.invoke("What is attention mechanism?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m



Let's check the pipeline's response, what elements it has.

In [12]:
# Checking the response
response

{'query': 'What is attention mechanism?',
 'result': 'The attention mechanism is a key component of transformer models that allows the model to focus on the relevant parts of the input sequence when making predictions for a specific part of the output sequence.\n\nIn traditional sequence-to-sequence models like RNNs and LSTMs, the entire input sequence gets encoded into a fixed-length vector representation from which the output sequence is decoded. This can make it difficult to capture long-range dependencies in the input.\n\nThe attention mechanism helps alleviate this by calculating an attention score for each part of the input sequence that indicates how relevant it is to the current step of the output sequence being generated. This allows the model to selectively focus on the most relevant pieces of information from the input when producing each part of the output.\n\nThere are different variants of attention like self-attention (used in transformers) where the attention scores are


In this step we are going to check the result of the pipeline, the pure output answer + the documents(content) that were retrieved from the vector database.

In [13]:
# Printing the answer
print(response["result"])

The attention mechanism is a key component of transformer models that allows the model to focus on the relevant parts of the input sequence when making predictions for a specific part of the output sequence.

In traditional sequence-to-sequence models like RNNs and LSTMs, the entire input sequence gets encoded into a fixed-length vector representation from which the output sequence is decoded. This can make it difficult to capture long-range dependencies in the input.

The attention mechanism helps alleviate this by calculating an attention score for each part of the input sequence that indicates how relevant it is to the current step of the output sequence being generated. This allows the model to selectively focus on the most relevant pieces of information from the input when producing each part of the output.

There are different variants of attention like self-attention (used in transformers) where the attention scores are calculated over the input itself, and encoder-decoder atten

In [14]:
# Getting the documents used
for doc in response["source_documents"]:
    print("\n################################################# Document #################################################")
    print(doc.page_content)
    print(doc.metadata)


################################################# Document #################################################
Table 1: A list of the different tasks and datasets used in our experiments.
Task
Datasets
Natural language inference
SNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25]
Question Answering
RACE [30], Story Cloze [40]
Sentence similarity
MSR Paraphrase Corpus [14], Quora Question Pairs [9], STS Benchmark [6]
Classiﬁcation
Stanford Sentiment Treebank-2 [54], CoLA [65]
but is shufﬂed at a sentence level - destroying long-range structure. Our language model achieves a
very low token level perplexity of 18.4 on this corpus.
Model speciﬁcations
Our model largely follows the original transformer work [62]. We trained a
12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12
attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
We used the Adam optimization scheme [27] with a max lea


Let's test a full conversation with the pipeline. We are going to ask a question or give an input and the pipeline is going to generate the answer. Then we are going to ask another question and the pipeline is going to generate the answer based on the previous question and the history of the conversation.

In [15]:
# Resetting the chain
memory = ConversationBufferWindowMemory(memory_key=MEMORY_KEY, input_key=INPUT_KEY, k=3, ai_prefix="Assistant")
chain = RetrievalQA.from_chain_type(
            llm=llm,
            retriever=retriever,
            verbose=True,
            return_source_documents=True,
            chain_type_kwargs={
                "prompt": prompt,
                "memory": memory
            }
)

In [16]:
# Asking a new question and printing the answer
response = chain.invoke("What are the two steps in BERT model?")
print(response["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The BERT (Bidirectional Encoder Representations from Transformers) model involves two main steps:

1. Pre-training: In this step, the BERT model is pre-trained on a large corpus of unlabeled text using two unsupervised tasks:

- Masked Language Modeling (MLM): Here, some tokens in the input sequence are randomly masked, and the model learns to predict the masked tokens based on the context.

- Next Sentence Prediction (NSP): The model receives pairs of sentences and learns to predict whether the second sentence follows the first in the original text.

This pre-training allows BERT to develop deep bidirectional representations by jointly conditioning on both left and right context in all layers.

2. Fine-tuning: After pre-training, the BERT model is fine-tuned on labeled data from the downstream task of interest, such as text classification, question answering, or natural language inference. During fine-tuning, the

In [17]:
# Getting the documents used as context
print(response["source_documents"])

[Document(metadata={'source': '/Users/raquelcardoso/Documents/interns_ml_pt/Raquel/week5/4. Retrieval Agumented Generation (RAG)/gpt.pdf', 'file_path': '/Users/raquelcardoso/Documents/interns_ml_pt/Raquel/week5/4. Retrieval Agumented Generation (RAG)/gpt.pdf', 'page': 4, 'total_pages': 12, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref package', 'producer': 'pdfTeX-1.40.18', 'creationDate': 'D:20180608191434Z', 'modDate': 'D:20180608191434Z', 'trapped': ''}, page_content='sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we\nevaluate on (2490 examples), we achieve an accuracy of 56%, which is below the 61.7% reported by a\nmulti-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is\nlikely our model will beneﬁt from multi-task training as well but we have not explored this currently.\n2https://ftfy.readthedocs.io/en/latest/\n3https://spacy.io/

In [18]:
# Asking a new question and printing the answer
response = chain.invoke("How does the attention mechanism work?")
print(response["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The attention mechanism in BERT and other transformer models allows the model to focus on the relevant parts of the input sequence when encoding a specific word or token. Here's a high-level overview of how it works:

1. The input sequence is first converted into vector representations (embeddings) for each token.

2. These embeddings go through multiple layers of self-attention, where each token attends to the other tokens in the sequence to compute its representation.

3. In self-attention, three vectors are computed for each token - a query vector, a key vector, and a value vector. The query vector is compared against the key vectors of all other tokens using a similarity score (e.g. dot product). This produces attention weights that are higher for tokens more relevant to the current token.

4. The value vectors of all tokens are then multiplied by their corresponding attention weights and summed up to produce 

In [19]:
# Getting the documents used as context
print(response["source_documents"])

[Document(metadata={'source': '/Users/raquelcardoso/Documents/interns_ml_pt/Raquel/week5/4. Retrieval Agumented Generation (RAG)/gpt.pdf', 'file_path': '/Users/raquelcardoso/Documents/interns_ml_pt/Raquel/week5/4. Retrieval Agumented Generation (RAG)/gpt.pdf', 'page': 4, 'total_pages': 12, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref package', 'producer': 'pdfTeX-1.40.18', 'creationDate': 'D:20180608191434Z', 'modDate': 'D:20180608191434Z', 'trapped': ''}, page_content='Table 1: A list of the different tasks and datasets used in our experiments.\nTask\nDatasets\nNatural language inference\nSNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25]\nQuestion Answering\nRACE [30], Story Cloze [40]\nSentence similarity\nMSR Paraphrase Corpus [14], Quora Question Pairs [9], STS Benchmark [6]\nClassiﬁcation\nStanford Sentiment Treebank-2 [54], CoLA [65]\nbut is shufﬂed at a sentence level - destroying long-range struct