# 1. RAG with LangChain & Langfuse

The idea of this notebook is to showcase how to utilize LLMOps tools like LangChain(for orchestration) and Langfuse(for monitoring & observation) to build a simple RAG system pipeline.

We need to download a PDF data which will be used as data for the RAG. For our RAG we are utilizing the paper:
- Attention is All You Need, Vaswani et al. 2017
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018
- Improving Language Understanding by Generative Pretraining, Radford et al. 2018

In [1]:
# Download the paper attention is all you need(the transformer paper) from the NeurIPS 2017 conference
!wget https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf -O attention_is_all_you_need.pdf

--2024-10-24 17:06:52--  https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Resolving proceedings.neurips.cc (proceedings.neurips.cc)... 198.202.70.94
Connecting to proceedings.neurips.cc (proceedings.neurips.cc)|198.202.70.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 569417 (556K) [application/pdf]
Saving to: ‘attention_is_all_you_need.pdf’


2024-10-24 17:06:54 (666 KB/s) - ‘attention_is_all_you_need.pdf’ saved [569417/569417]



In [2]:
# Download the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from the NAACL 2019 conference
!wget https://www.aclweb.org/anthology/N19-1423.pdf -O bert.pdf

--2024-10-24 17:06:54--  https://www.aclweb.org/anthology/N19-1423.pdf
Resolving www.aclweb.org (www.aclweb.org)... 50.87.169.12
Connecting to www.aclweb.org (www.aclweb.org)|50.87.169.12|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://aclanthology.org/N19-1423.pdf [following]
--2024-10-24 17:06:55--  https://aclanthology.org/N19-1423.pdf
Resolving aclanthology.org (aclanthology.org)... 174.138.37.75
Connecting to aclanthology.org (aclanthology.org)|174.138.37.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 786279 (768K) [application/pdf]
Saving to: ‘bert.pdf’


2024-10-24 17:06:56 (1,28 MB/s) - ‘bert.pdf’ saved [786279/786279]



In [3]:
# Download the GPT paper
!wget https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf -O gpt.pdf

--2024-10-24 17:06:56--  https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Resolving cdn.openai.com (cdn.openai.com)... 2620:1ec:bdf::42, 13.107.246.42
Connecting to cdn.openai.com (cdn.openai.com)|2620:1ec:bdf::42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 541036 (528K) [application/pdf]
Saving to: ‘gpt.pdf’


2024-10-24 17:06:58 (5,76 MB/s) - ‘gpt.pdf’ saved [541036/541036]



In [4]:
# Install the langchain and langchain-community packages
!pip install langchain langchain-community boto3 langfuse faiss-cpu PyMuPDF



For our use case, we are going to work **Langchain** for orchestration, **Ragas** for evaluation and **Langfuse** for monitoring and observation.

**Langchain** is a very powerful LLM/Agent orchestration tool that allows us to easily create and manage LLMs and Agents. **Langchain** provides all the necessary tools to create a whole RAG system, from the data ingestion pipeline to the inference pipeline.

**Langfuse** is a tool that allows us to monitor and observe the performance of our LLMs. It provides a simple interface to monitor and observe the performance of our LLM applications by observing traces, latency, costs, etc.

In [5]:
# Importing all the neccessary modules/libraries
import os

from langchain.chains import RetrievalQA
from langchain.document_loaders import PyMuPDFLoader
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_aws import ChatBedrock
from langchain_aws import BedrockEmbeddings
from langchain_community.vectorstores import FAISS
from langfuse.callback import CallbackHandler

In this step, we need to define the configuration of the pipeline. In the configuration we are going to define the:
- Region name and the credentials profile name of our AWS account
- The public & secret key for our Languse server and the host url of the Langfuse server
- The embedding model that we are going to use to embed the data and the configuration of the embedding model(dimension of the vector embeddings and the normalization of the embeddings)
- The size of the documents chunks and the overlap between the chunks
- The path of the PDF files that we are going to use to extract the data
- The path of the vector database that we are going to use to save the embedded data
- The LLM model that we are going to use to generate the output and the configuration of the LLM model(maximum tokens, top k, top p, temperature)
- The retriever configuration, basically the metric and the number of documents that we are going to retrieve(other additional arguments if the metrics of the retriever is changed)
- The input key, memory key and input variables that are going to be used in the prompt.
- The prompt template used for the inference(chain) pipeline

In [6]:
import os
from dotenv import load_dotenv, dotenv_values
load_dotenv()

True

In [7]:
# Defining the configuration
REGION_NAME = "us-west-2"
CREDENTIALS_PROFILE_NAME = "ML"

PUBLIC_KEY = os.getenv("LANGFUSE_PUBLIC_KEY")
SECRET_KEY = os.getenv("LANGFUSE_SECRET_KEY")
HOST = os.getenv("LANGFUSE_HOST")

EMBEDDER_MODEL_ID = "amazon.titan-embed-text-v2:0"
EMBEDDER_MODEL_KWARGS = {
    "dimensions": 512,
    "normalize": True
}

LLM_MODEL_ID = "anthropic.claude-3-sonnet-20240229-v1:0" # anthropic.claude-3-haiku-20240307-v1:0 or anthropic.claude-3-sonnet-20240229-v1:0 or anthropic.claude-v2:1
LLM_MODEL_KWARGS = {
    "max_tokens": 4096,
    "temperature": 0.1,
    "top_p": 1,
    "top_k": 250,
    "stop_sequences": ["\n\nHuman"]
}

CHUNK_SIZE = 2000
CHUNK_OVERLAP = 100

DATA_PATHS = [
    "attention_is_all_you_need.pdf",
    "bert.pdf",
    "gpt.pdf"
]

VECTOR_STORE_PATH = "./vector_database/"

SEARCH_TYPE = "similarity"
RETRIEVER_KWARGS = {
    "k": 5
}

VECTOR_STORE_PATH = "./vector_database/"

INPUT_KEY = "question"
MEMORY_KEY = "history"
INPUT_VARIABLES = ["context", "history", "question"]

# Inside in the prompt template, you can play with the system's persona, the context, the history, and the question.
PROMPT_TEMPLATE = """
System: You are a helpful, respectful and honest assistant for Machine Learning.
Always answer as helpfully as possible, while being safe.
Please ensure that your responses are socially unbiased and positive in nature.
When addressing the user, always base your responses on the context provided and the previous chat history if its available.
If you are unsure about the answer, please let the user know.
If the user asks something that is not related to Machine Learning, please let the user know.
Human:
----------
<context>
{context}
</context>
----------
<history>
{history}
</history>
----------
<question>
{question}
</question>
----------
Assistant:
"""


We need to define the chunker.

**The chunker** is responsible for splitting the data into the desired chunks. In this case, we are going to split the data into chunks of 2000 tokens with an overlap of 100 tokens.

The idea of why we are using larger chunks is to keep all the information of the document in the same chunk, so we can have a better representation of the document and the information that it contains.

The overlap is used to keep track of the context between the chunks.

In [8]:
# Defining the chunker
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP
)


We need to load the data from the PDF files by parsing the data and splitting it into the desired chunks. We are going to use the chunker that we defined in the previous step to split the data into chunks.

For loading the data we are going to use **PyMUPDFLoader** which is an excellent parser, that keeps the structure of the pdf document and allows us to extract the information of the docs in a very structured way.

In [9]:
# Creating chunks from the documents
global_chunks = []
for data_path in DATA_PATHS:
    loader = PyMuPDFLoader(os.path.join(os.getcwd(), data_path))
    docs = loader.load()
    chunks = splitter.split_documents(docs)
    global_chunks.extend(chunks)

We utilize the embedding model to embed the data.

We are going to use the newest **amazon-titan embeddings model v2**, which is a very powerful model that can embed the data in a very low or high dimension, depending on the desired configuration.

We are going to use the 512 dimension embeddings with the normalization of the embeddings, which is a standard configuration for the embeddings.

In [10]:
# Creating the embedder
embedder = BedrockEmbeddings(
    model_id=EMBEDDER_MODEL_ID,
    model_kwargs=EMBEDDER_MODEL_KWARGS,
    region_name=REGION_NAME
)

Now we are going to embed the data using the embedding model that we defined in the previous step and save the embedded data in the vector database. We are going to use the **FAISS** to save the embedded data in the vector database. **FAISS** is a very powerful vector database that can save the embedded data in a very efficient way.

In [11]:
# Creating the vector store
vector_store = FAISS.from_documents(documents=chunks, embedding=embedder)
vector_store.save_local(VECTOR_STORE_PATH)


We define the LLM model that we are going to use in the pipeline.

For the LLM we are utilizing the most powerful Antropic model designed for systems like ours, that's the **Claude 3 Sonnet model**. This model is also very cheap to run and has excellent performance. The model has 200k context size window and is very powerful.


In [12]:
# Creating the LLM and Embedder models
llm = ChatBedrock(region_name=REGION_NAME, credentials_profile_name=CREDENTIALS_PROFILE_NAME,model_id=LLM_MODEL_ID, model_kwargs=LLM_MODEL_KWARGS)


We are going to load the vector database to be used for the **retriever**.

**The retriever** is going to be used to retrieve the most similar documents to the input query. The search type is going to be similarity(cosine) and we are going to retrieve the top 5 documents. Some other approaches are changing the search type to mmr or similarity search with a threshold.

In [13]:
# Loading the vector store and creating retriever
vector_store = FAISS.load_local(VECTOR_STORE_PATH, embeddings=embedder, allow_dangerous_deserialization=True)
retriever = vector_store.as_retriever(search_type=SEARCH_TYPE, **RETRIEVER_KWARGS)


We want to define the **conversational memory buffer** which will store the previous question and the history of the conversation(for that question).

Also we are defining the **prompt template** that is going to be used in the pipeline. **The prompt template** is going to be used to generate the input for the LLM model. In the prompt template the context, history and the question is going to be passed + addional system prompt which can be changed.

In [14]:
# Creating the memory and the prompt template
memory = ConversationBufferWindowMemory(memory_key=MEMORY_KEY, input_key=INPUT_KEY, k=3, ai_prefix="Assistant")
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=INPUT_VARIABLES)

  memory = ConversationBufferWindowMemory(memory_key=MEMORY_KEY, input_key=INPUT_KEY, k=3, ai_prefix="Assistant")


Before invoking the pipeline, we need to define the **Langfuse** handler which is going to be used to monitor and observe the performance of the pipeline.
We are creating a langfuse callback handler that is going to direct all the inputs and outputs from the system to the Langfuse server.

In [15]:
# Creating the callback handler
langfuse_callback= CallbackHandler(
        public_key=PUBLIC_KEY,
        secret_key=SECRET_KEY,
        host=HOST,
    )


In this step we are going to define **the chain** that is going to be used in the pipeline.
**The chain** is going to be composed of:
- retriever,
- LLM model,
- conversational memory buffer
- prompt template
- Langfuse handler.
  
The chain is going to be used to generate the output for the input query.

In [16]:
# Creating the Chain for usage
chain = RetrievalQA.from_chain_type(
            llm=llm,
            retriever=retriever,
            verbose=True,
            return_source_documents=True,
            chain_type_kwargs={
                "prompt": prompt,
                "memory": memory
            }
)
response = chain.invoke("What is attention mechanism?", config={"callbacks": [langfuse_callback]})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
