# Retrieval Augmented Generation - Tutorial 1

Retrieval Augmented Generation (RAG) is a technique developed to take advantage of the LLMs great power and apply it to private data or data produced after it has been trained, without the need to retrain those models. With that, we bring specific knowlegde and more reliability for the LLMs answers, more control over the source knowledge and tracebility of the data sources.

In this tutorial, we are going to use the concepts and base code from [Build a Retrieval Augmented Generation (RAG) app](https://python.langchain.com/docs/tutorials/rag/) tutorial from LangChain.

A RAG application usually have two main steps:
- **Indexing**: it's commonly runned offline and it's responsible for ingesting data from the data sources and indexing it in a given place, e.g., a vector database.
- **Retrieval and Generation**: it happens in run time and it's responsible for, given a query, retrieve the relevant data (which were indexed before) associated with the query and passes it to a LLM that will generate an answer based on them. 


### Import libraries

In [None]:
# !pip install pypdf
# !pip install langchain-chroma
# !pip install langchain-huggingface

In [2]:
## This loads the environment variables defined in a .env file. 
# It's useful to load secrets
import dotenv
dotenv.load_dotenv()

True

### Indexing the data

The data indexing is composed of three main steps:
- **Document loading**: here we load the documents/data we want to index. There are a variaty of types of documents and LangChain has built-in tools ([`Document Loaders`](https://python.langchain.com/docs/concepts/#document-loaders)) to help us with that. A document in LangChain has two attributes:
    - `page_content: str`: that is the content of the document (Text).
    - `metadata: dict`: metadata associated with the document, such as the document id, the filename, etc. This can help us with the tracebility of the information.

- **Text splitting**: once we have loaded the documents, we need to split their contents in smaller chunks, what is useful for data indexing and for passing the data to the model. It's hard to search over large chunks of data and it may not fit into the model's finite context window. Again, LangChain can help us with built-in [`Text Splitters`](https://python.langchain.com/docs/concepts/#text-splitters).
    - It's important to highlight that this is a complex process, once we want to create chunks that are semantically meaningful, at the same time that they has a good length for retrieving. In other words, that are many strategies to apply here and it's something to be "finetuned".

- **Storing**: after we have broken our documents into smaller pieces, we need to store them in some place where we can easly retrieve later. We usually do this using [`Vector Stores`](https://python.langchain.com/docs/concepts/#vector-stores) and [`Embedding Models`](https://python.langchain.com/docs/concepts/#embedding-models).
    - `Vector Store`: it's responsible for storing the data (the content and its embedding, which are numerical representations of this content), and for retrieving the most relevant contents/embeddings for a given query (which also need to be embedded to perform the search). **#Storing #Retrieving**.
    - `Embedding Models`: are responsible for creating numerical representations of the data (embedding vectors), which capture the semantic meaning of it and allows us to perform mathematical operations over it, facilitating tasks such as search of pieces of data that are most similiar in meaning to my query.


#### Loading the Data

In this tutorial, we are going to work with PDF files.

In [3]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

path_pdfs = "/home/mariannadepinhosevero/Documentos/petro/studies/langgraph/notebooks/langchain/data/pdf"
pdf_loader = PyPDFDirectoryLoader(path=path_pdfs)

pdf_documents = pdf_loader.load()

The return of the PDF loading is a list of documents, where each document represents a page of a pdf, with its **content** and **metadata**. From the metadata we know to which document and respective page a given text (content) belongs to.

In [4]:
print(f"Type return: {type(pdf_documents)}, N elements: {len(pdf_documents)}, Type element: {type(pdf_documents[0])}\n")
print("---"*20)

for idx, doc in enumerate(pdf_documents):
    print(f'{idx} --> source: {doc.metadata["source"].split("/")[-1]}, page: {doc.metadata["page"]}')

Type return: <class 'list'>, N elements: 125, Type element: <class 'langchain_core.documents.base.Document'>

------------------------------------------------------------
0 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 0
1 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 1
2 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 2
3 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 3
4 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 4
5 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 5
6 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf, page: 6
7 --> source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pd

#### Splitting the data

In this tutorial we are going to use the [`RecursiveCharacterTextSplitter`](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/), which is a type of text splitter that tries to keep related pieces of text next to each other.

Text splitters like this one have two main parameters:
- `chunk_size`: which allows us to specify the maximum size of a chunk of text.
- `chunk_overlap`: which alows us to specify the size of the piece of text that can be present in both two consecutive chunks.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

pdf_text_splits = text_splitter.split_documents(pdf_documents)

The result is another list of documents. However, this time, each one of the previous pages were splitted into smaller chunks of max length equals to 1000. We can notice two interesting facts here:

1. Each page is broken into several documents. So, we will find repeated metadata, such as "page number" and "source" among different documents.
2. The generated chunks have different sizes. As the text splitter tries to keep related content close together, it can broke the text differently according to the content being splitted. In other words, we have chunks that preserve as much as possible the meanings in the text, and that can have different sizes, but not more than **chunk_size**.

In [6]:
print(f"Type return: {type(pdf_text_splits)}, N elements: {len(pdf_text_splits)}, Type element: {type(pdf_text_splits[0])}\n")
print("---"*20)


for idx, doc in enumerate(pdf_text_splits):
    print(f'{idx} --> content lenght: {len(doc.page_content)}, page: {doc.metadata["page"]}, source: {doc.metadata["source"].split("/")[-1]}')

Type return: <class 'list'>, N elements: 924, Type element: <class 'langchain_core.documents.base.Document'>

------------------------------------------------------------
0 --> content lenght: 961, page: 0, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
1 --> content lenght: 937, page: 0, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
2 --> content lenght: 969, page: 0, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
3 --> content lenght: 697, page: 0, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
4 --> content lenght: 993, page: 1, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
5 --> content lenght: 932, page: 1, source: Alzheimer s   Dementia - 2020 -  - 2020 Alzheimer s disease facts and figures.pdf
6 --> content lenght: 943, page: 1, source: Alzheimer s   Dementia - 2020

#### Store the data

As the **Vector Store** in this tutorial, we are going to use `Chroma`[[1]](https://python.langchain.com/docs/integrations/vectorstores/chroma/)[[2]](https://python.langchain.com/docs/integrations/vectorstores/chroma/), which is a AI-native open-source vector database integrated with the LangChain framework.

As the **Embedding Model** we are going to use the [`Hugging Face`](https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub/) "**sentence-transformers/all-MiniLM-l6-v2**" model, which is also integrated with the LangChain framework. We chose this embedding model instead of the OpenAI one to save costs.

In [7]:
# Generates an string ID that is the combination of a chunk metadata
def generate_id(index, chunk):
    source = chunk.metadata["source"]
    page = chunk.metadata["page"]

    return f"idx_{index}_page_{page}_source_{source}"

In [8]:
from langchain_chroma import Chroma
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# Creates the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-l6-v2")
# Creates the vector store and stores the chunks generated before
# To prevent the vector store to store repeated data, we will create unique ids that uses chunks metadata
# So, the same chunk will always generate the same id
ids = [generate_id(index=idx, chunk=chunk) for idx, chunk in enumerate(pdf_text_splits)]

vector_store = Chroma.from_documents(
    collection_name="alzheimer_papers_rag_tutorial_01",
    persist_directory="./notebooks/data/chroma",
    embedding=embedding_model,
    documents=pdf_text_splits,
    ids=ids
)

  from tqdm.autonotebook import tqdm, trange


We can check some information about the Vector Store we just created using the method `.get`, as shown bellow.

**Note**: the `embeddings` info is excluded by default from the `.get` return, in order to improve performance. So, to get this info we need to explicitly specify this field in the `include` parameter.

In [9]:
print("The vector store has:")
print("---"*10)

vs_info = vector_store.get(include=["embeddings", "documents", "metadatas"])

for key in vs_info:
    print(f"{key}: {len(vs_info[key])}" if isinstance(vs_info[key], list) else f"{key}: {vs_info[key]}", end=" | ")

The vector store has:
------------------------------
ids: 924 | embeddings: 924 | metadatas: 924 | documents: 924 | uris: None | data: None | included: 3 | 

### Retrieval and Generation

The **Retrieval** is the process of recoverying relevant data from our storage given an input query. In the case of a Vector Store, our data is embedded in a vector space where each dimension represents some semantic meaning of the original data. In this case, when we want to search for **relevant data** for an input query:

1. We **embed the input query** in the same vector space. This means that we use the same embedding model to create an embedding vector for the query.
2. Then, we create a **Retriever** (usually the vector store framework provides a method to create the retriever from the storage variable), which will search for the relevant data.
    - Here, relevant means the vectors most close to the query vector in the embedding space. 
    - To find them, as they are numerical arrays, we can calculate distances, such as cosine distance or others.
    - However, the closest vectors are not always the best ones. Because of that, there are many retrieval (e.g. similarity) and post retrieval (e.g. ranking) strategies we can experiment with. 

The **Generation** step is the one where the LLM or ChatModel receives the **Question** and the **Context** (retrieved data) included in a **Prompt** and generates an answer based on that.

In [10]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [11]:
# We create a retriever that uses similarity search and returns the 20 most similar to the query documents/chunks
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs = {"k": 20}
)

# Creates the prompt with the instructions and the input and output formats
prompt_template=("You are a helpful assistant for question-answering tasks and an expert in Alzheimers cientific research."
         " Use the following pieces of retrieved context to answer the question."
         " If you don't know the answer, just say that you don't know."
         " Use three sentences maximum and keep the answer concise."
         " Question: {question}\n\n"
         " Context: {context}\n\n"
         " Answer:")

prompt = ChatPromptTemplate.from_template(prompt_template)

In the chains created bellow it's important to detail some interesting points:

1. The chain will receive as input the **question** from the user.
2. Once the question is received:
    - It will be given as input to the retriever, which returns the result into the variable "context".
    - And it will also be passed through to the "question" variable.
3. This creates a dictionary that is used as input to the prompt template, which expects the **context** and **question** inputs to be completed.
    - So, note here that the variable names in the input should be the same as the variable names in the prompt template.
4. With the prompt filled, we pass it to the chat model, which will use the context and its previous knowledge to answer the question.
5. The model output is a AIMessage. So, we can use an output parser, such as StrOutputParser, to turn the result into a more human friendly message to read.

In [12]:
# Instantiates the chat model
chat_model = ChatOpenAI(model="gpt-3.5-turbo")

# Creates a function to format the retrieved documents
def format_retrieved_docs(documents):
    return "\n\n".join([doc.page_content for doc in documents])

# Creates the Retrieval-Generation chain
rag_chain_with_outparser = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | prompt
    | chat_model
    | StrOutputParser()
)

rag_chain_without_outparser = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | prompt
    | chat_model
)

In [13]:
question = "What is the Alzheimers disease?"

response_without_parser = rag_chain_without_outparser.invoke(input=question)
response_without_parser

AIMessage(content="Alzheimer's disease is a type of brain disease that is degenerative, meaning it worsens over time. It is characterized by changes in the brain that lead to symptoms such as memory loss and language problems. The disease is thought to start many years before symptoms appear, affecting cognitive function and ultimately leading to difficulty in daily activities.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 66, 'prompt_tokens': 3812, 'total_tokens': 3878, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-dde8f4b1-669d-434d-b188-baa9ffe83d10-0', usage_metadata={'input_tokens': 3812, 'output_tokens': 66, 'total_tokens': 3878})

In [14]:
question = "What is the Alzheimers disease?"
response_with_parser = rag_chain_with_outparser.invoke(input=question)
print(response_with_parser)

Alzheimer's disease is a type of brain disease that is degenerative, meaning it worsens over time. It is characterized by changes in the brain that lead to symptoms like memory loss and language problems. The disease is thought to start many years before symptoms become noticeable.


### Chatting with the chain

#### Without Memory

The chat model doesn't stores the previous interations with us. So, if we ask it what was our last question, it will not know.

In [15]:
chatbot_chain = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | prompt
    | chat_model
)

In [19]:
user_input = ""

while True:
    user_input = input("Typing: ")

    if user_input == "END":
        break
    
    HumanMessage(content=user_input, name="User").pretty_print()

    response = chatbot_chain.invoke(input=user_input)
    response.pretty_print()

Name: User

What is the Alzheimers disease?

Alzheimer's disease is a degenerative brain disease that worsens over time. It is characterized by changes in the brain that lead to symptoms such as memory loss and language problems, caused by damage to neurons involved in cognitive function. The disease begins years before symptoms appear, and its hallmark pathologies include the accumulation of beta-amyloid plaques and tau tangles in the brain.
Name: User

What did I asked you before?

I'm sorry, but I don't know what you asked me before.
Name: User

A partir de que idade é comum apresentar os sintomas do Alzheimer?

Os sintomas do Alzheimer tendem a se desenvolver antes dos 65 anos, às vezes tão jovem quanto 30 anos, enquanto a grande maioria dos indivíduos com Alzheimer tem Alzheimer de início tardio. É comum apresentar sintomas do Alzheimer antes dos 65 anos.
Name: User

Quem são os autores do artigo?

Os autores do artigo são C.R. Jack Jr., J.S. Andrews, T.G. Beach, entre outros.


#### With Memory

https://python.langchain.com/docs/how_to/message_history/

In order to let the chat model store the chat history, we need to modify the prompt.

In [34]:
#https://python.langchain.com/v0.1/docs/expression_language/how_to/message_history/

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

fake_messages_database = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in fake_messages_database:
        fake_messages_database[session_id] = ChatMessageHistory()
    
    return fake_messages_database[session_id]
        

In [35]:
# Creates the prompt with the instructions and the input and output formats
system_prompt=("system","You are a helpful assistant for question-answering tasks and an expert in Alzheimers cientific research."
         " Use the following pieces of retrieved context to answer the question."
         " If you don't know the answer, just say that you don't know."
         " Use three sentences maximum and keep the answer concise."
         " Context: {context}\n\n"
         " Answer:")

human_prompt = ("human", "Question: {question}")

prompt = ChatPromptTemplate.from_messages(
    [
        system_prompt,
        MessagesPlaceholder(variable_name="history"),
        human_prompt
    ]
)

In [36]:
chatbot_chain = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | prompt
    | chat_model
)

chatbot_chain_with_memory = RunnableWithMessageHistory(
    runnable=chatbot_chain,
    get_session_history=get_session_history,
    input_messages_key="question",
    history_messages_key="history"
)

In [38]:
get_session_history(session_id="1")

InMemoryChatMessageHistory(messages=[])

In [40]:
config = {"configurable": {"session_id":"1"}}
chatbot_chain_with_memory.invoke(input={"question": "What is the Alzheimers disease?"}, config=config)

AttributeError: 'dict' object has no attribute 'replace'