# Intro to Retrieval Augmented Generation Systems, LangChain & ChromaDB

This notebook walks through building a question/answer system that retrieves information to formulate responses, effectively grounding the LLM with specific information. A pre-trained LLM, or likely even a fine-tuned LLM will not be sufficient (in and of itself) when you want a system that understands specific, possibly private data or information that was not in its training dataset.

In this lab you will:
* Learn about the different components of a retrieval augmented system
* Build a simple retrieval augmented generation system 
* Use LangChain and ChromaDB to simplify and scale the process

### Setup

In [None]:
import pandas as pd
import scipy
from google import genai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.document_loaders import WikipediaLoader
from langchain.memory import ConversationBufferMemory
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_google_vertexai import VertexAI, VertexAIEmbeddings

In [None]:
EMBEDDING_MODEL = "text-embedding-004"
GENERATIVE_MODEL = "gemini-2.0-flash-001"

### Build a simple retrieval augmented generation system

In this toy example, we want to ground an LLM on information that an off-the-shelf LLM would not know. For example, instructions left for a house sitter that will be watching two pets.

In [None]:
# List of things we want to ground the LLM on.
information = [
    "Estrella is a dog",
    "Finnegan is a cat",
    "Finnegan gets fed five times daily. Estrella gets fed three times daily.",
    "Estrella usually goes on one long walk per day, but needs to go outside every 4-6 hours",
    "Please play with Finnegan for 30 minutes each day. His favorite toy is the fake mouse!",
]

information_df = pd.DataFrame({"text": information})
information_df.head()

At the core of most retrieval generation systems is a vector database. A vector database stores embedded representations of information. 

Let's add a column to our information dataframe that is an embedded representation of the text. We will use the [Vertex AI text-embeddings API](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings).

In [None]:
client = genai.Client(vertexai=True, location="us-central1")

In [None]:
information_df["vector"] = [
    x.values
    for x in client.models.embed_content(
        model=EMBEDDING_MODEL, contents=information
    ).embeddings
]
information_df.head()

Retrieval systems need a way of finding the most relevant information to answer a given query. This is done with a nearest neighbor (semantic similarity) search. Let's define a function to take in a query (text) input and return a distance metric for each text in our information. We will need to: 
* Embed the query with the same embedding model used for the information 
* Computes a distance metric between the query vector and each information vector. We will use cosine similarity, one of the many similarity measures that can be used.
* Returns a list of distance metrics between the query vector and each information vector 

### Exercise
Complete the function below so that it iterates through `information_df` and computes a similarity score between the query embedding and each information embedding.

In [None]:
def embed_and_compute_distances(query: str):
    # Get vector for query string
    query_embedding = (
        client.models.embed_content(model=EMBEDDING_MODEL, contents=query)
        .embeddings[0]
        .values
    )  # Query embedding

    distances = []

    # TODO: Compute distances between query vector and all information vectors
    # Each element of the returned list should be a dictionary with keys
    # `information` and `distance`. To compute distance use
    # scipy.spatial.distance.cosine(v1, v2)

    return distances

Test this function out on an example.

In [None]:
embed_and_compute_distances(query="What type of animal is Estrella?")

Notice that the vector that has the lowest cosine similarity (meaning most similiar) to the vector for "What type of animal is Estrella?" is the vector for "Estrella is a dog". This highlights the core assumption that underpins retrieval augmented systems: information relevant to answering a question will be close in vector space to the question itself.

Now all we have to do is write a function that incorporates the text corresponding to the closest information vectors in a prompt, then send that prompt to an LLM to answer the question with the information.

Start by writing a helper function to put together this prompt. `context` will be the relevant information strings (found via nearest neighbor search) and `query` will be the query string.

### Exercise
Complete the function below so that it implements a templated prompt, putting together a query string and the relevant context strings.

In [None]:
def get_prompt(query: str, context: list[str]):
    # TODO: Write the prompt template
    prompt = None
    return prompt

Now put everything together in a function that 
* Embeds the query
* Computes the distance between query vector and all information vectors 
* Gets the k most relevant information texts by sorting by distance 
* Uses the k most relevant information texts in a prompt to an LLM along with the query 
* Returns the LLM response and the information used (citations) 

### Exercise
Complete the function below so that it implements the end to end retrieval chain.  

In [None]:
def retrieval_chain(query: str, k: int = 2):
    # TODO: Compute distances for query and all information vectors
    distances = None

    # TODO: Sort the information from smallest distance to greatest distance
    sorted_distances = None

    # TODO: Get the text corresponding to the k closest vectors
    closest_information_texts = None

    # TODO: Incorporate the closest k information texts in a prompt to an LLM
    prompt = None

    # TODO: Send prompt through LLM
    response = None
    print(f"Response: {response.text}")
    print(f"Information used: {closest_information_texts}")

In [None]:
retrieval_chain("What type of animal is Estrella?")

In [None]:
retrieval_chain("How many times a day do I need to feed Finnegan?")

In [None]:
retrieval_chain("What stock should I invest in this month?")

Notice that the prompt is constructed such that if a question is asked that cannot be answered from the information provided, the LLM will not try to answer it.

It is also worth noting that we are arbitrarily setting k=2 (including the closest 2 information texts in the prompt). Different use cases require different k's and there is no perfect one-size-fits-all. 

### Simplify and Scale with LangChain and Chroma
Of course with only 5 examples of grounding information, we could easily include all five in a prompt. In other words, the extra retrieval step to identify *what* is needed in the prompt was unnessesary. Of course in the real world we may have thousands or millions of grounding information examples. Additionally as the number of grounding examples grows, simply computing a distance for every single vector is incredibly innefficient. In other words, production retrieval augmented generation systems require:
* Scalable vector databases to store large amounts of information
* Efficient ways of performing nearest neighbor searches 

Of course there are many options for a vectorstore, including managed and scalable offerings like [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/docs/vector-search/overview). For simplicity, in this lab we will use [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) as a vectorstore and [Langchain](https://github.com/langchain-ai/langchain) to orchestrate the retrieval system. Langchain will provide classes and methods that help simplify the steps we had to implement ourselves in the toy example above.   

#### Document Loading

Langchain provides classes to load data from different sources. Some useful data loaders are [Google Cloud Storage Directory Loader](https://python.langchain.com/docs/integrations/document_loaders/google_cloud_storage_directory/), [Google Drive Loader](https://python.langchain.com/docs/integrations/document_loaders/google_drive), [Recursive URL Loader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url/), [PDF Loader](https://python.langchain.com/docs/integrations/document_loaders/#pdfs), [JSON Loader](https://python.langchain.com/docs/integrations/document_loaders/json/), [Wikipedia Loader](https://python.langchain.com/docs/integrations/document_loaders/wikipedia/), and [more](https://python.langchain.com/docs/integrations/document_loaders/).

In this notebook we will use the Wikipedia loader to create a private knowledge base of wikipedia articles about large language models, but the overall process is similiar regardless of which document loader you use.

In [None]:
docs = WikipediaLoader(query="Large Language Models", load_max_docs=10).load()

# Take a look at a single document
docs[0]

#### Split text into chunks 
Now that we have the documents we will split them into chunks. Each chunk will become one vector in the vector store. To do this we will define a chunk size (number of characters) and a chunk overlap (amount of overlap i.e. sliding window). The perfect chunk size can be difficult to determine. Too large of a chunk size leads to too much information per chunk (individual chunks not specific enough), however too small of a chunk size leads to not enough information per chunk. In both cases, nearest neighbors lookup with a query/question embedding may struggle to retrieve the actually relevant chunks, or fail altogether if the chunks are too large to use as context with an LLM query.

In this notebook we will use a chunk size of 800 chacters and a chunk overlap of 400 characters, but feel free to experiment with other sizes! Note: you can specify a custom `length_function` with `RecursiveCharacterTextSplitter` if you want chunk size/overlap to be determined by something other than Python's len function. In addition to `RecursiveCharacterTextSplitter`, [other text splitters](https://python.langchain.com/docs/how_to/#text-splitters) you can consider.

### Exercise
In the cell below, instantiate `RecursiveCharacterTextSplitter` to split and chunk the documents.

In [None]:
text_splitter = None  # TODO: Instantiate recursive text splitter

chunks = text_splitter.split_documents(docs)

# Look at the first two chunks
chunks[0:2]

In [None]:
print(f"Number of documents: {len(docs)}")
print(f"Number of chunks: {len(chunks)}")

#### Embed Document Chunks 
Now we need to embed the document chunks and store them in a vectorstore. For this, we can use any text embedding model, however we need to be sure to use the same text embedding model when we embed our queries/questions at prediction time. To make things simple we will use the Gemini API for Embeddings. The langchain library provides a nice wrapper class around the Gemini Embeddings API, VertexAIEmbeddings().

Since Vertex AI Vector Search takes awhile (~45 minutes) to create an index, we will use Chroma instead to keep things simple. Of course, in a real-world use case with a large private knowledge-base, you may not be able to fit everything in memory. Langchain has a nice wrapper class for Chroma which allows us to pass in a list of documents, and an embedding class to create the vector store.

In [None]:
embedding = VertexAIEmbeddings(model_name=EMBEDDING_MODEL)

# set persist directory so the vector store is saved to disk
db = Chroma.from_documents(chunks, embedding, persist_directory="./vectorstore")

#### Putting it all together 

Now that everything is in place, we can tie it all together with a langchain chain. A langchain chain simply orchestrates the multiple steps required to use an LLM for a specific use case. In this case the process we will chain together first embeds the query/question, then performs a nearest neighbors lookup to find the relevant chunks, then uses the relevant chunks to formulate a response with an LLM. We will use the Chroma database as our vector store and Gemini as our LLM. Langchain provides a wrapper around Gemini, `VertexAI()`.

For this simple Q/A use case we can use langchain's `RetrievalQA` to link together the process.

### Exercise
In the cell below, implement the end to end retrieval chain using LangChain's `RetrievalQA` class.

In [None]:
# vector store
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10},  # number of nearest neighbors to retrieve
)

# You can also set temperature, top_p, top_k
llm = VertexAI(model_name=GENERATIVE_MODEL, max_output_tokens=1024)

# q/a chain
qa = None  # TODO: Instantiate retrieval chain

Now that everything is tied together we can send queries and get answers!

In [None]:
def ask_question(question: str):
    response = qa.invoke({"query": question})
    print(f"Response: {response['result']}\n")

    citations = {doc.metadata["source"] for doc in response["source_documents"]}
    print(f"Citations: {citations}\n")

    # uncomment below to print source chunks used
    # print(f"Source Chunks Used: {response['source_documents']}")

In [None]:
ask_question("What technology underpins large language models?")

In [None]:
ask_question("When was the transformer introduced?")

Congrats! You have now built a toy retrieval augmented generation system from scratch and applied the learnings to build a more real system using a vector database and orchestration with langchain.