# Small-to-big Retrieval-Augmented Generation  

<table align="left">
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fretrieval-augmented-generation%2Fsmall_to_big_rag%2Fsmall_to_big_rag.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/small_to_big_rag/small_to_big_rag.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/retrieval-augmented-generation/small_to_big_rag/small_to_big_rag.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
    <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/small_to_big_rag/small_to_big_rag.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

Authors: [Keith Ballinger](https://github.com/keithballinger), [Megan O'Keefe](https://github.com/askmeegs) 

Small-to-big retrieval is a form of [modular recursive RAG](https://www.promptingguide.ai/research/rag#modular-rag), where you link smaller grounding data chunks to larger "parent" data chunks. When a small chunk is retrieved at runtime, the larger linked chunk can be retrieved if needed.  

The Small-to-big strategy offers a few benefits over regular RAG: 
1. **Complex use cases**: Small-to-big RAG can be used to handle complex queries where the context is too large to fit into a single dense vector. Examples: legal documents, research papers. 
2. **Work around the limits of dense vectors**: Dense vectors can only "squish" the meaning of a text so much. If you try to embed a chunk that's too long (eg. an entire document), some of the meaning may be lost, resulting in less accurate retrieval results. By keeping the small vector chunks small, and retrieving large documents later (with or without embeddings), you're getting the semantic-search benefits of dense vectors, while still being able to retrieve the full context when needed.
3. **Cost**: You can set up small-to-big RAG to only fetch the long documents when needed (eg. if the model is unable to respond with the small context). This can save on inference costs, because Gemini on Vertex AI is [priced per input character](https://cloud.google.com/vertex-ai/generative-ai/pricing). 

There are multiple ways to implement small-to-big RAG. The small chunks could represent short passages of a document, and the larger chunks could represent the entire surrounding context (eg. the whole document) - see LangChain's [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever) for more info. Another way is to have the small chunks represent summaries of the larger documents. That is the method we'll explore here. 


In this example, we'll walk through a Small-to-big RAG example using a GitHub codebase called [Online Boutique](https://github.com/GoogleCloudPlatform/microservices-demo). Online Boutique is a microservices, multi-language sample application. We'll implement a question answering functionality to help a new contributor learn about and navigate this codebase.

![](architecture.png)

To complete this notebook, **you will need**: 
- A [Google Cloud account](https://console.cloud.google.com/)
- One [Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects) with [billing](https://cloud.google.com/billing/docs/how-to/modify-project) enabled 
- Enable the [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)
- [gcloud SDK](https://cloud.google.com/sdk/docs/install) installed in your environment.
- Your user has the **Vertex AI User** IAM role.

This notebook uses the following products and tools:
- [Vertex AI - Gemini API](https://cloud.google.com/vertex-ai/generative-ai/docs/start/quickstarts/quickstart-multimodal#gemini-beginner-samples-python_vertex_ai_sdk) 
- [Vertex AI - Text Embeddings API](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) 
- [Chroma](https://docs.trychroma.com/getting-started) (in-memory vector database)  


### Setup 

First, set variables. You are required to set your project ID. You can keep the other variables as-is.

In [None]:
EMBEDDING_MODEL = "textembedding-gecko@003"
GENERATIVE_MODEL = "gemini-1.0-pro"
PROJECT_ID = "YOUR-PROJECT-ID"
REGION = "us-central1"

Install the necessary packages, and import them. 

In [None]:
! pip install "google-cloud-aiplatform>=1.38"
! pip install pandas
! pip install chromadb

In [None]:
import os

import chromadb
import pandas as pd
import vertexai
from vertexai.generative_models import ChatSession, GenerativeModel
from vertexai.language_models import TextEmbeddingModel

Lastly, download the source code dataset from Cloud Storage. This is a modified version of the upstream Online Boutique repo, with certain files pruned for sample purposes.

In [None]:
! gsutil -m cp -r gs://github-repo/generative-ai/gemini/use-cases/rag/small-to-big-rag/onlineboutique-codefiles .

### Create helper functions

We'll create one function that calls Vertex AI text-embeddings-gecko, and another that inferences Gemini Pro on Vertex AI.

In [None]:
model = TextEmbeddingModel.from_pretrained(EMBEDDING_MODEL)


def get_text_embedding(doc) -> list:
    embeddings = model.get_embeddings([doc])
    if len(embeddings) > 1:
        raise ValueError("More than one embedding returned.")
    if len(embeddings) == 0:
        raise ValueError("No embedding returned.")
    return embeddings[0].values

In [None]:
vertexai.init(project=PROJECT_ID, location=REGION)
model = GenerativeModel(GENERATIVE_MODEL)
chat = model.start_chat()

In [None]:
def gemini_inference(chat: ChatSession, prompt: str) -> str:
    text_response = chat.generate_content(prompt)
    return text_response.text

In [None]:
gemini_inference(chat, "hello world!")

### Get summaries of code files 

First, we'll use Gemini on Vertex AI to get short summaries of each code file.  We'll do this by recursively traversing the files in `onlineboutique-codefiles`. 

In [None]:
# for every file in onlineboutique-codefiles/, read it in, and get the full tree filename, and a code summary
summaries = {}
for root, dirs, files in os.walk("onlineboutique-codefiles/"):
    for file in files:
        temp = {}
        full_file_path = os.path.join(root, file)
        with open(full_file_path) as f:
            print("Processing file: ", full_file_path)
            try:
                content = f.read()
                temp["content"] = content
                prompt = """ 
                You are a helpful code summarizer. Here is a source code file. Please identify the programming language and summarize it in three sentences or less. Give as much detail as possible, including function names and libraries used. Code: 
                {}
                """.format(
                    content
                )
                summary = gemini_inference(chat, prompt)
                temp["summary"] = summary
                summaries[full_file_path] = temp
            except Exception as e:
                print(f"⚠️ Error processing file: {full_file_path} - {e}")

Next, we'll create a Pandas DataFrame with the file paths, code content, and summaries.

In [None]:
df = pd.DataFrame.from_dict(summaries, orient="index")

In [None]:
df.head()

In [None]:
# number of file summaries
print("Number of rows: ", df.shape[0])

In [None]:
# the first column should be named "filename"
df = df.reset_index()
df = df.rename(columns={"index": "filename"})
df.head()

In [None]:
# write to csv
df.to_csv("code_summaries.csv", index=False)

### Convert summaries to embeddings

Next, we'll convert the text summaries of each code file to vector embeddings. We'll store those embeddings in an in-memory Chroma database. 

In [None]:
chroma_client = chromadb.Client()

In [None]:
collection = chroma_client.create_collection(name="code_summaries")

In [None]:
# iterate over dataframe. convert summary into embeddings. insert summary into collection.
for index, row in df.iterrows():
    fn = row["filename"]
    print("Getting embedding for: ", fn)
    summary = row["summary"]
    print(summary)
    e = get_text_embedding(summary)
    print(e)
    # add vector embedding to in-memory Chroma database.
    # the "small" summary embedding is linked to the "big" raw code file through the metadata key, "filename."
    collection.add(
        embeddings=[e], documents=[summary], metadatas=[{"filename": fn}], ids=[fn]
    )

### Implement the Small-to-big RAG workflow 

In [None]:
# Get a list of all files to pass to Gemini, if it needs to see a specific code file.
all_files = []
for root, dirs, files in os.walk("onlineboutique-codefiles/"):
    for file in files:
        all_files.append(os.path.join(root, file))
print(all_files)

The function below shows how we'll first try to inference Gemini with small chunks (code file summaries). If Gemini can answer with that context, we return its response and we're done. If Gemini needs more context, we'll ask it what file it would like to see. Then, we'll directly retrieve the code file from the DataFrame, and pass it into Gemini again as the "large" context.

In [None]:
def small_to_big(user_prompt):
    # SMALL: first, run RAG with the summary embeddings to try to get a response
    query_emb = get_text_embedding(user_prompt)
    result = collection.query(query_embeddings=[query_emb], n_results=3)
    # process nearest-neighbors
    processed_result = {}
    d = result["documents"][0]
    for i in range(0, len(d)):
        summary = d[i]
        filename = result["metadatas"][0][i]["filename"]
        processed_result[filename] = summary
    prompt_with_small = """
    You are a codebase helper. You will be given a user's question about the codebase, along with 
    summaries of relevant code files. Attempt to answer the question and only respond if you're confident in the answer. 
    If you need any more information, respond with ONLY the phrase "need more context". 

    The user query is: {} 

    The summaries are: {}
    """.format(
        user_prompt, str(processed_result)
    )
    print(prompt_with_small)
    small_result = gemini_inference(chat, prompt_with_small)
    # we're done if Gemini is confident with just the summaries as context...
    if "need more context" not in small_result.lower():
        return (
            "🐝 Completed at small, Gemini had enough context to respond. RESPONSE: \n"
            + small_result
        )
    print(
        "🤔 Gemini asked for more context. Let's ask what codefile Gemini wants to see."
    )
    # otherwise, move on to BIG:
    # IF we need the full context, get the filename that most closely matches the user's question
    prompt_to_get_filename = """ 
    You are a codebase helper. The list of code files that you know about: 
    {}

    The user asks the following question about the codebase: {}

    Please respond with the filename that most closely matches the user's question. Respond with ONLY the filename. 
    """.format(
        all_files, user_prompt
    )
    filename = gemini_inference(chat, prompt_to_get_filename)
    print("📂 Gemini asked for this file: " + filename)
    # is the filename in the dataframe?
    if filename not in df["filename"].values:
        # attempt to try again, appending "onlineboutique-codefiles"
        filename = "onlineboutique-codefiles/" + filename
        if filename not in df["filename"].values:
            return f"⚠️ Error: filename {filename} not found in dataframe"

    # get the full code file
    full_code = df[df["filename"] == filename]["content"].values[0]
    prompt_with_big = """ 
    You are a codebase helper. You will be given a user's question about the codebase, along with a complete source code file. Respond to the user's question with as much detail as possible.

    The user query is: {}
    
    The full code file is: {}
    """.format(
        user_prompt, full_code
    )

    big_response = gemini_inference(chat, prompt_with_big)
    return "🦖 Completed at big. RESPONSE: \n" + big_response

### Test it out 

You can test this function by calling `small_to_big("your codebase question")`. We include a few examples below. The more detailed your question (eg. citing a specific function, line of code, or dependency), the more likely it is that Gemini will ask for more context, and the "large" step occurs. 

In [None]:
# an example of a query where only the small (summary) step is needed
small_to_big("How does the ad service work?")

In [None]:
small_to_big(
    "Exactly how long is the kubectl wait condition in the Terraform deployment of online boutique? Return the right number of seconds"
)

In [None]:
# Solution terraform code in main.tf  - 280 seconds is correct
"""

# Wait condition for all Pods to be ready before finishing
resource "null_resource" "wait_conditions" {
  provisioner "local-exec" {
    interpreter = ["bash", "-exc"]
    command     = <<-EOT
    kubectl wait --for=condition=AVAILABLE apiservice/v1beta1.metrics.k8s.io --timeout=180s
    kubectl wait --for=condition=ready pods --all -n ${var.namespace} --timeout=280s
    EOT
  }

  depends_on = [
    resource.null_resource.apply_deployment
  ]
}
"""

In [None]:
small_to_big("What tracing frameworks are used across the codebase?")

In [None]:
small_to_big("Describe in detail exactly how the ListRecommendations function works.")