# Creating a Google Drive Chatbot on GCP

## Overview

This tutorial is based on Google own Retrieval-Augmented Generation (RAG) API doumentation you can find [here](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/rag-api).

Vertex AI's **Retrieval-Augmented Generation (RAG) API** is a package that follows the RAG technique to a process that enhances the capabilities of large language models (LLMs) following the steps below.

First **ingest/intake data** from different data sources like local files, Cloud Storage, and Google Drive. Second **transform the data** by splitting it into chunks to allow better retrieval of related information to your questions, lower risk of the model having hallucinations, and decrease the amount of tokens that are used. Third **embedding the text** meaning the words or pieces of text will be represented by a number that states the relationship it has with other text, how similar they are. Fourth **indexing the data** by creating knowledge bases or **corpus** that will store the chunks and embeddings and be given ids to be queried and referenced from. Fifth now that our knowledge base set up we can now **retrieve relevant information** from it based on the question or query the user provides. Sixth the retrieved information, this being only the relevant chunks of data, is then given to the model for it to **generate a response** to the users question.






## Prerequisites

We assume you have access to **Vertex AI** and Google Drive and have enabled the APIs. If not go to the console then `APIs & Services > Enabled APIs & Services` search for **'Google Drive API'** then click 'Enable'. do the same for Vertex AI.

In this tutorial we will be using Google **gemini-2.0-flash** which doesn't need to be deployed but if you would like to use another model you choose one from the **Model Garden** using the console which will allow you to add a model to your model registry, create an endpoint (or use an existing one), and deploy the model all in one step. 

The last thing before we begin will to create a **Vertex AI RAG Data Service Agent** service account by going to `IAM` on the console then check mark **Include Google-provided role grant** if it not listed there then click grant access and add Vertex AI RAG Data Service Agent as a role.

## Learning objectives

For this tutorial we are creating a chatbot that will answer questions by gathering information from documents we have provided via Google Drive. The model we will be using today is a pretrained 'Gemini' model from GCP.

This tutorial will go over the following topics:
- Introducing RAG
- Creating a Vertex AI RAG Corpus
- Connecting our model to the RAG corpus


## Get started

### Install packages

In [None]:
! pip install --upgrade google-api-python-client vertexai unstructured

In [None]:
from vertexai.preview import rag
from vertexai.preview.generative_models import GenerativeModel, Tool
import vertexai
import json
import os

### Enter variables

In [None]:
project_id = "<PROJECT_ID>"
display_name = "<RAG_CORPUS_NAME>"
location = "<REGION>"#Please ensure that you are using a region that supports the creation of a RAG Corpus (e.g. us-central1)

### Optional - Download articles

For this tutorial we will be downloading scientific articles from the [NIH RECOVER program](https://recovercovid.org/publications), which our model will then use as references to answer our questions about COVID. If you have your own file on your drive feel free to use those.

In [None]:
#urls list of articles
articles_urls = ['https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10781091/pdf/pone.0285645.pdf', 
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10219649/pdf/elife-86014.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10734909/pdf/pone.0285351.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10684592/pdf/41598_2023_Article_47655.pdf', 
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10601201/pdf/12889_2023_Article_16916.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516599/pdf/elife-86043.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10620090/pdf/41586_2023_Article_6651.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10414557/pdf/pone.0289774.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10355333/pdf/aids-37-1565.pdf',
                 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10289397/pdf/pone.0286297.pdf']


Now well use a for loop to run `subprocess.run` to download each article.

In [None]:
import subprocess
for url in articles_urls:
    subprocess.run(f'wget --user-agent="Chrome" {url}', shell=True, executable="/bin/bash")

Create a folder within your drive and upload these docs into your new drive folder. If you are using Jupyter Lab you can download the docs by right clicking them in the File Browser and selecting download. 

Then you will need to go to **IAM** on the console, check mark where it says **Include Google-provided role grants**, and copy the principcal email address with the **Vertex AI RAG Data Service Agent** role.

![gdrive1](../../images/gdrive1.png)

Then go to your Google Drive and share the folder you just created with that email address and add **Viewer permissions**.
![gdrive2](../../images/gdrive2.png)
![gdrive1.5](../../images/gdrive1.5.png)
![gdrive3](../../images/gdrive3.png)







### Setting up a RAG Corpus

Initialize Vertex AI API once per session.

In [None]:
vertexai.init(project=project_id, location=location)

Create your RAG Corpus running the code below.

In [None]:
# Create RagCorpus
rag_corpus = rag.create_corpus(display_name=display_name)

You can view the name of your corpus which we will us to upload our files from Google Drive.

In [None]:
print(rag_corpus.name)

### Importing and uploading files to RAG Corpus

Enter in the Google Drive folder URI you would like to upload to the corpus. Be sure to format the URI like so adding in that extra `/d` bit: `https://drive.google.com/drive/d/folders/<FOLDER_ID>`. You can also add to the list paths to local files and buckets (format using `gs://`) to be added to the corpus all at once with your drive folder.

For local files use the command `upload_files` instead.

You can also control the size of the chunks and how much text should overlap in each chunk (this helps the model make a concise conclusion).

In [None]:
# Import Files to the RagCorpus
paths = ["https://drive.google.com/drive/folders/11RWaO4LulNsyeJi-zohBObH3LtUGoqhn?usp=sharing"]

response = rag.import_files(
    rag_corpus.name,
    paths,
    chunk_size=512,  
    chunk_overlap=100
)

Let check what files have been uploaded to our corpus!

In [None]:
#list files in corpus
files = rag.list_files(corpus_name=rag_corpus.name)
for file in files:
    print(file)

We can now send our query to the corpus to view which relevant chunks it will send back. Notice the variable **similarity_top_k** this determine how many chunks we want to retrieve from our corpus. For this tutorial we are asking to retrieve the top 10 relevant chunks.

In [None]:
query = "What is Long Covid?"

# Direct context retrieval
response = rag.retrieval_query(
    rag_corpora=[rag_corpus.name],
    text=query,
    similarity_top_k=10,
)
print(response)

### Generating a reponse

Now we can create a RAG retrieval tool that will allow us to connect our model to 1 corpus to retrieval relevant data from. Notice we are using **gemini-2.0-flash**.

In [None]:
# Enhance generation
# Create a RAG retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_corpora=[rag_corpus.name],  # Currently only 1 corpus is allowed.
            similarity_top_k=3,
        ),
    )
)
# Create a gemini-pro model instance
rag_model = GenerativeModel(
    model_name="gemini-2.0-flash", tools=[rag_retrieval_tool]
)


Finally, we can submit a question to our model to generate a response based on the information it receives from our corpus!

In [None]:
question = "What is Long Covid?"
# Generate response
response = rag_model.generate_content(question)
print(response.text)

### Put it all together!

Now lets put it all together as a function to better allow you to add this process to other scripts!

In [None]:
vertexai.init(project=project_id, location=location) #only need to run once

def create_rag(display_name):
    rag_corpus = rag.create_corpus(display_name=display_name)
    return rag_corpus.name

def file_upload(paths, rag_corpus_name):
    response = rag.import_files(
    rag_corpus_name,
    [paths],
    chunk_size=512,  
    chunk_overlap=100
)

def generate_response(question, rag_corpus_name):
    rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_corpora=[rag_corpus_name],  # Currently only 1 corpus is allowed.
            similarity_top_k=3,
            ),
        )
    )
    # Create a gemini-pro model instance
    rag_model = GenerativeModel(
        model_name="gemini-2.0-flash", tools=[rag_retrieval_tool]
    )
    response = rag_model.generate_content(question)
    
    return response.text

## Conclusion

You have just created a Google Drive Chatbot by creating and utilizing a RAG corpus from Vertex AI to retrieve relevant information.

## Clean Up

**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs such as buckets, the RAG Corpus, and if needed your notebook.

In [None]:
#Delete RAG Corpus
rag.delete_corpus(name=rag_corpus.name)
print(f"Corpus {rag_corpus.name} deleted.")

In [None]:
#Delete bucket (if applicable)
!gcloud storage rm --recursive gs://{bucket}/

If you have imported a model and deployed it don't forget to delete the model from the Model Registry and delete the endpoint.