# Creating a PubMed Chatbot on GCP

## Overview

For this tutorial we create a PubMed chatbot that will answer questions by gathering information from documents we have provided via an index. The model we will be using today is the pretrained 'gemini-2.0-flash' model from GCP.

## Learning Objectives
- Introduce langchain
- Explain the differences between zero-shot, one-shot, and few-shot prompting
- Practice using different document retrievers

## Prerequisites
You need access to Verted AI. If you want to deploy a model (see below) then follow those instructions.

## Get Started

### Optional: Deploy the Model

In this tutorial we will be using a Google's gemini model **gemini-2.0-flash** which doesn't need to be deployed but if you would like to use another model you choose one from the **Model Garden** using the console which will allow you to add a model to your model registry, create an endpoint (or use an existing one), and deploy the model all in one step.

### PubMed API vs RAG with Vertex AI Vector Search

Our chatbot will rely on documents to answer our questions to do so we are supplying it a **vector index**. A vector index or index is a data structure that enables fast and accurate search and retrieval of vector embeddings from a large dataset of objects. We will be working with two options for our index: PubMed API vs RAG Vertex AI Vector Search method.

**What is the difference?**

The **PubMed API** is provided free by langchain to connect your model to more than **35 million citations** for biomedical literature from MEDLINE, life science journals, and online books. The langchain package for PubMed is already a retriever meaning that just simply using this tool will our chatbot beable to retrieve documents to refer to. 

**Vertex AI Vector Search** (formally known as Matching Engine) is a vector store from GCP that allows the user more **security and control** on which documents you wish to supply to your model. Vector Search, formerly known as Vertex AI Matching Engine, is a vector store or database that stores the **embeddings** of your documents and the metadata. Because this is not a retriever we have to make it so for our model to send back an output that also tells us which documents it is referencing, this is where RAG comes in. **RAG** stands for **Retrieval-augmented generation** it is a method or technique that **indexes documents** by first loading them in, splitting them into chucks (making it easier for our model to search for relevant splits), embedding the splits, then storing them in a vector store. The next steps in RAG are based on the question you ask your chatbot. If we were to ask it "What is a cell?" the vector store will be searched by a retriever to find relevant splits that have to do with our question, thus **retrieving relevant documents**. And finally our chatbot will **generate an answer** that makes sense of what a cell is, as part of the answer it will also point out which source documents it used to create the answer.

We will be exploring both methods!

In [None]:
! pip install langchain langchain-google-vertexai langchain-community unstructured

### Setting up Vertex AI Vector Search

If you choose to use the RAG method with Vertex AI RAG Vector Search to supply documents to your model follow the instructions below:

Set your project id, location, and bucket variables.

In [None]:
project_id='PROJECT_ID'
location='REGION'
bucket = 'UNIQUE_BUCKET_NAME'

### Gathering our Docs For our Vector Store

AWS marketplace has PubMed database named **PubMed Central® (PMC)** that contains free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). We will be subsetting this database to add documents to our Vertex AI Vector Search Index. Ensure that you have the correct permissions to allow your environment to connect to buckets and Vertex AI.

The first step will be to create a bucket that we will later use as our data source for our index.

In [None]:
#make bucket
!gsutil mb -l {location} gs://{bucket}

We will then download the metadata file from the PMC index directory, this will list all of the articles within the PMC bucket and their paths. We will use this to subset the database into our own bucket. Here we are using curl to connect to the public AWS s3 bucket where the metadata and documents are originally stored.

In [None]:
#download the metadata file
!curl -O http://pmc-oa-opendata.s3.amazonaws.com/oa_comm/txt/metadata/csv/oa_comm.filelist.csv

We only want the metadata of the first 50 files.

In [None]:
#import the file as a dataframe
import pandas as pd

df = pd.read_csv('oa_comm.filelist.csv')
#first 50 files
first_50=df[0:50]

Lets look at our metadata! We can see that the bucket path to the files are under the **Key** column this is what we will use to loop through the PMC bucket and copy the first 50 files to our bucket.

In [None]:
first_50.head()

The following commands will gather the location of each document with in AWS s3 bucket, output the text from the docs as bytes and save the bytes to our bucket in the form of a text file in a directory named "docs". This will all be done using curl.

In [None]:
from google.cloud import storage
import os
import requests

def upload_blob_from_memory(bucket_name, contents, destination_blob_name):
    """Uploads a file to the bucket."""
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_string(contents)

    return print(
        f"{destination_blob_name} uploaded to {bucket_name}."
    )

for i in first_50['Key']:
    doc_name=i.split(r'/')[-1]
    x = requests.get(f'https://pmc-oa-opendata.s3.amazonaws.com/{i}')
    doc = x.text
    upload_blob_from_memory(bucket, doc, f'docs/{doc_name}')

### Creating an Index

To create our vector store index, we will first start by creating a dummy embeddings file. An index holds a set of records so our dummy data will be the first record and then later we will add our PubMed docs to the same index. Inorder for Vector Search to find our dummy embeddings file it too must be in our bucket and we will add it to the subdirectory 'init_index'.

In [None]:
import uuid
import numpy as np
import json
init_embedding = {"id": str(uuid.uuid4()), "embedding": list(np.zeros(768))}

# dump embedding to a local file
with open("embeddings_0.json", "w") as f:
    json.dump(init_embedding, f)


In [None]:
#move inital embeddings file to bucket
!gsutil cp embeddings_0.json gs://{bucket}/init_index/

Now we can make our index, this can take up to 30min to 1hr. 

Please note that the dimensions depend on what text embedding model you are using for this tutorial we are using **Vertex AI's embedding model** which uses 768 dimensions. If you choose to change your model, choose an embedding model that is compatible with it.

In [None]:
from google.cloud import aiplatform
# create Index
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name = f"pubmed_vector_index",
    contents_delta_uri = f"gs://{bucket}/init_index",
    dimensions = 768,
    approximate_neighbors_count = 150,
    distance_measure_type="DOT_PRODUCT_DISTANCE",
    location=location
    
)

#save index id
index_id=index.name

### Creating an Endpoint and Deploying our Index

We will create a public endpoint for our vector store, you can also create a private one by setting up a VPC and specifying the VPC id for the params 'network'.

In [None]:
#Create the endpoint
index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name = "pubmed_vector_endpoint",
    public_endpoint_enabled = True,
    location = location
)

In [None]:
#save endpoint id
endpoint_id = index_endpoint.name

Here we are deploying our index to our endpoint, which can take up to a hour. Its also okay if this cell stops or gets interrupted because the actions are carried out in the console.

In [None]:
#deploy our index to our endpoint
deployed_index_id="deployed_pubmed_vector_index"
index_endpoint = index_endpoint.deploy_index(
    index=index, deployed_index_id=deployed_index_id
)
index_endpoint.deployed_indexes

### Adding Metadata to Our Data

After we have our documents stored in our bucket we can start to load our files back. This step is necessary though redundant because we will need to embed our docs for our vector store and we can attach metadata for each document. The first step of adding our metadata to the docs will be to remove the 'Key' column because this is no longer the location of our documents. Next, we'll convert the rest of the columns into a dictionary form.

In [None]:
import pandas as pd
#Remove the Key column to be replaced later
first_50.pop('Key')
#convert the metadata to dict
first_50_dict = first_50.to_dict('records')

Lets look at our metadata now!

In [None]:
first_50_dict[0]

Now we can load in our documents, add in the location of our docs in our bucket and the document name to our metadata, and finally attach that metadata to our documents. At the end we should have 50 documents before splitting the data.

In [None]:
#add metadata
from langchain_community.document_loaders import GCSDirectoryLoader
print(f"Processing documents from {bucket}")
loader = GCSDirectoryLoader(
    project_name=project_id, bucket=bucket, prefix = 'docs'
)
documents = loader.load()

# loop through docs to add metadata to each one
for i in range(50):
    doc_md = documents[i].metadata
    document_name = doc_md["source"].split("/")[-1]
    source = f"{bucket}/docs/{document_name}"
    # Add document name and source to the metadata
    documents[i].metadata = {"source": source, "document_name": document_name}
    documents[i].metadata.update(first_50_dict[i])# attached other metadata to doc
print(f"# of documents loaded (pre-chunking) = {len(documents)}")

Lets take a look at our metadata!

In [None]:
documents[0].metadata

### Splitting our Data

Splitting our data into chucks will help our vector store parse through our data faster and efficiently. We'll then add the chuck number to our metadata.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)
doc_splits = text_splitter.split_documents(documents)

# Add chunk number to metadata
for idx, split in enumerate(doc_splits):
    split.metadata["chunk"] = idx

print(f"# of documents = {len(doc_splits)}")

After splitting our data we now have 7620 documents. And looking at our metadata we can see that the chunk number is the last entry.

In [None]:
doc_splits[0]

### Embedding our Data

Now we can embed our text into **numerical vectors** that will help our model find similar objects like documents that hold similar texts or find similar photos based on the numbers assigned to the object. Depending on the model you choose you have to find an embedder that is compatible to our model. Since we are using a PaLM2 model (text-bison) we can use the embedding model from Vertex AI that defaults to using **'textembedding-gecko'**.

In [None]:
from langchain_google_vertexai import VectorSearchVectorStore
from langchain_google_vertexai import VertexAIEmbeddings
embeddings = VertexAIEmbeddings(model_name="text-embedding-005")

# initialize vector store
vector_store = VectorSearchVectorStore.from_components(
    project_id=project_id,
    region=location,
    gcs_bucket_name=bucket,
    embedding=embeddings,
    index_id=index_id,
    endpoint_id=endpoint_id,
)

For our split documents to be read by our embedding model we need to make tuple called **Document** that contains **page content** and **metadata**. The code below loops through the split docs and assigns them to the label page_content and the same is done for all parts of our metadata under the label metadata.

In [None]:
# Store docs as embeddings in Matching Engine index
# It may take a while since API is rate limited
texts = [doc.page_content for doc in doc_splits]
metadatas = [doc.metadata for doc in doc_splits]

lets look at our Document tuple!

In [None]:
doc_splits[0]

Now we can add our split documents and their metadata to our vector store. This is the longest step of the tutorial and can take up 1hr to complete. As you wait you can read up on Creating a Inference Script section of this tutorial.

In [None]:
doc_ids = vector_store.add_texts(texts=texts, metadatas=metadatas)

Test whether search from vector store is working

In [None]:
results=vector_store.similarity_search_with_score("brain")

### Creating an Interactive Inference Script 

For us to submit queries and receive responses from our chatbot we need to create an **inference script** that will format inputs in a way that the chatbot can understand and format outputs in a way we can understand. We will also be supplying instructions to the chatbot through this script.

Our script will utilize **LangChain** tools and packages to enable our model to:
- **Connect to sources of context** (e.g. providing our model with tasks and examples)
- **Rely on reason** (e.g. instruct our model how to answer based on provided context)

**Warning**: The following tools must be installed via your terminal `pip install "langchain" "xmltodict" "langchain-google-vertexai" "langchain-community" "unstructured"` and the inference script must be run on the terminal via the command `python YOUR_SCRIPT.py`.

The first part of our script will be to list all the tools that are required. 
-  **PubMedRetriever:** Utilizes the langchain retriever tool to specifically retrieve PubMed documents from the PubMed API.
- **MatchingEngine:** Connects to Vertex AI Vector Search to be used as a langchain retriever tool to specifically retrieve embedded documents stored in your bucket. 
- **ConversationalRetrievalChain:** Allows the user to construct a conversation with the model and retrieves the outputs while sending inputs to the model.
- **PromptTemplate:** Allows the user to prompt the model to provide instructions, best method for zero and few shot prompting
- **VertexAIEmbeddings:** Text embedding model used before to convert text to numerical vectors.
- **VertexAI**: Package used to import Google PaLM2 LLMs models. 


```python
from langchain_community.retrievers import PubMedRetriever
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
#from langchain_google_vertexai import VertexAIModelGarden
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_google_vertexai import VectorSearchVectorStore
from langchain_google_vertexai import ChatVertexAI
import sys
import json
import os
```

Second will build a class that will hold the functions we need to send inputs and retrieve outputs from our model. For the beginning of our class we will establish some colors to our text conversation with our chatbot which we will utilize later.

```python
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
```

If you are using Vector Search instead of the PubMed API we need to create a function that will gather the necessary information to connect to our model, which will be the:
- Project ID
- Location of bucket and vector store (they should be in the same location)
- Bucket name
- Vector Store Index ID
- Vector Store Endpoint ID

```python
def build_chain():
    PROJECT_ID = os.environ["PROJECT_ID"]
    LOCATION_ID = os.environ["LOCATION_ID"]
    #ENDPOINT_ID = os.environ["ENDPOINT_ID"] #uncomment if utilizing model from Model Garden
    BUCKET = os.environ["BUCKET"]
    VC_INDEX_ID = os.environ["VC_INDEX_ID"]
    VC_ENDPOINT_ID = os.environ["VC_ENDPOINT_ID"]
```

Now we can define our Google PaLM2 model being `gemini-2.0-flash` and other parameters:

- Max Output Tokens: Limit of tokens outputted by the model.
- Temperature: Controls randomness, higher values increase diversity meaning a more unique response make the model to think harder. Must be a number from 0 to 1, 0 being less unique.
- Top_p (nucleus): The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus. Must be a number from 0 to 1.
- Top_k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens. This means the model choses the most probable words. Lower values eliminate fewer coherent words.


```python
llm = VertexAI(
    model_name="gemini-2.0-flash",
    max_output_tokens=1024,
    temperature=0.2,
    top_p=0.8,
    top_k=40,
    verbose=True,
    
    
#if using a model from the Model Garden uncomment
#llm = VertexAIModelGarden(project=PROJECT_ID, endpoint_id=ENDPOINT_ID, location=LOCATION_ID)
```

We specify what our retriever both the PubMed and Vector Search retriever are listed, please only add one per script.

If using Vector Search we need to initialize our vector store as we did before when we added our split documents and metadata to it. Then we set the vector store as a **retriever** with the search type being **'similarity'** meaning it will find texts that are similar to each other depending on the question you ask the model. We also set **'k'** to 3 meaning that our retriever will retrieve 3 documents that are similar.

```python
retriever= PubMedRetriever()

#only if using Vector Search as a retriever

embeddings = VertexAIEmbeddings(model_name="text-embedding-005") #Make sure embedding model is compatible with model

vector_store = VectorSearchVectorStore.from_components(
        project_id=PROJECT_ID,
        region=LOCATION_ID,
        gcs_bucket_name=BUCKET,
        embedding=embeddings,
        index_id=VC_INDEX_ID,
        endpoint_id=VC_ENDPOINT_ID
    )

retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k":3}
    )
```

Here we are constructing our **prompt_template**, this is where we can try zero-shot or few-shot prompting. Only add one method per script.

#### Zero-shot prompting

Zero-shot prompting does not require any additional training more so it gives a pre-trained language model a task or query to generate text (our output). The model relies on its general language understanding and the patterns it has learned during its training to produce relevant output. In our script we have connect our model to a **retriever** to make sure it gathers information from that retriever (this can be the PubMed API or Vector Search). 

See below that the task is more like instructions notifying our model they will be asked questions which it will answer based on the info of the scientific documents provided from the index provided (this can be the PubMed API or Vector Search index). All of this information is established as a **prompt template** for our model to receive.

```python
prompt_template = """
  Ignore everything before.
  
  Instruction:
  Instructions:
  I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. 
  The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. 
  You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. 
  Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end.
  
  {question} Answer "don't know" if not present in the document. 
  {context}
  Solution:"""
  PROMPT = PromptTemplate(
      template=prompt_template, input_variables=["context", "question"],
  )
```

#### One-shot and Few-shot Prompting

One and few shot prompting are similar to one-shot prompting, in addition to giving our model a task just like before we have also supplied an example of how the our model structure our output.

See below that we have implemented one-shot prompting to our script.  

```python
prompt_template = """
  Instructions:
  I will provide you with research papers on a specific topic in English, and you will create a cumulative summary. 
  The summary should be concise and should accurately and objectively communicate the takeaway of the papers related to the topic. 
  You should not include any personal opinions or interpretations in your summary, but rather focus on objectively presenting the information from the papers. 
  Your summary should be written in your own words and ensure that your summary is clear, concise, and accurately reflects the content of the original papers. First, provide a concise summary then citations at the end. 
  Examples:
  Question: What is a cell?
  Answer: '''
  Cell, in biology, the basic membrane-bound unit that contains the fundamental molecules of life and of which all living things are composed. 
  Sources: 
  Chow, Christopher , Laskey, Ronald A. , Cooper, John A. , Alberts, Bruce M. , Staehelin, L. Andrew , 
  Stein, Wilfred D. , Bernfield, Merton R. , Lodish, Harvey F. , Cuffe, Michael and Slack, Jonathan M.W.. 
  "cell". Encyclopedia Britannica, 26 Sep. 2023, https://www.britannica.com/science/cell-biology. Accessed 9 November 2023.
  '''
  
  {question} Answer "don't know" if not present in the document. 
  {context}
  

  
  Solution:"""
  PROMPT = PromptTemplate(
      template=prompt_template, input_variables=["context", "question"],
  )
```

The following set of commands control the chat history essentially telling the model to expect another question after it finishes answering the previous one. Follow up questions can contain references to past chat history so the **ConversationalRetrievalChain** combines the chat history and the followup question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.

All of these pieces such as our conversational chain, prompt, and chat history are passed through a function called **run_chain** so that our model can return is response. We have also set the length of our chat history to one meaning that our model can only refer to the pervious conversation as a reference.

```python
condense_qa_template = """
  Chat History:
  {chat_history}
  Here is a new question for you: {question}
  Standalone question:"""
  standalone_question_prompt = PromptTemplate.from_template(condense_qa_template)
 
    qa = ConversationalRetrievalChain.from_llm(
        llm=llm, 
        retriever=retriever, 
        condense_question_prompt=standalone_question_prompt, 
        return_source_documents=True, 
        combine_docs_chain_kwargs={"prompt":PROMPT},
        )
      return qa

def run_chain(chain, prompt: str, history=[]):
    print(prompt)
    return chain({"question": prompt, "chat_history": history})

MAX_HISTORY_LENGTH = 1 #increase to refer to more pervious chats
```

The final part of our script utilizes our class and incorporates colors to add a bit of flare to our conversation with our model. The model when first initialized should greet the user asking **"Hello! How can I help you?"** then instructs the user to ask a question or exit the session **"Ask a question, start a New search: or CTRL-D to exit."**. With every question submitted to the model it is labeled as a **new search** we then run the run_chain function to get the models response or answer and add the response to the **chat history**. 

```python
if __name__ == "__main__":
  chat_history = []
  qa = build_chain()
  print(bcolors.OKBLUE + "Hello! How can I help you?" + bcolors.ENDC)
  print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
  print(">", end=" ", flush=True)
  for query in sys.stdin:
    if (query.strip().lower().startswith("new search:")):
      query = query.strip().lower().replace("new search:","")
      chat_history = []
    elif (len(chat_history) == MAX_HISTORY_LENGTH):
      chat_history.pop(0)
    result = run_chain(qa, query, chat_history)
    chat_history.append((query, result["answer"]))
    print(bcolors.OKGREEN + result['answer'] + bcolors.ENDC)  
    if 'source_documents' in result: 
      print(bcolors.OKGREEN + 'Sources:')
      for idx, ref in enumerate(result["source_documents"]):
            print(ref.page_content) #Use this for Vector store
            #print("PubMed UID: "+ref.metadata["uid"])#Use this for PubMed retriever
    print(bcolors.ENDC)
    print(bcolors.OKCYAN + "Ask a question, start a New search: or CTRL-D to exit." + bcolors.ENDC)
    print(">", end=" ", flush=True)
  print(bcolors.OKBLUE + "Bye" + bcolors.ENDC)
```

Running our script in the terminal will require us to export the following global variables before using the command `python NAME_OF_SCRIPT.py`. You can also check out our **example inference scripts** for the [Pubmed API](/example_scripts/example_langchain_chat_llama_2_zeroshot.py) and [Vertex AI Vector Search](/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py).

In [None]:
#retreive our index and endpoint id
print(index_id)
print(endpoint_id)

In [None]:
#enter the global variables in your terminal
export PROJECT_ID='<PROJECT_ID>' \
export LOCATION_ID='<LOCATION_ID>' \
#export ENDPOINT_ID='<MODEL_ENDPOINT_ID>' \ #Uncomment if using model from Model Garden
export BUCKET='<BUCKET_NAME>' \
export VC_INDEX_ID='<VECTOR_SEARCH_INDEX ID>' \
export VC_ENDPOINT_ID='VECTOR_SEARCH_ENDPOINT_ID>'

You should see similar results on the terminal. In this example we ask the chatbot to explain brain cancer!

![PubMed Chatbot Results](../../images/GCP_chatbot_results.png)

## Conclusion
Here you learned how to deploy and interact with a chat model and also how to deploy an inference script to create an interactive chatbot in the terminal.

## Clean Up

**Warning:** Dont forget to delete the resources we just made to avoid accruing additional costs!

In [None]:
#Undeploy index
!gcloud ai index-endpoints undeploy-index {endpoint_id} \
  --deployed-index-id={deployed_index_id} \
  --project={project_id} \
  --region={location}


#Delete index and endpoint
!gcloud ai indexes delete {index_id} \
  --project={project_id} \
  --region={location} --quiet

!gcloud ai index-endpoints delete {endpoint_id} \
  --project={project_id} \
  --region={location} --quiet

In [None]:
#Delete bucket
!gcloud storage rm --recursive gs://{bucket}/

If you have imported a model and deployed it don't forget to delete the model from the Model Registry and delete the endpoint.