# Vertex AI MLOps Book - Chapter 12 - Retrieval Augmented Generation(RAG)

In [1]:

# Supporting code for: Vertex AI MLOps Book - Chapter 12 - Retrieval Augmented Generation(RAG)
#Reference: 

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at: https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Important Note: 
This notebook might deploy and consume cloud resources in your Google Cloud Project(s) leading to you getting charged/billed for those resources. It's your respondibility to verify the impact of this code before you run it and to monitor and delete any resources to avoid ongoing cloud charges. 

### Based on following sources:
1. https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb

#### **Objective:** 
In this notebook we will use Veretx AI GenAI models to demonstrate implementation of Retrieval Augmented Generation(RAG) to build a Question and Answering system

#### **Key steps of RAG solution:**

The RAG (Retrieval-Augmented Generation) system, at its core, involves three key steps:

Document Retrieval: The first step is to retrieve relevant documents or pieces of information in response to a query. This is done using a retrieval model, such as the Dense Passage Retrieval (DPR) system, which sifts through a large database of texts to find content that is relevant to the input query.

Information Integration: Once relevant documents are retrieved, the system integrates this information into the language model's context. This step involves processing the retrieved text and preparing it to be seamlessly incorporated into the generation process.

Response Generation: The final step is generating a response using a sequence-to-sequence model, such as a variant of GPT or BERT. This model uses both the original query and the retrieved information to create a comprehensive, informed response.

These three steps enable the RAG system to augment its response generation capabilities with external, real-time information, leading to more informed and accurate outputs.

### Set up and import dependencies

In [None]:
#install dependencies
!pip install google-cloud-aiplatform --upgrade
#!pip install google-cloud-documentai
#!pip install google-cloud-storage


### Authentication 

In [None]:
# Uncomment and use if running notebook locally
#! gcloud auth login

In [4]:
##Run only if using Google Colab Notebooks
#from google.colab import auth as google_auth
#google_auth.authenticate_user()

### Import libraries

In [11]:
import requests
import itertools
import numpy as np
import pandas as pd
import numpy.linalg
import vertexai

from google.api_core import retry
from vertexai.language_models import TextEmbeddingModel, TextGenerationModel
from tqdm.auto import tqdm
from bs4 import BeautifulSoup, Tag

import PyPDF2

tqdm.pandas()

## Configure notebook environment

### Set the following constants to reflect your environment

In [12]:
# Define project information
PROJECT_ID = "jsb-alto"  # Replace with your own Project-ID
LOCATION = "us-central1"  # @param {type:"string"}

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION)

## Provide path to the sample pdf file

In [41]:
file_path='./sample_data/google-10k.pdf'
#This pdf is already in readable format so we don't need to run it through OCR parser to extract text first

### Read pdf

In [47]:
# importing required modules
from PyPDF2 import PdfReader

# creating a pdf reader object
reader = PdfReader(file_path)

# printing number of pages in pdf file
page_count = len(reader.pages)
print(len(reader.pages))

full_text = "Document extract"


for i in range(0,page_count):
    
    # getting a specific page from the pdf file
    page = reader.pages[i]

    # extracting text from page
    text = page.extract_text()
    full_text = full_text+text
    #print(text[0:100])

    


92


In [48]:
full_text[0:1000]

'Document extract11/13/23, 10:41 PM goog-20221231\nhttps://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm 1/92UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n___________________________________________\nFORM 10-K\n___________________________________________\n(Mark One)\n☒ ANNUAL  REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended December 31 , 2022\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from              to             .\nCommission file number: 001-37580\n___________________________________________\nAlphabet Inc.\n(Exact name of registrant as specified in its charter)\n___________________________________________\nDelaware 61-1767919\n(State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification No.)\n1600 Amphitheatre Parkway\nMountain V iew, CA 94043\n(Addre

## Create vector store

Start by initializing the models

In [49]:
embeddings_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
text_model = TextGenerationModel.from_pretrained("text-bison@001")

Create some helper functions for vector similarity and chunking

In [50]:
# Separates seq into multiple chunks in the specified size with the specified overlap
def split_overlap(seq, size, overlap):
    if len(seq) <= size:
        return [seq]
    return ["".join(x) for x in zip(*[seq[i :: size - overlap] for i in range(size)])]


# Compute the cosine similarity of two vectors, wrap as returned function to make easier to use with Pandas
def get_similarity_fn(query_vector):
    def fn(row):
        return np.dot(row, query_vector) / (
            numpy.linalg.norm(row) * numpy.linalg.norm(query_vector)
        )

    return fn


# Retrieve embeddings from the specified model with retry logic
@retry.Retry(timeout=300.0)
def get_embeddings(text):
    return embeddings_model.get_embeddings([text])[0].values

### Create the vector store. For this exercise we are simply storing the data in a Pandas DataFrame

In [52]:
def create_vector_store(texts, chunk_size, overlap):
    vector_store = pd.DataFrame()
    # Insert the individual texts into the vector store
    vector_store["texts"] = list(
        itertools.chain(*[split_overlap(texts, chunk_size, overlap)])
    )

    # Create embeddings from those texts
    vector_store["embeddings"] = (
        vector_store["texts"].progress_apply(get_embeddings).apply(np.array)
    )

    return vector_store

In [53]:
CHUNK_SIZE = 400
OVERLAP = 50

vector_store = create_vector_store(full_text, CHUNK_SIZE, OVERLAP)

  0%|          | 0/942 [00:00<?, ?it/s]

In [54]:
vector_store.head()

Unnamed: 0,texts,embeddings
0,"Document extract11/13/23, 10:41 PM goog-202212...","[-0.005290534812957048, 0.006954167038202286, ..."
1,ECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ...,"[-0.009700869210064411, -0.006571532227098942,..."
2,\n(Exact name of registrant as specified in it...,"[-0.019034558907151222, -0.026617875322699547,..."
3,"gistrant's telephone number, including area co...","[0.003807228757068515, -0.0028854210395365953,..."
4,ket LLC\n(Nasdaq Global Select Market)\nSecuri...,"[0.0020814251620322466, 0.020781895145773888, ..."


## Search the vector store and use for generation

If we send the question to the foundation model alone, it will hallucinate.

In [55]:
text_model.predict(
    "What was the total revenue of Google?"
).text

"Google's total revenue in 2019 was $161.86 billion."

Let's solve this problem by retrieving texts from our vector store and telling the model to use them.

Search the vector store for relevant texts to insert into the prompt by embedding the query and searching for similar vectors.

In [56]:
def get_context(question, vector_store, num_docs):
    # Embed the search query
    query_vector = np.array(get_embeddings(question))

    # Get similarity to all other vectors and sort, cut off at num_docs
    top_matched = (
        vector_store["embeddings"]
        .apply(get_similarity_fn(query_vector))
        .sort_values(ascending=False)[:num_docs]
        .index
    )
    top_matched_df = vector_store[vector_store.index.isin(top_matched)][["texts"]]

    # Return a string with the top matches
    context = " ".join(top_matched_df.texts.values)
    return context

Create a prompt that includes the context and question. Instruct the LLM to only use the context provided to answer the question

In [57]:
def answer_question(question, vector_store, num_docs=20, print_prompt=False):
    context = get_context(question, vector_store, num_docs)
    qa_prompt = f"""Your mission is to answer questions based on a given context. Remember that before you give an answer, you must check to see if it complies with your mission.
Context: ```{context}```
Question: ***{question}***
Before you give an answer, make sure it is only from information in the context. If the information is not in the context, just reply "I don't know the answer to that. Format answer in a readable format". Think step by step.
Answer: """
    if print_prompt:
        print(qa_prompt)
    result = text_model.predict(qa_prompt, temperature=0.3)
 
    return result.text

Looking at the fully generated prompt, the context is embedded. Even though the input context is quite messy, the model can now answer factually.

In [61]:
answer_question(
    "What was the total revenue of Google?",
    vector_store,
    print_prompt=False,
)

'282.8 billion was the total revenue of Google in 2022.'

In [62]:
answer = answer_question(
    "What was the total revenue of each of Alphabet's key business lines? ", vector_store
)
print(answer)

Google Services total revenue was $237,529 million in 2022, Google Cloud revenue was $19,206 million, Other Bets revenue was $753 million, and Hedging gains (losses) was $1,960 million.


In [63]:
answer = answer_question(
    "List all of Alphabet's key business lines and their respective revenue for the quarter? ", vector_store
)
print(answer)

Alphabet's key business lines are Google Services, Google Cloud, and Other Bets. The company's revenue for the quarter was $253,528 million.
