In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering with Large Documents

> **NOTE:** This notebook uses the PaLM generative model, which will reach its [discontinuation date in October 2024](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text#model_versions). Please refer to [this updated notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/document-processing/document_processing.ipynb) for a version which uses the latest Gemini model.

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/use-cases/document-qa/question_answering_documents.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Lavi Nigam](https://github.com/lavinigam-gcp) |

## Overview

This notebook shows how you can build a question-answering (Q&A) system (or "bot") over multiple large documents so that Vertex AI PaLM API can answer any questions about the contents of those documents.

Many companies have lots of information stored in documents, but retrieving that information easily and quickly can be challenging. To solve this, you will build a question-answering system powered by PaLM API to enable users to extract or query important details from those documents, which could be in any standard doc format such as .pdf, .doc, .docx, .txt, .pptx, or .html.

The challenge with building a Q&A system over large documents is that you must do more than just pass the entire documents, into the prompts themselves, as the prompt context. This is because LLMs, including [Vertex PaLM API](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models), have token limits that [restrict how much context you can provide](https://ai.google/static/documents/palm2techreport.pdf).

So how can you build a Q&A system with restrictions on token lengths? To solve this, in addition to your question (your prompt), you will need to provide just the relevant context; context that comes from your closed-domain sources (i.e. the large documents).

In this notebook, you will see three methods that can address the large context challenge, known as:

* **Stuffing** - pushing whole document content as a context.
* **Map-Reduce** - splitting documents in smaller chunks.
* **Map-Reduce - embedding** - creating embeddings of smaller chunks and using vector similarity search to find relevant context.

The notebook introduces you to the fundamental approach towards handling huge documents for building a question-answering bot using Vertex AI PaLM API and finding relevant context for a user query, keeping the context limitation in check.

In addition, there can be open source or Google Cloud drop-in replacement of steps, which will be discussed later in the notebook.

### Objective

By the end of the notebook, you will learn how to build a question-answering system that can handle large documents using the PaLM API.

You will also learn the conceptual implementation of two methods to help you embed large contexts from many documents.

At a high level, here are the topics that will be covered in this notebook"

* Install Vertex AI SDK & Other dependencies
* Authenticating your notebook environment
* Import libraries and Load models
* Introduction to chains and index chains
* Method 1: Stuffing
* Method 2: Map Reduce
* Method 3: Map Reduce with embeddings

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK & Other dependencies

In [None]:
# Base system dependencies
!sudo apt -y -qq install tesseract-ocr libtesseract-dev

# required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install poppler-utils python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

# Python dependencies
%pip install google-cloud-aiplatform pytesseract PyPDF2 textract --upgrade --quiet --user

***Colab only***: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [None]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Authenticating your notebook environment
* If you are using **Colab** to run this notebook, uncomment the cell below and continue.
* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

**Colab only:** Uncomment the following cell to initialize the Vertex AI SDK. For Vertex AI Workbench, you don't need to run this.

In [None]:
# import vertexai

# PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
# vertexai.init(project=PROJECT_ID, location="us-central1")

In [None]:
import os
import re
import warnings

from PyPDF2 import PdfReader
import numpy as np
import pandas as pd
from tenacity import retry, stop_after_attempt, wait_random_exponential
import textract
from vertexai.language_models import TextEmbeddingModel, TextGenerationModel

warnings.filterwarnings("ignore")

### Import models

In [None]:
generation_model = TextGenerationModel.from_pretrained("text-bison")
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

To make PaLM API calls more resilient and comply with [API quotas](https://cloud.google.com/vertex-ai/docs/quotas), you can use an  [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) mechanism that keeps trying the API to ensure the call is successful without over-calling the API and adhering to the quotas.

If you need your API quotas to be increased, refer [here](https://cloud.google.com/docs/quota_detail/view_manage#requesting_higher_quota).

You can find API guide for the current method [here](https://tenacity.readthedocs.io/en/latest/api.html).

In [None]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def text_generation_model_with_backoff(**kwargs):
    return generation_model.predict(**kwargs).text


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def embedding_model_with_backoff(text=[]):
    embeddings = embedding_model.get_embeddings(text)
    return [each.values for each in embeddings][0]

## Question Answering with large documents

One of the most commonly used methods across the industry is `chains` to solve question-answering with large and multiple documents using LLMs.

A chain is a sequence of steps an LLM takes to complete a task. For example, a chain might start with the LLM reading a document, then asking a question about the document, and finally generating a response to the question.


An index chain is a special type of chain that uses an index to store and retrieve information. An index is a data structure that allows the LLM to quickly find relevant information for a given task. For example, an index chain might use an index to store the names of all the people mentioned in a document so that the LLM can quickly find the information it needs to answer a question about those people. Or it can store document path, name, page number, and other metadata.

The idea is to create a simple index of all source documents so that LLMs can search through vast information easily. Index chains are helpful in question-answering, summarization, and chatbot

Foundationally, there are four [index-related chains](https://docs.langchain.com/docs/components/chains/index_related_chains):
* Stuffing
* Map Reduce
* Refine
* Map-Rerank

In this notebook, you will see three methods; Stuffing, Map Reduce and Map Reduce with embedding.

### Method 1: Stuffing

Stuffing is the simplest way to pass data to a large language model (LLM). You simply combine all of the data into a single prompt, and then pass that prompt to the LLM. This method has two advantages:

* It only makes a single call to the LLM, which can improve performance.
* The LLM has access to all of the data at once, which can improve the quality of the generated text.

However, stuffing has one major disadvantage: it only works with small amounts of data. If you have a large dataset, stuffing will not be feasible.

Before you dive deeper into possible methods for large document question-answering, you can explore the primary process of stuffing and how it fails with larger files and context.

Here is the flow of stuffing:

* **Document Loader**: Loading the required document from the source to your bucket or local storage.
* **Document Processing**: Processing the documents by extracting content and other metadata.
* **Context**: Building the context to pass the entire content extracted in the previous step.
* **Prompt Engineering**: Building a question-answering prompt that takes the context built in the previous step and adds instructions to perform specific tasks.
* **Vertex AI PaLM API**: Finally, with the prompt and the context, call the PaLM API to get the expected answer.

#### Document Loader
You start copying the documents from a Cloud Bucket and store them in your project bucket or locally.

In [None]:
# Copying the files from the GCS bucket to local
!mkdir documents
!gsutil -m cp -r gs://github-repo/documents .

You can view one of the documents here:
https://storage.googleapis.com/github-repo/documents/20230426_alphabet_10Q.pdf

#### Document Processing

When you have documents, you need to process them for downstream consumption. In the processing phase, you aim to read the documents and convert them into a format that the downstream logic can easily use. While reading, you should keep as much metadata as possible from the original document.

In this case, you are loading different file types, such as .pdf, .txt, .docx, and .json. Each file type has its reader, and you can use a simple open-source library called [textract](https://textract.readthedocs.io/en/stable/) and [PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/) to load them. You can save the file name, file type, page number (shown only for pdf), and content for each file.

This metadata will be essential for quoting the source of information when sending it as a context and answering queries later on.

The metadata and content extracted and processed are necessary because:
* Quote the source of information when sending it as a context.
* Answer queries about the documents.
* Track changes to the documents.
* Identify duplicate documents.
* Organize the documents.


In [None]:
def create_data_packet(file_name, file_type, page_number, file_content):
    """Creating a simple dictionary to store all information (content and metadata)
    extracted from the document"""
    data_packet = {}
    data_packet["file_name"] = file_name
    data_packet["file_type"] = file_type
    data_packet["page_number"] = page_number
    data_packet["content"] = file_content
    return data_packet

In [None]:
final_data = []


def files(path):
    """
    Function that returns only filenames (and not folder names)
    """
    for file in os.listdir(path):
        if os.path.isfile(os.path.join(path, file)):
            yield file


for file_name in files("documents/"):
    path = f"documents/{file_name}"
    _, file_type = os.path.splitext(path)
    if file_type == ".pdf":
        # loading pdf files, with page numbers as metadata.
        reader = PdfReader(path)
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                packet = create_data_packet(
                    file_name, file_type, page_number=int(i + 1), file_content=text
                )

                final_data.append(packet)
    else:
        # loading other file types
        text = textract.process(path).decode("utf-8")
        packet = create_data_packet(
            file_name, file_type, page_number=None, file_content=text
        )
        final_data.append(packet)

While extracting the content and metadata from the documents, you can store them in the pandas dataframe for easy downstream integration for citing the source of answer extraction. In addition, applying a text chunking process (splitting input text into smaller strings to fit into the token limit)  in the pandas dataframe will also be helpful.

In [None]:
# converting the data that has been read from GCS to Pandas DataFrame for easy readability and downstream logic
pdf_data = pd.DataFrame.from_dict(final_data)
pdf_data = pdf_data.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_data.reset_index(inplace=True, drop=True)
pdf_data.head()

In [None]:
# you can check how many different file type you have in our dataframe.
print("Data has these different file types : \n", pdf_data["file_type"].value_counts())

#### Context Selection

Now, the next step in the conventional method is to pass the context to PaLM API while asking the question.

You don't know which document will be helpful, so you can go ahead and use all the document's text present in `content` column as context.

In [None]:
# combining all the content of the PDF as single string such that it can be passed as context.
context = "\n".join(str(v) for v in pdf_data["content"].values)
print("The total words in the context: ", len(context))

#### Prompt Engineering

Next, you can write a simple prompt along with the question. Then, you can preempt the prompt by making it follow some basic instructions. In the prompt, you only ask to answer if it finds the answer in the given `context`.

You are dynamically passing the context and the question so that you can change it as per requirements and experimentations.

In [None]:
question = "What is the effect of change in accounting estimate for google in 2020?"
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context}?\n
            Question: \n {question} \n
            Answer:
          """

#### Vertex AI PaLM API - Answer Extraction & Evaluation

In your prompt, you are passing so many words as context (roughly all documents).

You already know that you have a input
(prompt) token limit of [8192 tokens](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models) for the `text-bison` model, so your PaLM API call should fail. Because, as per ~8k token limit, the PaLM model is expecting ~6k words (input token). However, you are sending  ~ `1531642` words just as a prompt.

As a reminder, a single token may be smaller than a word. A token is approximately four characters. Therefore, 100 tokens correspond to roughly 60-80 words.

Hence, you know why conventional methods would not work when you want to do question-answering on large documents.

In [None]:
try:
    print("PaLM Predicted:", generation_model.predict(prompt).text)
except Exception as e:
    print(
        "The code failed since it won't be able to run inference on such a huge context and throws this exception: ",
        e,
    )

However, you can still run the code, if you restrict the context to first 5000 words or something which is lesser than the token limit for PaLM API. But there is a good chance you will miss getting the expected answer, since your context might be missing in the first 5000 words.

In [None]:
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context[:5000]}?\n
            Question: \n {question} \n
            Answer:
          """
print("the words in the prompt: ", len(prompt))
print("PaLM Predicted:", generation_model.predict(prompt).text)

So, now you have seen how stuffing the whole document content of so many files is not a very promising method to build question-answering systems. There are many different methods to address this limitation, but as discussed in the overview section, you will see two foundational and important methods:

* Map-Reduce
* Map-Reduce with embedding: Q&A

### Method 2: Map Reduce

[Map Reduce](https://docs.langchain.com/docs/components/chains/index_related_chains) Chains is a method for processing large amounts of data with a large language model (LLM). It works by breaking the data into smaller chunks, running an initial prompt on each chunk, and then combining the results of the initial prompts with a different prompt.

For example, for question-answering, you can run initial prompt on each chunk to extract the answer and then finally combine the answers of the individual chunk with a different prompt.


The typical flow for this method goes like this:

* You take N documents from your source.
* Split documents into N chunks (let's say 1000 words for each chunk)
* Each chunk should be passed as context to the question-answer prompt
* Summarize the answers from all chunk by using a separate prompt.

![Embedding Learning](https://storage.googleapis.com/github-repo/img/reference-architecture%20/map_reduce_flow_new.jpeg)

You can start by writing a simple function `get_chunks_iter` that takes a long string `text` and the size of the chunk as `maxlength`.

This function aims to divide input string `text` into the size of `maxlength` - which are total words in that chunk and, save all the individual chunks into a list, and return `final_chunk` list.

In [None]:
# The function get_chunks_iter() can be used to split a piece of text into smaller chunks,
# each of which is at most maxlength characters long.
# This can be useful for tasks such as summarization, question answering, and translation.


def get_chunks_iter(text, maxlength):
    """
    Get chunks of text, each of which is at most maxlength characters long.

    Args:
        text: The text to be chunked.
        maxlength: The maximum length of each chunk.

    Returns:
        An iterator over the chunks of text.
    """
    start = 0
    end = 0
    final_chunk = []
    while start + maxlength < len(text) and end != -1:
        end = text.rfind(" ", start, start + maxlength + 1)
        final_chunk.append(text[start:end])
        start = end + 1
    final_chunk.append(text[start:])
    return final_chunk


# function to apply "get_chunks_iter" function on each row of dataframe.
# currently each row here for file_type=pdf is content of each page and for other file_type its the whole document.
def split_text(row):
    chunk_iter = get_chunks_iter(row, chunk_size)
    return chunk_iter

The `global` keyword is used to declare a variable as global. This means that the variable can be accessed from any scope within the program. The `chunk_size` variable is declared as global because it will be used by other functions in the program.

The `pdf_data_sample` variable is a copy of the `pdf_data` variable. This is done because the `pdf_data` variable will be modified by other functions in the program. By creating a copy of the variable, you can ensure that the original data is not modified.

In [None]:
global chunk_size
# you can define how many words should be there in a given chunk.
chunk_size = 5000

pdf_data_sample = pdf_data.copy()

In [None]:
# Remove all non-alphabets and numbers from the data to clean it up.
# This is harsh cleaning. You can define your custom logic for cleansing here.
pdf_data_sample["content"] = pdf_data_sample["content"].apply(
    lambda x: re.sub("[^A-Za-z0-9]+", " ", x)
)

The `split_text` function is a function that splits a string into a list of chunks, where each chunk is a continuous sequence of characters. In the second line of code below,
```
pdf_data_sample = pdf_data_sample.explode("chunks")

```
 explodes the chunks column into individual rows. This means that each row in the pdf_data_sample dataframe will now represent a single chunk of text.

In [None]:
# Apply the chunk splitting logic here on each row of content in dataframe.
pdf_data_sample["chunks"] = pdf_data_sample["content"].apply(split_text)
# Now, each row in 'chunks' contains list of all chunks and hence we need to explode them into individual rows.
pdf_data_sample = pdf_data_sample.explode("chunks")

In [None]:
# Sort and reset index
pdf_data_sample = pdf_data_sample.sort_values(by=["file_name", "page_number"])
pdf_data_sample.reset_index(inplace=True, drop=True)
pdf_data_sample.head()

You can observe how a single page in the `20210203_alphabet_10K.pdf` file is divided into three chunks.

You have three pages with the same "1" indicating that a page has been divided into three subsets (chunks). This is important because now you have a manageable chunk to send as context, rather than whole document as seen before.

This will increase the total number of rows in the dataframe as well.

In [None]:
print("The original dataframe has :", pdf_data.shape[0], " rows without chunking")
print("The chunked dataframe has :", pdf_data_sample.shape[0], " rows with chunking")

Now you can define the prompt and pass each chunk as the context.

In [None]:
# function to pass in the apply function on dataframe to extract answer for specific question on each row.


def get_answer(df):
    prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
                 not contained in the context, say "answer not available in context" \n\n
                  Context: \n {df['chunks']}?\n
                  Question: \n {question} \n
                  Answer:
            """

    pred = text_generation_model_with_backoff(prompt=prompt)
    return pred

In [None]:
# we can take a small sample of the whole dataframe to avoid making too many calls to the API.
pdf_data_sample_head = pdf_data_sample.head(10)

question = "What is the effect of change in accounting estimate for google in 2020?"
pdf_data_sample_head["predicted_answer"] = pdf_data_sample_head.apply(
    get_answer, axis=1
)
pdf_data_sample_head.head(2)

After you have asked the question-answering prompt to each chunk, combine all the answers into a new context. Then, send this new context to the final prompt. In the prompt you used for each chunk, you have told the model to return "answer not available in context" if it doesn't find any answers.

This will help you remove the chunks where the model responded with "answer not available in context". The remaining chunks will be the new context.

In [None]:
context_map_reduce = [
    each_answer
    for each_answer in pdf_data_sample_head["predicted_answer"].values
    if each_answer != "answer not available in context"
]

In [None]:
prompt = f"""Answer the question as precise as possible using the provided context. If the answer is
              not contained in the context, say "answer not available in context" \n\n
            Context: \n {context_map_reduce}?\n
            Question: \n {question} \n
            Answer:
          """
print("the words in the prompt: ", len(prompt))
print("PaLM Predicted:", generation_model.predict(prompt).text)

Now, let's look into this method's various pros and cons to summarize what you have done.

**Pros:**

* Increased precision due to chunking
* Most helpful for extracting entities across different document levels.
* It can scale to larger documents and more documents than other methods because chunks can be parallelized.


**Cons:**

* Multiple API calls, which can be costly and time-consuming
* Slow, as it searches through all chunks even if the answer is found early
* Conflicting answers, which can be difficult to resolve

Moving forward, let's explore the following method, which addresses some of the shortcomings of Method 1.

### Method 3: Map Reduce with embeddings

The previous method for question answering was inefficient because it required calling the PaLM API on all chunks of text. A more efficient approach is to create embeddings of the chunks and then use vector mathematics to find similar chunks. This allows you to find the relevant context from all the chunks in the dataframe where your answer may exist.

The typical flow for this method is as follows:
* Split documents into chunks.
* Create embeddings for each chunk.
* Convert the question to embeddings.
* Perform a cosine similarity between the question and chunk embeddings to find the closest chunks.
* Use the closest chunks as context for the PaLM API.

This method is more efficient because it only calls the PaLM API on the relevant chunks.


![Embedding Learning](https://storage.googleapis.com/github-repo/img/reference-architecture%20/map_reduce_embedding.jpeg)


You can start the implementation first by simply getting the embeddings for each chunk.

This will add the embeddings (vector/number representation) of each chunk as a separate column.

In [None]:
pdf_data_sample_head["embedding"] = pdf_data_sample_head["chunks"].apply(
    lambda x: embedding_model_with_backoff([x])
)
pdf_data_sample_head["embedding"] = pdf_data_sample_head.embedding.apply(np.array)
pdf_data_sample_head.head(2)

Now comes the heart of this method. First, you can define a function `get_context_from_question`, which takes the:
* `question` user wants to ask,
* `vector_store`: vector db store, which you created in the last step and,
* `sort_index_value`: The value defines how many chunks will be picked after running the sort on the cosine similarity score.

The function will take the `valid_question`, create the embeddings, and do the dot product (cosine similarity) with all the chunks you passed in the vector store. Once you have the score, you can sort the results in decreasing order and pick chunks per the `sort_index_value` value as a combined string.

This will become your context for the question asked.

In [None]:
def get_dot_product(row):
    return np.dot(row, query_vector)


def get_context_from_question(question, vector_store, sort_index_value=2):
    global query_vector
    query_vector = np.array(embedding_model_with_backoff([question]))
    top_matched = (
        vector_store["embedding"]
        .apply(get_dot_product)
        .sort_values(ascending=False)[:sort_index_value]
        .index
    )
    top_matched_df = vector_store[vector_store.index.isin(top_matched)][
        ["file_name", "page_number", "chunks"]
    ]
    context = " ".join(
        vector_store[vector_store.index.isin(top_matched)]["chunks"].values
    )
    return context, top_matched_df

Now that you have a general function that always gets you custom relevant context for the question, you can call it with every new question.

In [None]:
# your question for the documents
question = "What efforts have been taken by Google to safeguard their intellectual property in 2020?"

# get the custom relevant chunks from all the chunks in vector store.
context, top_matched_df = get_context_from_question(
    question,
    vector_store=pdf_data_sample_head,
    sort_index_value=5,  # Top N results to pick from embedding vector search
)
# top 5 data that has been picked by model based on user question. This becomes the context.
top_matched_df

In [None]:
# Prompt for Q&A which takes the custom context found in last step.
prompt = f""" Answer the question as precise as possible using the provided context. \n\n
            Context: \n {context}?\n
            Question: \n {question} \n
            Answer:
          """

# Call the PaLM API on the prompt.
print("PaLM Predicted:", text_generation_model_with_backoff(prompt=prompt))

As you can see, the best part of this method is that you don't have to call the API multiple times. Instead, just one time, and it figured out the answers.

Now, let's look into this method's various pros and cons to summarize what you have done.


*  Fast: this is fast since it doesn't require the API to be executed on all the chunks.


*  The dataframe can run into a vast length, and cosine similarity and basic mathematics can become slow.

**Pros:**


* Can be used to efficiently compute complex operations on large datasets of embeddings.
* Can be used to learn representations of words and phrases that are more informative than traditional bag-of-words representations.
* Can be used to improve the performance of a variety of natural language processing tasks, such as text classification, machine translation, and question answering.


**Cons:**

* Can be computationally expensive to compute vector similarity.
* Can be sensitive to the choice of embeddings.