In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering with Large Documents using LangChain 🦜🔗

> **NOTE:** This notebook uses the PaLM generative model, which will reach its [discontinuation date in October 2024](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text#model_versions). Please refer to [this updated notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented-generation/multimodal_rag_langchain.ipynb) for a version which uses the latest Gemini model.

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            


| | |
|-|-|
|Author(s) | [Lavi Nigam](https://github.com/lavinigam-gcp) |

## Overview

This notebook demonstrates how to build a question-answering (Q&A) system using LangChain with Vertex AI PaLM API to extract information from large documents.

The challenge with building a Q&A system over large documents is that Large Language Models, LLMs in short, have token limits that restrict how much context you can provide.

There are several methods to provide the context. They can use similarity search or not. Also there are different methods to pass context to LLMs. This notebook covers the following methods or chains:

- **Stuffing**: Push the whole document content as a context. This is the simplest method, but it can be inefficient for large documents.

- **Map-Reduce**: Split documents into smaller chunks and process them in parallel. This is more efficient than stuffing, but it can be more complex to implement.

- **Refine**: Run an initial prompt on a small chunk, generate an output and for each subsequent document, refine the output based on both output and new document. This is more accurate than Map-Reduce but less efficient.

This notebook also shows **Map-Reduce with Similarity search** where you create embeddings of smaller chunks and use vector similarity search to find relevant context. This is the most efficient method, but it can be the most complex to implement.


Learn more about [LangChain](https://python.langchain.com/en/latest/use_cases/question_answering.html) and [Vertex Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview)

### Objective

In this tutorial, you learn how to:

- Ingest documents which involves download the documents.
- Extract text from the PDF by using LangChain `PyPDFLoader`.
- Select context for identifying the relevant parts of the document that are needed to answer the question.
- Design prompt for question-answering
- Leverage chains for handling large contexts (with/without embeddings)

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI Generative AI Studio

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Getting Started

### Install Vertex AI SDK, other packages and their dependencies

Install the following packages required to execute this notebook.

In [None]:
# Base system dependencies
!sudo apt -y -qq install tesseract-ocr libtesseract-dev

# required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install poppler-utils python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

In [None]:
# Install the packages
import os

if not os.getenv("IS_TESTING"):
    USER = "--user"
else:
    USER = ""
# Install Vertex AI LLM SDK, langchain and dependencies
%pip install google-cloud-aiplatform langchain==0.0.323 chromadb==0.3.26 pydantic==1.10.8 typing-inspect==0.8.0 typing_extensions==4.5.0 pandas datasets google-api-python-client transformers==4.33.1 pypdf faiss-cpu config --user

### Colab only: Uncomment the following cell to restart the kernel.

***Colab only***: Run the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

### Authenticating your notebook environment

- If you are using **Colab** to run this notebook, run the cell below and continue.
- If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

- If you are running this notebook in a local development environment:
  - Install the [Google Cloud SDK](https://cloud.google.com/sdk).
  - Obtain authentication credentials. Create local credentials by running the following command and following the oauth2 flow (read more about the command [here](https://cloud.google.com/sdk/gcloud/reference/beta/auth/application-default/login)):

    ```bash
    gcloud auth application-default login
    ```

### Import libraries

**Colab only:** Run the following cell to initialize the Vertex AI SDK. For Vertex AI Workbench, you don't need to run this.

In [None]:
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"

vertexai.init(project=PROJECT_ID, location=REGION)

In [None]:
from pathlib import Path as p
from pprint import pprint
import urllib
import warnings

from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.vectorstores import Chroma
import pandas as pd

warnings.filterwarnings("ignore")
# restart python kernel if issues with langchain import.

### Import models

You load the pre-trained text and embeddings generation model called `text-bison` and `textembedding-gecko@001` respectively.

In [None]:
vertex_llm_text = VertexAI(model_name="text-bison")
vertex_embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@001")

## Question Answering with large documents

Large language models (LLMs) are powerful tools that can be used to answer a wide range of questions about large document base. However, there are some challenges associated with using large language model (LLM) for question answering. One of these challenges is related with the limited knowledge of LLMs models, especially when documents are specific of some context.

One way to address this limitation is to give more information about documents using retrieval augmented generation. Retrieval augmented generation is a technique for using a large language model (LLM) to answer questions about documents it was not trained on. The basic idea is to first retrieve any relevant documents from a corpus called context, then pass those documents along with the original question to the LLM. The LLM will then generate a response that is informed by the information in the retrieved documents.


### Ingest documents

To begin, you will need to download a few files that are required for the summarizing tasks below.

In [None]:
data_folder = p.cwd() / "data"
p(data_folder).mkdir(parents=True, exist_ok=True)

pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
pdf_file = str(p(data_folder, pdf_url.split("/")[-1]))

urllib.request.urlretrieve(pdf_url, pdf_file)

### Extract text from the PDF

You use an `PdfReader` to extract the text from our scanned documents.

In [None]:
pdf_loader = PyPDFLoader(pdf_file)
pages = pdf_loader.load_and_split()
print(pages[3].page_content)

### Prompt Design

In a Q&A system, you define a question and the associated prompt.

The question is simply a string that represents the question that the application will be asked to answer. In this case, the question is ```"What is Experimentation?"```

The prompt is a string that contains the context that the application will use to generate an answer to the question. In this case, the prompt is

```
Answer the question as precise as possible using the provided context.
If the answer is not contained in the context, say "answer not available in context" \n\n

Context: \n {context}?\n
Question: \n {question} \n
Answer:
```

In [None]:
question = "What is Experimentation?"
prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                    not contained in the context, say "answer not available in context" \n\n
                    Context: \n {context}?\n
                    Question: \n {question} \n
                    Answer:
                  """

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

### Q&A without similarity search

About providing the context, you can provide it or you may use part of the text you are looking for answer.

In this example, you select the first eight pages as context of your Q&A system.

#### Context Selection

In [None]:
context = "\n".join(str(p.page_content) for p in pages[:7])
print("The total words in the context: ", len(context))

#### Q&A Methods or Chains

##### Method 1: Stuffing

`Stuffing` is a simple method for applying large language models (LLMs) to question-answering. It involves providing the LLM with all of the relevant data as context in the prompt.

In LangChain, you can use `StuffDocumentsChain` as part of the `load_qa_chain` method. What you need to do is setting `stuff` as `chain_type` of your chain.

In [None]:
stuff_chain = load_qa_chain(vertex_llm_text, chain_type="stuff", prompt=prompt)

After you initialize a `load_qa_chain` chain, you can answer your question based on the input documents.

In [None]:
stuff_answer = stuff_chain(
    {"input_documents": pages[7:10], "question": question}, return_only_outputs=True
)

In [None]:
pprint(stuff_answer)

The `Stuffing` method has the advantage of only requiring a single call to the LLM, but it is limited by the LLM's context length and is not feasible for large amounts of data.

Below you see an exception raising when the context reach the LLMs limit.

In [None]:
try:
    print(
        stuff_chain(
            {"input_documents": pages[7:], "question": question},
            return_only_outputs=True,
        )
    )
except Exception as e:
    print(
        "The code failed since it won't be able to run inference on such a huge context and throws this exception: ",
        e,
    )

##### Method 2: MapReduce

With `MapReduce`, you can overcome the context limit. It involves dividing the document into chunks, running an initial prompt on each chunk, and then combining the results of the initial prompts using a different prompt.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_qa_chain` method with `map_reduce` as `chain_type` of your chain.

The `load_qa_chain` with `map_reduce` as `chain_type` requires two prompts, question and a combine prompts.

The question prompt is used to ask the LLM to answer a question based on the provided context. In this case, the `question_prompt` is

```
Answer the question as precise as possible using the provided context. \n\n
Context: \n {context} \n
Question: \n {question} \n
Answer:
```

The combine prompt object is used to combine the extracted content and the question to create a final answer. In this case, the `combine_prompt` is

```
Given the extracted content and the question, create a final answer.
If the answer is not contained in the context, say "answer not available in context. \n\n
Summaries: \n {summaries}?\n
Question: \n {question} \n
Answer:
```


In [None]:
question_prompt_template = """
                    Answer the question as precise as possible using the provided context. \n\n
                    Context: \n {context} \n
                    Question: \n {question} \n
                    Answer:
                    """
question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["context", "question"]
)

# summaries is required. a bit confusing.
combine_prompt_template = """Given the extracted content and the question, create a final answer.
If the answer is not contained in the context, say "answer not available in context. \n\n
Summaries: \n {summaries}?\n
Question: \n {question} \n
Answer:
"""
combine_prompt = PromptTemplate(
    template=combine_prompt_template, input_variables=["summaries", "question"]
)

After you define expected prompt, you initialize a `load_qa_chain` chain.

In [None]:
map_reduce_chain = load_qa_chain(
    vertex_llm_text,
    chain_type="map_reduce",
    return_intermediate_steps=True,
    question_prompt=question_prompt,
    combine_prompt=combine_prompt,
)

And you answer your question based on the input documents. Notice how you are passing entire document base.

In [None]:
map_reduce_outputs = map_reduce_chain({"input_documents": pages, "question": question})

You can store answers in a Pandas dataframe for checking the `MapReduce` intermediate steps and the LLMs answer.

In [None]:
final_mp_data = []

# for each document, extract metadata and intermediate steps of the MapReduce process
for doc, out in zip(
    map_reduce_outputs["input_documents"], map_reduce_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["answer"] = out
    final_mp_data.append(output)

In [None]:
# create a dataframe from a dictionary
pdf_mp_answers = pd.DataFrame.from_dict(final_mp_data)
# sorting the dataframe by filename and page_number
pdf_mp_answers = pdf_mp_answers.sort_values(by=["file_name", "page_number"])
pdf_mp_answers.reset_index(inplace=True, drop=True)
pdf_mp_answers.head()

In [None]:
index = 3
print("[Context]")
print(pdf_mp_answers["chunks"].iloc[index])
print("\n\n [Answer]")
print(pdf_mp_answers["answer"].iloc[index])
print("\n\n [Page number]")
print(pdf_mp_answers["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_mp_answers["file_name"].iloc[index])

In [None]:
index = 5
print("[Context]")
print(pdf_mp_answers["chunks"].iloc[index])
print("\n\n [Answer]")
print(pdf_mp_answers["answer"].iloc[index])
print("\n\n [Page number]")
print(pdf_mp_answers["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_mp_answers["file_name"].iloc[index])

**Consideration**: The `MapReduce` method has the advantage of being able to scale to larger amounts of data than the stuffing method, but it requires more calls to the LLM and may lose some information during the final combined call.

##### Method 3: Refine

With `Refine` method, you try to overcome the lost of `information` of `MapReduce` method. The method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_qa_chain` method. What you need to do is setting `refine` as `chain_type` of your chain.

The `load_qa_chain` with `refine` as chain_type requires two prompts, refine and a initial question prompts.

The `refine prompt` is used to generate a prompt that asks the LLM to refine an existing answer based on the provided context. In this case, the `refine prompt` is:

```
The original question is: \n {question} \n
The provided answer is: \n {existing_answer}\n
Refine the existing answer if needed with the following context: \n {context_str} \n
Given the extracted content and the question, create a final answer.
If the answer is not contained in the context, say "answer not available in context. \n\n
```

The `initial question` prompt is used to generate a prompt that asks the LLM to answer a question based on the provided context only. In this case, the `initial question prompt` is:

```
Answer the question as precise as possible using the provided context only. \n\n
Context: \n {context_str} \n
Question: \n {question} \n
Answer:
```

In [None]:
refine_prompt_template = """
    The original question is: \n {question} \n
    The provided answer is: \n {existing_answer}\n
    Refine the existing answer if needed with the following context: \n {context_str} \n
    Given the extracted content and the question, create a final answer.
    If the answer is not contained in the context, say "answer not available in context. \n\n
"""
refine_prompt = PromptTemplate(
    input_variables=["question", "existing_answer", "context_str"],
    template=refine_prompt_template,
)


initial_question_prompt_template = """
    Answer the question as precise as possible using the provided context only. \n\n
    Context: \n {context_str} \n
    Question: \n {question} \n
    Answer:
"""

initial_question_prompt = PromptTemplate(
    input_variables=["context_str", "question"],
    template=initial_question_prompt_template,
)

After you define expected prompt, you initialize a `load_qa_chain` chain.

In [None]:
refine_chain = load_qa_chain(
    vertex_llm_text,
    chain_type="refine",
    return_intermediate_steps=True,
    question_prompt=initial_question_prompt,
    refine_prompt=refine_prompt,
)

And you answer your question based on the input documents. Notice how you are passing entire document base.

In [None]:
refine_outputs = refine_chain({"input_documents": pages, "question": question})

You can store answers in a Pandas dataframe for checking the `Refine` intermediate steps and the LLMs answer.

In [None]:
final_refine_data = []
for doc, out in zip(
    refine_outputs["input_documents"], refine_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["answer"] = out
    final_refine_data.append(output)

In [None]:
pdf_refine_answers = pd.DataFrame.from_dict(final_refine_data)
pdf_refine_answers = pdf_refine_answers.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_refine_answers.reset_index(inplace=True, drop=True)
pdf_refine_answers.head()

In [None]:
index = 3
print("[Context]")
print(pdf_refine_answers["chunks"].iloc[index])
print("\n\n [Answer]")
print(pdf_refine_answers["answer"].iloc[index])
print("\n\n [Page number]")
print(pdf_refine_answers["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_refine_answers["file_name"].iloc[index])

In [None]:
index = 5
print("[Context]")
print(pdf_refine_answers["chunks"].iloc[index])
print("\n\n [Answer]")
print(pdf_refine_answers["answer"].iloc[index])
print("\n\n [Page number]")
print(pdf_refine_answers["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_refine_answers["file_name"].iloc[index])

**Consideration**: So far, you use both part of the document or the entire document as the context to answer your specific question. Both cases have several limitations, including incomplete context and slow to query, especially for large context.

Similarity search over a vector database, is a newer approach that addresses these limitations.


### Q&A with similarity search

With similarity search over a vector database, each piece of context is represented as a vector. These vectors are then stored in a database. When a user asks a question, the system first calculates the similarity between the question and the vectors in the database. The most similar vectors are then used to fetch the context that is relevant to the question.

This approach has several advantages including more accurate context with respect of the user's question.

In this case, you use `Chroma` an in-memory open-source embedding database to create similarity search index.

#### Context Selection

Create the similarity search index using `Chroma`.

`Chroma` works with Document Loaders like `PyPDFLoader`.

In [None]:
vector_index = Chroma.from_documents(pages, vertex_embeddings).as_retriever()

Next, retrieve relevant context using the original question.

In [None]:
docs = vector_index.get_relevant_documents(question)

#### MapReduce method

Finally you answer your question based on the context you retrieve with embeddings database and the input question.


In [None]:
map_reduce_embeddings_outputs = map_reduce_chain(
    {"input_documents": docs, "question": question}
)

In [None]:
print(map_reduce_embeddings_outputs["output_text"])

You can store answers in a Pandas dataframe for checking the `MapReduce with similarity search` intermediate steps and the LLMs answer.

In [None]:
final_mpe_data = []
for doc, out in zip(
    map_reduce_embeddings_outputs["input_documents"],
    map_reduce_embeddings_outputs["intermediate_steps"],
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["answer"] = out
    final_mpe_data.append(output)

In [None]:
pdf_mpe_answers = pd.DataFrame.from_dict(final_mpe_data)
pdf_mpe_answers = pdf_mpe_answers.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_mpe_answers.reset_index(inplace=True, drop=True)
pdf_mpe_answers.head()

You can store answers in a Pandas dataframe for checking the `MapReduce with similarity search` intermediate steps and the LLM answer.

In [None]:
final_mpe_data = []
for doc, out in zip(
    map_reduce_embeddings_outputs["input_documents"],
    map_reduce_embeddings_outputs["intermediate_steps"],
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["answer"] = out
    final_mpe_data.append(output)

In [None]:
pdf_mpe_answers = pd.DataFrame.from_dict(final_mpe_data)
pdf_mpe_answers = pdf_mpe_answers.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_mpe_answers.reset_index(inplace=True, drop=True)
pdf_mpe_answers.head()

## Conclusion

This notebook demonstrates how to build a question-answering (QA) system using LangChain with Vertex AI PaLM API to extract information from large documents.

In this case, you use Chroma, an in-memory open-source embedding database to create similarity search index. But [LangChain](https://python.langchain.com/docs/integrations/vectorstores/matchingengine) supports Vertex AI Matching Engine, the Google Cloud high-scale low latency vector database. With Vertex AI Matching Engine, you have a fully managed service that can scale to meet the needs of even the most demanding applications. It provides high performance for both training and inference. And it has several features including support for multiple similarity metrics, batch inference, and online learning. These features can be important for applications that need to perform complex matching tasks or that need to be able to adapt to changing data.