In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering with Large Documents using LangChain
This notebook demonstrates how to build a question-answering (Q&A) system using LangChain with Vertex AI Gemini to extract information from large documents.
Learn more about [LangChain](https://python.langchain.com/en/latest/use_cases/question_answering.html) and [Vertex Generative AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview)

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/orchestration/rag_with_langchain_vertexai_gemini.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/orchestration/rag_with_langchain_vertexai_gemini.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/orchestration/rag_with_langchain_vertexai_gemini.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
</table>

### Objective
LangChain makes it seamless to orchestrate  RAG components using the Retrieval QA chain (which is integrated with Vertex AI API for Text, Vertex AI Embeddings and Vertex AI Vector Search). LangChain provides flexible abstractions in the form of Document Loaders, Retrievers and Chains to implement stages such as loading data, chunking and embedding with just a few lines of code.

In this tutorial, you learn how to:

- Ingesting documents using LangChain Document loader
- Chunking or splitting documents using LangChain Text splitter
- Creating embeddings and storing embeddings in LangChain Vector store.
- Design prompt for question-answering
- Using RetrievalQA from LangChain to query

### Costs

This tutorial uses billable components of Google Cloud:

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Initial Setup

In [None]:
# Base system dependencies
!sudo apt -y -qq install tesseract-ocr libtesseract-dev

# required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install poppler-utils python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

The following additional packages will be installed:
  libarchive-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  libarchive-dev libleptonica-dev libtesseract-dev tesseract-ocr
  tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 6 newly installed, 0 to remove and 38 not upgraded.
Need to get 8,560 kB of archives.
After this operation, 31.6 MB of additional disk space will be used.
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 6.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package libarchive-dev:amd64.
(Reading database ... 121752 files and directories

In [None]:
!pip install google-cloud-aiplatform
!pip install langchain
!pip install langchain-google-vertexai
!pip install unstructured
!pip install "unstructured[pdf]"
!pip install chromadb
!pip install rank_bm25
!pip install prettyprinter

Collecting unstructured
  Using cached unstructured-0.12.6-py3-none-any.whl (1.8 MB)
Collecting backoff==2.2.1 (from unstructured)
  Using cached backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting dataclasses-json-speakeasy==0.5.11 (from unstructured)
  Using cached dataclasses_json_speakeasy-0.5.11-py3-none-any.whl (28 kB)
Collecting emoji==2.10.1 (from unstructured)
  Using cached emoji-2.10.1-py2.py3-none-any.whl (421 kB)
Collecting filetype==1.2.0 (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting jsonpath-python==1.0.6 (from unstructured)
  Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Collecting langdetect==1.0.9 (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting lxml==5.1.0 (from unstructured)
  Downloading lxml-5.1.0-cp310-

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.28.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2

In [None]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
from google.colab import auth

auth.authenticate_user()

In [None]:
from google.cloud import aiplatform

PROJECT_ID = ""  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

aiplatform.init(project=PROJECT_ID, location=REGION)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_google_vertexai import VertexAIEmbeddings

### Ingesting documents using LangChain Document loader

In this step, we will injest the Google 10k Annual report which is 92 pages. We will use the Langchain document loader APIs for ingestion.

In [None]:
# Ingest PDF files
from langchain.document_loaders import PyPDFLoader

# Load GOOG's 10K annual report (92 pages).
url = "https://abc.xyz/assets/investor/static/pdf/20230203_alphabet_10K.pdf"
loader = PyPDFLoader(url)
documents = loader.load()

###Chunking or splitting documents using LangChain Text splitter
Once the documents are loaded, Langchain provides a RecursiveCharacterTextSplitter library,which is designed to handle large documents by recursively splitting them into smaller chunks based on character count. This Text Splitter is particularly useful when dealing with lengthy documents that exceed the context window of language models.

In [None]:
# split the documents into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
print(f"# of documents = {len(docs)}")

# of documents = 263


**Loading models for LLM and Embeddings from LangChain Google Vertex AI library**

In order to perform embeddings and question answering, we would need to load models from the LangChain Vertex AI library. With LangChain you can access Google Cloud foundation models , open source models such as gemma as well as third party models in Vertex AI Model garden.


In [None]:
from langchain_google_vertexai import VertexAI

llm = VertexAI(model_name="gemini-pro")
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@latest")

Asking questions without RAG using the Vertex google foundation model Gemini

In [None]:
message = "What was Alphabet's net income in 2022?"
llm.invoke(message)

'$59.99 billion'

### Embeddings and storing embeddings in LangChain Vector store Chromadb
In this step we use LangChain vector store library to  import Chromadb, which is an open-source, lightweight, and easy-to-use vector database designed for storing and querying embeddings. As shown in the code below ,chroma database (db) is created using the from_documents method, which takes a list of documents (docs) and an embeddings object(Vertex Embeddings API)  to convert the documents into vector representations and store them.

In [None]:
# Store docs in local vectorstore as index
# it may take a while since API is rate limited
from langchain.vectorstores import Chroma

db = Chroma.from_documents(docs, embeddings)

In [None]:
# Expose index to the retriever
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})

### Using RetreivalQA from langchain to query

In this step we will use the RetrievalQA class from LangChain.The RetrievalQA class in LangChain is a high-level abstraction that combines document retrieval and question answering into a single, easy-to-use interface. It performs two main functions: question answering and document retrieval. In this step we will use its RetrievalQA.from_chain_type() method to create a QA system.

Uses LLM to synthesize results from the search index.


In [None]:
# Create chain to answer questions
from langchain.chains import RetrievalQA

# Uses LLM to synthesize results from the search index.
# We use Vertex Gemini
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)

Asking same question using RAG gives a grounded response in the document text

In [None]:
query = "What was Alphabet's net income in 2022?"
result = qa({"query": query})
print(result)

{'query': "What was Alphabet's net income in 2022?", 'result': '$59,972', 'source_documents': [Document(page_content='Alphabet Inc.\nCONSOLIDATED STATEMENTS OF INCOME\n(in millions, except per share amounts)\n Year Ended December 31,\n 2020 2021 2022\nRevenues $ 182,527 $ 257,637 $ 282,836 \nCosts and expenses:\nCost of revenues  84,732  110,939  126,203 \nResearch and development  27,573  31,562  39,500 \nSales and marketing  17,946  22,912  26,567 \nGeneral and administrative  11,052  13,510  15,724 \nTotal costs and expenses  141,303  178,923  207,994 \nIncome from operations  41,224  78,714  74,842 \nOther income (expense), net  6,858  12,020  (3,514) \nIncome before income taxes  48,082  90,734  71,328 \nProvision for income taxes  7,813  14,701  11,356 \nNet income $ 40,269 $ 76,033 $ 59,972 \nBasic net income per share of Class A, Class B, and Class C stock $ 2.96 $ 5.69 $ 4.59 \nDiluted net income per share of Class A, Class B, and Class C stock $ 2.93 $ 5.61 $ 4.56 \nSee accom