## **RAG** : Retrieval Augumented Generation
External Data Source : Alphabet Inc.'s Annual Financial Reports

![RAG](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)

![RAG](https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png)

#### Installing Dependancies

In [None]:
!pip install langchain-google-genai

In [None]:
!pip install langchain_community



In [None]:
!pip install pypdf



In [None]:
!pip install tiktoken



In [None]:
!pip install chromadb



In [None]:
!pip install langchain-chroma



In [None]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting unstructured-client (from unstructured)
  Downloading unstructured_client-0.34.0-py3-none-any.whl.metadata (21 kB)
Collecting python-oxmsg (from un

In [None]:
!pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250506


In [None]:
!pip install pi-heif

Collecting pi-heif
  Downloading pi_heif-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.5 kB)
Downloading pi_heif-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pi-heif
Successfully installed pi-heif-0.22.0


In [None]:
!pip install "unstructured[local-inference]"



#### Document Loading

In [None]:
from unstructured.partition.pdf import partition_pdf
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("goog-10-k-2024.pdf")
doc = loader.load()

#### Document Chunking

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(doc)

#### Derfining Embeddings Model

In [None]:
import os
os.environ["GOOGLE_API_KEY"] = "<api-key>"

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

#### Inserting Embeddings in Vector Store

In [None]:
from langchain_chroma import Chroma

vector_db = Chroma.from_documents(chunks, embeddings_model)

In [None]:
retriever = vector_db.as_retriever()

#### Defining the RAG Pipeline

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatGoogleGenerativeAI(model="gemini-2.0-flash")


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

#### Checking with relevant questions

In [None]:
chain.invoke("What was alphabet's Total revenues")

"Based on the provided text, Alphabet's **Consolidated revenues** for the year ended December 31, 2024 was **$350,018** (millions)."

In [None]:
chain.invoke("What was alphabet's Total revenues in APAC")

"Based on the provided text, Alphabet's total revenues in APAC for the year ended December 31, 2024, was $56,815 million."

#### Checking with Irrelevant questions

In [None]:
chain.invoke("What is text embedding and how does langchain help in doing it")

'The provided text does not mention "text embedding" or "Langchain." Therefore, I cannot answer your question.'