**Step 1**

Install all dependencies, set up API Keys

Get the GOOGLE_API_KEY from google ai studio, and get the langsmith api key from your langsmith account

IMPORTANT NOTE: SELECT T4 GPU RUNTIME IN TOP RIGHT UNDER THE GEMINI BUTTON. If this is not selected, the embeddings model and vector database will be extremely slow

In [None]:
!pip install -U langchain langchain-openai
!pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph
!pip install tiktoken
!pip install -qU "langchain[google-genai]"
!pip install pypdf
!pip install datasets
!pip install -qU langchain-community
!pip install -qU langchain-huggingface
!pip install -qU langchain-chroma
!pip install opentelemetry-api==1.27.0 opentelemetry-sdk==1.27.0
!pip install huggingface_hub[hf_xet]

Collecting opentelemetry-api==1.27.0
  Using cached opentelemetry_api-1.27.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-sdk==1.27.0
  Using cached opentelemetry_sdk-1.27.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-semantic-conventions==0.48b0 (from opentelemetry-sdk==1.27.0)
  Using cached opentelemetry_semantic_conventions-0.48b0-py3-none-any.whl.metadata (2.4 kB)
Using cached opentelemetry_api-1.27.0-py3-none-any.whl (63 kB)
Using cached opentelemetry_sdk-1.27.0-py3-none-any.whl (110 kB)
Using cached opentelemetry_semantic_conventions-0.48b0-py3-none-any.whl (149 kB)
Installing collected packages: opentelemetry-api, opentelemetry-semantic-conventions, opentelemetry-sdk
  Attempting uninstall: opentelemetry-api
    Found existing installation: opentelemetry-api 1.33.0
    Uninstalling opentelemetry-api-1.33.0:
      Successfully uninstalled opentelemetry-api-1.33.0
  Attempting uninstall: opentelemetry-semantic-conventions
    Found existing installa

In [None]:
import torch
print(torch.cuda.is_available())  # should return True

True


In [None]:
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

os.environ["LANGSMITH_TRACING"]= "true"
os.environ["LANGSMITH_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter API key for Langsmith")
os.environ["LANGSMITH_PROJECT"] = "pr-roasted-radar-54"

Enter API key for Google Gemini: ··········
Enter API key for Langsmith··········


**Step 2**

Load in Data from src_doc_files_example

The folder should be uploaded to your my drive folder. Once your drive is mounted (drive.mount), you should be able to see "drive" in the folders section on the lefthand side of your screen. Navigate to the src_doc_files_example folder, click the three dots, and click "copy path". Paste this in as the root_folder below.

At the end, we will look at some of the documents we loaded in as well as some sample questions to base questions you might want to write off

In [None]:
from pickle import load
from google.colab import drive, files, userdata
import glob
from langchain_community.document_loaders import PyPDFLoader
import os

# get files and file paths in tat_docs and requirements
# Steps to add tat_docs and the requirements file to your drive:
# Step 1: Get the knowledge_base and put it in your google drive you will be prompted to login after you run this cell
# Step 2: store it in the MyDrive folder and you should be able to load the documents after
drive.mount("/content/drive", force_remount=True)

# Step 2: Define the root folder where your PDFs are stored
root_folder = "/content/drive/MyDrive/ColabNotebooks/fin_docs"

# Step 3: Recursively get all PDF file paths in the folder and subfolders
pdf_files = glob.glob(os.path.join(root_folder, '**', '*.pdf'), recursive=True)

# TEST: THE FOLLOWING PRINT STATEMENT SHOULD GIVE ALL DOCUMENTS IN SRC_DOC_FILES_EXAMPLE FOLDER
print("pdf files: ")
print(pdf_files)

# Step 4: Load all PDFs into a list of LangChain Document objects
all_documents = []

# pass in a reference to all_documents so that it can be mutated
def load_pdfs(pdf_files, document_lst):
  for pdf_path in pdf_files:
      loader = PyPDFLoader(pdf_path)
      pages = []
      for page in loader.lazy_load():
          pages.append(page)
      document_lst.extend(pages)  # Add all Document objects from this PDF

load_pdfs(pdf_files, all_documents)

# Final result: all_documents contains all Document objects from all PDFs
print(f"Loaded {len(all_documents)} documents from {len(pdf_files)} PDF files.")

# Sometimes the document loading is messy, so if a document was split across mulitple elements of the list, we want to combine them
doc_names = set([name.metadata["source"] for name in all_documents])
from collections import defaultdict


document_dict = defaultdict(str)
i = 0

for document in all_documents:
    document_dict[document.metadata["source"]] += document.page_content
    # ensure that this loop is making progress by checking when each document is getting added
    print("added " + document.metadata["source"] + " to list of documents")

# convert dictionary of documents to document list
from langchain.docstore.document import Document
docs_list = []
for key, value in document_dict.items():
    docs_list.append(Document(page_content=value, metadata={"source": key.split("/")[-1].split(".pdf")[0]}))


Mounted at /content/drive
pdf files: 
['/content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf', '/content/drive/MyDrive/ColabNotebooks/fin_docs/ABMD_2012.pdf', '/content/drive/MyDrive/ColabNotebooks/fin_docs/GS_2016.pdf', '/content/drive/MyDrive/ColabNotebooks/fin_docs/JKHY_2015.pdf']
Loaded 494 documents from 4 PDF files.
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_2009.pdf to list of documents
added /content/drive/MyDrive/ColabNotebooks/fin_docs/ADI_200

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Let us inspect a sample document along with some questions we might want to ask
print(docs_list[0])
doc_names_lst = []
for doc in docs_list:
    print(doc.metadata["source"])
    doc_names_lst.append(doc.metadata["source"])
doc_names_set = set(doc_names_lst)

from ast import literal_eval

def _process_answer_enrtry(dataset_name, entry):
    if dataset_name == "fin":
        answers = {"str_answer": entry["answer_1"], "exe_answer": entry["answer_2"]}
    elif dataset_name == "tat":
        raw_answer = entry["answer"]
        # in hf parquet, the answer is already a sequence
        answer = (
            literal_eval(raw_answer)
            if isinstance(raw_answer, str)
            else list(raw_answer)
        )
        answers = {
            "answer": answer,
            "answer_type": entry["answer_type"],
            "scale": entry["answer_scale"],
        }
    elif dataset_name in ["paper_tab", "paper_text"]:
        answers = [entry["answer_1"], entry["answer_2"], entry["answer_3"]]
    elif dataset_name == "feta":
        answers = entry["answer"]
    elif dataset_name == "nq":
        answers = {
            "short_answer": entry["short_answer"],
            "long_answer": entry["long_answer"],
        }
    return answers


def qa_df_to_dict(dataset_name, df):
    qas_dict = {}
    for item in df.iterrows():
        doc_name = str(item[1]["doc_name"])
        answers = _process_answer_enrtry(dataset_name, item[1])
        qa_dict = {
            "question": item[1]["question"],
            "answers": answers,
            "q_uid": str(item[1]["q_uid"]),
        }
        if doc_name not in qas_dict:
            qas_dict[doc_name] = []
        qas_dict[doc_name].append(qa_dict)
    return qas_dict

page_content='Dear Shareholders:
While ﬁscal year 2009 proved challenging across virtually every market in the world, we continued to steer Analog Devices on the
path to enduring success. There is much work left to do in 2010, but we enter the new year a stronger and more competitive
company. In this letter, I will describe the actions we have taken, the impact that we believe these actions will have, and why we
remain very enthusiastic about ADI’s future. 
Focused Innovation Remains the Lifeblood of ADI
It is intuitive that innovation will
separate the best technology companies
from the mediocre ones, and that
innovation is not a spigot that can be
turned on and off in response to
short-term order trends. Given the
proliferation of signal processing into
virtually every end market, we have
historically invested in many diverse
product and market opportunities;
some investments resulted in
innovations that have produced
signiﬁcant and sustainable returns, and
others have not met our ex

In [None]:
import pandas as pd
import sys
from huggingface_hub import hf_hub_download
from datasets import load_dataset

# CHANGE TO fin if want to load questions for financial documents
DATASET_NAME = "fin"

hf_dataset = load_dataset("qinchuanhui/UDA-QA", DATASET_NAME)
hf_data = hf_dataset["test"]
df = hf_data.to_pandas()
qas_dict = qa_df_to_dict(DATASET_NAME, df)


test_questions = dict()
for key in list(qas_dict.keys()):
    if key in doc_names_set:
        test_questions[key] = qas_dict[key][:2]
        print("Document Name: ", key)
        print("Its Q&A pairs: ", qas_dict[key][:2])
        print("=========================================")

keys = list(test_questions.keys())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

test_00000_of_00001.parquet:   0%|          | 0.00/483k [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

Document Name:  ADI_2009
Its Q&A pairs:  [{'question': 'what is the the interest expense in 2009?', 'answers': {'str_answer': '380', 'exe_answer': '3.8'}, 'q_uid': 'ADI/2009/page_49.pdf-1'}, {'question': 'what is the expected growth rate in amortization expense in 2010?', 'answers': {'str_answer': '-27.0%', 'exe_answer': '-0.26689'}, 'q_uid': 'ADI/2009/page_59.pdf-2'}]
Document Name:  ABMD_2012
Its Q&A pairs:  [{'question': 'during the 2012 year , did the equity awards in which the prescribed performance milestones were achieved exceed the equity award compensation expense for equity granted during the year?', 'answers': {'str_answer': '', 'exe_answer': 'yes'}, 'q_uid': 'ABMD/2012/page_75.pdf-1'}, {'question': 'for equity awards where the performance criteria has been met in 2012 , what is the average compensation expense per year over which the cost will be expensed?', 'answers': {'str_answer': '1719526', 'exe_answer': '1714285.71429'}, 'q_uid': 'ABMD/2012/page_75.pdf-2'}]
Document Na

**Step 3**

Let us tokenize our text, create our vector embeddings, and load them into our vector database

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,  # chunk size in characters not in words
    chunk_overlap=300,  # no overlap
)
text_chunks = text_splitter.split_documents(docs_list)
# Ensure the tokenization worked as expected. You should see the original first document you saw earlier
doc_1 = text_chunks[0].page_content.strip()
print("reconstructed beginning of doc1: ")
print("---------")
print(doc_1)
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# from langchain_chroma import Chroma
from langchain_core.vectorstores import InMemoryVectorStore

# vector_store = Chroma(
#     collection_name="example_collection",
#     embedding_function=embeddings,
#     persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
# )
vector_store = InMemoryVectorStore(embeddings)

_ = vector_store.add_documents(text_chunks)

reconstructed beginning of doc1: 
---------
Dear Shareholders:
While ﬁscal year 2009 proved challenging across virtually every market in the world, we continued to steer Analog Devices on the
path to enduring success. There is much work left to do in 2010, but we enter the new year a stronger and more competitive
company. In this letter, I will describe the actions we have taken, the impact that we believe these actions will have, and why we
remain very enthusiastic about ADI’s future. 
Focused Innovation Remains the Lifeblood of ADI
It is intuitive that innovation will
separate the best technology companies
from the mediocre ones, and that
innovation is not a spigot that can be
turned on and off in response to
short-term order trends. Given the
proliferation of signal processing into
virtually every end market, we have
historically invested in many diverse
product and market opportunities;
some investments resulted in
innovations that have produced
signiﬁcant and sustainable returns, 

**Step 4**

Let us now build our application logic and test our system

In [None]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain import hub


# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")

# Define state for application
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


# Define application steps
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"])
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}


# Compile application and test
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

**LANGSMITH**

After running the following cell, take a look at your langsmith logs. They tell you the latency, where time is spent, the cost, the inputs, the retrieved documents, and the output. So cool !!!

In [None]:
key = keys[0]
print("key: ", key)
print("\n")
# print("first question: ", test_questions[key][0]["question"])
print("first question: ", "What color is the sun?")
print("\n")
print("actual answer: ", test_questions[key][0]["answers"])
print("\n")

print("RAG answer: ")
# response = graph.invoke({"question": test_questions[key][0]["question"]})
# response = graph.invoke({"question": "What color is the sun?"})
response = graph.invoke({"question": "What stock should I buy?"})
print(response["answer"])


key:  ADI_2009


first question:  What color is the sun?


actual answer:  {'str_answer': '380', 'exe_answer': '3.8'}


RAG answer: 
I cannot provide financial advice. Investing in the stock market involves risk, and the best stock for you will depend on your personal financial situation and investment goals. Consult with a qualified financial advisor before making any investment decisions.


**Step 5**

Now it is your turn! Use the test_questions, keys, and graph.invoke data structures and methods to test the RAG system. Try to create your own questions that may require knowledge from multiple documents (e.g. how does company x's policy relate to company y's) ! I also encourage you to go up a few blocks, load in different questions, and test (check end of step 2)!

**Step 6**

Integrate agentic AI and see how quickly calls which we do not need to retrieve for execute compared to a non-agentic rag system, where we retrieve on every call.