For this notebook to run, we have to make sure that we can run simple python code

In [1]:
print(3+5)

8


#Dependencies
We also have to make sure that necessary python packages are downloaded

In [2]:
!pip -q install chromadb openai langchain tiktoken pypdf google-generativeai

You can set up your OpenAI key here
https://platform.openai.com/docs/quickstart/account-setup

In [3]:
import os
#os.environ['OPENAI_API_KEY'] = "Add your openAI key here"
#This key should be set as an environment variable outside. For now just set it here
os.environ['GOOGLE_API_KEY'] = "Add your GoogleAPI key here"

In [4]:
from langchain.vectorstores import Chroma
#from langchain.embeddings import OpenAIEmbeddings
#from langchain.llms import OpenAI
from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter




In [5]:
loader = PyPDFDirectoryLoader("pdfs")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_chunks = text_splitter.split_documents(data)

In [6]:
text_chunks

[Document(page_content='2/26/24, 10:07 AM goog-20231231\nhttps://www.sec.gov/Archives/edgar/data/1652044/000165204424000022/goog-20231231.htm 1/135UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n___________________________________________\nFORM 10-K\n___________________________________________\n(Mark One)\n☒ ANNUAL  REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended December 31 , 2023\nOR', metadata={'source': 'pdfs/Google-10k-2024-jan.pdf', 'page': 0}),
 Document(page_content='OR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from              to             .\nCommission file number: 001-37580\n___________________________________________\nAlphabet Inc.\n(Exact name of registrant as specified in its charter)\n___________________________________________\nDelaware 61-1767919\n(State or other jurisdiction of incorporation or organizati

This example uses GooglePalm. Palm is deprecated so it should be replaced with Gemini or the latest Google AI LLM engine.

In [7]:

persist_dir = "dbdir"
embeddings=GooglePalmEmbeddings()
#print embeddings
embeddings

  from .autonotebook import tqdm as notebook_tqdm


GooglePalmEmbeddings(client=<module 'google.generativeai' from '/Users/user/Library/Python/3.9/lib/python/site-packages/google/generativeai/__init__.py'>, google_api_key=None, model_name='models/embedding-gecko-001', show_progress_bar=False)

In [8]:
vectordb = Chroma.from_documents(documents=data,
                                 embedding=embeddings,
                                 persist_directory=persist_dir)
vectordb.persist()
vectordb = None

In [9]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_dir,
                  embedding_function=embeddings)
retriever = vectordb.as_retriever()

#This is just an example to get relevant documents for the query. This returns all the associated documents
#with the query. Here we have only one document. By default it lists 4 chunks. They are all the same. We can
#configure this output to 1 or 2 documents. This example has nothing to do with the query that is asked below
#This output is produced by Chroma
docs = retriever.get_relevant_documents("What was the Google's revenue according to the 10K?")

In [10]:
docs

[Document(page_content='2/26/24, 10:07 AM goog-20231231\nhttps://www.sec.gov/Archives/edgar/data/1652044/000165204424000022/goog-20231231.htm 56/135Table of Contents Alphabet Inc.\n•On July 21, 2023, the IRS announced a rule change allowing taxpayers to temporarily apply the regulations in effect\nprior to 2022 related to U.S. federal foreign tax credits. This announcement applies to foreign taxes paid or accrued in\nthe fiscal years 2022 and 2023. A cumulative one-time adjustment applicable to the prior period for this tax rule change\nwas recorded in 2023 and is reflected in our effective tax rate of 13.9% for the year ended December 31, 2023.\n•Repurchases of Class A and Class C shares were $62.2 billion for the year ended December 31, 2023. For additional\ninformation, see Note 11 of the Notes to Consolidated Financial Statements included in Item 8 of this Annual Report on\nForm 10-K.\n•Operating cash flow was $101.7 billion for the year ended December 31, 2023.\n•Capital expenditu

In [11]:
retriever.search_type

'similarity'

In [20]:
retriever.search_kwargs

{'k': 2}

In [12]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

Use Langchain to make chain

In [14]:
llm = GooglePalm(temperature=0.1)


In [15]:
# create the chain to answer questions
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [16]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [19]:
# full example
query = "List the revenue for each of the Google's products. Return answer in bullet points?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

- Google Search & other revenues: $146.9 billion
- Google Network revenues: $146.3 billion
- Google subscriptions, platforms, and devices revenues: $202.5 billion
- Google Cloud revenues: $153.3 billion


Sources:
pdfs/Google-10k-2024-jan.pdf
pdfs/Google-10k-2024-jan.pdf
