# **RAG SYSTEM FOR FINANCIAL DOCUMENTS**

## **PIP Installations**

The following classes will need to be imported from their respsective packages: 

    - google-genai
    - azure-search-docments
    - langchain-google-genai
    - langchain-chroma

Note that either one of chroma or azure-search will be used as a vector data base.

In [39]:

#%capture
#%pip install langchain-core langchain-community langchain-google-genai
#%pip install azure-ai-documentintelligence azure-search-documents
#%pip install -U langchain-google-genai
#%pip install -U langchain-chroma


## **Setting up the Environment to use Google's GenAI llm products**

In [40]:
import os

In [41]:
os.environ['GOOGLE_API_KEY'] = 'AIzaSyB9hPjhpqM6THjy_qn8Ne214BLL1MZpobQ'

## **Testing Google's Chat Generative AI**

In [42]:
# Testing Google Gemini Langchain interation
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-pro", convert_system_message_to_human=False)

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
#ai_msg = llm.invoke(messages)
#ai_msg


## **Loading  & Chunking documents**

### **Defining a method used to loop through a directory and load all documents inside it.**

The loading is carried out by an instance of AzureAIDocumentIntelligenceLoader for the following purposes:

    - Ablility to analyze tables, images, and text, extracting all information from the above

In [43]:
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
"""
def load_docs(direc_path, doc_intelligence_endpoint, doc_intelligence_key):
    docs = []
    for file in os.listdir(direc_path):
        if file.endswith(".pdf"):
            file_path = direc_path + "/"+ file
            loader = AzureAIDocumentIntelligenceLoader(file_path=file_path, api_endpoint= doc_intelligence_endpoint, api_key=doc_intelligence_key, api_model="prebuilt-layout", mode="markdown", analysis_features= ["ocrHighResolution"])
            docs += loader.load()
    return docs
"""

'\ndef load_docs(direc_path, doc_intelligence_endpoint, doc_intelligence_key):\n    docs = []\n    for file in os.listdir(direc_path):\n        if file.endswith(".pdf"):\n            file_path = direc_path + "/"+ file\n            loader = AzureAIDocumentIntelligenceLoader(file_path=file_path, api_endpoint= doc_intelligence_endpoint, api_key=doc_intelligence_key, api_model="prebuilt-layout", mode="markdown", analysis_features= ["ocrHighResolution"])\n            docs += loader.load()\n    return docs\n'

In [51]:
from llama_parse import LlamaParse
from langchain.docstore.document import Document

def load_docs(direc_path, LLAMA_CLOUD_API_KEY):
    os.environ["LLAMA_CLOUD_API_KEY"] = LLAMA_CLOUD_API_KEY
    doc_list = []
    parsed_content = {}
    parsing_10k_instructions = '''The documents attached are various company's financial Form 10-Ks. They explore financial data, providing insight into the respective company's financial status. Answer questions using the information in these financial documents and be precise.'''
    def get_text_file_name(file_path):
        return file_path.split('/')[-1].split('.')[0]
    for file in os.listdir(direc_path):
        file_path = direc_path + "/"+ file
        document = LlamaParse(
                            result_type="markdown",
                            parsing_instructions=parsing_10k_instructions,
                            page_separator="\n=================\n",
                            skip_diagonal_text=True,
                        ).load_data(file_path) # file is file directory string
        print(type(document))
        doc_list += [Document(page_content=document, id=file)]
        #parsed_content[file] = documents

    return doc_list
    #return parsed_content
    #print(parsed_content.values())

In [45]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

Create a AzureAIDocumentIntelligence resource noting down it's endpoint and api_key.

*OcrHighResolution* is a Read Optical Character Recognition (OCR) model. It runs at a higher resolution than Azure AI Vision Read and extracts print and handwritten text from PDF documents and scanned images.

In [52]:
# Loading sample document (invoice)
#url_path = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/layout.png"

#endpoint = "https://docintelone.cognitiveservices.azure.com/"
#key = "64c61ce74d924ced974b3ee968e50fbe"

# Only if being run on a notebook
import nest_asyncio
nest_asyncio.apply()

analysis_features = ["ocrHighResolution"]

direc_path = "C:/Users/abbandomo/OneDrive - KPMG/Desktop/RAG-IN-ABBANDOMO/Shrinked"
#docs = load_docs(direc_path, endpoint, key)
docs = load_docs(direc_path, "llx-Ideo4Vk8OrczBqU8CMwe4o5JAZtnnmw8DQZtcHUG8SxrHWb6")



Error while parsing the file 'C:/Users/abbandomo/OneDrive - KPMG/Desktop/RAG-IN-ABBANDOMO/Shrinked/Apple - Financial Statement Notes.pdf': 
<class 'list'>


ValidationError: 1 validation error for Document
page_content
  str type expected (type=type_error.str)

In [None]:
print(docs[0].page_content)

KeyError: 0

### **Splitting**

The RecursiveCharacterTextSplitter is used because of its simplicity. It breaks down text into smaller *"chunks"* depending on your choice of *"chunk_size"* and *"overlap"*.

For a query based system, it is important to minimize the chunk_size to allow for effective retrieval, however, it is also important to have them as large as possible to minimize storage utilisation. 

*Note that other splitting techniques exist such as Semantic text splitting.* 

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Chunking using Langchain splitters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

len(splits)


129

In [None]:
"""
for split in splits: 
    print(split.page_content)
    print("----------------------------------------------------------------------------------")
"""

'\nfor split in splits: \n    print(split.page_content)\n    print("----------------------------------------------------------------------------------")\n'

## **Embedding Stage**

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [None]:
# Setting up embeddings

model = "models/embedding-001"
embeddings = GoogleGenerativeAIEmbeddings(model=model)


### **VECTOR STORE USING AZURE_SEARCH**

In [None]:
# Setting up Azure Search
"""
from langchain_community.vectorstores.azuresearch import AzureSearch

vector_store_address: str = "https://ragsearchone.search.windows.net"
vector_store_password: str = "oZfKSbPgirnz2DlPhtHA8KQjcJ8UaRKnpANa7UI16EAzSeC0NfRj"

index_name: str = "langchain-rag-prototype"
vectorstore_azure = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
    additional_search_client_options={"retry_total": 4},
)
retriever_azure = vectorstore_azure.as_retriever()

vectorstore_azure.add_documents(documents=docs)

"""

'\nfrom langchain_community.vectorstores.azuresearch import AzureSearch\n\nvector_store_address: str = "https://ragsearchone.search.windows.net"\nvector_store_password: str = "oZfKSbPgirnz2DlPhtHA8KQjcJ8UaRKnpANa7UI16EAzSeC0NfRj"\n\nindex_name: str = "langchain-rag-prototype"\nvectorstore_azure = AzureSearch(\n    azure_search_endpoint=vector_store_address,\n    azure_search_key=vector_store_password,\n    index_name=index_name,\n    embedding_function=embeddings.embed_query,\n    additional_search_client_options={"retry_total": 4},\n)\nretriever_azure = vectorstore_azure.as_retriever()\n\nvectorstore_azure.add_documents(documents=docs)\n\n'

### **VECTOR STORE USING CHROMA_DB**

In [None]:
from langchain_chroma import Chroma

In [None]:

vectorstore_chroma = Chroma.from_documents(splits, embeddings)

retriever_chroma = vectorstore_chroma.as_retriever()

vectorstore_chroma.add_documents(documents=docs)


['560905db-aa54-4524-9c5c-c06ff37c2843',
 '3568d4b4-04c1-4f79-a981-3557e6191c83',
 'ec0785ad-9e5a-45da-bf80-64177197c3b2',
 '48a02127-a896-4fe8-8a81-667b15acad16',
 'a252eea9-9eff-4b41-957f-cf74a2f354cf',
 'e0760074-11ed-4774-9973-09a139b3ef99']

### **VECTOR STORE USING FAISS**

In [None]:
from langchain_community.vectorstores import FAISS

In [None]:
"""
vectorstore_faiss = FAISS.from_documents(docs, embeddings)

retriever_faiss = vectorstore_faiss.as_retriever()

vectorstore_faiss.add_documents(documents=docs)
"""

'\nvectorstore_faiss = FAISS.from_documents(docs, embeddings)\n\nretriever_faiss = vectorstore_faiss.as_retriever()\n\nvectorstore_faiss.add_documents(documents=docs)\n'

## **Retrieval-Augmented Generation System**

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain import hub

In [None]:
# Rag prompt for retrieval

prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs): 
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain_chroma = (
    {"context": retriever_chroma | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser()
)

"""
rag_chain_azure = (
    {"context": retriever_azure | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser()
)
"""
"""
rag_chain_faiss = (
    {"context": retriever_faiss | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser()
)
"""

'\nrag_chain_faiss = (\n    {"context": retriever_faiss | format_docs, "question": RunnablePassthrough()}\n    | prompt\n    | llm \n    | StrOutputParser()\n)\n'

# **Testing**

In [None]:

#questions = "When does Apple recognize revenue for its products?, How does Apple allocate revenue for arrangements with multiple performance obligations?, What are the three performance obligations identified by Apple for iPhone Mac and iPad sales?, How is revenue allocated to product-related bundled services and unspecified software upgrade rights recognized?, What factors does Apple consider when determining control of third-party products?, How does Apple account for third-party application-related sales through the App Store?"
#questions = "How much deferred revenue was recognized by Apple in the year 2023?, How much deferred revenue from the previous year was recognized by Apple in the year 2022?, What amount from the year prior to that was recognized as part of Apple's net sales for the year ending on September 26th?, In which region did iPhone sales represent a higher proportion than usual according to Note 13?, As of September 30th what is the total amount reported for total deferred revenues?," 
#questions = "What is Apple's reported amount for total deferred revenues as of December 24th?"
#questions = "How many shares did Uber issue and sell during its IPO?, What was the price per share during Uber’s IPO?, How much net proceeds did Uber receive from the IPO?, How many shares of common stock were converted from the outstanding redeemable convertible preferred stock upon the IPO?, What was the tax withholding obligation based on the IPO public offering price?, How much did Uber agree to pay for its controlling interest in Cornershop?"

#questions = "How many shares of common stock were issued upon conversion of the 2021 and 2022 Convertible Notes?, How many shares were issued related to the vesting of RSUs with performance-based vesting conditions?, What was the tax withholding obligation based on the IPO public offering price?, How much stock-based compensation expense did Uber recognize upon its IPO?, How much deferred revenue was recognized by the company in the year 2023?"
#questions = "What did holders of the 2021 and 2022 Convertible Notes do after the closing of the IPO??"
#questions = "Who are ZeniMax Media?, Who are activision blizzard inc and what happened between them and microsoft?"
questions = "What was the total purchase price of ZeniMax Media Inc.?, When did Microsoft complete the acquisition of ZeniMax Media Inc.?, What is the name of the parent company of Bethesda Softworks LLC?, How much cash and cash equivalents were acquired as part of the ZeniMax Media Inc. acquisition?, To which segment of Microsoft is ZeniMax Media Inc. reported?, When was the allocation of the purchase price to goodwill for the ZeniMax Media Inc. acquisition completed?, How much goodwill was assigned to Microsoft's More Personal Computing segment due to the ZeniMax Media Inc. acquisition?, What is the expected average life of the technology-based intangible assets acquired from ZeniMax Media Inc.?, What is the purchase price per share offered by Microsoft for Activision Blizzard, Inc.?, When did Microsoft enter into a definitive agreement to acquire Activision Blizzard, Inc.?, What is the total enterprise value of Microsoft's acquisition of Activision Blizzard Inc.?, What is the name of the segment in Microsoft with the highest goodwill as of June 30 2023?, How much did the goodwill of the Productivity and Business Processes segment increase from June 30 2021 to June 30 2022?"

list = questions.split(",")

for num, question in enumerate(list):
    print("Question: {}".format(question))
    response_chroma = rag_chain_chroma.invoke(question)
    #response_faiss = rag_chain_faiss.invoke(question)
    #response_azure = rag_chain_azure.invoke(question)
    #answer = response["answer"]
    #print("Question: {} \n Answer: {}".format(question, answer))
    #print("AZURE:")
    #print("Answer: {}".format(response_azure))
    #print("")
    print("Chroma:")
    print("Answer: {}".format(response_chroma))
    #print("")
    #print("faiss:")
    #print("Answer: {}".format(response_faiss))
    print("________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________")



Question: What was the total purchase price of ZeniMax Media Inc.?
Chroma:
Answer: The total purchase price of ZeniMax Media Inc. was $8.1 billion, consisting primarily of cash. The purchase price included $766 million of cash and cash equivalents acquired.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Question:  When did Microsoft complete the acquisition of ZeniMax Media Inc.?
Chroma:
Answer: Microsoft completed the acquisition of ZeniMax Media Inc. on March 9, 2021. The purchase price was $8.1 billion, primarily consisting of cash. ZeniMax is one of the largest privately held game developers and publishers in the world.
______________________________________________________________________________________________________________________________________

In [None]:
# Retrieval using Gemini
"""
response = rag_chain.invoke({"input": "During Uber's IPO, how many shares of common stock were sold, at what price were they sold, and how much were the net proceeds received?"})
response["answer"]
"""

response = rag_chain_chroma.invoke("What is the total enterprise value of Microsoft's acquisition of Activision Blizzard Inc.?")
print(response)


The total enterprise value of Microsoft's acquisition of Activision Blizzard Inc. is $68.7 billion. This includes Activision Blizzard's net cash.


In [None]:
# In progress: Adding chat history as context to the rag chain so that the chatbot has even more context and chat seems natural.