# **RAG SYSTEM FOR FINANCIAL DOCUMENTS**

## **PIP Installations**

The following classes will need to be imported from their respsective packages: 

    - google-genai
    - azure-search-docments
    - langchain-google-genai
    - langchain-chroma

Note that either one of chroma or azure-search will be used as a vector data base.

In [104]:

#%capture
#%pip install langchain-core langchain-community langchain-google-genai
#%pip install azure-ai-documentintelligence azure-search-documents
#%pip install -U langchain-google-genai
#%pip install -U langchain-chroma


## **Setting up the Environment to use Google's GenAI llm products**

In [None]:
import os

In [87]:
os.environ['GOOGLE_API_KEY'] = 'AIzaSyB9hPjhpqM6THjy_qn8Ne214BLL1MZpobQ'

## **Testing Google's Chat Generative AI**

In [88]:
# Testing Google Gemini Langchain interation
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-pro", convert_system_message_to_human=True)

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg




AIMessage(content="J'aime la programmation.", response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': [{'category': 'HARM_CATEGORY_SEXUALLY_EXPLICIT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HATE_SPEECH', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_HARASSMENT', 'probability': 'NEGLIGIBLE', 'blocked': False}, {'category': 'HARM_CATEGORY_DANGEROUS_CONTENT', 'probability': 'NEGLIGIBLE', 'blocked': False}]}, id='run-f243ff4f-6c97-4a21-9d97-6c91553f5db0-0', usage_metadata={'input_tokens': 21, 'output_tokens': 6, 'total_tokens': 27})

## **Loading  & Chunking documents**

### **Defining a method used to loop through a directory and load all documents inside it.**

The loading is carried out by an instance of AzureAIDocumentIntelligenceLoader for the following purposes:

    - Ablility to analyze tables, images, and text, extracting all information from the above

In [89]:
def load_docs(direc_path, doc_intelligence_endpoint, doc_intelligence_key):
    docs = []
    for file in os.listdir(direc_path):
        if file.endswith(".pdf"):
            file_path = direc_path + "/"+ file
            loader = AzureAIDocumentIntelligenceLoader(file_path=file_path, api_endpoint= doc_intelligence_endpoint, api_key=doc_intelligence_key, api_model="prebuilt-layout", mode="markdown", analysis_features= ["ocrHighResolution"])
            docs += loader.load()
    return docs

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

Create a AzureAIDocumentIntelligence resource noting down it's endpoint and api_key.

"OcrHighResolution" is a Read Optical Character Recognition (OCR) model. It runs at a higher resolution than Azure AI Vision Read and extracts print and handwritten text from PDF documents and scanned images.

In [91]:
# Loading sample document (invoice)
#url_path = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/layout.png"

endpoint = "https://docintelone.cognitiveservices.azure.com/"
key = "64c61ce74d924ced974b3ee968e50fbe"

analysis_features = ["ocrHighResolution"]

direc_path = "/teamspace/studios/this_studio/Shrinked"
docs = load_docs(direc_path, endpoint, key)
print("Length of docs: {}".format(len(docs)))


Length of docs: 4


In [92]:
#print(docs[0])

### **Splitting**

The RecursiveCharacterTextSplitter is used because of its simplicity. It breaks down text into smaller *"chunks"* depending on your choice of *"chunk_size"* and *"overlap"*.

For a query based system, it is important to minimize the chunk_size to allow for effective retrieval, however, it is also important to have them as large as possible to minimize storage utilisation. 

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [93]:
# Chunking using Langchain splitters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

len(splits)


40

In [94]:
"""
for split in splits: 
    print(split.page_content)
    print("----------------------------------------------------------------------------------")
"""

'\nfor split in splits: \n    print(split.page_content)\n    print("----------------------------------------------------------------------------------")\n'

## **Embedding Stage**

In [95]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [96]:
# Setting up embeddings

model = "models/embedding-001"
embeddings = GoogleGenerativeAIEmbeddings(model=model)


### **VECTOR STORE USING AZURE_SEARCH**

In [97]:
# Setting up Azure Search
"""
from langchain_community.vectorstores.azuresearch import AzureSearch

vector_store_address: str = "https://ragsearchone.search.windows.net"
vector_store_password: str = "oZfKSbPgirnz2DlPhtHA8KQjcJ8UaRKnpANa7UI16EAzSeC0NfRj"

index_name: str = "langchain-rag-prototype"
vectorstore: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
    additional_search_client_options={"retry_total": 4},
)
retriever = vectorstore.as_retriever(search_type="similarity")

vectorstore.add_documents(documents=docs)

"""

'\nfrom langchain_community.vectorstores.azuresearch import AzureSearch\n\nvector_store_address: str = "https://ragsearchone.search.windows.net"\nvector_store_password: str = "oZfKSbPgirnz2DlPhtHA8KQjcJ8UaRKnpANa7UI16EAzSeC0NfRj"\n\nindex_name: str = "langchain-rag-prototype"\nvectorstore: AzureSearch = AzureSearch(\n    azure_search_endpoint=vector_store_address,\n    azure_search_key=vector_store_password,\n    index_name=index_name,\n    embedding_function=embeddings.embed_query,\n    additional_search_client_options={"retry_total": 4},\n)\nretriever = vectorstore.as_retriever(search_type="similarity")\n\nvectorstore.add_documents(documents=docs)\n\n'

### **VECTOR STORE USING CHROMA_DB**

In [98]:
from langchain_chroma import Chroma

In [2]:
vectorstore = Chroma.from_documents(splits, embeddings)

retriever = vectorstore.as_retriever()

vectorstore.add_documents(documents=docs)


NameError: name 'Chroma' is not defined

### **VECTOR STORE USING FAISS**

## **Retrieval-Augmented Generation System**

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain import hub

In [100]:
# Rag prompt for retrieval

prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs): 
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm 
    | StrOutputParser()
)
"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
"""


'\nprompt = ChatPromptTemplate.from_messages(\n    [\n        ("system", system_prompt),\n        ("human", "{input}"),\n    ]\n)\n\nquestion_answer_chain = create_stuff_documents_chain(llm, prompt)\nrag_chain = create_retrieval_chain(retriever, question_answer_chain)\n'

# **Testing**

In [101]:

#questions = "When does Apple recognize revenue for its products?, How does Apple allocate revenue for arrangements with multiple performance obligations?, What are the three performance obligations identified by Apple for iPhone Mac and iPad sales?, How is revenue allocated to product-related bundled services and unspecified software upgrade rights recognized?, What factors does Apple consider when determining control of third-party products?, How does Apple account for third-party application-related sales through the App Store?"
#questions = "How much deferred revenue was recognized by Apple in the year 2023?, How much deferred revenue from the previous year was recognized by Apple in the year 2022?, What amount from the year prior to that was recognized as part of Apple's net sales for the year ending on September 26th?, In which region did iPhone sales represent a higher proportion than usual according to Note 13?, As of September 30th, what is the total amount reported for total deferred revenues?, What is Apple's reported amount for total deferred revenues as of December 24th?"
questions = "How many shares did Uber issue and sell during its IPO?, What was the price per share during Uber’s IPO?, How much net proceeds did Uber receive from the IPO?, How many shares of common stock were converted from the outstanding redeemable convertible preferred stock upon the IPO?, What was the tax withholding obligation based on the IPO public offering price?, How much did Uber agree to pay for its controlling interest in Cornershop?"

list = questions.split(",")

for num, question in enumerate(list):
    response = rag_chain.invoke(question)
    #answer = response["answer"]
    #print("Question: {} \n Answer: {}".format(question, answer))
    print("Question: {} \n Answer: {}".format(question, response))
    print("________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________")





Question: How many shares did Uber issue and sell during its IPO? 
 Answer: On May 14, 2019, Uber issued and sold 180 million shares of its common stock during its IPO. The price per share was $45.00.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________




Question:  What was the price per share during Uber’s IPO? 
 Answer: The price per share during Uber’s IPO was $45.00.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________




Question:  How much net proceeds did Uber receive from the IPO? 
 Answer: On May 14, 2019, Uber received approximately $8.0 billion in net proceeds from its initial public offering (IPO). This figure is after deducting $106 million in underwriting discounts and commissions and offering expenses.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________




Question:  How many shares of common stock were converted from the outstanding redeemable convertible preferred stock upon the IPO? 
 Answer: Upon the IPO, all shares of the Company's outstanding redeemable convertible preferred stock automatically converted into 905 million shares of common stock.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________




Question:  What was the tax withholding obligation based on the IPO public offering price? 
 Answer: Based on the IPO public offering price of $45.00 per share, the tax withholding obligation was $1.3 billion. The company issued 76 million shares of common stock, and withheld 29 million shares to meet the tax withholding requirements.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________




Question:  How much did Uber agree to pay for its controlling interest in Cornershop? 
 Answer: Uber agreed to pay up to $459 million for its controlling interest in Cornershop, payable in a combination of cash and Uber common stock.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________


In [102]:
# Retrieval using Gemini
"""
response = rag_chain.invoke({"input": "During Uber's IPO, how many shares of common stock were sold, at what price were they sold, and how much were the net proceeds received?"})
response["answer"]
"""

response = rag_chain.invoke("As of september 30, 2023, what was  Apple's net Products sales?")
print(response)




Apple's net Products sales as of September 30, 2023 were $298,085 million.


In [103]:
# In progress: Adding chat history as context to the rag chain so that the chatbot has even more context and chat seems natural.