# **CREATING A RAG SYSTEM FOR FINANCIAL DATA**

Necessary to pip install langchain_community package, using the following code, which contains the AzureAIDocumentIntelligenceLoader class.

## Importing the necessary packages

In [1]:
! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-search-documents==11.6.0b3

Defaulting to user installation because normal site-packages is not writeable


## **Setting up environmental variables from the AzureAIDocumentIntelligence Resource**
Ensure that the location of the AI Document Intelligence resource is set the USA (US WEST, EAST, CENTRAL)

In [49]:

from azure.core.credentials import AzureKeyCredential
doc_intelligence_endpoint = "https://docintelone.cognitiveservices.azure.com/"
doc_intelligence_key = "fa36ec29d4d548f5985930a2cbb1b3ee"
# credential = AzureKeyCredential(doc_intelligence_key)
# The above is not necessary becuase the AzureAIDocumentIntelligenceLoader uses the api_key internally to create the AzureKeyCredential
# necessary to extract and load the information

## **Import, from the langchain_community package, the AzureAIDocumentIntelligenceLoader**

In [50]:
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch

# **Instantiation, Loading, and Chunking of pdf data**

1. Instantiate the AzureAIDocumentIntelligenceLoader class using the AzureAI Doc Intel resource keys.
2. Load, using the load() method, to create a list[documents] data type. (In this case one pdf document creates a list of length 1)
3. Split the document into chunks based on markdown headers. It is possible to use RecursiveTextSplitter or load_and_split() method instead. 


## **Import the os class and use it to iterate through documents in a directory**

In [30]:
import os 

## **Iterate through the directory and load each pdf file using this method.**

In [51]:
def load_docs(direc_path):
    docs = []
    for file in os.listdir(direc_path):
        if file.endswith(".pdf"):
            file_path = direc_path + "/"+ file
            loader = AzureAIDocumentIntelligenceLoader(file_path=file_path, api_endpoint= doc_intelligence_endpoint, api_key=doc_intelligence_key, api_model="prebuilt-read")
            docs += loader.load()
    return docs

## **Split the documents on the basis of markdown headers**

In [52]:
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
#loader = AzureAIDocumentIntelligenceLoader(file_path="C:/Users/abbandomo/Downloads/sample-layout.pdf", api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-read", mode='markdown')

direc_path = "C:/Users/abbandomo/OneDrive - KPMG/Desktop/RAG-IN-ABBANDOMO/TestFile"
docs = load_docs(direc_path)


# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# figure out how to spilt and store multiple documents so that they can all be embedded as well.
#splits = text_splitter.split_documents(docs)
splits = []
for doc in docs:
    docs_string = doc.page_content
    splits += text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(splits)))

for split in splits: 
    print(split.page_content)

HttpResponseError: (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
    "code": "InvalidContentLength",
    "message": "The input image is too large. Refer to documentation for the maximum file size."
}

In [5]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from langchain.chains import RetrievalQA

In [11]:
%pip install -U langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.0.3-py3-none-any.whl.metadata (1.2 kB)
Downloading langchain_huggingface-0.0.3-py3-none-any.whl (17 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [16]:
model_path = "sentence-transformers/all-MiniLM-l6-v2"

model_kwargs = {"device" : "cuda"}

encode_kwargs = {"normalize_embeddings" : False}

embeddings = HuggingFaceEmbeddings(model_name = model_path, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)

text = "This is a test document"

query_result = embeddings.embed_query(text)
print(query_result[:3])

# Use the embed_documents method to create a list[list[float]] for each document/split



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

AssertionError: Torch not compiled with CUDA enabled