# **CREATING A RAG SYSTEM FOR FINANCIAL DATA**

Necessary to pip install langchain_community package, using the following code, which contains the AzureAIDocumentIntelligenceLoader class.

## Importing the necessary packages

In [2]:
! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-search-documents==11.6.0b3

Defaulting to user installation because normal site-packages is not writeable


## **Setting up environmental variables from the AzureAIDocumentIntelligence Resource**
Ensure that the location of the AI Document Intelligence resource is set the USA (US WEST, EAST, CENTRAL)

In [10]:

from azure.core.credentials import AzureKeyCredential
doc_intelligence_endpoint = "https://docintelone.cognitiveservices.azure.com/"
doc_intelligence_key = "79549667acc44ba491bd843cf4ff5ad2"
# credential = AzureKeyCredential(doc_intelligence_key)
# The above is not necessary becuase the AzureAIDocumentIntelligenceLoader uses the api_key internally to create the AzureKeyCredential
# necessary to extract and load the information

## **Import, from the langchain_community package, the AzureAIDocumentIntelligenceLoader**

In [4]:
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch

# **Instantiation, Loading, and Chunking of pdf data**

1. Instantiate the AzureAIDocumentIntelligenceLoader class using the AzureAI Doc Intel resource keys.
2. Load, using the load() method, to create a list[documents] data type. (In this case one pdf document creates a list of length 1)
3. Split the document into chunks based on markdown headers. It is possible to use RecursiveTextSplitter or load_and_split() method instead. 


## Import the os class and use it to iterate through documents in a directory

In [5]:
import os 

## Iterate through the directory and load each pdf file using this method.

In [6]:
def load_docs(direc_path):
    docs = []
    for file in os.listdir(direc_path):
        if file.endswith(".pdf"):
            file_path = direc_path + "/"+ file
            loader = AzureAIDocumentIntelligenceLoader(file_path=file_path, api_endpoint= doc_intelligence_endpoint, api_key=doc_intelligence_key, api_model="prebuilt-read")
            docs += loader.load()
    return docs

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Split the documents on the basis of markdown headers

In [31]:
# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
#loader = AzureAIDocumentIntelligenceLoader(file_path="C:/Users/abbandomo/Downloads/sample-layout.pdf", api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-read", mode='markdown')

direc_path = "C:/Users/abbandomo/OneDrive - KPMG/Desktop/RAG-IN-ABBANDOMO/TestFile"
docs = load_docs(direc_path)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

print(len(splits))

for split in splits: 
    print(split.page_content)


7
7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
"This is the best machine learning course I've done. Worth every cent."
- Jose Reyes, AI/ML at Cevo Australia
Building Machine Learning Systems That Don't Suck
A live, interactive program that'll help you build production-ready machine learning systems from the ground up.
Next cohort: July 1 - 18, 2024 Check the schedule for more details about upcoming cohorts.
I want to join! Sign in
Learn how to design, build, deploy, and scale machine learning systems to solve real-world problems.
I'll lose my mind if I see another book or course teaching people the same basic ideas for the hundredth time. Most people are stuck in beginner mode, and finding help to solve real-world problems is hard.
I want to change that.
I started writing software 30 years ago. I've written pipelines and trained models for some of the largest companies in the world. I want to show you how to do the same.
https://www.ml.school
1/11
7/1/24, 11:50 AM

In [13]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from langchain.chains import RetrievalQA

In [14]:
%pip install -U langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [15]:
%pip install sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [18]:
model_path = "sentence-transformers/all-MiniLM-l6-v2"

model_kwargs = {"device" : "cpu"}

encode_kwargs = {"normalize_embeddings" : False}

embedder = HuggingFaceEmbeddings(model_name = model_path, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)

#text = "This is a test document"

#query_result = embedder.embed_query(text)
#print(query_result[:3])

#Use the embed_documents method to create a list[list[float]] for each document/split



## **Using Azure Search For Vector Storage**


In [17]:
from langchain.vectorstores.azuresearch import AzureSearch

In [20]:
vector_store_address: str = "https://ragsearchone.search.windows.net"
vector_store_password: str = "oZfKSbPgirnz2DlPhtHA8KQjcJ8UaRKnpANa7UI16EAzSeC0NfRj"

index_name = "ragstore"
vector_store= AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embedder.embed_query,
)
print(len(splits)) # Returns the list of IDs of the added texts.
vector_store.add_documents(documents=splits)

13


['YzQxYWFkYzgtOTUyZC00YjAyLTk0NjItOWQxMGI1MzY1N2Iy',
 'NmM5ZmZlYzctOGZkZS00NTQxLThjNTMtNDE1OGQ0YjcyMmI1',
 'YTFhMmQ4ODMtMDNhZC00NzE3LTkyNDEtYTZkNjg3YWM5YTVh',
 'NDE2NWY5ZTMtYWJhOC00M2Y3LWE5ZjQtMTNkZThiNzc5M2Qx',
 'Yzk0NDZjMmYtMWI2Zi00MjQ1LWFiMjctMWY1MmJlYTNmZmYy',
 'MWZjYTYzNjktMTczOS00YmM4LTk3ZmQtMDFiNTY2MDYwYTIw',
 'ZTkyZjJkNzYtZWM5MC00NzU4LWE5MDgtNzUzZDJlNDczMjU0',
 'MDExOTc1ZmItNWViZC00NDk1LTkyMWEtMDQ1OGUwZGE0YmUy',
 'NjUyMGQzZjctYmNjNC00NTQ5LWJjZDUtYzFkYjhiMDM1OGUy',
 'MDEyZDY2MmQtNWVlOC00NzcxLWI4OGQtNWJjMDg3ODkzMjFi',
 'ZmQxMDc1ZGYtMjYwNC00ZmE5LTkwMmEtMGIwYjFlMDA4MzBk',
 'NzM4OGZjYmQtODdiYy00ZDVmLTkzMTAtYjNjOTIyNzZlODlj',
 'M2U2NThiYWMtYjY4Ny00NTcyLWE3OWItYzJiMDE0NGI5ZTU0']

## **Query-based retrieval of relevant data**

In [21]:
!pip install -q torch
!pip install -q transformers
!pip install -q datasets

In [22]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

In [32]:
# Retrieve relevant chunks based on the question

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# kwarg "k" details that the 3 most relevant documents should be returned.

#retrieved_docs = retriever.get_relevant_documents(
    #"How many outstanding shares as of April 24th 2024")

#print(retrieved_docs[0].page_content)

# IMPLEMENT THE LLM USING HUGGINGFACE

# Create a tokenizer object by loading the pretrained "Intel/dynamic_tinybert" tokenizer.
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")

# Create a question-answering model object by loading the pretrained "Intel/dynamic_tinybert" model.
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")
# Intel/dynamic is a fine-tuned model responsible for question-answering

# TO BE CONTINUED....

# Specify the model name you want to use
model_name = "Intel/dynamic_tinybert"

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "question-answering", 
    model=model_name, 
    tokenizer=tokenizer,
    return_tensors='pt'
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)


# Create a retriever object from the 'db' using the 'as_retriever' method.
# This retriever is likely used for retrieving data or documents from the database.
retriever = vector_store.as_retriever()

## Testing the ability of the rag prototype to obtain relevant documents.

In [33]:
docs = retriever.get_relevant_documents("What is the name of the machine learning course?")
print(docs[0].page_content)

https://www.ml.school
1/11
7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
This is the class I wish I had taken when I started.
This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.
When you join, you get lifetime access to the following:


In [40]:
from langchain_core.runnables import RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [56]:
# Rag prompt for retrieval
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [57]:
# Retrieval using Gemini
response = rag_chain.invoke({"input": "What is the name of the machine learning course"})
response["answer"]

ValueError: System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.

https://www.ml.school
1/11
7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
This is the class I wish I had taken when I started.
This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.
When you join, you get lifetime access to the following:

https://www.ml.school
1/11
7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
This is the class I wish I had taken when I started.
This program will help you unlearn what you think machine learning is. It's a practical, hands-on class where you'll learn from years of experience and real-world examples.
When you join, you get lifetime access to the following:

7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
"This is the best machine learning course I've done. Worth every cent."
- Jose Reyes, AI/ML at Cevo Australia
Building Machine Learning Systems That Don't Suck
A live, interactive program that'll help you build production-ready machine learning systems from the ground up.
Next cohort: July 1 - 18, 2024 Check the schedule for more details about upcoming cohorts.
I want to join! Sign in

7/1/24, 11:50 AM
Building Machine Learning Systems That Don't Suck
"This is the best machine learning course I've done. Worth every cent."
- Jose Reyes, AI/ML at Cevo Australia
Building Machine Learning Systems That Don't Suck
A live, interactive program that'll help you build production-ready machine learning systems from the ground up.
Next cohort: July 1 - 18, 2024 Check the schedule for more details about upcoming cohorts.
I want to join! Sign in
Human: What is the name of the machine learning course argument needs to be of type (SquadExample, dict)