### Guide to create basic RAG Pipeline

#### This notebook provides guide to create RAG (Retrieval Augmentation System) pipeline

#### This document contains following sections:
1. Extract text from docuemnts
2. Chunking Techniques
3. Vector Store
4. Retrieval Method
5. Q&A Chain

#### Short course on RAG using Langchain 
1. https://learn.deeplearning.ai/courses/langchain-chat-with-your-data/lesson/1/introduction
2. https://learn.deeplearning.ai/courses/langchain/lesson/1/introduction
3. https://learn.deeplearning.ai/courses/building-evaluating-advanced-rag/lesson/1/introduction

In [1]:
### Initiate LLM object using langchain. Have separate guide.
# from langchain.embeddings import AzureOpenAIEmbeddings
# from langchain_openai import AzureOpenAIEmbeddings
# from langchain_community.chat_models import AzureChatOpenAI

from langchain_openai import ChatOpenAI
from langchain_community.embeddings.openai import OpenAIEmbeddings


## Create embedding of the chunks using Faiss
from langchain_community.vectorstores import FAISS

# Required following parameters. Update following input parameters. Make sure embedding model and LLM models are deployed within same resource.
openai_key="<Enter OpenAI Key Here>"
llm = ChatOpenAI(api_key=openai_key,model="gpt-4o-mini")

embedding = OpenAIEmbeddings(api_key=openai_key)

# llm = AzureChatOpenAI(azure_deployment=api_deployment,
#                       api_version=api_version,
#                       azure_endpoint=base_url,
#                       api_key=api_key,
#                       model=model_name)


# embedding = AzureOpenAIEmbeddings(deployment=embedding_deployment,
#                                         openai_api_key=api_key,
#                                         openai_api_version=api_version,
#                                         azure_endpoint=base_url,
#                                         openai_api_type="azure")


  embedding = OpenAIEmbeddings(api_key=openai_key)


### Extract Text from the Documents or Read Files

#### There are variety of loaders are integrated with the langchain. Follow Link for more details: https://python.langchain.com/docs/integrations/document_loaders/

In [2]:
## Use PymuPDF library integrated with langchain to load the documents
from langchain_community.document_loaders import PyMuPDFLoader,Docx2txtLoader


loader = PyMuPDFLoader(r"Tanico.pdf")
documents=loader.load()

In [3]:
# By default it will load documents as langchain document object if imported from the langchain
documents[0]

Document(metadata={'source': 'Tanico.pdf', 'file_path': 'Tanico.pdf', 'page': 0, 'total_pages': 26, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'Qt 5.12.8', 'creationDate': "D:20240125071904+01'00'", 'modDate': '', 'trapped': ''}, page_content='Inline Viewer\nMenu Menu\nInformation Save XBRL Instance Save XBRL Zip File Open as HTML Settings Help\nSections Sections\nAdditional Search Options\n Include Fact Name\n Include Fact Content\n Include Labels\n Include Definitions\n Include Dimensions\nReference Options\n Include Topic\n Include Sub-Topic\n Include Paragraph\n Include Publisher\n Include Section\n Match Case\nSearch Facts\nClear Search  Submit Search\nData Data\n All\n Amounts Only\n Text Only\n Calculations Only\n Negatives Only\n Additional Items Only\nTags Tags\n All\n Standard Only\n Custom Only\nMore Filters More Filters\nSelecting any of the below will take a few moments.\nPeriods\nMeasures\nAx

In [4]:
print("# of pages in documents",len(documents))

# of pages in documents 26


#### Link: Create langchain document object manually
https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document

### Chunks the documents

#### Link of variety of chunking techniques: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter=RecursiveCharacterTextSplitter(chunk_size=1500,chunk_overlap=200,separators=["\n\n","\n"," "])
splitted_documents=splitter.split_documents(documents)

In [6]:
len(splitted_documents)

60

In [7]:
splitted_documents[0]

Document(metadata={'source': 'Tanico.pdf', 'file_path': 'Tanico.pdf', 'page': 0, 'total_pages': 26, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'Qt 5.12.8', 'creationDate': "D:20240125071904+01'00'", 'modDate': '', 'trapped': ''}, page_content='Inline Viewer\nMenu Menu\nInformation Save XBRL Instance Save XBRL Zip File Open as HTML Settings Help\nSections Sections\nAdditional Search Options\n Include Fact Name\n Include Fact Content\n Include Labels\n Include Definitions\n Include Dimensions\nReference Options\n Include Topic\n Include Sub-Topic\n Include Paragraph\n Include Publisher\n Include Section\n Match Case\nSearch Facts\nClear Search  Submit Search\nData Data\n All\n Amounts Only\n Text Only\n Calculations Only\n Negatives Only\n Additional Items Only\nTags Tags\n All\n Standard Only\n Custom Only\nMore Filters More Filters\nSelecting any of the below will take a few moments.\nPeriods\nMeasures\nAx

In [8]:
splitted_documents[0].metadata

{'source': 'Tanico.pdf',
 'file_path': 'Tanico.pdf',
 'page': 0,
 'total_pages': 26,
 'format': 'PDF 1.4',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'wkhtmltopdf 0.12.5',
 'producer': 'Qt 5.12.8',
 'creationDate': "D:20240125071904+01'00'",
 'modDate': '',
 'trapped': ''}

In [9]:
splitted_documents[0].page_content

'Inline Viewer\nMenu Menu\nInformation Save XBRL Instance Save XBRL Zip File Open as HTML Settings Help\nSections Sections\nAdditional Search Options\n Include Fact Name\n Include Fact Content\n Include Labels\n Include Definitions\n Include Dimensions\nReference Options\n Include Topic\n Include Sub-Topic\n Include Paragraph\n Include Publisher\n Include Section\n Match Case\nSearch Facts\nClear Search  Submit Search\nData Data\n All\n Amounts Only\n Text Only\n Calculations Only\n Negatives Only\n Additional Items Only\nTags Tags\n All\n Standard Only\n Custom Only\nMore Filters More Filters\nSelecting any of the below will take a few moments.\nPeriods\nMeasures\nAxis\n Explicit\n Typed\nMembers\n Explicit\n Typed\nScale\nBalance 2\n Debit\n Credit\nReset All Filters\nLinks\nFacts 233\nLoading Inline Form.\nTagged Sections\n Search in All.\n Search in Internal Sections Only.\n Search in External Sections Only.\nSearch Sections\nClear Sections Search  Submit Sections Search\nHelp\nGet

### Vector Store

#### Find the details on various vector store to link: https://python.langchain.com/v0.1/docs/integrations/vectorstores/

In [10]:
## Used FAISS index -- in memory to generate the embeddings
from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(splitted_documents,embedding)

In [11]:
#Visualize the faiss indexes
#db.docstore.__dict__

In [12]:
# For read and write operation embeddings can be stored to disc
db.save_local("./Index")

In [13]:
# Load the DB
try:
    db=FAISS.load_local("./Index",embedding)    
except:
    db=FAISS.load_local("./Index",embedding,allow_dangerous_deserialization=True)

### Retrieval Method

In [14]:
## Retrieve most similar chunks to user inputs from vector db. Various db supports various types of indexing i.e. IVF_FLAT
##i.e., Retrieve the most similar 5 chunks from the DB.
user_prompt="What is Total Assest as of Decemeber, 31  2023 for Tanico?"
most_similar_chunks = db.similarity_search_with_score(user_prompt,k=10)

In [15]:
# Receieved response to the langchain document objects. It also stores the metadata & calculated similarity for given chunks
most_similar_chunks[0]

(Document(metadata={'source': 'Tanico.pdf', 'file_path': 'Tanico.pdf', 'page': 7, 'total_pages': 26, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'Qt 5.12.8', 'creationDate': "D:20240125071904+01'00'", 'modDate': '', 'trapped': ''}, page_content='Table of Contents\n  \nTanico Inc.\nBalance Sheets\nAs of December 31, 2023 (unaudited) and September 30, 2023\n \n  \n   \n \n \n \nDecember 31,\n2023\n  \nSeptember 30,\n2023\n \nASSETS\n  \n   \n \n \n  \n   \n \nCurrent Assets\n  \n   \n \nCash\n $\n4,849  $\n14,299 \nPrepaids and other current assets\n  \n-   \n- \nTotal Current Assets\n  \n4,849   \n14,299 \nNon-Current Assets\n   \n    \n \nProperty, Plant & Equipment (net)\n  \n121   \n241 \nTotal Non-Current Assets\n  \n121   \n241 \nTotal Assets\n $\n4,970  $\n14,540 \n \n   \n    \n \nLIABILITIES AND STOCKHOLDERS’ EQUITY (DEFICIT)\n   \n    \n \nCurrent Liabilities\n   \n    \n \nAccounts Payable and Accr

In [16]:
aggregated_content = "\n".join([chunk[0].page_content for chunk in most_similar_chunks])

In [17]:
aggregated_content[:1000]

'Table of Contents\n  \nTanico Inc.\nBalance Sheets\nAs of December 31, 2023 (unaudited) and September 30, 2023\n \n  \n   \n \n \n \nDecember 31,\n2023\n  \nSeptember 30,\n2023\n \nASSETS\n  \n   \n \n \n  \n   \n \nCurrent Assets\n  \n   \n \nCash\n $\n4,849  $\n14,299 \nPrepaids and other current assets\n  \n-   \n- \nTotal Current Assets\n  \n4,849   \n14,299 \nNon-Current Assets\n   \n    \n \nProperty, Plant & Equipment (net)\n  \n121   \n241 \nTotal Non-Current Assets\n  \n121   \n241 \nTotal Assets\n $\n4,970  $\n14,540 \n \n   \n    \n \nLIABILITIES AND STOCKHOLDERS’ EQUITY (DEFICIT)\n   \n    \n \nCurrent Liabilities\n   \n    \n \nAccounts Payable and Accrued Expenses\n $\n2,748  $\n8,099 \nDeferred Revenue\n  \n-   \n- \nDue to Related Party\n  \n18,007   \n17,791 \nTotal Current Liabilities\n  \n20,755   \n25,890 \nTotal Liabilities\n  \n20,755   \n25,890 \n \n   \n    \n \nCommitments and contingencies\n  \n-   \n- \n \n   \n    \n \nStockholders’ Equity (Deficit)\n   \n 

### Final Q&A

In [18]:
## Once retrieved the content than can be leveraged to generate the final response. Here custom user instruction prompts can be passed.

overall_message = """
                Based on the provided content response to question mentioned question.
                
                Content: {}
                
                Question: {}
                    
                """.format(aggregated_content,user_prompt)

response = llm.invoke(overall_message)



In [19]:
response.content

'The Total Assets for Tanico Inc. as of December 31, 2023, is $4,970.'

### Leverage the inbuild Retrieval Methods available as part of the Langchain Framework:

#### Link: https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/

In [21]:
# 0.20*80