# PDF RAG System
**Process:**
1. Ingest PDF files
2. Extract text from PDF files and split into small chunks
3. Send the chunks to the embedding model
4. Save the embeddings to a vector database
5. Perform similarity search on the vector database to find similar documents
6. Retrieve the similar documents and present them to the user

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents.base import Document
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [2]:
DOC_PATH = './data/BOI.pdf'
MODEL = 'llama3.2'
CHROMA_DB_DIR = './data/langchain_chroma_db'
EMBED_MODEL = 'nomic-embed-text'

## Ingesting PDF 

In [3]:
def pdf_loader(path: str) -> list[Document]:
    loader = PyPDFLoader(path)
    data = loader.load()
    return data

In [4]:
data = pdf_loader(DOC_PATH)

In [5]:
data[0]

Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 18.1 (Windows)', 'creationdate': '2023-12-22T14:19:26-05:00', 'moddate': '2023-12-22T14:22:35-05:00', 'title': 'BOI Report Filing Instructions', 'trapped': '/False', 'source': './data/BOI.pdf', 'total_pages': 21, 'page': 0, 'page_label': '1'}, page_content='Beneficial Ownership Information Report\nFiling Instructions\nFinancial Crimes Enforcement Network\nU.S. Department of the Treasury\nVersion 1.0 January 2024')

In [6]:
data[0].page_content[:100]

'Beneficial Ownership Information Report\nFiling Instructions\nFinancial Crimes Enforcement Network\nU.S'

## Chunking

In [7]:
def split_and_chunk(
    data: list[Document],
    *,
    chunk_size: int = 1200,
    chunk_overlap: int = 300
) -> list[Document]:
    splitter = RecursiveCharacterTextSplitter(chunk_size= chunk_size, chunk_overlap= chunk_overlap)
    chunks = splitter.split_documents(data)
    return chunks

In [8]:
chunks = split_and_chunk(data)

In [9]:
len(chunks)

48

In [10]:
print(chunks[0])

page_content='Beneficial Ownership Information Report
Filing Instructions
Financial Crimes Enforcement Network
U.S. Department of the Treasury
Version 1.0 January 2024' metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 18.1 (Windows)', 'creationdate': '2023-12-22T14:19:26-05:00', 'moddate': '2023-12-22T14:22:35-05:00', 'title': 'BOI Report Filing Instructions', 'trapped': '/False', 'source': './data/BOI.pdf', 'total_pages': 21, 'page': 0, 'page_label': '1'}


## Vector database

In [11]:
# for writing data to vector db
def chunks_to_vectordb(
    chunks: list[Document],
    name: str,
    directory: str
) -> Chroma:
    vector_db = Chroma.from_documents(
        documents= chunks,
        embedding= OllamaEmbeddings(model= EMBED_MODEL),
        collection_name= name,
        persist_directory= directory
    )

    return vector_db

We only have to run below cell once, after that we can just read the data using the next cell.

In [12]:
vector_db = chunks_to_vectordb(
    chunks= chunks,
    name= 'boi_rag',
    directory= CHROMA_DB_DIR
)

Loading already stored vector db

In [13]:
vector_db = Chroma(
    persist_directory= CHROMA_DB_DIR, 
    embedding_function= OllamaEmbeddings(model= EMBED_MODEL)
)

In [14]:
vector_db

<langchain_chroma.vectorstores.Chroma at 0x266e26118e0>

## Retrieval

In [15]:
llm = ChatOllama(model= MODEL)

A simple prompt for relevant data retrieval

In [23]:
QUERY_PROMPT = PromptTemplate(
    input_variables= ['question'],
    template= """You are an AI language model assistant. Your task is to generate alternative queries for retrieving relevant information from a vector database for answering a user given question. You have to add relevant subtopics, use narrower and broader scopes and use domain-specific terms. Your core focus is Beneficial Ownership Information Report (BOIR) and it's filling instructions. Maintain the original question's intent while expanding. Do not invent entities, names, or laws that are not explicitly related to the Beneficial Ownership Information Report. Ensure at least two queries are highly specific and at least two are moderately broad. Provide around five alternative queries with small explanation, each on a new line, with no numbering and no bullet points.
    Original question: {question}"""
)

In [24]:
retriever = MultiQueryRetriever.from_llm(
    retriever= vector_db.as_retriever(),
    llm= llm,
    prompt= QUERY_PROMPT
)

RAG prompt 

In [25]:
template = """Answer the question based on the given context.
Context: {context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Chain

In [26]:
chain = (
    {'context': retriever, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [27]:
response = chain.invoke(input= 'What are the main points as a business owner I should know?')

In [28]:
print(response)

As a business owner, there are several key points you should be aware of to ensure the success and growth of your enterprise. Here are some main points:

1. **Business Planning**: Develop a comprehensive business plan that outlines your mission, vision, target market, financial projections, and operational strategy.
2. **Market Research**: Conduct thorough market research to understand your target audience, their needs, preferences, and buying habits.
3. **Financial Management**: Effectively manage your finances by creating a budget, tracking expenses, and maintaining a cash flow management system.
4. **Marketing Strategy**: Develop a solid marketing strategy that includes branding, advertising, social media, and content marketing to reach your target audience.
5. **Human Resources**: Build a strong team of employees or contractors who share your vision and values, and provide them with the necessary training and resources to succeed.
6. **Operations Management**: Implement efficient o

In [29]:
response = chain.invoke(input= 'What is Beneficial Ownership Information Report?')
print(response)

I don't have information about a "Beneficial Ownership Information Report" in my knowledge cutoff of December 2023. However, I can provide general information.

Beneficial ownership information reports are typically filed by companies with regulatory bodies, such as the Securities and Exchange Commission (SEC) in the United States or the Companies House in the UK. These reports disclose the ultimate beneficial owners of a company, which includes individuals who directly or indirectly control the company through shareholdings or other means.

The purpose of these reports is to provide transparency and accountability, helping to prevent tax evasion, money laundering, and other illicit activities. Beneficial ownership information can also be used to track the flow of capital and identify potential risks or vulnerabilities in the financial system.

If you have any more specific information about the context in which this question was asked, I may be able to provide a more detailed answer.


In [30]:
response = chain.invoke(input= 'How to Report Beneficial Ownership Information')
print(response)

The beneficial ownership information should be reported through various channels, depending on the jurisdiction and type of entity. Here are some common ways to report beneficial ownership information:

1. **Company Registration**: When registering a company, it is often required to provide information about the company's beneficial owners, such as directors, shareholders, or controllers.
2. **Annual Reports**: Many countries require companies to file annual reports that include information about their beneficial owners.
3. **Anti-Money Laundering (AML) Reporting**: Financial institutions and other organizations may be required to report suspicious transactions and identify beneficial owners of customers.
4. **National Counter-Terrorism Centre (NCTC)**: In India, the NCTC requires certain entities to file a Beneficial Owner Information Form, which includes details about the beneficial owner.
5. **State Data Centers**: Some states require companies to register with their state data cent