## RAG (Retrieval-Augmented Generation) Architecture:

1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context

Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge

In [36]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

db_folder = "./chroma_db"

try:
    # Try to load existing vector store
    vector_store = Chroma(
        persist_directory=db_folder,
        embedding_function=HuggingFaceEmbeddings(),
        collection_name="Rag_collection"
    )
    print(f"✅ Loaded existing collection. Count: {vector_store._collection.count()}")
    
except Exception as e:
    # Collection doesn't exist, create new one
    print(f"Collection not found. Creating new one...")
    vector_store = Chroma.from_documents(
        documents=[],  # Empty
        embedding=HuggingFaceEmbeddings(),
        persist_directory=db_folder,
        collection_name="Rag_collection"
    )
    print(f"✅ Created new empty collection. Count: {vector_store._collection.count()}")

  embedding_function=HuggingFaceEmbeddings(),


✅ Loaded existing collection. Count: 144


- Creating PDF Processing

In [8]:

from langchain_core.documents import Document # Document is container which contains metadata and page_content
from typing import List # relevant for TypeHints
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
class PDFProcessor:
    """Advanced PDF processing with error handling"""
    def __init__(self,chunk_size=1000,chunk_overlap=100):
        self.chunk_size=chunk_size,
        self.chunk_overlap=chunk_overlap,
        self.text_splitter=RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=[" "], # only split at spaces between words after 1000 char

        )

    def process_pdf(self,pdf_path:str)->List[Document]: # returns a List of Documents => Each document contains metadata + page_content
        """Process PDF with smart chunking and metadata enhancement"""

        # Laod PDF

        loader=PyPDFLoader(pdf_path)
        pages=loader.load()

        ## Process each page

        processed_chunks=[]

        for page_num,page in enumerate(pages):
            ## clean text
            cleaned_text=self._clean_text(page.page_content)

            # Skip nearly empty pages
            if len(cleaned_text.strip()) < 50:
                continue

            # Create chunks with enhanced metadata
            #this is adding more meta data on the chunks ontop of existing data
            chunks = self.text_splitter.create_documents(
                texts=[cleaned_text],
                metadatas=[{
                    **page.metadata, #  copies ALL existing metadata, then adds your new fields (i.e. adding on new meta ontop of exisitng metadata)
                    "page": page_num + 1,
                    "total_pages": len(pages),
                    "chunk_method": "smart_pdf_processor",
                    "char_count": len(cleaned_text)
                }]
            )
            
            processed_chunks.extend(chunks)

        return processed_chunks
        #private method
    def _clean_text(self, text: str) -> str:
        """Clean extracted text"""
        # Remove excessive whitespace
        text = " ".join(text.split())
        
        
        return text

    
            


In [9]:
pdf_processor= PDFProcessor()

try:
    smart_chunks=pdf_processor.process_pdf("data_sources/Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf")
    print(f"Processed into {len(smart_chunks)} smart chunks")

    # Show enhanced metadata
    if smart_chunks:
        #This metadata when put into vector database makes it much easier when doing query/retrieval later
        print("\nSample chunk metadata:")
        #This is the meta data of the first chunk
        for key, value in smart_chunks[0].metadata.items():
            print(f"  {key}: {value}")
    print("---------Checking smart chunks ----------\n")
   # print(smart_chunks[1])
    print("---------Checking overall smart chunks metadata ----------\n")
    print(smart_chunks)
except Exception as e:
    print(f"Processing error: {e}")

Processed into 144 smart chunks

Sample chunk metadata:
  producer: PyPDF
  creator: PyPDF
  creationdate: 2022-09-23T15:43:22+08:00
  moddate: 2022-09-23T15:43:23+08:00
  source: data_sources/Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf
  total_pages: 52
  page: 1
  page_label: 1
  chunk_method: smart_pdf_processor
  char_count: 136
---------Checking smart chunks ----------

---------Checking overall smart chunks metadata ----------

[Document(metadata={'producer': 'PyPDF', 'creator': 'PyPDF', 'creationdate': '2022-09-23T15:43:22+08:00', 'moddate': '2022-09-23T15:43:23+08:00', 'source': 'data_sources/Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf', 'total_pages': 52, 'page': 1, 'page_label': '1', 'chunk_method': 'smart_pdf_processor', 'char_count': 136}, page_content='ADVISORY GUIDELINES ON ENFORCEMENT OF THE DATA PROTECTION PROVISIONS Issued 21 April 2016 Revised 1 February 2021 Revised 1 October 2022'), Document(metadata={'producer': 'PyPDF', 'c

In [None]:
#Pass the chunks into the vectordb
smart_chunks[0] # 1 chunk
print(f"✅ Added {len(smart_chunks)} chunks")
print(f"Total vectors in DB before adding chunks: {vector_store._collection.count()}")
vector_store.add_documents(smart_chunks)
print(f"Total vectors in DB after adding chunks: {vector_store._collection.count()}")



✅ Added 144 chunks
Total vectors in DB before adding chunks: 0
Total vectors in DB after adding chunks: 144


## Intialising LLM (Gemma)

In [37]:
from langchain.chat_models.base import init_chat_model
from langchain_core.prompts import ChatPromptTemplate # creates a structured template for your LLM prompt => i.e. Defines HOW to present the retrieved documents and question to the LLM
from dotenv import load_dotenv
load_dotenv() #loads ALL variables from .env file 


# Gemma model test
llm = init_chat_model(model="groq:gemma2-9b-it")
response = llm.invoke("What are you?")
response 

## Create a prompt template

system_prompt="""You are an assistant for answering questions related to Singapore's Personal Data and Protecttion Act (PDPA). 
Use the following pieces of retrieved context to answer the question. Any questions not related to PDPA, you should reply "Please ask questions related to PDPA"
 or refer them to use other sources to answer their questions
If you don't know the answer to PDPA, just say that you don't know. 
Use three sentences maximum and keep the answer concise.

Context: {context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
])

- Create_stuff_document_chain

    - chain that "stuffs" (inserts) all retrieved documents into a single prompt and sends it to the LLM. It's called "stuff" because it literally stuffs all the documents into the context window at once.

- This chain:

    - Takes retrieved documents
    - "Stuffs" them into the prompt's {context} placeholder
    - Sends the complete prompt to the LLM
    - Returns the LLM's response

In [24]:
from langchain.chains.combine_documents import create_stuff_documents_chain

## Convert vector store to retriever
retriever=vector_store.as_retriever(
    search_kwarg={"k":3} ## Retrieve top 3 relevant chunks
)
retriever


### Create a document chain
document_chain=create_stuff_documents_chain(llm,prompt) # A chain that takes documents + question → generates an answer
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='You are an assistant for answering questions related to Singapore\'s Personal Data and Protecttion Act (PDPA). \nUse the following pieces of retrieved context to answer the question. Any questions not related to PDPA, you should reply "Please ask questions related to PDPA"\n or refer them to use other sources to answer their questions\nIf you don\'t know the answer to PDPA, just say that you don\'t know. \nUse three sentences maximum and keep the answer concise.\n\nContext: {context}'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['inp

In [25]:
from langchain.chains import create_retrieval_chain
rag_chain=create_retrieval_chain(retriever,document_chain)
rag_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000001DB6A8424B0>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template='You are an assistant for answering questions related to Singapore\'s Personal Data and Protecttion Act (PDPA). \nUse the following 

In [33]:
response=rag_chain.invoke({"input":"What are the circumstances where the comission can launch a investigation"})
response

{'input': 'What are the circumstances where the comission can launch a investigation',
 'context': [Document(metadata={'chunk_method': 'smart_pdf_processor', 'total_pages': 52, 'source': 'data_sources/Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf', 'creationdate': '2022-09-23T15:43:22+08:00', 'moddate': '2022-09-23T15:43:23+08:00', 'char_count': 2173, 'producer': 'PyPDF', 'creator': 'PyPDF', 'page_label': '26', 'page': 26}, page_content='may, in certain circumstances, commence an investigation of its own motion. This may include (without limitation) situations w here the Commission receives information concerning the conduct of an organisation. In such situations, the Commission may, if it considers it appropriate, pr oceed with an investigation of its own motion based on the information received.'),
  Document(metadata={'moddate': '2022-09-23T15:43:23+08:00', 'total_pages': 52, 'page': 25, 'creationdate': '2022-09-23T15:43:22+08:00', 'source': 'data_sources/Advisory

In [34]:
response['answer']

"The Personal Data Protection Commission (PDPC) may launch an investigation of its own motion if it receives information concerning an organisation's conduct.  The PDPC may consider various factors, including the public interest, before deciding to commence an investigation.  The PDPC has specific powers outlined in the PDPA to conduct investigations, such as requiring document production and entering premises. \n\n\n\n"

In [35]:
response=rag_chain.invoke({"input":"When did K-POP Demon Hunters release?"})
response

{'input': 'When did K-POP Demon Hunters release?',
 'context': [Document(metadata={'page_label': '23', 'creator': 'PyPDF', 'creationdate': '2022-09-23T15:43:22+08:00', 'total_pages': 52, 'chunk_method': 'smart_pdf_processor', 'char_count': 1261, 'source': 'data_sources/Advisory Guidelines on Enforcement of DP Provisions_1oct2022.pdf', 'producer': 'PyPDF', 'page': 23, 'moddate': '2022-09-23T15:43:23+08:00'}, page_content='21 or 22 of the PDPA after completion of a review unless there appears to the Commission to be a significant non-compliance with section 21 or 22 of the PDPA or there are other exceptional circumstances. Please refer to section 16 of these Guidelines for more information on when the Commission may commence an investigation. 35 See paragraph 11.2 of these Guidelines.'),
  Document(metadata={'chunk_method': 'smart_pdf_processor', 'creationdate': '2022-09-23T15:43:22+08:00', 'page_label': '24', 'source': 'data_sources/Advisory Guidelines on Enforcement of DP Provisions_1o