### Building a RAG Application 
This notebook will walk you through building a complete RAG system using: 
1. Langchain Framework.
2. ChromaDB vector Database
3. OpenAI/Grok ai : Embeddings and Language Model.

### RAG (Retrieval-Augmented Generation) Architecture:

1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context

### Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge

In [1]:
import os 
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
## langchain Import
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface  import HuggingFaceEmbeddings
from langchain_groq import ChatGroq

## vectorstore 
from langchain_community.vectorstores import Chroma

## Utility imports
import numpy as np 
import pandas as pd
from typing import List

## Document loader
from langchain_community.document_loaders import PyMuPDFLoader

In [3]:
class smartPDFProcessor:
    def __init__(self):

        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                                             chunk_overlap=20,
                                                             separators=[" "])

    def process(self, pdf_path):
        # Split the PDF content into manageable chunks
        self.pdf_data = PyMuPDFLoader(pdf_path).load()
        print(f"Loaded {len(self.pdf_data)} pages from PDF.")
        chunks = self.text_splitter.split_documents(self.pdf_data)
        print(f"Processed into {len(chunks)} chunks.")
        return chunks

In [4]:
#load sample data
process_pdf = smartPDFProcessor()
chunk_data = process_pdf.process('data/learn_sql.pdf')

Loaded 221 pages from PDF.
Processed into 354 chunks.


### Initialize ChomaDB Vector store and stores the chunks in Verctor Reperatiation

In [5]:

## Create a chromaDB vector store
persist_dirc = "./chroma_db"
# model = 'sentence-transformers/all-MiniLM-L6-v2'

# Initialize HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2") # 384 dimensions

# # Initialize Chroma DB - By default, it uses L2 distance (Euclidean distance)
# # If you want to use cosine similarity, you can set the distance metric in the collection

# vectorstore = Chroma.from_documents(
#     documents=chunk_data,  
#     embedding=embeddings,  
#     persist_directory=persist_dirc,  
#     collection_name="rag_collection"
# )

# to use cosine similarity, you can use the following:
from chromadb.config import Settings

vectorstore = Chroma.from_documents(
    documents=chunk_data,
    embedding=embeddings,
    persist_directory=persist_dirc,
    collection_name="rag_collection",
    # 🔑 Set distance metric
   collection_metadata={"hnsw:space": "cosine"}
)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
print(f"Vectorstore created with {vectorstore._collection.count()} vectors.")

Vectorstore created with 1770 vectors.


In [7]:
#Similarity search
query = "What are the types of joints in SQL?"

docs = vectorstore.similarity_search(query, k=3)
docs


[Document(metadata={'author': '', 'title': '', 'keywords': '', 'trapped': '', 'creationDate': "D:20200806083349+02'00'", 'subject': '', 'moddate': '2020-08-06T08:33:49+02:00', 'total_pages': 221, 'encryption': 'Standard V2 R3 128-bit RC4', 'page': 19, 'file_path': 'data/learn_sql.pdf', 'format': 'PDF 1.4', 'modDate': "D:20200806083349+02'00'", 'producer': 'GPL Ghostscript 9.52', 'creationdate': '2020-08-06T08:33:49+02:00', 'source': 'data/learn_sql.pdf', 'creator': ''}, page_content='Chapter 1: Getting started with SQL\nRemarks\nSQL is Structured Query Language used to manage data in a relational database system. \nDifferent vendors have improved upon the language and have variety of flavors for the language.\nNB: This tag refers explicitly to the ISO/ANSI SQL standard; not to any specific implementation of \nthat standard.\nVersions\nVersion\nShort Name\nStandard\nRelease Date\n1986\nSQL-86\nANSI X3.135-1986, ISO 9075:1987\n1986-01-01\n1989\nSQL-89\nANSI X3.135-1989, ISO/IEC 9075:1989

In [8]:
query = "What is window function?"

docs = vectorstore.similarity_search(query, k=3)
docs

[Document(metadata={'source': 'data/learn_sql.pdf', 'creationDate': "D:20200806083349+02'00'", 'creationdate': '2020-08-06T08:33:49+02:00', 'subject': '', 'title': '', 'format': 'PDF 1.4', 'total_pages': 221, 'page': 96, 'author': '', 'moddate': '2020-08-06T08:33:49+02:00', 'creator': '', 'file_path': 'data/learn_sql.pdf', 'modDate': "D:20200806083349+02'00'", 'producer': 'GPL Ghostscript 9.52', 'encryption': 'Standard V2 R3 128-bit RC4', 'keywords': '', 'trapped': ''}, page_content='functions provide information about the configuration of the current SQL \ninstance.\n1. \nConversion functions convert data into the correct data type for a given operation. For \nexample, these types of functions can reformat information by converting a string to a date or \nnumber to allow two different types to be compared.\n2. \nDate and time functions manipulate fields containing date and time values. They can return \nnumeric, date, or string values. For example, you can use a function to retrieve t

In [9]:
### Advance Similarity Search with Scores
restults = vectorstore.similarity_search_with_score(query, k=3)
restults

[(Document(metadata={'trapped': '', 'keywords': '', 'total_pages': 221, 'producer': 'GPL Ghostscript 9.52', 'file_path': 'data/learn_sql.pdf', 'source': 'data/learn_sql.pdf', 'modDate': "D:20200806083349+02'00'", 'subject': '', 'creator': '', 'format': 'PDF 1.4', 'page': 96, 'title': '', 'author': '', 'creationdate': '2020-08-06T08:33:49+02:00', 'moddate': '2020-08-06T08:33:49+02:00', 'creationDate': "D:20200806083349+02'00'", 'encryption': 'Standard V2 R3 128-bit RC4'}, page_content='functions provide information about the configuration of the current SQL \ninstance.\n1. \nConversion functions convert data into the correct data type for a given operation. For \nexample, these types of functions can reformat information by converting a string to a date or \nnumber to allow two different types to be compared.\n2. \nDate and time functions manipulate fields containing date and time values. They can return \nnumeric, date, or string values. For example, you can use a function to retrieve 

### Understanding Similarity Scores
The similarity score represents how closely related a document chunk is to your query. The scoring depends on the distance metric used:

ChromaDB default: Uses L2 distance (Euclidean distance)

-L2 distance of x = (x₁, x₂) and y = (y₁, y₂) in 2d space is given by:
d(x, y) = √((x₁ − y₁)² + (x₂ − y₂)²)

- Lower scores = MORE similar (closer in vector space)
- Score of 0 = identical vectors
- Typical range: 0 to 2 (but can be higher)

To Configure Cosine similarity # 🔑 Set distance metric - ""collection_metadata={"hnsw:space": "cosine"}""

Cosine similarity (if configured):
- Higher scores = MORE similar
- Range: -1 to 1 (1 being identical)

### Initialize LLM, RAG Chain, Prompt Template, Query the RAG System

In [None]:
#initialize LLM
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0, max_tokens=1000)

# Initialize LLM with Groq
from langchain_groq import ChatGroq
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0.3, max_tokens=1000)

##one more way to initialize LLM
# from langchain.chat_models import init_chat_model
# llm = init_chat_model(
#     "llama-3.1-8b-instant",
#     model_provider="groq",
#     temperature=0.0,
# )

In [None]:
# llm.invoke("What is window function?")



In [None]:
### Modern RAG Chain
from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
### create_stuff_documents_chain makes it easy to build a document-processing chain for RAG workflows.
### Basically : combine retrived documents → insert into prompt → run LLM → get response.
from langchain.chains.combine_documents import create_stuff_documents_chain  

In [None]:
### Step 1 : Convert any vectorstore to retriever
### It act like a bridge between vectorstore and RAG chain
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3}) 

In [None]:
from langchain_core.prompts import ChatPromptTemplate
##Step2 : Create a Prompt Template
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Use the context to provide accurate and concise answers.
If the context does not contain enough information, you should say "I don't know".

context = {context}
"""

prompt = ChatPromptTemplate.from_messages(
    [("system", system_prompt),
     ("human", "{input}")]
)

In [36]:
from langchain.chains import create_retrieval_chain

### Step 3 : Create a Document Chain
### This chain processes the retrieved documents and prepares them for the LLM.
document_chain = create_stuff_documents_chain(llm=llm,prompt=prompt)

### Step 4 : Create a RAG Chain
rag_chain = create_retrieval_chain(retriever, document_chain)

In [40]:
response = rag_chain.invoke({"input": "What are joints in SQL and explain it?"})
print(response['answer'])

In SQL, a join is a method of combining information from two or more tables based on a related column between them. The result is a stitched set of columns from both tables, defined by the join type and join criteria.

A join can be thought of as a way of querying data from several tables in a joint fashion, with the rows displaying columns taken from more than one table. Joins can be used to:

- Combine data from multiple tables
- Retrieve data from a single table that is related to data in another table
- Perform complex queries that require data from multiple tables

There are several types of joins in SQL, including:

1. **INNER JOIN**: Returns only the rows that have a match in both tables.
2. **LEFT JOIN** (or **LEFT OUTER JOIN**): Returns all the rows from the left table and the matching rows from the right table. If there is no match, the result will contain NULL values.
3. **RIGHT JOIN** (or **RIGHT OUTER JOIN**): Similar to LEFT JOIN, but returns all the rows from the right t