<a href="https://colab.research.google.com/github/DhrubaAdhikary/GEN_AI_DEMO/blob/master/1_Basic_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is a revised, clean IPYNB-style notebook that incorporates all your requested changes:

Interactive document upload

Sentence-Transformers for embeddings

FAISS for similarity search

ChromaDB as the vector database backend

LangChain orchestration

Clear separation into notebook cells

This is aligned with current LangChain (0.1.x) patterns and avoids deprecated APIs.

üìò RAG with LangChain + Sentence Transformers + FAISS + ChromaDB

# Retrieval-Augmented Generation (RAG)

Pipeline:
1. Upload document (PDF / TXT / MD)
2. Chunk document
3. Generate embeddings using Sentence Transformers
4. Store vectors in ChromaDB (FAISS-backed)
5. Perform similarity search
6. Query using an LLM


In [6]:
!pip install \
  langchain==0.1.20 \
  langchain-community==0.0.38 \
  langchain-openai==0.1.7 \
  sentence-transformers==2.6.1 \
  chromadb==0.4.24 \
faiss-cpu==1.13.2 \
  pypdf==4.2.0 \
  ipywidgets==8.1.2 \
  python-dotenv==1.0.1


Collecting faiss-cpu==1.13.2
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m81.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
  Attempting uninstall: faiss-cpu
    Found existing installation: faiss-cpu 1.8.0
    Uninstalling faiss-cpu-1.8.0:
      Successfully uninstalled faiss-cpu-1.8.0
Successfully installed faiss-cpu-1.13.2


In [8]:
from google.colab import userdata
import os

# Fetch from Colab Secrets
api_key = userdata.get("OPENAI_API_KEY")

if not api_key:
    raise ValueError(
        "OPENAI_API_KEY not found in Colab Secrets. "
        "Add it via the üîí Secrets panel."
    )

os.environ["OPENAI_API_KEY"] = api_key

print("OPENAI_API_KEY loaded successfully")


OPENAI_API_KEY loaded successfully


In [17]:
### Upload File
from google.colab import files

uploaded = files.upload()


Saving CV #5 Canny_Hough (1).pdf to CV #5 Canny_Hough (1).pdf


In [18]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from pathlib import Path

if not uploaded:
    raise RuntimeError("No file uploaded")

filename = list(uploaded.keys())[0]
suffix = Path(filename).suffix.lower()

if suffix == ".pdf":
    loader = PyPDFLoader(filename)
elif suffix in [".txt", ".md"]:
    loader = TextLoader(filename)
else:
    raise ValueError(f"Unsupported file type: {suffix}")

documents = loader.load()

print(f"Loaded file: {filename}")
print(f"Pages: {len(documents)}")


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


Loaded file: CV #5 Canny_Hough (1).pdf
Pages: 40


In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")


Created 40 chunks


Create Embeddings (Sentence Transformer)

This converts text chunks into vectors.

üü© Cell ‚Äî Embeddings

In [22]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    encode_kwargs={"normalize_embeddings": True}
)


Store Chunks in Vector Database (ChromaDB)

This enables similarity search.

üü© Cell ‚Äî Vector Store

In [23]:
from langchain_community.vectorstores import Chroma

VECTOR_DB_DIR = "chroma_db"

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=VECTOR_DB_DIR
)

vectorstore.persist()

print("Vector store created")


Vector store created


  warn_deprecated(


Create a Retriever

This is the Retrieval part of RAG.

üü© Cell ‚Äî Retriever

In [24]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)


Initialize the LLM

This is the Generation part.

üü© Cell ‚Äî LLM

In [25]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0
)


Build the RAG Chain

This connects:
Retriever ‚Üí LLM

üü© Cell ‚Äî RAG Chain

In [26]:
from langchain.chains import RetrievalQA

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    return_source_documents=True
)


In [27]:
### Ask a Question
query = "Summarize the key points of this document."

response = rag_chain.invoke({"query": query})

print("ANSWER:\n")
print(response["result"])


ANSWER:

The document discusses the Hough Transform for Line Fitting, which is a technique used to identify lines in an image based on points that belong to those lines. It addresses the challenges in line fitting, such as dealing with extra edge points, multiple models, missing evidence, and noise in measured edge points. The role of voting in line fitting is highlighted as a way to let features vote for all models that are compatible with them, helping to identify the most likely lines. The readings mentioned include Canny's edge detector and the Hough Transform from the book "Digital Image Processing" by Rafael C. Gonzalez and Richard E. Woods.


In [28]:
### Inspect Retrieved Chunks

for i, doc in enumerate(response["source_documents"], 1):
    print(f"\n--- Source Chunk {i} ---")
    print(doc.page_content[:400])



--- Source Chunk 1 ---
HoughTransformforLineFitting‚óèGivenpointsthatbelongtoaline,whatisthe line?‚óèHowmanylinesarethere?‚óèWhichpointsbelongto whichlines?‚óèHoughTransformisavotingtechniquethat canbeusedtoanswerallofthesequestions.Mainidea:1.Recordvoteforeachpossiblelineonwhich eachedgepointlies.2.Lookforlinesthatgetmanyvotes.
6KristenGrauman

--- Source Chunk 2 ---
DifficultyinLineFitting‚Ä¢Extraedgepoints(clutter),multiple models:‚Äìwhichpointsgowithwhichline,ifany?‚Ä¢Onlysomepartsofeachlinedetected, andsomepartsaremissing:‚Äìhowtofindaline thatbridgesmissing evidence?‚Ä¢Noiseinmeasurededgepoints, orientations:‚Äìhowtodetecttrueunderlyingparameters?
4KristenGrauman

--- Source Chunk 3 ---
Readings:
19Canny‚Äôsedgedetector-Page729-735-DigitalImageProcessing,4thEd,RafaelC.Gonzalez& RichardE.WoodsHoughTransform-Page737-742-DigitalImageProcessing,4thEd,RafaelC.Gonzalez&RichardE.Woods

--- Source Chunk 4 ---
RoleofVotinginLineFitting‚Ä¢It‚Äôsnotfeasibletocheckallcombinationsoffeaturesb