<a href="https://colab.research.google.com/github/AanLetna7025/Rag_pdf/blob/main/rag1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U PyPDF2 langchain langchain_community faiss-cpu sentence-transformers langchain_google_genai

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain_google_genai
  Downloading langchain_google_genai-2.1.10-py3-none-any.whl.metadata (7.2 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain_google_genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
Collecting marshmallow

In [None]:
!pip install langchain_huggingface

Collecting langchain_huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain_huggingface
Successfully installed langchain_huggingface-0.3.1


In [None]:
import os
import google.generativeai as genai
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from google.colab import userdata
from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
from google.colab import userdata
api_key = userdata.get('gemini_key')
if not api_key:
    raise ValueError("GOOGLE_API_KEY not found.")
genai.configure(api_key=api_key)
model = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=api_key)

In [None]:
def get_pdf_text(pdf_docs):
    text=""
    for pdf in pdf_docs:
        if not os.path.exists(pdf):
            raise FileNotFoundError(f"PDF not found: {pdf}")
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            page_text = page.extract_text() or ""
            text += page_text
    return text

In [None]:
def get_text_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_text(text)
    return chunks


In [None]:
def get_embedding_model():
    """
    Initializes and returns the HuggingFaceEmbeddings model.
    This helps keep the model initialization consistent.
    """
    model_name = "sentence-transformers/all-MiniLM-L6-v2"

    device = "cuda" if os.environ.get("COLAB_TPU_ADDR") is None and os.environ.get("CUDA_VISIBLE_DEVICES") is not None else "cpu"

    model_kwargs = {'device': device}
    encode_kwargs = {'normalize_embeddings': True} # Important for cosine similarity with FAISS

    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs
    )
    return embeddings


In [None]:
def get_vector_store(text_chunks):
    embeddings = get_embedding_model() # Use the consistent embedding model
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local("faiss_index_sbert") # Saved with a new name (SBERT)
    return vector_store

In [None]:

def get_conversational_chain():
    prompt_template = ChatPromptTemplate.from_messages([
        ("system", "Answer the following question based only on the provided context. If the answer is not in the context, politely state that you cannot find the answer in the provided information."),
        ("human", "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:")
    ])

    model = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.3, google_api_key=api_key)

    # Create the chain using pipe operator
    chain = prompt_template | model | StrOutputParser()
    return chain

def user_input(user_question):
    """Simple function to answer user questions"""
    embeddings = get_embedding_model()

    # Load the FAISS index
    new_db = FAISS.load_local("faiss_index_sbert", embeddings, allow_dangerous_deserialization=True)
    docs = new_db.similarity_search(user_question)

    # Format context from retrieved documents
    context = "\n".join([doc.page_content for doc in docs])

    chain = get_conversational_chain()

    # Use the pipe operator chain
    response = chain.invoke({
        "context": context,
        "question": user_question
    })

    print("Answer:", response)


if __name__ == "__main__":
    pdf_files = ["/content/MainProject_Report_final.pdf"]

    if not os.path.exists(pdf_files[0]):
      print(f"Error: PDF file not found at {pdf_files[0]}. Please upload it to Colab session storage.")
    else:
        print(f"File exists: {os.path.exists(pdf_files[0])}")
        print("Processing PDF...")
        text = get_pdf_text(pdf_files)
        chunks = get_text_chunks(text)
        get_vector_store(chunks) # This will use SBERT embeddings
        print("PDF processed successfully with Sentence-Transformer embeddings!")

        while True:
            question = input("\nAsk a question (or type 'quit'): ")
            if question.lower() == 'quit':
                break
            user_input(question)

File exists: True
Processing PDF...
PDF processed successfully with Sentence-Transformer embeddings!

Ask a question (or type 'quit'): what is cnn?
Answer: A Convolutional Neural Network (CNN) is a type of deep learning model primarily designed for processing structured grid data, such as images and spectrograms. It consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers.

Here's a breakdown of its components and functions:
*   **Convolutional layers:** These layers apply filters to the input data, using kernels that slide over the input to detect patterns such as edges, textures, and shapes, thereby extracting spatial features.
*   **Pooling layers:** These layers reduce the dimensionality of the data by selecting the most significant values, which improves computational efficiency while retaining important features. Common techniques include max pooling and average pooling.
*   **Fully connected layers:** The features extracted by the p

-pdf extraction using pdf2.pdfreader.
text chunking for easy processing of -these large text  into manageble pieces for embedding.
-embedding for converting to numerical data for similarity search.(sentence-transformers/all-MiniLM-L6-v2 model)
-vector storage with faiss.Enable fast similarity search across document chunks.
-user input.
-LCEL pipeline(langchain expression language)prompt_template | model | StrOutputParser()
-retrieved document chunks into context string.to the llm.
-llm generation.gemini.

