# Document Ingestion and QA Notebook (Pinecone Edition)

This notebook loads PDFs, chunks the text, creates embeddings, and uploads them to a **Pinecone Vector Database**. It then uses an LLM to answer questions based on the persistent cloud memory.

In [19]:
# Install ipywidgets to fix the TqdmWarning if needed
# !pip install ipywidgets pinecone-client langchain-pinecone

In [20]:
from langchain_community.document_loaders import PyMuPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone
import torch
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

# --- CONFIGURATION ---
 # Set to False to skip loading and uploading if index already exists

True

### 1. Load PDF Files (Conditional)
Load all PDF files from the `data/` directory using `PyMuPDFLoader`.

In [21]:
DATA_PATH = "data/"

# --- AUTO-DETECT if Pinecone index already has data ---
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
PINECONE_INDEX_NAME = "docbot-index"
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index(PINECONE_INDEX_NAME)
stats = index.describe_index_stats()
vector_count = stats.get("total_vector_count", 0)
UPDATE_VECTOR_DB = vector_count == 0  # Auto: True if empty, False if already has data

if UPDATE_VECTOR_DB:
    def load_pdf_files(data):
        loader = DirectoryLoader(
            data,
            glob="**/*.pdf",
            loader_cls=PyMuPDFLoader,
            use_multithreading=True
        )
        return loader.load()

    documents = load_pdf_files(DATA_PATH)
    print(f"Loaded {len(documents)} document pages")
else:
    print("Skipping PDF loading (Using existing Pinecone index)")

Skipping PDF loading (Using existing Pinecone index)


### 2. Create Text Chunks (Conditional)
Split the loaded documents into smaller chunks for processing.

In [22]:
if UPDATE_VECTOR_DB:
    def create_chunks(extracted_data):
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        text_chunks = text_splitter.split_documents(extracted_data)
        return text_chunks

    text_chunks = create_chunks(documents)
    print(f"Created {len(text_chunks)} text chunks")
else:
    print("Skipping chunk creation")

Skipping chunk creation


### 3. Initialize Embedding Model (Optimized for GPU)
Load the HuggingFace embedding model (`sentence-transformers/all-MiniLM-L6-v2`) and set it to use CUDA if available.

In [23]:
def get_embedding_model():
    # Check if CUDA (GPU) is available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")
    
    embedding_model = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={"device": device}
    )
    return embedding_model

embedding_model = get_embedding_model()

Using device: cuda


### 4. Upload to Pinecone (Conditional)
Push the generated vectors to your Pinecone index only if `UPDATE_VECTOR_DB` is True.
**Pre-requisite:** Ensure `PINECONE_API_KEY` is in your `.env` file and you have created an index named `docbot-index`.

In [24]:
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY")
PINECONE_INDEX_NAME = "docbot-index"

if not PINECONE_API_KEY:
    raise ValueError("PINECONE_API_KEY not found in .env file")

if UPDATE_VECTOR_DB:
    print(f"Uploading vectors to Pinecone index '{PINECONE_INDEX_NAME}'...")
    
    # Batch upload to Pinecone
    if text_chunks:
        docsearch = PineconeVectorStore.from_documents(
            documents=text_chunks,
            embedding=embedding_model,
            index_name=PINECONE_INDEX_NAME,
            pinecone_api_key=PINECONE_API_KEY
        )
        print("Successfully uploaded vectors to Pinecone!")
    else:
        print("No text chunks to upload.")
else:
    print("Skipping upload (Using existing Pinecone index)")

Skipping upload (Using existing Pinecone index)


### 5. Setup LLM (Mistral-7B-Instruct)
Set up the `HuggingFaceEndpoint` to use the Mistral-7B-Instruct model. Ensure `HF_TOKEN` must be set in your `.env` file.

In [25]:
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA

llm = ChatGroq(
    model_name="llama-3.3-70b-versatile",
    temperature=0.0,
    groq_api_key=os.environ["GROQ_API_KEY"],
)
print("LLM loaded successfully")

LLM loaded successfully


### 6. Setup QA Chain
Define the custom prompt template and creating the RetrievalQA chain using the Pinecone vector store.

In [26]:
CUSTOM_PROMPT_TEMPLATE = """
Use the pieces of information provided in the context to answer user's question.
If you dont know the answer, then search on your parameter but if user asked who is mehak then say Mehak is 
girlfriend of Rahul , Rahul loves her so much

Context: {context}
Question: {question}

Start the answer directly. No small talk please.
"""

def set_custom_prompt(custom_prompt_template):
    prompt = PromptTemplate(template=custom_prompt_template, input_variables=["context", "question"])
    return prompt

# Ensure we use the existing index (in case we didn't just upload it)
docsearch = PineconeVectorStore.from_existing_index(
    index_name=PINECONE_INDEX_NAME,
    embedding=embedding_model
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(search_kwargs={'k': 3}),
    return_source_documents=True,
    chain_type_kwargs={'prompt': set_custom_prompt(CUSTOM_PROMPT_TEMPLATE)}
)

print("QA Chain initialized")

QA Chain initialized


### 7. Run Query
Run a query against the document bot.

In [28]:
user_query = input("Write Query Here: ")
if user_query:
    response = qa_chain.invoke({'query': user_query})
    print("RESULT: ", response["result"])
    # print("SOURCE DOCUMENTS: ", response["source_documents"])
else:
    print("No query entered.")

RESULT:  Rahul is not mentioned in the context, but according to the parameter, Rahul loves Mehak so much, she is his girlfriend.
