# Cleaned & Fixed RAG Notebook

This notebook is a cleaned, fixed, and fully rewritten RAG pipeline using a Groq chat model. It:

- Shows required package installs
- Loads PDFs from a folder and splits into chunks
- Creates embeddings and a FAISS vectorstore
- Builds a retriever
- Uses a fixed `rag_simple` function compatible with chat LLMs (ChatGroq)

> **Save** this file and run cells in order. If you run into missing package errors, run the first cell to install dependencies.

In [None]:
!pip install -q langchain langchain-groq langchain-core "sentence-transformers>=2.2.2" faiss-cpu PyPDF2 python-dotenv nbformat

In [None]:
# Imports & environment
import os
from dotenv import load_dotenv
load_dotenv()

# Groq + LangChain imports
from langchain_groq import ChatGroq
from langchain_core.messages import SystemMessage, HumanMessage

# Document loading, splitting, embeddings, vectorstore
try:
    from langchain.document_loaders import PyPDFLoader
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.embeddings import HuggingFaceEmbeddings
    from langchain.vectorstores import FAISS
except Exception as e:
    print("Warning: Could not import some langchain modules. If you see ImportError later, ensure langchain version is compatible.")
    print(e)

from pathlib import Path
print('Working directory:', os.getcwd())
print('GROQ_API_KEY present in env:', bool(os.getenv('GROQ_API_KEY')))


## Initialize the Groq LLM
Replace the model name if you have a different recommendation from Groq. The notebook uses `llama-3.1-8b-instant` as a recommended replacement for deprecated Gemma models.

In [None]:
# Initialize ChatGroq (change model_name if you prefer another supported model)
groq_api_key = os.getenv('GROQ_API_KEY')
if not groq_api_key:
    print('Warning: GROQ_API_KEY not set. Set it in your .env file or your environment.')

llm = ChatGroq(
    groq_api_key=groq_api_key,
    model_name="llama-3.1-8b-instant",
    temperature=0.1,
    max_tokens=1024
)
print('LLM initialized (ok)')


## PDF processing: load all PDFs from a directory and split into chunks
Place your PDFs into a folder named `pdfs/` (relative to this notebook) or change the `PDF_DIR` variable below.

In [None]:
# Set your PDF directory here
PDF_DIR = 'pdfs'  # change if your PDFs are elsewhere

# Function to load and split PDFs
from typing import List

def process_all_pdfs(pdf_directory: str, chunk_size: int = 1000, chunk_overlap: int = 150):
    pdf_dir = Path(pdf_directory)
    if not pdf_dir.exists():
        raise FileNotFoundError(f"PDF directory not found: {pdf_directory}. Create the folder and add PDFs.")

    pdf_files = list(pdf_dir.glob('**/*.pdf'))
    print(f'Found {len(pdf_files)} PDF(s) in', pdf_directory)

    loader = PyPDFLoader
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    all_docs = []
    for pdf in pdf_files:
        print('Loading:', pdf)
        try:
            loader_instance = PyPDFLoader(str(pdf))
            pages = loader_instance.load()
            chunks = splitter.split_documents(pages)
            print(f' - produced {len(chunks)} chunks')
            all_docs.extend(chunks)
        except Exception as e:
            print('Failed to load or split', pdf, 'Error:', e)

    print('Total chunks produced:', len(all_docs))
    return all_docs

# Ensure PDF_DIR exists
Path(PDF_DIR).mkdir(parents=True, exist_ok=True)
print('PDF_DIR exists:', Path(PDF_DIR).exists())


## Create embeddings and build FAISS vector store
This cell will:
- Create a HuggingFace embeddings object (uses sentence-transformers model)
- Index the document chunks into FAISS

**Note:** If you prefer another embedding model, change `EMBEDDING_MODEL_NAME`. 

In [None]:
# Build embeddings and vectorstore
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Choose an embeddings model (sentence-transformers). You can change model_name as needed.
EMBEDDING_MODEL_NAME = 'sentence-transformers/all-mpnet-base-v2'  # 768-dim, good default

print('Creating embedding model:', EMBEDDING_MODEL_NAME)
emb = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)

# Process PDFs and create vectorstore
docs = process_all_pdfs(PDF_DIR)
if len(docs) == 0:
    print('No document chunks found. Add PDFs to the', PDF_DIR, 'directory and re-run this cell.')
else:
    vectorstore = FAISS.from_documents(docs, emb)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    try:
        nvec = vectorstore.index.ntotal
    except Exception:
        nvec = 'unknown'
    print('Vectorstore created. Number of vectors:', nvec)


## Fixed RAG function
This RAG function retrieves documents using the retriever, builds a safe prompt (no `.format()` on f-strings), and calls `llm.invoke()` with proper message objects.

In [None]:
def rag_simple(query: str, retriever, llm, top_k: int = 3):
    # retrieve docs (use retriever.get_relevant_documents for LangChain retrievers)
    try:
        docs = retriever.get_relevant_documents(query)
    except Exception:
        # fallback for other retriever interfaces
        results = retriever.retrieve(query, top_k=top_k)
        docs = []
        # handle results that are dict-like
        for r in results:
            if hasattr(r, 'page_content'):
                docs.append(r)
            elif isinstance(r, dict) and 'content' in r:
                from langchain.schema import Document
                docs.append(Document(page_content=r['content']))

    if not docs:
        print('Retrieved 0 documents (after filtering)')
        return 'No relevant context found to answer the question.'

    context = "\n\n".join([d.page_content for d in docs])

    prompt = f"""
Use the following context to answer the question concisely.

Context:
{context}

Question: {query}

Answer:
"""

    # Call chat model with message objects
    response = llm.invoke([
        SystemMessage(content='You are a helpful assistant. Answer concisely using only the provided context.'),
        HumanMessage(content=prompt)
    ])

    # response may be an AIMessage or similar; extract content safely
    try:
        return response.content
    except Exception:
        # fallback to str(response)
        return str(response)


## Example query
Run the following cell to test your RAG pipeline once the vectorstore and retriever are created.

In [None]:
# Example usage (run after creating vectorstore & retriever)
try:
    answer = rag_simple('What is attention all you need?', retriever, llm)
    print('\nAnswer:\n', answer)
except NameError as e:
    print('Make sure you ran the cell that builds the vectorstore and retriever. Error:', e)
