# Notebook 02 - Retrieval-Augmented Generation (RAG) with Pinecone + Groq
### By: JosuÃ© Hernandez

In this notebook I implement a complete **RAG pipeline** using:
- **Pinecone** as the vector database.
- **Google Gemini** (free tier) for embeddings (`gemini-embedding-001`).
- **Groq** (free tier) for the chat model (`llama-3.3-70b-versatile`).

**What I will do step by step:**
1. **Credentials** - I load API keys securely.
2. **Index creation** - I create (or connect to) a Pinecone serverless index.
3. **Document preparation** - I define sample documents and split them into chunks.
4. **Indexing** - I embed the chunks and upsert them into Pinecone.
5. **RAG chain** - I build a retrieval chain that fetches relevant context and generates an answer.
6. **Query** - I ask questions and inspect both the answer and the retrieved sources.

> **Note:** API keys are loaded from environment variables or prompted via `getpass`. I never hard-code secrets.

## 1 - Credentials

I load a `.env` file (if present) and make sure `GOOGLE_API_KEY`, `PINECONE_API_KEY`, and `GROQ_API_KEY` are set.

- **Google Gemini** provides free-tier access to embedding models.
- **Pinecone** offers a free starter tier for vector storage.
- **Groq** provides free-tier access to LLMs like Llama 3.3 (30 RPM, very fast inference).

In [15]:
import os
import getpass
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API key: ")

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

if not os.getenv("GROQ_API_KEY"):
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")

PINECONE_INDEX_NAME = os.getenv("PINECONE_INDEX_NAME", "lab-rag-index")
PINECONE_CLOUD = os.getenv("PINECONE_CLOUD", "aws")
PINECONE_REGION = os.getenv("PINECONE_REGION", "us-east-1")

print(f"Index: {PINECONE_INDEX_NAME} | Cloud: {PINECONE_CLOUD} | Region: {PINECONE_REGION}")

Index: lab-rag-index | Cloud: aws | Region: us-east-1


## 2 - Create / Connect to the Pinecone Index

I instantiate the Pinecone client and create a **serverless index** if it does not already exist. I set the dimension to **768** because I use `gemini-embedding-001` with `output_dimensionality=768` (Matryoshka Representation Learning). I use **cosine** as the similarity metric.

If the index already exists with a different dimension (e.g. 1536 from a previous OpenAI run), I automatically delete and recreate it.

In [16]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

EMBEDDING_DIMENSION = 768

if pc.has_index(PINECONE_INDEX_NAME):
    desc = pc.describe_index(PINECONE_INDEX_NAME)
    if desc.dimension != EMBEDDING_DIMENSION:
        print(f"Index '{PINECONE_INDEX_NAME}' has dimension {desc.dimension}, deleting to recreate with {EMBEDDING_DIMENSION}...")
        pc.delete_index(PINECONE_INDEX_NAME)

if not pc.has_index(PINECONE_INDEX_NAME):
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=EMBEDDING_DIMENSION,
        metric="cosine",
        spec=ServerlessSpec(cloud=PINECONE_CLOUD, region=PINECONE_REGION),
    )

index = pc.Index(PINECONE_INDEX_NAME)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'lab5': {'vector_count': 3}},
 'total_vector_count': 3,
 'vector_type': 'dense'}

## 3 - Prepare the Documents

I define three sample documents inline (no external files needed). Each `Document` object contains:
- `page_content`: the actual text.
- `metadata`: a dictionary with at least a `source` key for traceability.

In a real-world scenario I would load documents from files, databases, or the web using LangChain **document loaders**.

In [17]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content=(
            "Data Policy (example)\n\n"
            "Retention:\n"
            "- Operational logs are kept for 90 days.\n"
            "- Backups are kept for 30 days.\n\n"
            "Access:\n"
            "- Access to sensitive data requires role-based authorization.\n"
            "- All access events must be audited (who, when, what).\n\n"
            "Security:\n"
            "- Secrets (API keys) must never be committed to version control.\n"
            "- Key rotation is recommended every 90 days."
        ),
        metadata={"source": "data_policy"},
    ),
    Document(
        page_content=(
            "Product FAQ (example)\n\n"
            "What is RAG?\n"
            "RAG (Retrieval-Augmented Generation) is a technique where a model retrieves "
            "relevant information from a knowledge base (e.g. a vector store) and then "
            "generates an answer using that context.\n\n"
            "Why use a vector database?\n"
            "It enables semantic similarity search even when the query does not literally "
            "match the stored text.\n\n"
            "How to reduce hallucinations?\n"
            "- Constrain the answer to the retrieved context only.\n"
            "- Improve chunking strategy and document quality.\n"
            "- Apply similarity thresholds and/or re-ranking."
        ),
        metadata={"source": "product_faq"},
    ),
    Document(
        page_content=(
            "Quick Notes: LangChain (example)\n\n"
            "Typical components in a RAG pipeline:\n"
            "1) Loader  2) Splitter  3) Embeddings  4) Vector store  5) Retriever  6) LLM\n\n"
            "Best practices:\n"
            "- Keep metadata (source, path) for traceability.\n"
            "- Experiment with k, chunk_size, and chunk_overlap.\n"
            "- Log the retrieved context for debugging."
        ),
        metadata={"source": "langchain_notes"},
    ),
]

print(f"{len(docs)} documents loaded.")

3 documents loaded.


### Split documents into chunks

I split the documents into smaller **chunks** so that:
- Each chunk fits within the embedding model's token limit.
- The retriever can return precise, relevant passages instead of entire documents.

I use `RecursiveCharacterTextSplitter` which tries to split at natural boundaries (paragraphs, sentences) before falling back to character-level splits.

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
splits = text_splitter.split_documents(docs)

print(f"{len(splits)} chunks created from {len(docs)} documents.")
splits[0].metadata

3 chunks created from 3 documents.


{'source': 'data_policy'}

## 4 - Indexing (Embed + Upsert)

I initialize `GoogleGenerativeAIEmbeddings` (with `output_dimensionality=768`) and `PineconeVectorStore`, then I upsert the chunks. I use a **namespace** (`lab5`) to isolate this lab's data from anything else in the same index.

To avoid duplicating vectors on repeated runs, I check whether the namespace already contains data and skip the upsert if so.

In [19]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import PineconeVectorStore

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
    task_type="SEMANTIC_SIMILARITY",
    output_dimensionality=768,
)

NAMESPACE = "lab5"

vector_store = PineconeVectorStore(index=index, embedding=embeddings, namespace=NAMESPACE)

stats = index.describe_index_stats()
existing = (stats.get("namespaces") or {}).get(NAMESPACE, {}).get("vector_count", 0)

if existing and existing > 0:
    print(
        f"Namespace '{NAMESPACE}' already has {existing} vectors. "
        "Skipping upsert to avoid duplicates."
    )
else:
    ids = vector_store.add_documents(documents=splits)
    print(f"Upserted {len(ids)} chunks into Pinecone (namespace={NAMESPACE}).")

Namespace 'lab5' already has 3 vectors. Skipping upsert to avoid duplicates.


## 5 - Build the RAG Chain

I build the RAG chain using **LCEL (LangChain Expression Language)**. It has two stages:

1. **Retriever** - I fetch the top-k most relevant chunks from Pinecone for a given query.
2. **Generation** - I pass the retrieved context plus the user's question to the LLM and produce an answer.

I use `RunnablePassthrough` to pipe the input through and a helper function to format the retrieved documents into a single string for the prompt.

My system prompt instructs the model to answer **only** from the provided context. If the context does not contain the answer, it must say so explicitly - this helps reduce hallucinations.

I use **Groq** with **Llama 3.3 70B** as the chat model. Groq provides extremely fast inference on its free tier (30 requests/minute).

In [20]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_groq import ChatGroq

llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)

retriever = vector_store.as_retriever(search_kwargs={"k": 3})


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


qa_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful question-answering assistant.\n"
            "Answer the user's question using ONLY the CONTEXT below.\n"
            "If the context does not contain the answer, respond: "
            "'I don't know based on the available context.'\n\n"
            "CONTEXT:\n{context}",
        ),
        ("human", "{question}"),
    ]
)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | qa_prompt
    | llm
    | StrOutputParser()
)

print("RAG chain ready.")

RAG chain ready.


## 6 - Query the RAG Chain

### Question 1 - Answerable from context

I ask a question whose answer exists in the indexed documents. The chain retrieves the relevant chunks, passes them as context to the LLM, and generates a grounded answer.

In [21]:
question = "What is the data retention policy?"

answer = rag_chain.invoke(question)

print("QUESTION:", question)
print("\nANSWER:", answer)

QUESTION: What is the data retention policy?

ANSWER: The data retention policy is as follows:
- Operational logs are kept for 90 days.
- Backups are kept for 30 days.


### Inspect retrieved sources

Transparency is key in RAG systems. Below I use the retriever directly to inspect which chunks were retrieved for the same question, so I can verify the answer is grounded in real data.

In [22]:
retrieved_docs = retriever.invoke(question)

print("Retrieved chunks:")
for i, doc in enumerate(retrieved_docs, start=1):
    source = doc.metadata.get("source", "unknown")
    print(f"\n[{i}] source={source}")
    print(doc.page_content[:400])

Retrieved chunks:

[1] source=data_policy
Data Policy (example)

Retention:
- Operational logs are kept for 90 days.
- Backups are kept for 30 days.

Access:
- Access to sensitive data requires role-based authorization.
- All access events must be audited (who, when, what).

Security:
- Secrets (API keys) must never be committed to version control.
- Key rotation is recommended every 90 days.

[2] source=product_faq
Product FAQ (example)

What is RAG?
RAG (Retrieval-Augmented Generation) is a technique where a model retrieves relevant information from a knowledge base (e.g. a vector store) and then generates an answer using that context.

Why use a vector database?
It enables semantic similarity search even when the query does not literally match the stored text.

How to reduce hallucinations?
- Constrain the

[3] source=langchain_notes
Quick Notes: LangChain (example)

Typical components in a RAG pipeline:
1) Loader  2) Splitter  3) Embeddings  4) Vector store  5) Retriever  6) LLM

B

### Question 2 - Not answerable from context

I now ask something that is **not** present in the indexed documents. The model should decline to answer rather than hallucinate, thanks to my system prompt constraint.

In [23]:
question2 = "What is the support phone number?"
answer2 = rag_chain.invoke(question2)

print("QUESTION:", question2)
print("ANSWER:", answer2)

QUESTION: What is the support phone number?
ANSWER: I don't know based on the available context.


## Summary

In this notebook I built a full RAG pipeline:

| Step | Component | Tool |
|------|-----------|------|
| Document preparation | Inline `Document` objects | `langchain_core` |
| Chunking | `RecursiveCharacterTextSplitter` | `langchain_text_splitters` |
| Embedding | `GoogleGenerativeAIEmbeddings` (`gemini-embedding-001`, 768 dims) | `langchain_google_genai` |
| Vector storage and retrieval | `PineconeVectorStore` | `langchain_pinecone` |
| Answer generation | `ChatGroq` (`llama-3.3-70b-versatile`) via LCEL | `langchain_groq` |

I constrained the LLM to answer **only** from the retrieved context, which helps reduce hallucinations.