## 🌐 **Web-Based RAG with Hallucination Detection** | **RAG100X**

This notebook demonstrates a fully functional Retrieval-Augmented Generation (RAG) pipeline built on top of real blog articles from [DeepLearning.ai](https://www.deeplearning.ai/blog/). It extracts, indexes, and retrieves relevant sections from web content, performs generation with proper grounding, and includes advanced reasoning modules like hallucination detection and source attribution.

✅ **Key Capabilities**  
*This notebook expands the RAG capabilities by layering multiple critical stages:*

- *Loads articles directly from web URLs using LangChain’s WebBaseLoader*  
- *Splits the raw text into semantic chunks using `RecursiveCharacterTextSplitter`*  
- *Embeds each chunk using Cohere’s high-performance embedding model*  
- *Stores and retrieves via Chroma vectorstore for scalable search*  
- *Uses Groq’s LLaMA 3.1 model to filter relevant chunks (document grading)*  
- *Generates grounded answers using only the filtered, high-relevance chunks*  
- *Detects hallucinations in the LLM output by verifying against the source text*  


> 🛠️ **Note:** This notebook is **completely self-contained**, with all logic implemented inline for transparency, reproducibility, and easy customization. It is designed to move one step closer to production-grade RAG.

---

## 🔁 **How This Notebook Differs from Previous RAGs**

🧩 Compared to the first two notebooks in **RAG100x**, this version introduces multiple critical upgrades:

| Feature                        | Day 1: PDF QA RAG | Day 2: CSV RAG | ✅ Day 3: Web RAG |
|-------------------------------|-------------------|----------------|-------------------|
| Data Source                   | PDFs              | CSVs           | 🌐 Web Articles   |
| Retrieval Only                | ✅ Yes            | ❌ No          | ❌ No (includes generation) |
| Answer Generation             | ❌ No             | ✅ Yes         | ✅ Yes            |
| Document Relevance Filtering  | ❌ No             | ❌ No          | ✅ Yes (via LLM)  |
| Hallucination Detection       | ❌ No             | ❌ No          | ✅ Yes            |
| Embedding Model               | OpenAI            | OpenAI         | 🧠 Cohere         |
| Vector DB                     | FAISS             | FAISS          | 📦 Chroma         |
| LLM Stack                     | OpenAI GPT-4o     | GPT-4o         | 🔥 Groq (LLaMA 3.1, Mixtral) |


This notebook moves from simple experimentation to more **real-world RAG workflows** like trust, traceability, and precision — key for production readiness.


### 📦 Installing Core Libraries for Web-based RAG

To get started, we install a minimal yet powerful set of libraries required for a **self-contained, production-ready RAG system** built on top of live web content:

- **`langchain` & `langchain-community`**  
  Provides standardized interfaces for document loaders, splitters, embedding models, vectorstores, and LLM chains — including community-maintained integrations.

- **`python-dotenv`**  
  Helps manage API credentials securely by loading them from a `.env` file into environment variables.

> We intentionally keep dependencies lightweight and modular to retain full control over the pipeline and ensure reproducibility in future experiments.


In [None]:
# Install required packages
!pip install langchain langchain-community python-dotenv

### 🔐 Setting Up Environment Variables for API Access

To connect with external services like **Cohere** (for embeddings) and **Groq** (for running LLMs), we need to securely load our API keys.

Instead of hardcoding sensitive credentials into the notebook, we use a `.env` file combined with the `python-dotenv` package. This keeps our keys out of version control and makes the setup portable.

- **`.env` file** contains your API keys in simple `KEY=value` format.
- **`load_dotenv()`** reads that file and makes the values available to Python.
- **`os.environ[...]`** is used to explicitly set the keys for downstream compatibility.

> 📁 Make sure your `.env` file is located in the root directory and looks like this:
> ```
> GROQ_API_KEY=your_groq_key
> COHERE_API_KEY=your_cohere_key
> ```

This setup allows the rest of our notebook to access LLMs and embedding models seamlessly.


In [None]:
# Load and set API keys from .env file
import os
from dotenv import load_dotenv

# Load variables from the .env file into environment
load_dotenv()

# Explicitly set them in os.environ (ensures compatibility across services)
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY")


### 🧱 Building the Vector Index from Web Articles

We’re turning online blog posts into searchable chunks by following a few essential steps:

- **`WebBaseLoader`**  
  Loads the full article content from each URL — useful when your data lives on the web.

- **`RecursiveCharacterTextSplitter`**  
  Splits long articles into smaller chunks. This helps preserve semantic structure while making each chunk manageable for embedding.

- **`CohereEmbeddings`**  
  Converts each chunk into a dense vector using Cohere’s high-quality multilingual embedding model.

- **`Chroma` Vectorstore**  
  Stores all these vectors in a fast, local vector database that supports similarity search. A lightweight and fast vectorstore ideal for local experimentation.

- **`.as_retriever()`**  
  Creates a retriever that can return the top `k` most relevant chunks based on a semantic query.

> ✅ After this step, we have a ready-to-query semantic index of the DeepLearning.ai blogs — our knowledge base for RAG!



In [None]:
# Import required tools
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_cohere import CohereEmbeddings

# Load Cohere embedding model
embedding_model = CohereEmbeddings(model="embed-english-v3.0")

# List of DeepLearning.ai article URLs
urls = [
    "https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-2-reflection/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-3-tool-use/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-4-planning/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-5-multi-agent-collaboration/?ref=dl-staging-website.ghost.io"
]

# Load all web documents
raw_docs = []
for url in urls:
    raw_docs.extend(WebBaseLoader(url).load())

# Split each article into smaller chunks
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500,
    chunk_overlap=0
)
chunks = splitter.split_documents(raw_docs)

# Store embeddings in Chroma vectorstore
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    collection_name="web-rag"
)

# Create a retriever to fetch top 4 relevant chunks
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)


### 🔍 Retrieving Relevant Chunks for a Question

Now that we’ve built our vector index, we can test it by asking a question.

- **`retriever.invoke()`**  
  Takes a user question, embeds it, and returns the top-matching document chunks from the vectorstore.

- **Document Metadata**  
  We print the first retrieved chunk to inspect what kind of content is being returned — including its `title`, `source URL`, and actual `content`.

> This step confirms that our retriever is working correctly and that the documents being pulled are semantically relevant.


In [None]:
# Ask a question
question = "what are the different kinds of agentic design patterns?"

# Retrieve top-matching documents from the vector index
docs = retriever.invoke(question)

# Preview the first retrieved chunk
print(f"Title: {docs[0].metadata['title']}\n")
print(f"Source: {docs[0].metadata['source']}\n")
print(f"Content:\n{docs[0].page_content}")


### 🧪 LLM-Based Filtering: Checking Document Relevance

Retrievers aren't perfect — they might return documents that are loosely related or even irrelevant. To improve answer quality, we use an **LLM to grade each retrieved chunk** for actual relevance.

Here's how this filtering works:

- **Why?**  
  To remove noise and hallucinations in later steps, we want to only pass *truly useful* documents to the answer generation step.

- **What?**  
  We create a structured grading function using **Groq’s LLaMA 3.1 model**, which checks whether a document is relevant to the given question.

- **How?**  
  1. We define a simple schema with a single binary output: `"yes"` or `"no"`.
  2. We wrap the LLM with this schema using `with_structured_output(...)`.
  3. We feed each retrieved chunk + question into a prompt template, and the LLM decides whether it's useful or not.

- **`GradeDocuments (Pydantic)`**  
  Defines the expected output: the model can only reply with **"yes"** or **"no"** — nothing else.

- **`ChatGroq(model="llama-3.1-8b-instant")`**  
  Uses Groq’s lightning-fast LLaMA model to analyze each document.

- **`.with_structured_output(...)`**  
  Forces the model to follow our "yes/no" format — this keeps responses consistent and easy to use.

- **`ChatPromptTemplate`**  
  Gives the model a clear job:  
  > “Here’s a document and a user question. Is this document relevant?”


>| Component                   | Role                                                        |
| --------------------------- | ----------------------------------------------------------- |
| `retriever.invoke(...)`     | Fetches relevant-looking documents                          |
| `retrieval_grader`          | Decides if each document is *actually* relevant or not      |
| `"yes"` / `"no"` decision   | Filters out bad retrievals before they reach the LLM answer |

> 🧠 Under the hood: This turns the LLM into a *relevance classifier*, helping us automatically discard retrieved content that isn’t grounded in the user’s query.

In [None]:
# Imports for prompting and schema validation
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_groq import ChatGroq

# Define a structured schema for the LLM output
class GradeDocuments(BaseModel):
    """Binary score for relevance check on retrieved documents."""
    binary_score: str = Field(
        description="Is this document relevant to the user’s question? 'yes' or 'no'"
    )

# Load Groq's LLaMA model with output schema binding
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)
structured_llm_grader = llm.with_structured_output(GradeDocuments)

# Create a prompt that tells the LLM how to assess relevance
system_prompt = """
You are a grader assessing relevance of a retrieved document to a user question.
If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant.
It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
Give a binary score — 'yes' if relevant, 'no' otherwise.
"""

grade_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "Retrieved document:\n\n{document}\n\nUser question: {question}")
])

# Chain prompt and model into a single grader They reply: Yes or No
retrieval_grader = grade_prompt | structured_llm_grader


### 🧹 Filtering Out Irrelevant Documents

Now that we’ve built our document grader, let’s use it to **clean the retrieval results**.

We loop through the retrieved documents and ask:

> 🧐 “Does this document help answer the question?”

- **`retrieval_grader.invoke({...})`**  
  Passes in the document and question to our LLM-based grader.

- **Filter Logic**  
  If the model returns `"yes"` (relevant), we **keep the document**.  
  Otherwise, we drop it from further processing.

- **`docs_to_use`**  
  Stores only the documents that pass our relevance check —  
  so the final answer is based on **trusted, filtered sources**.

> This improves answer quality and reduces hallucination by eliminating off-topic content.


In [None]:
docs_to_use = []
for doc in docs:
    print(doc.page_content, '\n', '-'*50)
    res = retrieval_grader.invoke({"question": question, "document": doc.page_content})
    print(res,'\n')
    if res.binary_score == 'yes':
        docs_to_use.append(doc)

### 🧠 Generating the Final Answer with LLaMA-3.1

This is the final step — where the **cleaned, relevant documents** are passed to the LLM to generate a concise answer.

- **`ChatPromptTemplate`**  
  Structures the prompt with two pieces:
  - `<docs>...</docs>` → Injects the filtered documents
  - `<question>...</question>` → Inserts the user’s query

- **`ChatGroq(model="llama-3.1-8b-instant")`**  
  Uses Groq’s blazing-fast LLaMA model to answer based on the input context.

- **`format_docs()`**  
  Wraps each document in tags (`<doc1>`, `<doc2>`, etc.) with metadata like title & source for traceability.

- **`StrOutputParser()`**  
  Ensures the final output is returned as clean, readable text.

- **`rag_chain`**  
  This chain flows:  
  **Prompt → LLM → Output Parser** — giving us a ready-to-display answer.

> ✨ In just a few lines, we’ve gone from user query → document retrieval → answer generation — all powered by LangChain + Groq.


In [None]:
from langchain_core.output_parsers import StrOutputParser

# Prompt structure for answer generation
system_prompt = """You are an assistant for question-answering tasks.
Answer the question based on the retrieved documents and your knowledge.
Keep it short and focused — 3 to 5 sentences max."""

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human",
     "Retrieved documents:\n\n<docs>{documents}</docs>\n\n"
     "User question: <question>{question}</question>")
])

# LLM: Fast and deterministic
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)

# Format retrieved docs with metadata
def format_docs(docs):
    return "\n".join(
        f"<doc{i+1}>:\nTitle: {doc.metadata.get('title', 'N/A')}\n"
        f"Source: {doc.metadata.get('source', 'N/A')}\n"
        f"Content: {doc.page_content}\n</doc{i+1}>\n"
        for i, doc in enumerate(docs)
    )

# Final RAG chain: Prompt → LLM → Parser
rag_chain = qa_prompt | llm | StrOutputParser()

# Run
generation = rag_chain.invoke({
    "documents": format_docs(docs_to_use),
    "question": question
})

print(generation)


---

### 🧪 Hallucination Detection Using Groq + Function Calling

Once the LLM generates an answer, it's important to **verify whether the answer is actually grounded in the retrieved source documents** — especially in high-stakes or trust-sensitive applications.

This section introduces a **hallucination checker** using Groq’s LLaMA 3.1 model and LangChain's function calling capabilities.

---

#### 🧠 What We're Doing

We use an LLM to **analyze whether the generated answer is actually supported by the retrieved chunks**. This adds an extra trust layer to the RAG pipeline.

---

#### 🔍 Why This Matters

While RAG systems are designed to stay grounded in the retrieved context, **hallucinations can still happen** when:

- The retrieved context is only loosely relevant
- The LLM fills gaps with prior knowledge
- The question was misunderstood

This step explicitly flags such situations by asking the LLM to judge its own output.

---

#### ⚙️ How It Works

1. **`GradeHallucinations` Pydantic Model**  
   Defines a schema with a single field: `binary_score` → either `'yes'` or `'no'`.

2. **Structured Output with Groq’s LLM**  
   We wrap Groq’s LLaMA 3.1 model with `.with_structured_output(...)` so the LLM *must* return output matching our `GradeHallucinations` schema.

3. **Custom Prompt for Hallucination Scoring**  
   A system message tells the LLM to check if the answer is grounded in the facts.  
   A human message supplies both the `<facts>` and the `<generation>`.

4. **Grading Execution**  
   Finally, we pass the prompt and invoke the grading pipeline:
   ```python
   response = hallucination_grader.invoke({
       "documents": format_docs(docs_to_use),
       "generation": generation
   })


In [None]:
# Grading schema: grounded = 'yes', hallucinated = 'no'
class GradeHallucinations(BaseModel):
    binary_score: str = Field(
        ...,
        description="Answer is grounded in the facts, 'yes' or 'no'"
    )

# Grader LLM setup
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)
structured_llm_grader = llm.with_structured_output(GradeHallucinations)

# Grading prompt: Does generation align with retrieved facts?
system_prompt = """You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts.
Give a binary score 'yes' or 'no'. 'Yes' means that the answer is grounded in / supported by the set of facts."""

hallucination_prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human",
     "Set of facts:\n\n<facts>{documents}</facts>\n\n"
     "LLM generation: <generation>{generation}</generation>")
])

# Combine prompt and LLM
hallucination_grader = hallucination_prompt | structured_llm_grader

# Evaluate grounding
response = hallucination_grader.invoke({
    "documents": format_docs(docs_to_use),
    "generation": generation
})

print(response)

---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.


## 🌐 Why Use Web Data?

Unlike static files (like PDFs or CSVs), web content changes fast — and often contains rich, diverse context. In this notebook, I:
- Scraped content from [DeepLearning.ai](https://www.deeplearning.ai/blog/)
- Parsed, cleaned, and embedded the text using **Cohere Embeddings**
- Stored it in a **Chroma vectorstore**
- Retrieved and ranked relevant chunks based on user queries

This is closer to how real production systems behave — dynamic data, fast inference, and structured outputs.

## 🧠 What’s New in This Version?

Compared to my previous RAG builds, this version introduces several advanced features:

- 🔍 **LLM-based document filtering** — Uses a structured grader to filter out irrelevant chunks before generation  
- 🧪 **Hallucination detection** — Adds a second grading step to validate whether the generated answer is grounded in the retrieved content  
- ⚡ **Groq-hosted LLaMA 3.1 models** — Provides ultra-fast inference with structured outputs  
- 🔗 **LangChain 0.2+ composability** — Built using clean expression-style chaining, no legacy `LCEL` syntax  
- 💡 **Manual document formatting** — Ensures consistent input formatting for both generation and grading phases  
- ✨ **No external modules** — Everything is implemented directly inside the notebook for transparency and reproducibility


## 🚀 What Could Be Added Next?

This is a strong base, but several production-ready enhancements could follow:

- ✅ **Source Attribution** — Map the generated answer to specific documents or snippets  
  *Use techniques like context windows with highlighted evidence or attach source metadata inline with answer spans. LangChain’s `stuff` or `map_reduce` chains can be adapted to include citations.*

- 📈 **Confidence Scoring** — Add soft labels or scoring to the relevance and hallucination outputs  
  *Instead of binary "yes/no" hallucination flags, output a confidence score between 0–1 using a regression-style prompt or token probabilities.*

- 🧩 **Multi-hop Retrieval** — Support reasoning across multiple documents with scratchpad prompts  
  *Chain together multiple retrieved documents and guide the LLM with intermediate reasoning steps (e.g. via ReAct or Tree of Thoughts-style prompting).*

- 🧪 **Answer Evaluation** — Use GPT-based graders or traditional metrics to score the final answer  
  *Apply tools like `LLM-as-a-judge` or integrate BLEU/ROUGE-style metrics if ground truths exist — useful in benchmarking system accuracy.*

- 🖼️ **Streamlit / Gradio UI** — Turn this into a live tool or chatbot  
  *Wrap the full chain into a friendly web interface where users can upload links, ask questions, and view sources, generation, and hallucination verdicts interactively.*

- 🔍 **Hybrid Retrieval** — Combine keyword (BM25) and vector search for higher recall  
  *Fuse dense embedding search (via Chroma or FAISS) with sparse keyword retrieval (via BM25 or Elasticsearch) to catch both semantic and lexical matches.*

Each of these could be a new notebook in the series.

## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.