# RAG 2.0 — Feature Test Notebook


This notebook is a structured, fully-documented test suite for the RAG application. Each section contains contextual markdown, code cells with purpose comments, expected outputs, and simple validation notes.

⚠️ **Run this notebook from the `RAG/` project directory. Ensure `.env` and dependencies are present.**


---

## 1. Setup & Environment

**Purpose:** Verify environment, .env, and dependencies. Expected: env vars loaded and paths resolved.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Load modules and configuration, verify .env variables.
# Inputs: As defined by environment and previous cells
# Expected Output: Prints env vars summary and version info.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

import json
import os
from pathlib import Path

---

## 2. Data Ingestion

**Purpose:** Upload or crawl documents and URLs. Expected: text extraction confirmations.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Test document ingestion for sample files or URLs.
# Inputs: As defined by environment and previous cells
# Expected Output: Extracted text snippet(s) printed and saved.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

from config import Config, IS_CONFIG_VALID
from core.knowledge_graph import KnowledgeGraphBuilder
from core.rag_engine import RAGEngine
from ingestion.document_processor import DocumentProcessor
from ingestion.web_crawler import WebCrawler
from logger import logger

In [None]:
# ------------------------------------------------------------
# Purpose: Test document ingestion for sample files or URLs.
# Inputs: As defined by environment and previous cells
# Expected Output: Extracted text snippet(s) printed and saved.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

def pjson(obj):
    """Pretty-print JSON to stdout."""
    print(json.dumps(obj, indent=2, ensure_ascii=False))

---

## 3. Text Processing & Chunking

**Purpose:** Normalize text and split into semantic chunks. Expected: chunk counts and stats.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Text cleaning and chunking routines.
# Inputs: As defined by environment and previous cells
# Expected Output: Number of chunks and a sample chunk shown.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

if not IS_CONFIG_VALID:
    logger.error("CRITICAL: .env file is not configured correctly. Please check it.")
    raise RuntimeError("Invalid configuration")

logger.info("Configuration is valid.")
print(f"LLM Provider : {Config.LLM_PROVIDER}")
print(f"LLM Model    : {Config.LLM_MODEL}")
print(f"Vector Store : {Config.VECTOR_STORE_TYPE}")

rag = RAGEngine()

---

## 4. Embedding Generation

**Purpose:** Generate vector embeddings for chunks using configured embedding model. Expected: embeddings shape and latency metrics.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Embedding model initialization and batching.
# Inputs: As defined by environment and previous cells
# Expected Output: Embeddings shape (n_chunks, dim) printed.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

logger.warning("Clearing vector store for a clean test…")
rag.clear_vector_store()
print("Vector store cleared.\n")

In [None]:
# ------------------------------------------------------------
# Purpose: Embedding model initialization and batching.
# Inputs: As defined by environment and previous cells
# Expected Output: Embeddings shape (n_chunks, dim) printed.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

mock_files = [
    {
        "name": "test_paris.txt",
        "type": "text/plain",
        "data": b"The capital of France is Paris. Paris is known for the Eiffel Tower, "
                b"the Louvre Museum, and its beautiful cafes. It is a major center for "
                b"art and culture.",
    },
    {
        "name": "test_berlin.txt",
        "type": "text/plain",
        "data": b"Berlin is the capital of Germany. It is famous for the Brandenburg Gate "
                b"and the remains of the Berlin Wall. It has a vibrant nightlife and tech scene.",
    },
]

processor = DocumentProcessor()
docs = processor.process_uploaded_files(mock_files)
rag.add_documents(docs)
print(f"DocumentProcessor created {len(docs)} chunks from {len(mock_files)} files.\n")

stats = rag.get_vector_store_stats()
print("--- Vector-store stats after document ingestion ---")
pjson(stats)

---

## 5. Vector Store Operations

**Purpose:** Write/read embeddings to/from Vector DB (Chroma or FAISS). Expected: successful writes and accurate retrievals.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Vector store write and read checks.
# Inputs: As defined by environment and previous cells
# Expected Output: Write confirmation and a top-k retrieval result.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

crawler = WebCrawler()
seed_urls = ["https://en.wikipedia.org/wiki/Bread"]
crawled = crawler.crawl_root_urls(
    seed_urls,
    context="baking, history, flour",
    max_depth=1,
    max_pages_per_url=2,   # <-- added
)

print(f"\nWebCrawler found {len(crawled)} relevant pages.")
if crawled:
    rag.add_documents(crawled)
    stats = rag.get_vector_store_stats()
    print("--- Vector-store stats after web crawl ---")
    pjson(stats)

---

## 6. RAG Engine Evaluation

**Purpose:** Run retrieval + generation pipeline to answer queries. Expected: grounded and relevant answers.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: RAG engine run: retrieval + generation sample query.
# Inputs: As defined by environment and previous cells
# Expected Output: Generated answer with retrieved context printed.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

query = "What is the capital of France?"
retrieved = rag.retrieve_relevant_documents(query, k=3)
print(f"\nRetrieval test for: '{query}'")
pjson(retrieved)

In [None]:
# ------------------------------------------------------------
# Purpose: RAG engine run: retrieval + generation sample query.
# Inputs: As defined by environment and previous cells
# Expected Output: Generated answer with retrieved context printed.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

print("\n=== Generation tests ===")

q1 = "What is Paris known for?"
print(f"\nQ: {q1}")
pjson(rag.generate_response(q1))

q2 = "What is bread?"
print(f"\nQ: {q2}")
pjson(rag.generate_response(q2))

---

## 7. Knowledge Graph Tests

**Purpose:** Extract entities and relationships and build graph. Expected: nodes/edges summary.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Knowledge graph extraction from documents.
# Inputs: As defined by environment and previous cells
# Expected Output: Node and edge counts printed; small sample graph.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

print("\n=== Multi-turn chat ===")
history = []

turns = [
    "What is the capital of Germany?",
    "How many people live there?",
    "What is the Eiffel Tower?",
]

for turn in turns:
    print(f"\nHuman   : {turn}")
    reply = rag.chat_mode(turn, history)
    print(f"Assistant: {reply['answer']}")
    history.append({"human": turn, "assistant": reply["answer"]})

---

## 8. LLM Response Validation

**Purpose:** Check response quality, grounding and simple factual checks. Expected: metrics or pass/fail indicators.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: LLM response validation utilities.
# Inputs: As defined by environment and previous cells
# Expected Output: Basic factuality check metrics (scores).
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

print("\n=== Knowledge Graph ===")
kg_builder = KnowledgeGraphBuilder()

all_docs = rag.get_all_documents_for_kg()
print(f"Building KG from {len(all_docs)} chunks…")

kg_stats = kg_builder.extract_entities_and_relationships(all_docs)
print("\n--- KG stats ---")
pjson(kg_stats)

if kg_stats.get("graph_nodes", 0):
    fig = kg_builder.visualize_graph_plotly()
    # Save interactive plot instead of trying to display in terminal
    out_file = Path("knowledge_graph.html")
    fig.write_html(out_file)
    print(f"Interactive graph saved → {out_file.resolve()}")
else:
    print("No nodes found for KG visualisation.")

In [None]:
# ------------------------------------------------------------
# Purpose: LLM response validation utilities.
# Inputs: As defined by environment and previous cells
# Expected Output: Basic factuality check metrics (scores).
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------

print("\nTest complete. "
      "Delete 'chroma_db_store' and 'logs' folders if you want a fresh start next run.")

---

## 9. Streamlit UI Integration

**Purpose:** Smoke test critical UI pages (upload, chat, graph). Expected: UI initialization and basic endpoints responding.

**How to validate:** See the expected outputs in the code cells below.


In [None]:
# ------------------------------------------------------------
# Purpose: Streamlit app smoke-start (no UI rendering in notebook).
# Inputs: As defined by environment and previous cells
# Expected Output: Confirmation message that Streamlit started or config is valid.
# Validation: Verify printed output or returned objects match expectations
# ------------------------------------------------------------


import os
from huggingface_hub import InferenceClient

# ------------------------------------------------------------------
# 1. Read token from environment
# ------------------------------------------------------------------
api_key = os.getenv("HF_API_TOKEN")
if not api_key:
    raise RuntimeError("Export HF_TOKEN=<your-hugging-face-token> first")


s= "explain yourself"
# ------------------------------------------------------------------
# 2. Build client
# ------------------------------------------------------------------
client = InferenceClient(
    provider="featherless-ai",
    api_key=api_key,
)

# ------------------------------------------------------------------
# 3. Fire request
# ------------------------------------------------------------------
completion = client.chat.completions.create(
    model="inclusionAI/Ling-1T",
    messages=[{"role": "user", "content": s}],
    max_tokens=250,
    temperature=0.7,
)

# ------------------------------------------------------------------
# 4. Display answer
# ------------------------------------------------------------------
answer = completion.choices[0].message.content
print(answer)