# RAG 2.0 Feature Test Notebook

This notebook will test all major components of the RAG application. 

### ⚠️ Prerequisites

1.  **Run from `RAG/` directory:** This notebook *must* be saved in your main `RAG/` project folder to resolve the imports correctly.
2.  **`.env` File:** Ensure you have a `.env` file in the `RAG/` directory with your `HF_API_TOKEN` and other settings.
3.  **Dependencies:** Make sure you have run `pip install -r requirements.txt` and `pip install jupyter notebook` in your virtual environment.
4.  **Clean Database:** For a clean test, stop any running apps and delete your old database directory:
    * **Windows (PowerShell):** `Remove-Item -Path .\\chroma_db_store -Recurse -Force`
    * **macOS/Linux:** `rm -rf ./chroma_db_store`

In [1]:
import os
import io
import sys
import json
from IPython.display import display, Markdown
import plotly.io as pio

# Set Plotly to dark mode for the notebook
pio.templates.default = "plotly_dark"

# Import all our application modules
from config import Config, IS_CONFIG_VALID
from logger import logger
from core.rag_engine import RAGEngine
from ingestion.document_processor import DocumentProcessor
from ingestion.web_crawler import WebCrawler
from core.knowledge_graph import KnowledgeGraphBuilder

def pjson(data):
    """Helper function to pretty-print JSON."""
    print(json.dumps(data, indent=2))

2025-10-23 13:16:32,994 - RAG_App - INFO - Loading configuration...
2025-10-23 13:16:32,995 - RAG_App - INFO - Configuration validated successfully.
2025-10-23 13:16:33,965 - faiss.loader - INFO - Loading faiss with AVX512 support.
2025-10-23 13:16:33,966 - faiss.loader - INFO - Could not load library with AVX512 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx512'")
2025-10-23 13:16:33,966 - faiss.loader - INFO - Loading faiss with AVX2 support.
2025-10-23 13:16:34,145 - faiss.loader - INFO - Successfully loaded faiss with AVX2 support.


  from .autonotebook import tqdm as notebook_tqdm


2025-10-23 13:17:00,624 - datasets - INFO - TensorFlow version 2.20.0 available.
2025-10-23 13:17:00,625 - datasets - INFO - JAX version 0.7.2 available.



## 1. Configuration & Setup

This cell checks if the `.env` config is valid and initializes the main `RAGEngine`. It will also clear any existing vector store to ensure a clean test.

In [2]:
if not IS_CONFIG_VALID:
    logger.error("CRITICAL: .env file is not configured correctly. Please check it.")
else:
    logger.info("Configuration is valid.")
    print(f"LLM Provider: {Config.LLM_PROVIDER}")
    print(f"LLM Model: {Config.LLM_MODEL}")
    print(f"Vector Store: {Config.VECTOR_STORE_TYPE}")

# Initialize the main engine
rag_engine = RAGEngine()

# Clear the store for a clean test
logger.warning("Clearing vector store for a clean test...")
rag_engine.clear_vector_store()
print("\nVector store cleared.")

2025-10-23 13:17:07,024 - RAG_App - INFO - Configuration is valid.
LLM Provider: featherless-ai
LLM Model: inclusionAI/Ling-1T
Vector Store: chroma
2025-10-23 13:17:07,025 - RAG_App - INFO - Initializing RAGEngine...
2025-10-23 13:17:07,025 - RAG_App - INFO - Initializing ChromaVectorStore at ./chroma_db_store
2025-10-23 13:17:07,064 - chromadb.telemetry.product.posthog - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


2025-10-23 13:17:07.502 
  command:

    streamlit run C:\Users\Wizard\AppData\Roaming\Python\Python311\site-packages\ipykernel_launcher.py [ARGUMENTS]


2025-10-23 13:17:07,506 - RAG_App - INFO - Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
2025-10-23 13:17:07,535 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: cuda:0
2025-10-23 13:17:07,536 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2




2025-10-23 13:17:13,312 - RAG_App - INFO - ChromaVectorStore loaded on initialization.
2025-10-23 13:17:13,313 - RAG_App - INFO - Vector store loaded successfully.
2025-10-23 13:17:13,313 - RAG_App - INFO - RAGEngine initialized with provider featherless-ai and model inclusionAI/Ling-1T
2025-10-23 13:17:13,326 - RAG_App - ERROR - Error resetting chroma database: Reset is disabled by config
2025-10-23 13:17:13,326 - RAG_App - INFO - Vector store cleared and re-initialized.

Vector store cleared.


## 2. Feature 1: Document Ingestion

We will create two dummy `.txt` files, process them, and add them to the `RAGEngine`.

In [3]:
# 1. Create dummy files
mock_files_data = [
    {
        'name': 'test_paris.txt',
        'type': 'text/plain',
        'data': b"The capital of France is Paris. Paris is known for the Eiffel Tower, the Louvre Museum, and its beautiful cafes. It is a major center for art and culture."
    },
    {
        'name': 'test_berlin.txt',
        'type': 'text/plain',
        'data': b"Berlin is the capital of Germany. It is famous for the Brandenburg Gate and the remains of the Berlin Wall. It has a vibrant nightlife and tech scene."
    }
]

# 2. Process files
processor = DocumentProcessor()
processed_docs = processor.process_uploaded_files(mock_files_data)

print(f"DocumentProcessor created {len(processed_docs)} chunks from {len(mock_files_data)} files.")

# 3. Add to RAG Engine
rag_engine.add_documents(processed_docs)

# 4. Check stats
stats = rag_engine.get_vector_store_stats()
print("\n--- Vector Store Stats after Document Ingestion ---")
pjson(stats)

2025-10-23 13:17:13,336 - RAG_App - INFO - DocumentProcessor initialized.
2025-10-23 13:17:13,336 - RAG_App - INFO - Processing file: test_paris.txt
2025-10-23 13:17:13,336 - RAG_App - INFO - Successfully processed test_paris.txt, created 1 chunks.
2025-10-23 13:17:13,337 - RAG_App - INFO - Processing file: test_berlin.txt
2025-10-23 13:17:13,337 - RAG_App - INFO - Successfully processed test_berlin.txt, created 1 chunks.
DocumentProcessor created 2 chunks from 2 files.
2025-10-23 13:17:13,339 - RAG_App - INFO - Adding 2 documents to Chroma...


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]

2025-10-23 13:17:13,971 - RAG_App - INFO - Document addition to Chroma complete.
2025-10-23 13:17:13,971 - RAG_App - INFO - ChromaVectorStore is persistent. No explicit save needed.

--- Vector Store Stats after Document Ingestion ---
{
  "total_documents": 2,
  "index_size": 2,
  "dimension": 384,
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "type": "Chroma"
}





## 3. Feature 2: Web Crawling

We will crawl a simple, text-heavy webpage and add the content to the RAG Engine.

In [4]:
crawler = WebCrawler()
urls_to_crawl = ["https://en.wikipedia.org/wiki/Bread"]
context = "baking, history, flour"
max_pages = 2
max_depth = 1 # 0 = root page, 1 = root + its links

print(f"Starting crawl for {urls_to_crawl[0]}...")
crawled_content = crawler.crawl_root_urls(urls_to_crawl, context, max_pages, max_depth)

print(f"WebCrawler found {len(crawled_content)} relevant pages.")

if crawled_content:
    # Add to RAG Engine
    rag_engine.add_documents(crawled_content)
    
    # Check stats again
    stats = rag_engine.get_vector_store_stats()
    print("\n--- Vector Store Stats after Web Crawl ---")
    pjson(stats)
else:
    print("Skipping web crawl addition, no content found.")

Starting crawl for https://en.wikipedia.org/wiki/Bread...
2025-10-23 13:17:13,985 - RAG_App - INFO - Starting crawl 1/1 for root URL: https://en.wikipedia.org/wiki/Bread
2025-10-23 13:17:13,985 - RAG_App - INFO - Crawling (Depth 0): https://en.wikipedia.org/wiki/Bread
2025-10-23 13:17:15,518 - RAG_App - INFO - Crawling (Depth 1): https://en.wikipedia.org/wiki/Lactic_acid
2025-10-23 13:17:16,232 - RAG_App - INFO - Crawl complete. Fetched 2 total pages.
WebCrawler found 2 relevant pages.
2025-10-23 13:17:16,233 - RAG_App - INFO - Adding 2 documents to Chroma...


Batches: 100%|██████████| 1/1 [00:00<00:00, 46.75it/s]

2025-10-23 13:17:16,273 - RAG_App - INFO - Document addition to Chroma complete.
2025-10-23 13:17:16,273 - RAG_App - INFO - ChromaVectorStore is persistent. No explicit save needed.

--- Vector Store Stats after Web Crawl ---
{
  "total_documents": 4,
  "index_size": 4,
  "dimension": 384,
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "type": "Chroma"
}





## 4. Feature 3: RAG Engine - Retrieval

Test the `retrieve_relevant_documents` function. The top result should be from `test_paris.txt`.

In [5]:
query = "What is the capital of France?"
print(f"Testing retrieval for: '{query}'\n")

retrieved_docs = rag_engine.retrieve_relevant_documents(query, k=3)

pjson(retrieved_docs)

Testing retrieval for: 'What is the capital of France?'



Batches: 100%|██████████| 1/1 [00:00<00:00, 26.64it/s]

[
  {
    "document": "The capital of France is Paris. Paris is known for the Eiffel Tower, the Louvre Museum, and its beautiful cafes. It is a major center for art and culture.",
    "metadata": {
      "file_type": "text/plain",
      "source": "upload",
      "chunk_id": 0,
      "filename": "test_paris.txt"
    },
    "score": 0.689079999923706
  },
  {
    "document": "Berlin is the capital of Germany. It is famous for the Brandenburg Gate and the remains of the Berlin Wall. It has a vibrant nightlife and tech scene.",
    "metadata": {
      "filename": "test_berlin.txt",
      "source": "upload",
      "chunk_id": 0,
      "file_type": "text/plain"
    },
    "score": 0.2829338312149048
  },
  {
    "document": "Bread - Wikipedia Jump to content From Wikipedia, the free encyclopedia Food made of flour and water For other uses, see Bread (disambiguation). BreadVarious leavened breadsMain ingredientsFlour, water Cookbook: Bread\u00a0 Media: Bread Bread is a baked food product made




## 5. Feature 4: RAG Engine - Generation (Single-Turn)

Test the full `generate_response` pipeline. We'll test both our uploaded document and the crawled content.

In [6]:
display(Markdown("### Test 1: Query from Uploaded Document"))
query1 = "What is Paris known for?"
print(f"Testing generation for: '{query1}'\n")
response1 = rag_engine.generate_response(query1)
pjson(response1)

display(Markdown("\n### Test 2: Query from Crawled Webpage"))
query2 = "What is bread?"
print(f"Testing generation for: '{query2}'\n")
response2 = rag_engine.generate_response(query2)
pjson(response2)

### Test 1: Query from Uploaded Document

Testing generation for: 'What is Paris known for?'



Batches: 100%|██████████| 1/1 [00:00<00:00, 199.90it/s]

2025-10-23 13:17:16,363 - RAG_App - INFO - Generating LLM chat completion for 1 messages...





2025-10-23 13:17:21,036 - RAG_App - INFO - LLM response received.
{
  "answer": "Paris is known for the Eiffel Tower, the Louvre Museum, and its beautiful cafes. It is also a major center for art and culture.",
  "sources": [
    {
      "filename": "test_paris.txt",
      "chunk_id": 0,
      "source": "upload",
      "file_type": "text/plain"
    },
    {
      "source": "upload",
      "file_type": "text/plain",
      "chunk_id": 0,
      "filename": "test_berlin.txt"
    },
    {
      "description": "",
      "title": "Bread - Wikipedia",
      "context": "baking, history, flour",
      "source": "web_crawl",
      "url": "https://en.wikipedia.org/wiki/Bread"
    }
  ],
  "confidence": 0.27571433782577515,
  "context_used": "The capital of France is Paris. Paris is known for the Eiffel Tower, the Louvre Museum, and its beautiful cafes. It is a major center for art and culture.\n\n---\n\nBerlin is the capital of Germany. It is famous for the Brandenburg Gate and the remains of the 


### Test 2: Query from Crawled Webpage

Testing generation for: 'What is bread?'



Batches: 100%|██████████| 1/1 [00:00<00:00, 117.52it/s]

2025-10-23 13:17:21,054 - RAG_App - INFO - Generating LLM chat completion for 1 messages...





2025-10-23 13:17:26,245 - RAG_App - INFO - LLM response received.
{
  "answer": "Bread is a baked food product made from water, flour, and often yeast. It has been an important part of many cultures' diets throughout history and is typically made by culturing wheat-flour dough with yeast, allowing it to rise, and baking it in an oven.",
  "sources": [
    {
      "context": "baking, history, flour",
      "source": "web_crawl",
      "title": "Bread - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Bread",
      "description": ""
    },
    {
      "source": "web_crawl",
      "url": "https://en.wikipedia.org/wiki/Lactic_acid",
      "description": "",
      "context": "baking, history, flour",
      "title": "Lactic acid - Wikipedia"
    },
    {
      "file_type": "text/plain",
      "chunk_id": 0,
      "source": "upload",
      "filename": "test_berlin.txt"
    }
  ],
  "confidence": 0.2364456206560135,
  "context_used": "Bread - Wikipedia Jump to content From Wikipedia, th

## 6. Feature 5: RAG Engine - Chat Mode (Multi-Turn)

Test the `chat_mode` function to see if it can handle history and use context correctly. We will also test its ability to *not* answer questions when the information isn't in the context.

In [7]:
chat_history = []

display(Markdown("### Chat Turn 1: Initial Question (from context)"))
query1 = "What is the capital of Germany?"
print(f"HUMAN: {query1}")
response1 = rag_engine.chat_mode(query1, chat_history)
print(f"ASSISTANT: {response1['answer']}")

# Add to history
chat_history.append({'human': query1, 'assistant': response1['answer']})

display(Markdown("\n### Chat Turn 2: Follow-up (NOT in context)"))
query2 = "How many people live there?"
print(f"HUMAN: {query2}")
response2 = rag_engine.chat_mode(query2, chat_history)
print(f"ASSISTANT: {response2['answer']}")

# Add to history
chat_history.append({'human': query2, 'assistant': response2['answer']})

display(Markdown("\n### Chat Turn 3: New Topic (from context)"))
query3 = "What is the Eiffel Tower?"
print(f"HUMAN: {query3}")
response3 = rag_engine.chat_mode(query3, chat_history)
print(f"ASSISTANT: {response3['answer']}")


### Chat Turn 1: Initial Question (from context)

HUMAN: What is the capital of Germany?


Batches: 100%|██████████| 1/1 [00:00<00:00, 110.95it/s]

2025-10-23 13:17:26,271 - RAG_App - INFO - Generating LLM chat completion for 1 messages...





2025-10-23 13:17:28,275 - RAG_App - INFO - LLM response received.
ASSISTANT: Berlin



### Chat Turn 2: Follow-up (NOT in context)

HUMAN: How many people live there?


Batches: 100%|██████████| 1/1 [00:00<00:00, 111.12it/s]

2025-10-23 13:17:28,291 - RAG_App - INFO - Generating LLM chat completion for 3 messages...





2025-10-23 13:17:30,687 - RAG_App - INFO - LLM response received.
ASSISTANT: I do not have that information in my documents.



### Chat Turn 3: New Topic (from context)

HUMAN: What is the Eiffel Tower?


Batches: 100%|██████████| 1/1 [00:00<00:00, 122.92it/s]

2025-10-23 13:17:30,703 - RAG_App - INFO - Generating LLM chat completion for 5 messages...





2025-10-23 13:17:35,077 - RAG_App - INFO - LLM response received.
ASSISTANT: The Eiffel Tower is a landmark in Paris, France.


## 7. Feature 6: Knowledge Graph

Finally, we'll test the Knowledge Graph builder by feeding it all the documents we've ingested. This may take a moment.

In [8]:
display(Markdown("### Building Knowledge Graph..."))

kg_builder = KnowledgeGraphBuilder()

# 1. Get all documents from the vector store
all_docs = rag_engine.get_all_documents_for_kg()
print(f"Building KG from {len(all_docs)} total document chunks.")

# 2. Extract entities and relationships
kg_stats = kg_builder.extract_entities_and_relationships(all_docs)
print("\n--- Knowledge Graph Stats ---")
pjson(kg_stats)

# 3. Visualize the graph
if kg_stats.get('graph_nodes', 0) > 0:
    display(Markdown("### Interactive Knowledge Graph Visualization"))
    fig = kg_builder.visualize_graph_plotly()
    display(fig)
else:
    print("No nodes found for KG visualization.")

### Building Knowledge Graph...

2025-10-23 13:17:35,086 - RAG_App - INFO - Retrieving all documents for KG build...
Building KG from 4 total document chunks.
2025-10-23 13:17:35,097 - RAG_App - INFO - Starting KG extraction from 4 documents...
2025-10-23 13:17:35,099 - RAG_App - INFO - Processing document 0/4 for KG...
2025-10-23 13:17:35,609 - RAG_App - INFO - KG build complete. Nodes: 115, Edges: 1

--- Knowledge Graph Stats ---
{
  "entities_count": 115,
  "relationships_count": 1,
  "graph_nodes": 115,
  "graph_edges": 1
}


### Interactive Knowledge Graph Visualization

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

## 8. Cleanup

You can now manually delete the `chroma_db_store` and `logs` directories if you wish.

In [9]:
print("Test complete. Remember to delete 'chroma_db_store' and 'logs' if you want a fresh start next time.")

Test complete. Remember to delete 'chroma_db_store' and 'logs' if you want a fresh start next time.
