# RAG 2.0 Feature Test Notebook

This notebook will test all major components of the RAG application. 

### ⚠️ Prerequisites

1.  **Run from `RAG/` directory:** This notebook *must* be saved in your main `RAG/` project folder to resolve the imports correctly.
2.  **`.env` File:** Ensure you have a `.env` file in the `RAG/` directory with your `HF_API_TOKEN` and other settings.
3.  **Dependencies:** Make sure you have run `pip install -r requirements.txt` and `pip install jupyter notebook` in your virtual environment.
4.  **Clean Database:** For a clean test, stop any running apps and delete your old database directory:
    * **Windows (PowerShell):** `Remove-Item -Path .\\chroma_db_store -Recurse -Force`
    * **macOS/Linux:** `rm -rf ./chroma_db_store`

In [None]:
import os
import io
import sys
import json
from IPython.display import display, Markdown
import plotly.io as pio

# Set Plotly to dark mode for the notebook
pio.templates.default = "plotly_dark"

# Import all our application modules
from config import Config, IS_CONFIG_VALID
from logger import logger
from core.rag_engine import RAGEngine
from ingestion.document_processor import DocumentProcessor
from ingestion.web_crawler import WebCrawler
from core.knowledge_graph import KnowledgeGraphBuilder

def pjson(data):
    """Helper function to pretty-print JSON."""
    print(json.dumps(data, indent=2))

2025-10-21 04:59:53,617 - RAG_App - INFO - Loading configuration...
2025-10-21 04:59:53,618 - RAG_App - INFO - Configuration validated successfully.
2025-10-21 04:59:54,499 - faiss.loader - INFO - Loading faiss with AVX512 support.
2025-10-21 04:59:54,500 - faiss.loader - INFO - Could not load library with AVX512 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx512'")
2025-10-21 04:59:54,500 - faiss.loader - INFO - Loading faiss with AVX2 support.
2025-10-21 04:59:54,638 - faiss.loader - INFO - Successfully loaded faiss with AVX2 support.


  from .autonotebook import tqdm as notebook_tqdm


2025-10-21 05:00:32,655 - datasets - INFO - TensorFlow version 2.20.0 available.
2025-10-21 05:00:32,657 - datasets - INFO - JAX version 0.7.2 available.



## 1. Configuration & Setup

This cell checks if the `.env` config is valid and initializes the main `RAGEngine`. It will also clear any existing vector store to ensure a clean test.

In [11]:
if not IS_CONFIG_VALID:
    logger.error("CRITICAL: .env file is not configured correctly. Please check it.")
else:
    logger.info("Configuration is valid.")
    print(f"LLM Provider: {Config.LLM_PROVIDER}")
    print(f"LLM Model: {Config.LLM_MODEL}")
    print(f"Vector Store: {Config.VECTOR_STORE_TYPE}")

# Initialize the main engine
rag_engine = RAGEngine()

# Clear the store for a clean test
logger.warning("Clearing vector store for a clean test...")
rag_engine.clear_vector_store()
print("\nVector store cleared.")

2025-10-21 05:06:19,574 - RAG_App - INFO - Configuration is valid.
LLM Provider: featherless-ai
LLM Model: inclusionAI/Ling-1T
Vector Store: chroma
2025-10-21 05:06:19,575 - RAG_App - INFO - Initializing RAGEngine...
2025-10-21 05:06:19,575 - RAG_App - INFO - RAGEngine initialized with provider featherless-ai and model inclusionAI/Ling-1T
2025-10-21 05:06:19,588 - RAG_App - ERROR - Error deleting chroma collection: Collection [rag_collection] does not exist
2025-10-21 05:06:19,589 - RAG_App - INFO - Vector store cleared and re-initialized.

Vector store cleared.


## 2. Feature 1: Document Ingestion

We will create two dummy `.txt` files, process them, and add them to the `RAGEngine`.

In [12]:
# 1. Create dummy files
mock_files_data = [
    {
        'name': 'test_paris.txt',
        'type': 'text/plain',
        'data': b"The capital of France is Paris. Paris is known for the Eiffel Tower, the Louvre Museum, and its beautiful cafes. It is a major center for art and culture."
    },
    {
        'name': 'test_berlin.txt',
        'type': 'text/plain',
        'data': b"Berlin is the capital of Germany. It is famous for the Brandenburg Gate and the remains of the Berlin Wall. It has a vibrant nightlife and tech scene."
    }
]

# 2. Process files
processor = DocumentProcessor()
processed_docs = processor.process_uploaded_files(mock_files_data)

print(f"DocumentProcessor created {len(processed_docs)} chunks from {len(mock_files_data)} files.")

# 3. Add to RAG Engine
rag_engine.add_documents(processed_docs)

# 4. Check stats
stats = rag_engine.get_vector_store_stats()
print("\n--- Vector Store Stats after Document Ingestion ---")
pjson(stats)

2025-10-21 05:06:24,072 - RAG_App - INFO - DocumentProcessor initialized.
2025-10-21 05:06:24,072 - RAG_App - INFO - Processing file: test_paris.txt
2025-10-21 05:06:24,074 - RAG_App - INFO - Successfully processed test_paris.txt, created 1 chunks.
2025-10-21 05:06:24,074 - RAG_App - INFO - Processing file: test_berlin.txt
2025-10-21 05:06:24,075 - RAG_App - INFO - Successfully processed test_berlin.txt, created 1 chunks.
DocumentProcessor created 2 chunks from 2 files.
2025-10-21 05:06:24,075 - RAG_App - INFO - Adding 2 documents to Chroma...


Batches: 100%|██████████| 1/1 [00:00<00:00, 90.61it/s]

2025-10-21 05:06:24,090 - RAG_App - ERROR - Error adding batch to Chroma: Error getting collection: Collection [f5ac716d-72d6-4bbf-95f6-176837ad59b3] does not exist.
2025-10-21 05:06:24,090 - RAG_App - INFO - Document addition to Chroma complete.
2025-10-21 05:06:24,090 - RAG_App - INFO - ChromaVectorStore is persistent. No explicit save needed.





NotFoundError: Collection [f5ac716d-72d6-4bbf-95f6-176837ad59b3] does not exist.

## 3. Feature 2: Web Crawling

We will crawl a simple, text-heavy webpage and add the content to the RAG Engine.

In [13]:
crawler = WebCrawler()
urls_to_crawl = ["https://en.wikipedia.org/wiki/Bread"]
context = "baking, history, flour"
max_pages = 2
max_depth = 1 # 0 = root page, 1 = root + its links

print(f"Starting crawl for {urls_to_crawl[0]}...")
crawled_content = crawler.crawl_root_urls(urls_to_crawl, context, max_pages, max_depth)

print(f"WebCrawler found {len(crawled_content)} relevant pages.")

if crawled_content:
    # Add to RAG Engine
    rag_engine.add_documents(crawled_content)
    
    # Check stats again
    stats = rag_engine.get_vector_store_stats()
    print("\n--- Vector Store Stats after Web Crawl ---")
    pjson(stats)
else:
    print("Skipping web crawl addition, no content found.")

Starting crawl for https://en.wikipedia.org/wiki/Bread...
2025-10-21 05:06:31,804 - RAG_App - INFO - Starting crawl 1/1 for root URL: https://en.wikipedia.org/wiki/Bread
2025-10-21 05:06:31,805 - RAG_App - INFO - Crawling (Depth 0): https://en.wikipedia.org/wiki/Bread
2025-10-21 05:06:33,172 - RAG_App - INFO - Crawling (Depth 1): https://en.wikipedia.org/wiki/Grain_trade
2025-10-21 05:06:34,550 - RAG_App - INFO - Crawl complete. Fetched 2 total pages.
WebCrawler found 2 relevant pages.
2025-10-21 05:06:34,551 - RAG_App - INFO - Adding 2 documents to Chroma...


Batches: 100%|██████████| 1/1 [00:00<00:00, 41.33it/s]

2025-10-21 05:06:34,580 - RAG_App - ERROR - Error adding batch to Chroma: Error getting collection: Collection [f5ac716d-72d6-4bbf-95f6-176837ad59b3] does not exist.
2025-10-21 05:06:34,581 - RAG_App - INFO - Document addition to Chroma complete.
2025-10-21 05:06:34,582 - RAG_App - INFO - ChromaVectorStore is persistent. No explicit save needed.





NotFoundError: Collection [f5ac716d-72d6-4bbf-95f6-176837ad59b3] does not exist.

## 4. Feature 3: RAG Engine - Retrieval

Test the `retrieve_relevant_documents` function. The top result should be from `test_paris.txt`.

In [14]:
query = "What is the capital of France?"
print(f"Testing retrieval for: '{query}'\n")

retrieved_docs = rag_engine.retrieve_relevant_documents(query, k=3)

pjson(retrieved_docs)

Testing retrieval for: 'What is the capital of France?'



NotFoundError: Collection [f5ac716d-72d6-4bbf-95f6-176837ad59b3] does not exist.

## 5. Feature 4: RAG Engine - Generation (Single-Turn)

Test the full `generate_response` pipeline. We'll test both our uploaded document and the crawled content.

In [None]:
display(Markdown("### Test 1: Query from Uploaded Document"))
query1 = "What is Paris known for?"
print(f"Testing generation for: '{query1}'\n")
response1 = rag_engine.generate_response(query1)
pjson(response1)

display(Markdown("\n### Test 2: Query from Crawled Webpage"))
query2 = "What is bread?"
print(f"Testing generation for: '{query2}'\n")
response2 = rag_engine.generate_response(query2)
pjson(response2)

## 6. Feature 5: RAG Engine - Chat Mode (Multi-Turn)

Test the `chat_mode` function to see if it can handle history and use context correctly. We will also test its ability to *not* answer questions when the information isn't in the context.

In [None]:
chat_history = []

display(Markdown("### Chat Turn 1: Initial Question (from context)"))
query1 = "What is the capital of Germany?"
print(f"HUMAN: {query1}")
response1 = rag_engine.chat_mode(query1, chat_history)
print(f"ASSISTANT: {response1['answer']}")

# Add to history
chat_history.append({'human': query1, 'assistant': response1['answer']})

display(Markdown("\n### Chat Turn 2: Follow-up (NOT in context)"))
query2 = "How many people live there?"
print(f"HUMAN: {query2}")
response2 = rag_engine.chat_mode(query2, chat_history)
print(f"ASSISTANT: {response2['answer']}")

# Add to history
chat_history.append({'human': query2, 'assistant': response2['answer']})

display(Markdown("\n### Chat Turn 3: New Topic (from context)"))
query3 = "What is the Eiffel Tower?"
print(f"HUMAN: {query3}")
response3 = rag_engine.chat_mode(query3, chat_history)
print(f"ASSISTANT: {response3['answer']}")


## 7. Feature 6: Knowledge Graph

Finally, we'll test the Knowledge Graph builder by feeding it all the documents we've ingested. This may take a moment.

In [None]:
display(Markdown("### Building Knowledge Graph..."))

kg_builder = KnowledgeGraphBuilder()

# 1. Get all documents from the vector store
all_docs = rag_engine.get_all_documents_for_kg()
print(f"Building KG from {len(all_docs)} total document chunks.")

# 2. Extract entities and relationships
kg_stats = kg_builder.extract_entities_and_relationships(all_docs)
print("\n--- Knowledge Graph Stats ---")
pjson(kg_stats)

# 3. Visualize the graph
if kg_stats.get('graph_nodes', 0) > 0:
    display(Markdown("### Interactive Knowledge Graph Visualization"))
    fig = kg_builder.visualize_graph_plotly()
    display(fig)
else:
    print("No nodes found for KG visualization.")

## 8. Cleanup

You can now manually delete the `chroma_db_store` and `logs` directories if you wish.

In [None]:
print("Test complete. Remember to delete 'chroma_db_store' and 'logs' if you want a fresh start next time.")