# RAG vs. GraphRAG: A Side-by-Side Comparison

This notebook demonstrates the difference between **Standard RAG** (Vector-based) and **GraphRAG** (Graph-based) using the `semantica` framework. We will use a real-world text source to build both systems and compare their retrieval capabilities.

## What is the difference?
- **Standard RAG**: Retrieves documents based on semantic similarity (vector distance). Good for direct matching but can miss connected concepts.
- **GraphRAG**: Retrieves information by traversing a Knowledge Graph (nodes and edges). Excellent for multi-hop reasoning and understanding relationships between entities.

## Workflow
1. **Ingest Data**: Load a technical article.
2. **Build Standard RAG**: Chunk text -> Embeddings -> Vector Store -> Similarity Search.
3. **Build GraphRAG**: Extract Entities/Relations -> Knowledge Graph -> Graph Traversal.
4. **Compare Results**: Ask the same complex question to both systems.

In [None]:
# Install semantica if not already installed
!pip install semantica networkx matplotlib plotly sentence-transformers

In [None]:
import os
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Import Semantica Core Modules
from semantica.core import Semantica
from semantica.ingest import WebIngestor
from semantica.split import TextSplitter  # Proper Semantica splitter
from semantica.vector_store import FAISSStore
from semantica.embeddings import EmbeddingGenerator  # Proper embedding generator
from semantica.semantic_extract import NamedEntityRecognizer, RelationExtractor
from semantica.kg import GraphBuilder
from semantica.visualization import KGVisualizer  # Proper visualization module

# Initialize Semantica
semantica = Semantica()

## 1. Data Ingestion
We'll use a text about **"The Impact of AI on Healthcare"**. This topic has rich relationships (AI -> improves -> Diagnosis, Privacy -> challenges -> AI).

In [None]:
# Sample "Real World" Data Source
text_content = """
Artificial Intelligence (AI) is revolutionizing healthcare by enabling early diagnosis and personalized treatment plans.
Machine Learning algorithms analyze medical imaging to detect anomalies like tumors faster than human radiologists.
For instance, DeepMind's AlphaFold has solved the protein folding problem, accelerating drug discovery.
However, the integration of AI in healthcare faces significant challenges, primarily concerning patient data privacy and algorithmic bias.
Regulatory bodies like the FDA are establishing guidelines to ensure AI tools are safe and effective.
Unlike traditional software, AI systems can adapt and learn, which complicates validation processes.
Hospitals using predictive analytics have seen a 30% reduction in patient readmission rates.
Yet, cybersecurity threats remain a critical risk for connected medical devices.
"""

print(f"Loaded text with {len(text_content)} characters.")

## 2. Standard RAG (Vector Search)
We will chunk the text using `TextSplitter`, create embeddings with `EmbeddingGenerator`, and store them in a `FAISSStore`.

In [None]:
# 1. Split Text using Semantica's TextSplitter
splitter = TextSplitter(method="recursive", chunk_size=100, chunk_overlap=20)
chunks = splitter.split(text_content)

# Extract text from chunk objects
chunk_texts = [chunk.text for chunk in chunks]
print(f"Created {len(chunks)} chunks.")

# 2. Initialize Vector Store
vector_store = FAISSStore(dimension=384) # Standard dimension for MiniLM
vector_store.create_index(index_type="flat", metric="L2")

# 3. Generate Embeddings using Semantica's EmbeddingGenerator
# This uses local Sentence Transformers by default
generator = EmbeddingGenerator(text={"method": "sentence_transformers", "model_name": "all-MiniLM-L6-v2"})
embeddings = generator.generate_embeddings(chunk_texts, data_type="text")

# 4. Store Vectors
vector_store.add_vectors(embeddings, ids=[f"chunk_{i}" for i in range(len(chunks))])

# 5. Define a search function
def standard_rag_search(query):
    # Generate query embedding
    query_vec = generator.generate_embeddings(query, data_type="text")
    
    # Search
    results = vector_store.search_similar(query_vec, k=2)
    
    # Retrieve actual text
    retrieved_texts = []
    for res in results:
        idx = int(res['id'].split('_')[1])
        retrieved_texts.append(chunk_texts[idx])
    return retrieved_texts

print("Standard RAG Pipeline Built.")

## 3. GraphRAG (Knowledge Graph)
Now we extract entities and relationships to build a structured graph using `GraphBuilder` and visualize it with `KGVisualizer`.

In [None]:
# 1. Extract Entities and Relations
ner = NamedEntityRecognizer()
rel_extractor = RelationExtractor()

entities = ner.extract_entities(text_content)
relations = rel_extractor.extract_relations(text_content, entities=entities)

print(f"Extracted {len(entities)} entities and {len(relations)} relationships.")

# 2. Build Graph using Semantica's GraphBuilder
builder = GraphBuilder()
kg = builder.build([{"entities": entities, "relationships": relations}])

# 3. Visualize Graph using Semantica's KGVisualizer
viz = KGVisualizer()
fig = viz.visualize_network(kg, output="interactive")
fig.show()

# Fallback static visualization if interactive fails in some environments
# plt.figure(figsize=(10, 8))
# pos = nx.spring_layout(kg, k=0.5)
# nx.draw(kg, pos, with_labels=True, node_color='lightblue', node_size=2000, font_size=10)
# edge_labels = nx.get_edge_attributes(kg, 'relation')
# nx.draw_networkx_edge_labels(kg, pos, edge_labels=edge_labels)
# plt.show()

In [None]:
# 4. Define Graph Search Function
def graph_rag_search(start_entity, hops=1):
    if start_entity not in kg:
        # Try simple fuzzy match
        for node in kg.nodes():
            if start_entity.lower() in node.lower():
                start_entity = node
                break
        else:
            return [f"Entity '{start_entity}' not found in graph."]
    
    # Traverse graph
    subgraph_nodes = list(nx.bfs_tree(kg, source=start_entity, depth_limit=hops))
    subgraph = kg.subgraph(subgraph_nodes)
    
    # Convert paths to natural language
    facts = []
    for u, v, data in subgraph.edges(data=True):
        relation = data.get('relation', 'related to')
        facts.append(f"{u} --[{relation}]--> {v}")
    
    return facts

## 4. Comparison Results
Let's compare what each system retrieves for the query: **"What are the risks associated with AI?"**

In [None]:
query = "What are the risks associated with AI?"

print(f"Query: {query}\n")

print("--- Standard RAG Results (Vector Similarity) ---")
rag_results = standard_rag_search(query)
for i, res in enumerate(rag_results):
    print(f"{i+1}. {res}...")

print("\n--- GraphRAG Results (Graph Traversal) ---")
# In a full system, we would link the query "AI" to the node "Artificial Intelligence"
graph_results = graph_rag_search("Artificial Intelligence", hops=2)
for i, res in enumerate(graph_results):
    print(f"{i+1}. {res}")

## Conclusion

| Feature | Standard RAG | GraphRAG |
|---------|--------------|----------|
| **Mechanism** | Semantic Similarity (Vector) | Graph Traversal (Structural) |
| **Strengths** | Fast, good for general queries | Captures explicit relationships, multi-hop reasoning |
| **Weaknesses**| May miss disconnected chunks | Requires entity extraction & graph construction overhead |
| **Best For** | Fact retrieval | Complex reasoning & relationship mapping |

**Semantica** allows you to combine both into a **Hybrid RAG** system for optimal performance.