# GraphRAG-Viz: Glass Box Pipeline Tutorial

This notebook demonstrates the complete Glass Box GraphRAG pipeline with full transparency and interpretability.

## What is GraphRAG?

GraphRAG (Graph Retrieval-Augmented Generation) enhances traditional RAG by:
1. Building a knowledge graph from documents
2. Detecting communities of related entities
3. Using hierarchical summarization for better query answering

## What Makes This "Glass Box"?

- ‚úÖ Complete provenance tracking
- ‚úÖ Transparent processing at every stage
- ‚úÖ Interpretable query results
- ‚úÖ Interactive visualizations

In [None]:
# Install dependencies (if needed)
# !pip install -r requirements.txt

In [None]:
import os
from graphrag_viz import GraphRAGPipeline, GraphVisualizer, PipelineConfig
import json
from IPython.display import HTML, Image, display

## Step 1: Configure the Pipeline

Set up your OpenAI API key and pipeline parameters.

In [None]:
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Configure the pipeline
config = PipelineConfig(
    chunk_size=500,
    chunk_overlap=50,
    openai_model="gpt-3.5-turbo",
    enable_logging=True,
    save_intermediate_results=True,
    output_dir="notebook_output"
)

print("‚úì Configuration complete")

## Step 2: Prepare Sample Documents

Let's use some sample documents about a fictional tech company.

In [None]:
documents = [
    {
        "id": "doc1",
        "text": """
        Alice Johnson is a software engineer at TechCorp, a leading technology company based in San Francisco.
        She specializes in artificial intelligence and machine learning. Alice graduated from MIT with a degree
        in Computer Science. At TechCorp, she works on developing natural language processing systems and 
        collaborates closely with Bob Smith, who is the head of the AI research division.
        """
    },
    {
        "id": "doc2",
        "text": """
        TechCorp was founded in 2010 by Carol Williams and David Brown. The company is headquartered in 
        San Francisco, California, and has offices in New York, London, and Tokyo. TechCorp specializes in
        artificial intelligence solutions for enterprise clients. The company has grown to over 500 employees
        and is known for its innovative approach to machine learning and data analytics.
        """
    },
    {
        "id": "doc3",
        "text": """
        Bob Smith leads the AI research division at TechCorp. He has a PhD in Machine Learning from Stanford
        University and has published numerous papers on deep learning and neural networks. Bob's team focuses
        on developing cutting-edge AI technologies, including natural language processing, computer vision,
        and reinforcement learning. The team collaborates with universities and research institutions worldwide.
        """
    }
]

print(f"Prepared {len(documents)} documents for processing")

## Step 3: Initialize and Run the Pipeline

Process documents through all pipeline stages with full transparency.

In [None]:
# Initialize the pipeline
pipeline = GraphRAGPipeline(config)
print("‚úì Pipeline initialized\n")

# Process documents
print("Processing documents through pipeline...")
results = pipeline.process_documents(documents)
print("\n‚úì Processing complete!")

## Step 4: Examine Pipeline Results (Glass Box Transparency)

Let's examine what happened at each stage of the pipeline.

In [None]:
# Display execution trace
print("üìä Pipeline Execution Summary")
print("=" * 70)
print(f"Pipeline ID: {results['pipeline_id']}")
print(f"\nGraph Statistics:")
print(f"  - Entities: {results['graph_statistics']['num_nodes']}")
print(f"  - Relationships: {results['graph_statistics']['num_edges']}")
print(f"  - Communities: {results['community_structure']['num_communities']}")
print(f"  - Graph Density: {results['graph_statistics']['density']:.4f}")
print(f"  - Is Connected: {results['graph_statistics']['is_connected']}")

print("\nüîç Execution Trace (Glass Box):")
for step in results['execution_trace']['steps']:
    print(f"\n{step['step']}. {step['name']}")
    print(f"   Input: {step['input']}")
    print(f"   Output: {step['output']}")
    print(f"   Duration: {step['duration_seconds']:.2f}s")

## Step 5: Explore Communities

Let's see what communities were detected and their summaries.

In [None]:
print("üèòÔ∏è Community Summaries")
print("=" * 70)

for comm_id, summary in results['summaries'].items():
    print(f"\nCommunity {comm_id}:")
    print(f"  Entities: {summary['num_entities']}")
    print(f"  Key entities: {', '.join(summary['key_entities'][:5])}")
    print(f"  Summary: {summary['summary']}")
    print("-" * 70)

## Step 6: Query the Knowledge Graph

Now let's ask questions and see how the system answers with full provenance.

In [None]:
questions = [
    "Who works at TechCorp?",
    "What is Alice Johnson's role?",
    "Where is TechCorp located?",
    "What are the main focus areas of TechCorp?"
]

for question in questions:
    print("\n" + "=" * 70)
    print(f"‚ùì Question: {question}")
    print("=" * 70)
    
    answer_result = pipeline.query(question, top_k=2)
    
    print(f"\nüí° Answer: {answer_result['answer']}")
    
    print(f"\nüîç Provenance (Glass Box):")
    print(f"  Relevant Communities: {[c['community_id'] for c in answer_result['provenance']['relevant_communities']]}")
    print(f"  Entities Referenced: {answer_result['provenance']['entities_referenced'][:5]}")
    print(f"  Model Used: {answer_result['provenance']['model']}")
    print(f"  Tokens Used: {answer_result['provenance']['tokens_used']}")

## Step 7: Trace Answer Provenance

For any answer, we can trace it back to the source documents.

In [None]:
# Get detailed provenance for the last query
question = "Who works at TechCorp?"
answer_result = pipeline.query(question)

print("üìç Complete Provenance Trace")
print("=" * 70)

full_provenance = answer_result['full_provenance']

print(f"\nEntities used in answer:")
for entity in full_provenance['entities_used'][:5]:
    print(f"\n  - {entity['name']} ({entity['type']})")
    print(f"    Description: {entity['description'][:100]}...")
    print(f"    Source chunks: {entity['source_chunks']}")

print(f"\nAll source chunks involved: {full_provenance['source_chunks']}")
print(f"Communities involved: {full_provenance['communities_involved']}")

## Step 8: Create Visualizations

Generate interactive visualizations of the knowledge graph.

In [None]:
# Initialize visualizer
visualizer = GraphVisualizer()

# Create output directory
import os
os.makedirs("notebook_visualizations", exist_ok=True)

# Generate visualizations
print("Creating visualizations...")

# 1. Interactive graph
graph_file = visualizer.visualize_graph(
    graph=pipeline.graph,
    partition=pipeline.partition,
    output_file="notebook_visualizations/knowledge_graph.html"
)
print(f"‚úì Knowledge graph: {graph_file}")

# 2. Pipeline summary
summary_file = visualizer.create_pipeline_summary_visualization(
    pipeline_results=results,
    output_file="notebook_visualizations/pipeline_summary.html"
)
print(f"‚úì Pipeline summary: {summary_file}")

# 3. Community distribution
visualizer.plot_community_distribution(
    communities=pipeline.communities,
    output_file="notebook_visualizations/community_distribution.png"
)
print(f"‚úì Community distribution plot")

# 4. Entity type distribution
visualizer.plot_entity_type_distribution(
    graph=pipeline.graph,
    output_file="notebook_visualizations/entity_types.png"
)
print(f"‚úì Entity type distribution plot")

print("\n‚úÖ All visualizations created!")

## Display Visualizations

In [None]:
# Display community distribution
print("Community Size Distribution:")
display(Image(filename="notebook_visualizations/community_distribution.png"))

In [None]:
# Display entity type distribution
print("Entity Type Distribution:")
display(Image(filename="notebook_visualizations/entity_types.png"))

In [None]:
# Link to interactive graph (open in browser)
print("üåê Interactive Knowledge Graph:")
print("Open this file in your browser: notebook_visualizations/knowledge_graph.html")
print("\nüåê Pipeline Summary:")
print("Open this file in your browser: notebook_visualizations/pipeline_summary.html")

## Summary: Glass Box Advantages

This Glass Box implementation provides:

1. **Complete Transparency**: Every entity traces back to source text
2. **Execution Tracking**: Full logs of all pipeline steps
3. **Interpretable Results**: Answers include provenance information
4. **Interactive Exploration**: Visualize the knowledge graph structure
5. **Debuggability**: Understand exactly how decisions were made

This makes it ideal for:
- Research and education
- Compliance and auditing
- System debugging and optimization
- Understanding AI decision-making processes