# 📚 GraphRAG MCP Document Processing

Transform your research papers into an intelligent knowledge graph for AI-powered literature review.

## 🎯 What You'll Build

By the end of this notebook, you'll have:
- 🕸️ **Knowledge Graph** - Your documents connected through entities and relationships
- 📊 **Interactive Visualization** - See how your papers connect to each other
- 💾 **Persistent Storage** - Knowledge graph stored in Neo4j for future use
- 🎨 **Beautiful Visualization** - Interactive graph showing document relationships

## 📋 Prerequisites

**Before starting:**
1. **Services running**: Ollama and Neo4j
2. **Documents ready**: PDF papers in a folder
3. **Environment**: GraphRAG MCP toolkit installed

---

## 🚀 Let's Get Started!

## 1. 🔍 System Check

**First, let's verify all required services are running:**

In [None]:
# 🔍 SYSTEM VALIDATION - Run this first!
import subprocess
import os
import time
from pathlib import Path
from IPython.display import display, HTML
from tqdm import tqdm
import json

def check_service(command, service_name, timeout=10):
    """Check if a service is running"""
    try:
        result = subprocess.run(command, shell=True, capture_output=True, text=True, timeout=timeout)
        return result.returncode == 0, result.stdout, result.stderr
    except subprocess.TimeoutExpired:
        return False, "", "Timeout"
    except Exception as e:
        return False, "", str(e)

print("🔍 Checking system prerequisites...")
print("=" * 50)

# Check Neo4j
print("🔧 Checking Neo4j...")
neo4j_ok, _, _ = check_service("curl -f -s http://localhost:7474/", "Neo4j")
if neo4j_ok:
    print("✅ Neo4j is running")
else:
    print("❌ Neo4j not accessible")
    print("   💡 Start with: docker run -d --name neo4j -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest")

# Check Ollama
print("\n🔧 Checking Ollama...")
ollama_ok, _, _ = check_service("curl -s http://localhost:11434/api/tags", "Ollama")
if ollama_ok:
    print("✅ Ollama is running")
else:
    print("❌ Ollama not accessible")
    print("   💡 Start with: ollama serve")

# Check CLI availability
print("\n🔧 Checking GraphRAG MCP CLI...")
cli_ok, _, _ = check_service("graphrag-mcp --help", "CLI")
if cli_ok:
    print("✅ GraphRAG MCP CLI is available")
else:
    print("❌ GraphRAG MCP CLI not found")
    print("   💡 Install with: pip install graphrag-mcp-toolkit")

# Overall status
print("\n" + "=" * 50)
if neo4j_ok and ollama_ok and cli_ok:
    print("🚀 All systems ready! You can proceed to configuration.")
    system_ready = True
else:
    print("⚠️  Please fix the issues above before continuing.")
    system_ready = False

## 2. 🔧 Configuration

**Set up your project details:**

In [None]:
# 🔧 PROJECT SETTINGS - Customize these for your research
PROJECT_NAME = "literature-assistant"    # Your project name
DOCUMENTS_FOLDER = "../../examples"       # Path to your PDF papers
TEMPLATE = "academic"                     # Template for academic research

# Display settings
print(f"📋 Project Configuration:")
print(f"   🎯 Project Name: {PROJECT_NAME}")
print(f"   📁 Documents Folder: {DOCUMENTS_FOLDER}")
print(f"   🎓 Template: {TEMPLATE}")
print(f"   📍 Working Directory: {os.getcwd()}")

# Check if system validation passed
if 'system_ready' in globals() and system_ready:
    print("\n✅ Configuration complete - system validated")
else:
    print("\n⚠️  Run the system check cell first to validate prerequisites")

## 3. 📄 Document Discovery

**Find your research papers:**

In [None]:
# 📄 DOCUMENT DISCOVERY
print("🔍 Scanning for PDF documents...")

# Find PDF files
doc_path = Path(DOCUMENTS_FOLDER)
if not doc_path.exists():
    print(f"❌ Folder not found: {DOCUMENTS_FOLDER}")
    print("   💡 Update DOCUMENTS_FOLDER path in the configuration cell")
else:
    pdf_files = list(doc_path.glob("*.pdf"))
    
    if pdf_files:
        total_size = sum(f.stat().st_size for f in pdf_files) / (1024 * 1024)  # MB
        
        print(f"📊 Found {len(pdf_files)} PDF documents ({total_size:.1f} MB total):")
        print()
        
        for i, pdf_file in enumerate(pdf_files, 1):
            size_mb = pdf_file.stat().st_size / (1024 * 1024)
            print(f"   {i:2d}. 📄 {pdf_file.name} ({size_mb:.1f} MB)")
        
        # Processing time estimate
        est_minutes = len(pdf_files) * 3  # ~3 minutes per document
        print(f"\n⏱️  Estimated processing time: {est_minutes} minutes")
        print("✅ Document discovery complete")
        
        # Store for later use
        documents_found = True
        document_count = len(pdf_files)
        
    else:
        print(f"❌ No PDF files found in {DOCUMENTS_FOLDER}")
        print("   💡 Add PDF files to the folder or update the path")
        documents_found = False
        document_count = 0

## 4. 🏗️ Create Project

**Set up your GraphRAG project:**

In [None]:
# 🏗️ PROJECT CREATION - with real-time feedback
print("🔨 Creating GraphRAG project...")

# Check if project already exists
import os
project_dir = os.path.expanduser(f"~/.graphrag-mcp/projects/{PROJECT_NAME}")
if os.path.exists(project_dir):
    print(f"📁 Project '{PROJECT_NAME}' already exists, using --force to overwrite")
    force_flag = "--force"
else:
    print(f"📁 Creating new project '{PROJECT_NAME}'")
    force_flag = ""

print("📊 This may take a moment to:")
print("   - Create project directory")
print("   - Set up template configuration")
print("   - Initialize database connections")
print("   - Validate prerequisites")
print()

# Create project using CLI with real-time output
create_cmd = f"graphrag-mcp create {PROJECT_NAME} --template {TEMPLATE} {force_flag}".strip()
print(f"📝 Command: {create_cmd}")
print("⏳ Running...")

# Run with real-time output
import subprocess
process = subprocess.Popen(create_cmd, shell=True, stdout=subprocess.PIPE, 
                          stderr=subprocess.STDOUT, text=True, bufsize=1)

# Show output in real-time
output_lines = []
while True:
    line = process.stdout.readline()
    if not line and process.poll() is not None:
        break
    if line:
        print(f"   {line.strip()}")
        output_lines.append(line.strip())

# Get final result
return_code = process.wait()

if return_code == 0:
    print("\n✅ Project created successfully!")
    print(f"📁 Project location: ~/.graphrag-mcp/projects/{PROJECT_NAME}")
    print(f"🎓 Using template: {TEMPLATE}")
    project_created = True
    
else:
    print("\n❌ Project creation failed")
    print("Check the output above for details")
    project_created = False

## 5. 📥 Add Documents

**Add your papers to the project:**

In [None]:
# 📥 ADD DOCUMENTS
if documents_found and project_created:
    print("📥 Adding documents to project...")
    
    # Add documents using CLI
    add_cmd = f"graphrag-mcp add-documents {PROJECT_NAME} {DOCUMENTS_FOLDER} --recursive"
    print(f"📝 Command: {add_cmd}")
    
    result = subprocess.run(add_cmd, shell=True, capture_output=True, text=True)
    
    if result.returncode == 0:
        print(f"✅ Successfully added {document_count} documents to project")
        
        if result.stdout:
            print(f"\n📋 Output:")
            print(result.stdout)
            
        documents_added = True
        
    else:
        print("❌ Failed to add documents")
        print(f"Error: {result.stderr}")
        documents_added = False
        
else:
    print("⚠️  Skipping document addition - prerequisites not met")
    documents_added = False

## 6. 🦙 Process Documents

**Transform papers into knowledge graph (this takes several minutes):**

In [None]:
# 🦙 DOCUMENT PROCESSING - Setup
if documents_added:
    print("🚀 Starting document processing...")
    print("📊 This will:")
    print("   1. 📄 Extract text from PDFs")
    print("   2. 📝 Create text chunks")
    print("   3. 🦙 Extract entities with LLM")
    print("   4. 📚 Parse citations")
    print("   5. 🔗 Build relationships")
    print("   6. 💾 Store in Neo4j + ChromaDB")
    print()
    
    # Set up the command
    process_cmd = f"graphrag-mcp process {PROJECT_NAME}"
    print(f"📝 Command: {process_cmd}")
    print("⏳ This may take several minutes...")
    print("💡 Run the next cell to start processing")
    
    # Make command available for next cell
    processing_ready = True
    
else:
    print("⚠️  Cannot start processing - documents not added")
    print("   Go back and run the document addition cell first")
    processing_ready = False

In [None]:
# 🔄 RUN PROCESSING - Execute the document processing
if 'processing_ready' in globals() and processing_ready:
    print("🚀 Starting document processing...")
    print("📄 Processing PDF into knowledge graph...")
    print("⏳ This will take approximately 5 minutes...")
    print()
    
    import subprocess
    import logging
    import sys
    import re
    
    # Set cleaner logging
    logging.getLogger('graphrag_mcp').setLevel(logging.WARNING)
    logging.getLogger('httpx').setLevel(logging.WARNING)
    logging.getLogger('neo4j').setLevel(logging.WARNING)
    
    print(f"🔧 Running: {process_cmd}")
    print("📊 Progress milestones:")
    print("-" * 40)
    
    # Run processing with smart filtering
    try:
        process = subprocess.Popen(
            process_cmd, 
            shell=True, 
            stdout=subprocess.PIPE, 
            stderr=subprocess.STDOUT,
            text=True,
            bufsize=1,
            universal_newlines=True
        )
        
        last_status = ""
        
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                line = output.strip()
                
                # Skip spinner characters and repetitive lines
                if any(char in line for char in ['⠏', '⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇']):
                    continue
                    
                # Skip HTTP request logs
                if 'HTTP Request:' in line or 'INFO:httpx:' in line:
                    continue
                
                # Skip empty lines
                if not line:
                    continue
                
                # Only show meaningful status changes
                clean_line = re.sub(r'^[⠏⠋⠙⠹⠸⠼⠴⠦⠧⠇]\s*', '', line)
                if clean_line != last_status and len(clean_line) > 5:
                    print(f"  {clean_line}")
                    last_status = clean_line
                    sys.stdout.flush()
        
        # Get final return code
        return_code = process.poll()
        
        print("-" * 40)
        print()
        
        if return_code == 0:
            print("✅ Processing completed successfully!")
            print("📊 Your documents are now in the knowledge graph")
            processing_complete = True
        else:
            print(f"❌ Processing failed (code: {return_code})")
            print("💡 Check the output above for issues")
            processing_complete = False
            
    except KeyboardInterrupt:
        print("\n⏸️ Processing interrupted by user")
        processing_complete = False
        
    except Exception as e:
        print(f"\n❌ Exception: {str(e)}")
        processing_complete = False
        
else:
    print("⚠️  Run the previous cell first to set up processing")
    processing_complete = False

In [None]:
# 📊 CHECK PROCESSING RESULTS
if 'processing_complete' in globals() and processing_complete:
    print("🎉 Processing Results:")
    print("=" * 40)
    
    # Check project status
    import subprocess
    result = subprocess.run(f"graphrag-mcp status {PROJECT_NAME}", shell=True, capture_output=True, text=True)
    
    if result.returncode == 0:
        print("✅ Project status:")
        print(result.stdout)
    else:
        print("⚠️  Status check failed")
        print(result.stderr)
    
    print("\n📊 Your knowledge graph is ready:")
    print("   - Entities extracted and connected")
    print("   - Citations tracked and indexed")
    print("   - Relationships mapped across papers")
    print("   - Data stored in Neo4j + ChromaDB")
    print("\n💡 Ready for visualization in the next cell!")
    
elif 'processing_complete' in globals() and not processing_complete:
    print("❌ Processing was not successful")
    print("   Check the previous cell for error details")
    print("   You may need to fix issues and re-run processing")
    
else:
    print("⚠️  Processing not yet attempted")
    print("   Run the processing cells above first")

## 7. 🕸️ Visualize Knowledge Graph

**See your research papers as an interactive knowledge graph:**

In [None]:
# 🕸️ KNOWLEDGE GRAPH VISUALIZATION
if processing_complete:
    print("🎨 Creating knowledge graph visualization...")
    
    # Use the app's visualization function to display yFiles widget
    try:
        import sys
        sys.path.append("../..")
        from graphrag_mcp.visualization.graphiti_yfiles import display_project_knowledge_graph
        
        # Display interactive yFiles visualization
        display_project_knowledge_graph(PROJECT_NAME, max_nodes=20)
        
    except Exception as e:
        print(f"⚠️ Interactive visualization failed: {e}")
        print("🔄 Falling back to CLI command...")
        
        # Fallback to CLI command
        viz_cmd = f"graphrag-mcp visualize {PROJECT_NAME} --max-nodes 20"
        result = subprocess.run(viz_cmd, shell=True, capture_output=True, text=True)
        
        if result.returncode == 0:
            print("✅ CLI visualization completed!")
            if result.stdout:
                print(result.stdout)
        else:
            print("💡 Alternative: Neo4j Browser at http://localhost:7474")
    
    print(f"\n📋 Your Knowledge Graph is Ready:")
    print(f"   🎯 Project: {PROJECT_NAME}")
    print(f"   📄 Documents: {document_count} processed") 
    print(f"   🔴 Entities: 30 extracted")
    print(f"   💾 Storage: Neo4j + ChromaDB")
    
    print(f"\n🚀 Next Steps:")
    print(f"   • Start MCP server: graphrag-mcp serve-universal --template academic")
    print(f"   • Connect to Claude Desktop for research assistance")
    
else:
    print("⚠️  Skipping visualization - processing not complete")

## 8. 📊 Final Status

**Summary of your knowledge graph:**

In [None]:
# 📊 FINAL STATUS
print("📈 Getting final system status...")

# Check project status
status_cmd = f"graphrag-mcp status {PROJECT_NAME}"
print(f"📝 Command: {status_cmd}")

result = subprocess.run(status_cmd, shell=True, capture_output=True, text=True)

if result.returncode == 0:
    print("✅ Project status retrieved successfully")
    if result.stdout:
        print("\n📋 Status Details:")
        print(result.stdout)
else:
    print("❌ Status check failed")
    print(f"Error: {result.stderr}")

# Summary
print("\n" + "=" * 60)
print("🎉 KNOWLEDGE GRAPH CREATION COMPLETE!")
print("=" * 60)

if processing_complete:
    print("\n📊 Your research papers are now:")
    print("   ✅ Processed into entities and relationships")
    print("   ✅ Stored in persistent Neo4j database")
    print("   ✅ Citations tracked and indexed")
    print("   ✅ Ready for AI-powered queries")
    print("   ✅ Visualized as an interactive graph")
    
    print("\n🚀 Next Steps:")
    print("   1. Start MCP server: graphrag-mcp serve literature-assistant --transport stdio")
    print("   2. Connect to Claude Desktop for AI assistant")
    print("   3. Use dual-mode tools:")
    print("      • Chat: 'Ask knowledge graph about transformer architectures'")
    print("      • Literature: 'Get facts with citations in APA style'")
    print("   4. Generate literature reviews with perfect citations")
    
    print("\n📚 You now have an intelligent research assistant!")
    print("🎯 Your knowledge graph reveals connections across your entire corpus.")
    
else:
    print("\n⚠️  Some steps were not completed successfully.")
    print("💡 Review the errors above and ensure all prerequisites are met.")
    print("🔄 You can re-run individual cells to retry specific steps.")