# 📚 GraphRAG MCP Document Processing

Transform your PDF documents into an intelligent AI research assistant!

## 🎯 What You'll Build
By the end of this notebook, you'll have:
- 🕸️ **Knowledge Graph** - Your documents transformed into connected concepts
- 🤖 **AI Research Assistant** - Chat with your documents via Claude Desktop
- 📊 **Analytics** - See what entities and citations were extracted
- 🔄 **Dual-Mode Tools** - Both conversational chat and formal literature review
- 🎨 **Interactive Visualization** - See your knowledge graph in action

## 📋 Prerequisites (5 minutes setup)

**Before running this notebook, make sure you have:**

### 1. **Environment Setup**
```bash
# Activate your Python environment
source graphrag-env/bin/activate  # or your environment name

# Install dependencies
uv pip install -r requirements.txt

# Install visualization libraries
pip install plotly networkx
```

### 2. **Ollama (AI Models)**
```bash
# Install Ollama
brew install ollama  # macOS
# or download from https://ollama.com

# Download required models
ollama pull llama3.1:8b
ollama pull nomic-embed-text

# Start Ollama server (keep this running)
ollama serve
```

### 3. **Neo4j Database**
```bash
# Start Neo4j with Docker (keep this running)
docker run -d --name neo4j -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest

# Verify it's running
curl -f http://localhost:7474/
```

### 4. **Your Documents**
- 📄 **PDFs ready**: Put your research papers in a folder
- 📁 **Folder path**: Know where your documents are located
- 📊 **5-20 documents**: Good size for testing (too many = long processing time)

## 🚨 Troubleshooting
If you see errors below, check:
- ✅ **Ollama running**: `curl -s http://localhost:11434/api/tags`
- ✅ **Neo4j running**: `curl -f http://localhost:7474/`
- ✅ **Models installed**: `ollama list`
- ✅ **Python environment**: `which python`
- ✅ **Plotly installed**: `pip install plotly networkx`

---

## 🏃‍♀️ Quick Start: Run All Cells
1. **Update settings** in the Setup cell below
2. **Run all cells** (Cell → Run All)
3. **Wait for processing** (progress bars will show)
4. **View your knowledge graph** visualization
5. **Follow next steps** for Claude Desktop integration

**⏱️ Processing Time:** ~2-10 minutes per document

---

## 🚀 Setup & Configuration

**📝 Customize these settings for your project:**

In [None]:
# 🔍 Run comprehensive prerequisites check
from processing_utils import check_prerequisites

# This will check all services, packages, and configurations
result = check_prerequisites()

if result["status"] == "passed":
    print("\n🚀 All systems ready! You can proceed to the next cell.")
else:
    print(f"\n⚠️  Found {len(result['issues'])} issues that need to be fixed:")
    for i, issue in enumerate(result['issues'], 1):
        print(f"   {i}. ❌ {issue}")

    print("\n🛑 Please fix these issues before continuing!")
    print("💡 Refer to the detailed output above for specific solutions.")

## 🔍 Prerequisites Check

**🚨 Let's verify all required services are running properly:**

In [None]:
from processing_utils import check_prerequisites, quick_setup

# ⚙️ CUSTOMIZE THESE SETTINGS FOR YOUR PROJECT:
PROJECT_NAME = "my-research"        # 📝 Change this to your project name
DOCUMENTS_FOLDER = "../../examples"    # 📁 Change to your documents folder path

# Examples of document folder paths:
# DOCUMENTS_FOLDER = "~/Documents/research-papers"
# DOCUMENTS_FOLDER = "./my-pdfs"
# DOCUMENTS_FOLDER = "/Users/yourname/Desktop/papers"
# DOCUMENTS_FOLDER = "../../examples"  # Default: examples folder in main project

print("🔧 Checking prerequisites...")
check_prerequisites()

print("\n🏗️  Initializing GraphRAG processor...")
processor = quick_setup(PROJECT_NAME, DOCUMENTS_FOLDER)
print(f"✅ Ready to process project: {PROJECT_NAME}")
print(f"📁 Looking for documents in: {DOCUMENTS_FOLDER}")

## 📄 Document Discovery

**🔍 Finding PDFs in your folder...**

In [None]:
# 🔍 Scan for PDF documents
documents = processor.discover_documents(DOCUMENTS_FOLDER)

if documents:
    print(f"\n📊 Found {len(documents)} PDF documents:")
    total_size = sum(doc.size_mb for doc in documents)
    est_time = len(documents) * 5  # Rough estimate: 5 minutes per document

    print(f"   📏 Total size: {total_size:.2f} MB")
    print(f"   ⏱️  Estimated processing time: {est_time} minutes")
    print("\n📋 Document list:")

    for i, doc in enumerate(documents, 1):
        print(f"   {i:2d}. 📄 {doc.name} ({doc.size_mb} MB)")

    print(f"\n✅ Ready to process {len(documents)} documents")

else:
    print("❌ No PDF documents found!")
    print(f"   📁 Checked folder: {DOCUMENTS_FOLDER}")
    print("   💡 Make sure your PDF files are in the correct folder")
    print("   🔧 Try updating DOCUMENTS_FOLDER path in the cell above")

## 🔄 Processing Documents

**🚀 This will process all documents and build the knowledge graph...**

*You'll see progress bars as documents are processed. This may take several minutes.*

In [None]:
# 🚀 Process all documents with progress tracking
if documents:
    print(f"🔄 Starting processing of {len(documents)} documents...")
    print("⏳ Please wait - this may take several minutes...")

    # Process documents (this will show progress bars)
    results = await processor.process_documents(documents)

    # Show results summary
    print("\n📊 Processing Complete!")
    print(f"   ✅ Successfully processed: {results['success']}/{len(documents)} documents")

    if results['failed'] > 0:
        print(f"   ❌ Failed: {results['failed']} documents")
        print("   🔄 You can retry failed documents in the next section")

    print(f"   ⏱️  Total processing time: {results['total_time']/60:.1f} minutes")
    print(f"   ⚡ Average per document: {results['avg_time']:.1f} seconds")

    if results['success'] > 0:
        print("\n🎉 Success! Your knowledge graph is ready!")
        print("   🕸️  Documents are now connected in Neo4j database")
        print("   🤖 Ready to create your AI research assistant")

else:
    print("❌ No documents to process")
    print("   Go back to the 'Document Discovery' section and check your folder path")

## 📊 Processing Results & Analytics

**📈 See detailed results and performance metrics:**

In [None]:
# 📊 Show detailed processing results
if documents:
    print("📋 Detailed Processing Results:")
    print("=" * 80)

    # Results table
    df = processor.get_results_dataframe(documents)
    print(df.to_string(index=False))

    # Key metrics
    completed = df[df['Status'] == 'completed']
    if not completed.empty:
        print("\n📈 Key Metrics:")
        print(f"   🔢 Total entities extracted: {completed['Entities'].sum()}")
        print(f"   📚 Total citations found: {completed['Citations'].sum()}")
        print(f"   ⚡ Average entities per document: {completed['Entities'].mean():.1f}")
        print(f"   ⏱️  Fastest document: {completed['Time (s)'].min():.1f} seconds")
        print(f"   🐌 Slowest document: {completed['Time (s)'].max():.1f} seconds")

    print("\n📊 Generating analytics charts...")
    # Analytics charts (this will show visualizations)
    processor.show_analytics(documents)

else:
    print("❌ No results to display")
    print("   Process documents first in the section above")

In [None]:
# 🕸️ Create interactive knowledge graph visualization
if documents and any(doc.status == "completed" for doc in documents):
    print("🎨 Creating interactive knowledge graph...")
    print("   📊 This shows documents (blue) and entities (red) with their connections")
    print("   🖱️  You can hover over nodes to see details")
    print("   🔍 Larger nodes have more connections")
    print()

    # Create the visualization
    graph = processor.visualize_knowledge_graph(documents, max_nodes=50)

    if graph:
        print("✅ Knowledge graph visualization complete!")
        print("   🎯 Blue nodes = Documents")
        print("   🎯 Red nodes = Entities")
        print("   🎯 Lines = Connections")
        print("   💡 This shows how your documents are connected through shared concepts")
    else:
        print("❌ Could not create visualization")
        print("   📦 Install required libraries: pip install plotly networkx")

else:
    print("❌ No completed documents to visualize")
    print("   Process documents first in the sections above")

## 🕸️ Knowledge Graph Visualization

**🎨 Interactive visualization of your knowledge graph:**

## 🔄 Error Recovery (Optional)

**🛠️ If some documents failed, you can retry them here:**

In [None]:
# 🔍 Check for failed documents
failed_docs = processor.get_failed_documents(documents)

if failed_docs:
    print(f"❌ Found {len(failed_docs)} failed documents:")
    print("   📋 Failed documents and errors:")

    for i, doc in enumerate(failed_docs, 1):
        print(f"   {i}. 📄 {doc.name}")
        print(f"      💥 Error: {doc.error_message}")
        print()

    print("🔄 To retry failed documents:")
    print("   1. Uncomment the lines below")
    print("   2. Run this cell again")
    print("   3. Wait for processing to complete")
    print()

    # 🔄 UNCOMMENT THESE LINES TO RETRY FAILED DOCUMENTS:
    # print("🔄 Retrying failed documents...")
    # processor.reset_for_retry(failed_docs)
    # retry_results = await processor.process_documents(failed_docs)
    # print(f"✅ Retry complete: {retry_results['success']} success, {retry_results['failed']} still failed")

else:
    print("✅ No failed documents - all processing successful!")
    print("   🎉 Your knowledge graph is complete and ready to use")

## 🎯 Next Steps: Create Your AI Research Assistant

**🚀 Your knowledge graph is ready! Now connect it to Claude Desktop:**

In [None]:
# 🎯 Show next steps for MCP server setup
print("🎉 Congratulations! Your GraphRAG MCP system is ready!")
print("=" * 60)

processor.print_next_steps()

print("\n" + "=" * 60)
print("🎮 How to Use Your AI Research Assistant:")
print()
print("📝 Conversational Mode (Explore & Discover):")
print('   • "Ask knowledge graph: What are the main themes in my research?"')
print('   • "Explore topic: machine learning applications"')
print('   • "Find connections between neural networks and optimization"')
print('   • "What do we know about drug discovery?"')
print()
print("📚 Literature Review Mode (Formal Writing):")
print('   • "Gather sources for topic: transformer architectures"')
print('   • "Get facts with citations about attention mechanisms in APA style"')
print('   • "Verify claim with sources: BERT outperforms traditional NLP models"')
print('   • "Generate bibliography in IEEE style"')
print()
print("🔍 Research Analysis:")
print('   • "Find research gaps in my field"')
print('   • "Compare methodologies for sentiment analysis"')
print('   • "Analyze contributions by [author name]"')
print('   • "Track evolution of [concept] over time"')
print()
print("✨ Your documents are now an intelligent AI research assistant!")
print("🎯 Happy researching! 🎓")