# ICE Building Workflow - Knowledge Graph Construction

**Purpose**: Comprehensive data ingestion and knowledge graph building for investment intelligence
**Architecture**: ICE Simplified (2,508 lines) with LightRAG integration
**Input**: Financial data from multiple sources ‚Üí **Output**: Searchable knowledge graph

## Workflow Overview

1. **Environment Setup** - Initialize ICE system and configure data sources
2. **Workflow Mode Selection** - Choose between initial build or incremental update
3. **Data Ingestion** - Fetch financial data from APIs and process documents
4. **Knowledge Graph Building** - Extract entities, relationships, and build LightRAG graph
5. **Storage & Validation** - Verify graph construction and monitor storage
6. **Metrics & Monitoring** - Track processing metrics and system health

---

## 1. Environment Setup & System Initialization

In [1]:
# Setup and imports
import sys
import os
from pathlib import Path
from datetime import datetime, timedelta
import json

# Add project root to path
project_root = Path.cwd()
sys.path.insert(0, str(project_root))

# Configure environment
os.environ.setdefault('ICE_WORKING_DIR', './src/ice_lightrag/storage')

print(f"üöÄ ICE Building Workflow")
print(f"üìÖ {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"üìÅ Working Directory: {project_root}")

üöÄ ICE Building Workflow
üìÖ 2025-10-11 22:49
üìÅ Working Directory: /Users/royyeo/Library/CloudStorage/OneDrive-NationalUniversityofSingapore/Capstone Project


In [2]:
# Initialize ICE system
from updated_architectures.implementation.ice_simplified import create_ice_system

try:
    ice = create_ice_system()
    system_ready = ice.is_ready()
    print(f"‚úÖ ICE System Initialized")
    print(f"üß† LightRAG Status: {'Ready' if system_ready else 'Initializing'}")
    print(f"üìä Architecture: ICE Simplified (2,508 lines)")
    print(f"üîó Components: Core + Ingester + QueryEngine")
except Exception as e:
    print(f"‚ùå Initialization Error: {e}")
    raise  # Let errors surface for proper debugging

INFO:ice_data_ingestion.ice_integration:ICE LightRAG system initialized successfully
INFO:ice_data_ingestion.ice_integration:ICE LightRAG system initialized successfully
INFO:updated_architectures.implementation.ice_simplified:ICE Core initializing with ICESystemManager orchestration
INFO:src.ice_core.ice_system_manager:ICE System Manager initialized with working_dir: ice_lightrag/storage
INFO:updated_architectures.implementation.ice_simplified:‚úÖ ICESystemManager initialized successfully
INFO:updated_architectures.implementation.ice_simplified:Data Ingester initialized with 4 API services
INFO:updated_architectures.implementation.ice_simplified:Query Engine initialized
INFO:updated_architectures.implementation.ice_simplified:‚úÖ ICE Simplified system initialized successfully
INFO:src.ice_core.ice_system_manager:LightRAG wrapper created successfully (lazy initialization mode)
INFO:src.ice_core.ice_system_manager:Exa MCP connector initialized successfully
INFO:src.ice_core.ice_graph_bu

‚úÖ LightRAG successfully imported!


INFO:src.ice_core.ice_graph_builder:LightRAG instance updated in Graph Builder
INFO:src.ice_core.ice_system_manager:Graph Builder initialized successfully
INFO:src.ice_core.ice_query_processor:ICE Query Processor initialized
INFO:src.ice_core.ice_system_manager:Query Processor initialized successfully
INFO:updated_architectures.implementation.ice_simplified:System health: ready=True
INFO:updated_architectures.implementation.ice_simplified:Components: {'lightrag': True, 'exa_connector': True, 'graph_builder': True, 'query_processor': True, 'data_manager': False}
INFO:updated_architectures.implementation.ice_simplified:‚úÖ ICE system created and ready for operations


‚úÖ ICE System Initialized
üß† LightRAG Status: Ready
üìä Architecture: ICE Simplified (2,508 lines)
üîó Components: Core + Ingester + QueryEngine


In [3]:
# Verify storage architecture and components
print(f"üì¶ LightRAG Storage Architecture Verification")
print(f"‚îÅ" * 40)

if not (ice and ice.core.is_ready()):
    raise RuntimeError("ICE system not ready - cannot verify storage")

# Get storage statistics using new method
storage_stats = ice.core.get_storage_stats()

print(f"LightRAG Storage Components:")
for component_name, component_info in storage_stats['components'].items():
    status = "‚úÖ Initialized" if component_info['exists'] else "‚ö†Ô∏è Not created yet"
    size_mb = component_info['size_bytes'] / (1024 * 1024) if component_info['size_bytes'] > 0 else 0
    print(f"  {component_name}: {status}")
    print(f"    Purpose: {component_info['description']}")
    print(f"    File: {component_info['file']}")
    if size_mb > 0:
        print(f"    Size: {size_mb:.2f} MB")

print(f"\nüìÅ Working Directory: {storage_stats['working_dir']}")
print(f"üóÑÔ∏è Storage Backend: File-based (development mode)")
print(f"üíæ Total Storage: {storage_stats['total_storage_bytes'] / (1024 * 1024):.2f} MB")

üì¶ LightRAG Storage Architecture Verification
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
LightRAG Storage Components:
  chunks_vdb: ‚ö†Ô∏è Not created yet
    Purpose: Vector database for document chunks
    File: vdb_chunks.json
  entities_vdb: ‚ö†Ô∏è Not created yet
    Purpose: Vector database for extracted entities
    File: vdb_entities.json
  relationships_vdb: ‚ö†Ô∏è Not created yet
    Purpose: Vector database for entity relationships
    File: vdb_relationships.json
  graph: ‚ö†Ô∏è Not created yet
    Purpose: NetworkX graph structure
    File: graph_chunk_entity_relation.graphml

üìÅ Working Directory: ice_lightrag/storage
üóÑÔ∏è Storage Backend: File-based (development mode)
üíæ Total Storage: 0.00 MB


In [4]:
# Data sources configuration status
if not (ice and hasattr(ice, 'ingester')):
    raise RuntimeError("Data ingester not initialized")

available_services = ice.ingester.available_services
print(f"\nüì° Data Sources Available: {len(available_services)}")
for service in available_services:
    print(f"  ‚úÖ {service}")

if not available_services:
    print(f"  ‚ö†Ô∏è No APIs configured - will use sample data")
    print(f"  üí° Set NEWSAPI_ORG_API_KEY for real news")
    print(f"  üí° Set ALPHA_VANTAGE_API_KEY for financial data")

# Validate OpenAI for LightRAG
openai_configured = bool(os.getenv('OPENAI_API_KEY'))
print(f"\nüîë OpenAI API: {'‚úÖ Configured' if openai_configured else '‚ùå Required for full functionality'}")


üì° Data Sources Available: 4
  ‚úÖ alpha_vantage
  ‚úÖ newsapi
  ‚úÖ polygon
  ‚úÖ finnhub

üîë OpenAI API: ‚úÖ Configured


## 2. Workflow Mode Selection & Configuration

### üîß Model Provider Configuration

ICE supports **OpenAI** (paid) or **Ollama** (free local) for LLM and embeddings:

#### Option 1: OpenAI (Default - No setup required)
```bash
export OPENAI_API_KEY="sk-..."
```
- **Cost**: ~$5/month for typical usage
- **Quality**: Highest accuracy for entity extraction and reasoning
- **Setup**: Just set API key

#### Option 2: Ollama (Free Local - Requires setup)
```bash
# Set provider
export LLM_PROVIDER="ollama"

# One-time setup:
ollama serve                      # Start Ollama service
ollama pull qwen3:30b-32k        # Pull LLM model (32k context required)
ollama pull nomic-embed-text      # Pull embedding model
```
- **Cost**: $0/month (completely free)
- **Quality**: Good for most investment analysis tasks
- **Setup**: Requires local Ollama installation and model download

#### Option 3: Hybrid (Recommended for cost-conscious users)
```bash
export LLM_PROVIDER="ollama"           # Use Ollama for LLM
export EMBEDDING_PROVIDER="openai"     # Use OpenAI for embeddings
export OPENAI_API_KEY="sk-..."
```
- **Cost**: ~$2/month (embeddings only)
- **Quality**: Balanced - free LLM with high-quality embeddings

**Current configuration will be logged when you run the next cell.**

In [5]:
# ### Provider Switching - Uncomment ONE option below, then restart kernel

# ### Option 1: OpenAI ($5/mo, highest quality)
import os; os.environ['LLM_PROVIDER'] = 'openai'; print("‚úÖ Switched to OpenAI")

# ###Option 2: Hybrid ($2/mo, 60% savings, recommended)
# import os; os.environ['LLM_PROVIDER'] = 'ollama'; os.environ['EMBEDDING_PROVIDER'] = 'openai'; os.environ['OPENAI_API_KEY'] = 'sk-YOUR-KEY'; print("‚úÖ Switched to Hybrid")

### Option 3: Full Ollama ($0/mo, requires graph clearing)
# import os; os.environ['LLM_PROVIDER'] = 'ollama'; os.environ['EMBEDDING_PROVIDER'] = 'ollama'; print("‚úÖ Switched to Full Ollama - Clear graph in Cell 9 if needed")

‚úÖ Switched to OpenAI


### üóëÔ∏è Graph Management (Optional)

**When to clear the graph:**
- ‚úÖ Switching to Full Ollama (1536-dim ‚Üí 768-dim embeddings)
- ‚úÖ Graph corrupted or very old (>30 days without updates)
- ‚úÖ Testing fresh graph builds from scratch

**When NOT to clear:**
- ‚ùå Just switching LLM provider (OpenAI ‚Üî Hybrid use same embeddings)
- ‚ùå Adding new documents (incremental updates work fine)
- ‚ùå Changing query modes (local, hybrid, etc.)

**How to clear:**
Run the code cell below (uncomment lines to activate)

In [17]:
##########################################################
#                    Check graph info                    #
##########################################################

from pathlib import Path
import shutil

def check_storage(storage_path):
    """Check and display storage file inventory"""
    files = ['vdb_entities.json', 'vdb_relationships.json', 'vdb_chunks.json', 'graph_chunk_entity_relation.graphml']
    total_size = 0
    for fname in files:
        fpath = storage_path / fname
        if fpath.exists():
            size_mb = fpath.stat().st_size / (1024 * 1024)
            print(f"  ‚úÖ {fname}: {size_mb:.2f} MB")
            total_size += size_mb
        else:
            print(f"  ‚ö†Ô∏è  {fname}: not found")
    print(f"  üíæ Total: {total_size:.2f} MB")

# Use actual config path instead of hardcoded path to avoid path mismatches
storage_path = Path(ice.config.working_dir)

check_storage(storage_path)

#####################################################################
ice.core.get_graph_stats()

##########################################################
#              Graph Health Metrics (P0)                 #
##########################################################

def check_graph_health(storage_path):
    """Check critical graph health metrics (P0 only)"""
    import json
    from pathlib import Path
    
    TICKERS = {'NVDA', 'TSMC', 'AMD', 'ASML'}  # Known portfolio tickers
    
    result = {
        'tickers_covered': set(),
        'total_entities': 0,
        'total_relationships': 0,
        'buy_signals': 0,
        'sell_signals': 0,
        'price_targets': 0
    }
    
    # Parse entities
    entities_file = Path(storage_path) / 'vdb_entities.json'
    if entities_file.exists():
        data = json.loads(entities_file.read_text())
        result['total_entities'] = len(data.get('data', []))
        
        for entity in data.get('data', []):
            text = f"{entity.get('entity_name', '')} {entity.get('content', '')}".upper()
            
            # Detect tickers
            for ticker in TICKERS:
                if ticker in text:
                    result['tickers_covered'].add(ticker)
            
            # Detect signals
            if 'BUY' in text:
                result['buy_signals'] += 1
            if 'SELL' in text:
                result['sell_signals'] += 1
            if 'PRICE TARGET' in text or 'PRICE_TARGET' in text:
                result['price_targets'] += 1
    
    # Parse relationships
    rels_file = Path(storage_path) / 'vdb_relationships.json'
    if rels_file.exists():
        data = json.loads(rels_file.read_text())
        result['total_relationships'] = len(data.get('data', []))
    
    result['tickers_covered'] = sorted(list(result['tickers_covered']))
    return result

# Run health check
health = check_graph_health(storage_path)

# Display results
print("\nüß¨ Graph Health Metrics:")
print(f"  üìä Content Coverage:")
print(f"    Tickers: {', '.join(health['tickers_covered']) if health['tickers_covered'] else 'None'} ({len(health['tickers_covered'])}/4 portfolio holdings)")

print(f"\n  üï∏Ô∏è Graph Structure:")
print(f"    Total entities: {health['total_entities']:,}")
print(f"    Total relationships: {health['total_relationships']:,}")
if health['total_entities'] > 0:
    avg_conn = health['total_relationships'] / health['total_entities']
    print(f"    Avg connections: {avg_conn:.2f}")

print(f"\n  üíº Investment Signals:")
print(f"    BUY signals: {health['buy_signals']}")
print(f"    SELL signals: {health['sell_signals']}")
print(f"    Price targets: {health['price_targets']}")

  ‚úÖ vdb_entities.json: 1.96 MB
  ‚úÖ vdb_relationships.json: 1.65 MB
  ‚úÖ vdb_chunks.json: 0.27 MB
  ‚úÖ graph_chunk_entity_relation.graphml: 0.13 MB
  üíæ Total: 4.02 MB

üß¨ Graph Health Metrics:
  üìä Content Coverage:
    Tickers: AMD, ASML, TSMC (3/4 portfolio holdings)

  üï∏Ô∏è Graph Structure:
    Total entities: 165
    Total relationships: 139
    Avg connections: 0.84

  üíº Investment Signals:
    BUY signals: 2
    SELL signals: 1
    Price targets: 0


## üß™ Test: Hybrid Entity Categorization with Qwen2.5-3B

**Purpose**: Compare categorization accuracy between keyword-only and hybrid (keyword+LLM) approaches

**What this tests**:
- Baseline: Fast keyword pattern matching (~1ms per entity)
- Enhanced: Confidence scoring to identify ambiguous cases
- Hybrid: LLM fallback for low-confidence entities (~40ms per entity)

**Prerequisites**:
- ‚úÖ Ollama installed with qwen2.5:3b model (optional - degrades gracefully)
- ‚úÖ LightRAG graph built (previous cells completed)

**Expected runtime**: ~0.5 seconds for 12 sample entities (hybrid mode)

**Configuration**: `src/ice_lightrag/graph_categorization.py` - Change `CATEGORIZATION_MODE` to enable hybrid by default

In [19]:
# Purpose: Test hybrid entity categorization with Ollama LLM fallback
# Location: ice_building_workflow.ipynb Cell 12 (after Cell 10)
# Dependencies: graph_categorization.py, entity_categories.py, LightRAG storage

import json
import time
import sys
import random
import requests
from pathlib import Path
from collections import Counter

# ===== SETUP: Imports with error handling =====
print("=" * 70)
print("üß™ Hybrid Entity Categorization Test")
print("=" * 70)

# Ensure src is in path for notebook context
project_root = Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

try:
    from src.ice_lightrag.graph_categorization import (
        categorize_entity,
        categorize_entity_with_confidence,
        categorize_entity_hybrid
    )
    from src.ice_lightrag.entity_categories import CATEGORY_DISPLAY_ORDER
    print("‚úÖ Categorization functions imported successfully\n")
except ImportError as e:
    print(f"‚ùå Import error: {e}")
    print("   ‚Üí Ensure previous cells completed successfully")
    print("   ‚Üí Check that src/ice_lightrag/graph_categorization.py exists\n")
    raise

# ===== HEALTH CHECK: Ollama service availability =====
def check_ollama_service():
    """Check if Ollama service is running and qwen2.5:3b is available"""
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=2)
        if response.status_code == 200:
            models = response.json().get('models', [])
            qwen_available = any('qwen2.5:3b' in m.get('name', '') for m in models)
            return True, qwen_available
        return False, False
    except requests.RequestException:
        return False, False
    except (KeyError, json.JSONDecodeError):
        return True, False  # Ollama running but malformed response

ollama_running, qwen_available = check_ollama_service()
if ollama_running and qwen_available:
    print("‚úÖ Ollama service running with qwen2.5:3b model")
elif ollama_running:
    print("‚ö†Ô∏è  Ollama running but qwen2.5:3b not found")
    print("   ‚Üí Install: ollama pull qwen2.5:3b")
else:
    print("‚ö†Ô∏è  Ollama not running - hybrid mode will fall back to keyword matching")
    print("   ‚Üí Start Ollama: brew services start ollama (macOS)\n")

# ===== DATA LOADING: Entities with validation =====
# Use ICE config for robust path resolution
storage_path = Path(ice.config.working_dir) / "vdb_entities.json"

if not storage_path.exists():
    print(f"‚ùå Storage file not found: {storage_path}")
    print("   ‚Üí Run previous cells to build the knowledge graph first\n")
    raise FileNotFoundError(f"LightRAG storage not found at {storage_path}")

with open(storage_path) as f:
    entities_data = json.load(f)

# Validate LightRAG storage structure: {"data": [...]}
if not isinstance(entities_data, dict) or 'data' not in entities_data:
    print(f"‚ùå Invalid storage format (expected dict with 'data' key)")
    raise ValueError("Invalid LightRAG storage format - expected {'data': [...]}")

entities_list = entities_data.get('data', [])
if not isinstance(entities_list, list) or len(entities_list) == 0:
    print(f"‚ùå No entities found in storage")
    raise ValueError("No entities found in LightRAG storage")

print(f"‚úÖ Loaded {len(entities_list)} entities from knowledge graph")

# Random sampling for better test coverage (avoids bias from first N entities)
random.seed(42)  # Reproducible sampling
test_entities = random.sample(entities_list, min(12, len(entities_list)))
print(f"   Testing with {len(test_entities)} randomly sampled entities\n")

# ===== HELPER: Compact result display =====
def display_results(results, title, show_confidence=False, show_llm=False):
    """Display categorization results in compact format"""
    print(f"\n{title}")
    print("-" * 70)
    
    for i, (name, category, confidence, used_llm) in enumerate(results, 1):
        # Truncate long entity names for readability
        display_name = name[:40] + "..." if len(name) > 40 else name
        
        if show_llm and used_llm:
            indicator = "ü§ñ"  # LLM was used
        else:
            indicator = "‚ö°"  # Keyword matching
        
        if show_confidence:
            print(f"{i:2d}. {indicator} {display_name:43s} ‚Üí {category:20s} (conf: {confidence:.2f})")
        else:
            print(f"{i:2d}. {display_name:45s} ‚Üí {category}")
    
    # Category distribution summary
    category_counts = Counter(cat for _, cat, _, _ in results)
    print(f"\nüìä Distribution: {dict(category_counts)}")

# ===== TEST 1: Keyword-Only Baseline =====
print("\n" + "=" * 70)
print("TEST 1: Keyword-Only Categorization (Baseline)")
print("=" * 70)

start_time = time.time()
keyword_results = []

for entity in test_entities:
    name = entity.get('entity_name', '')
    content = entity.get('content', '')
    category = categorize_entity(name, content)
    keyword_results.append((name, category, 1.0, False))  # No confidence/LLM info

elapsed = time.time() - start_time
display_results(keyword_results, "Results (Keyword Matching):", show_confidence=False)
print(f"\n‚è±Ô∏è  Time: {elapsed*1000:.1f}ms ({elapsed*1000/len(test_entities):.1f}ms per entity)")

# ===== TEST 2: Confidence Scoring Analysis =====
print("\n" + "=" * 70)
print("TEST 2: Keyword + Confidence Scoring")
print("=" * 70)

start_time = time.time()
confidence_results = []

for entity in test_entities:
    name = entity.get('entity_name', '')
    content = entity.get('content', '')
    category, confidence = categorize_entity_with_confidence(name, content)
    confidence_results.append((name, category, confidence, False))

elapsed = time.time() - start_time
display_results(confidence_results, "Results (with confidence scores):", show_confidence=True)

# Highlight ambiguous entities (confidence < 0.70)
low_confidence = [(n, c, conf) for n, c, conf, _ in confidence_results if conf < 0.70]
if low_confidence:
    print(f"\nüîç Ambiguous entities (confidence < 0.70): {len(low_confidence)}")
    for name, cat, conf in low_confidence:
        print(f"   - {name[:50]:50s} ‚Üí {cat:20s} (conf: {conf:.2f})")
else:
    print(f"\n‚úÖ All entities have high confidence (‚â•0.70) - no LLM fallback needed")

print(f"\n‚è±Ô∏è  Time: {elapsed*1000:.1f}ms ({elapsed*1000/len(test_entities):.1f}ms per entity)")

# ===== TEST 3: Hybrid Mode (if Ollama available) =====
if ollama_running and qwen_available:
    print("\n" + "=" * 70)
    print("TEST 3: Hybrid Categorization (Keyword + LLM Fallback)")
    print("=" * 70)
    print("‚è±Ô∏è  Note: LLM calls may take 5-10 seconds total...\n")
    
    start_time = time.time()
    hybrid_results = []
    llm_call_count = 0
    
    for entity in test_entities:
        name = entity.get('entity_name', '')
        content = entity.get('content', '')
        
        # Optimization: Compute keyword confidence once and reuse
        keyword_cat, keyword_conf = categorize_entity_with_confidence(name, content)
        
        if keyword_conf >= 0.70:
            # High confidence - use keyword result (skip LLM call)
            category, confidence = keyword_cat, keyword_conf
            used_llm = False
        else:
            # Low confidence - use hybrid (may call LLM)
            category, confidence = categorize_entity_hybrid(name, content, confidence_threshold=0.70)
            # LLM was used if confidence jumped to 0.90
            used_llm = (confidence == 0.90)
            if used_llm:
                llm_call_count += 1
        
        hybrid_results.append((name, category, confidence, used_llm))
    
    elapsed = time.time() - start_time
    display_results(hybrid_results, "Results (Hybrid mode - keyword + LLM):", show_confidence=True, show_llm=True)
    
    print(f"\nü§ñ LLM calls: {llm_call_count}/{len(test_entities)} ({100*llm_call_count/len(test_entities):.1f}%)")
    print(f"‚è±Ô∏è  Time: {elapsed*1000:.1f}ms ({elapsed*1000/len(test_entities):.1f}ms per entity)")
    
    # ===== COMPARISON SUMMARY =====
    print("\n" + "=" * 70)
    print("üìä COMPARISON SUMMARY")
    print("=" * 70)
    
    # Count category changes between keyword and hybrid
    changes = 0
    for i in range(len(test_entities)):
        if keyword_results[i][1] != hybrid_results[i][1]:
            changes += 1
    
    print(f"Entities recategorized by LLM: {changes}/{len(test_entities)} ({100*changes/len(test_entities):.1f}%)")
    
    if changes > 0:
        print("\nRecategorization details:")
        for i in range(len(test_entities)):
            kw_cat = keyword_results[i][1]
            hyb_cat = hybrid_results[i][1]
            if kw_cat != hyb_cat:
                name = test_entities[i].get('entity_name', '')[:50]
                print(f"   - {name:50s}: {kw_cat:20s} ‚Üí {hyb_cat}")
    
    print(f"\n‚úÖ Hybrid categorization complete!")
    
else:
    print("\n" + "=" * 70)
    print("‚ö†Ô∏è  TEST 3 SKIPPED: Ollama not available")
    print("=" * 70)
    print("To enable hybrid mode:")
    print("   1. Install Ollama: https://ollama.com")
    print("   2. Pull model: ollama pull qwen2.5:3b")
    print("   3. Start service: brew services start ollama (macOS)")
    print("   4. Re-run this cell")

print("\n" + "=" * 70)

üß™ Hybrid Entity Categorization Test
‚úÖ Categorization functions imported successfully

‚úÖ Ollama service running with qwen2.5:3b model
‚ùå Storage file not found: ice_lightrag/storage/vdb_entities.json
   ‚Üí Run previous cells to build the knowledge graph first



FileNotFoundError: LightRAG storage not found at ice_lightrag/storage/vdb_entities.json

In [18]:

## Clear graph storage (COMMENTED BY DEFAULT FOR SAFETY)
## Uncomment lines below to clear existing graph:

##########################################################
#                      Clear Graph                       #
##########################################################

if storage_path.exists():
    print("üìä PRE-DELETION CHECK")
    check_storage(storage_path)
    
    shutil.rmtree(storage_path) # Deletes directory + all contents.
    storage_path.mkdir(parents=True, exist_ok=True) # This re-creates empty directory.
    
    print("\n‚úÖ POST-DELETION CHECK")
    check_storage(storage_path)
    print("\n‚úÖ Graph cleared - will rebuild from scratch")
else:
    print("‚ö†Ô∏è  Storage path doesn't exist - nothing to clear")

üìä PRE-DELETION CHECK
  ‚úÖ vdb_entities.json: 1.96 MB
  ‚úÖ vdb_relationships.json: 1.65 MB
  ‚úÖ vdb_chunks.json: 0.27 MB
  ‚úÖ graph_chunk_entity_relation.graphml: 0.13 MB
  üíæ Total: 4.02 MB

‚úÖ POST-DELETION CHECK
  ‚ö†Ô∏è  vdb_entities.json: not found
  ‚ö†Ô∏è  vdb_relationships.json: not found
  ‚ö†Ô∏è  vdb_chunks.json: not found
  ‚ö†Ô∏è  graph_chunk_entity_relation.graphml: not found
  üíæ Total: 0.00 MB

‚úÖ Graph cleared - will rebuild from scratch


In [None]:
# Portfolio configuration
import pandas as pd

portfolio_df = pd.read_csv('portfolio_holdings.csv')

# Basic validation
if portfolio_df.empty:
    raise ValueError("Portfolio CSV is empty")
if 'ticker' not in portfolio_df.columns:
    raise ValueError("CSV must have 'ticker' column")

holdings = portfolio_df['ticker'].tolist()

print(f"üéØ Portfolio Configuration")
print(f"‚îÅ" * 40)
print(f"Holdings: {', '.join(holdings)} ({len(holdings)} stocks)")
print(f"Sector: {portfolio_df['sector'].iloc[0] if len(portfolio_df) > 0 else 'N/A'}")
print(f"Data Range: 2 years historical (editable in Cell 21)")
print(f"üìÑ Source: portfolio_holdings.csv")

## 3. Data Ingestion & Processing

In [9]:
print("\nüìä Data Source Summary")
print("=" * 50)

# Show ACTUAL metrics if available (not fake percentages)
if ice and ice.core.is_ready():
    storage_stats = ice.core.get_storage_stats()
    print(f"üíæ Current Graph Size: {storage_stats['total_storage_bytes'] / (1024*1024):.2f} MB")
    
    # Show real source info if ingestion has run
    if 'ingestion_result' in locals() and 'metrics' in ingestion_result:
        metrics = ingestion_result['metrics']
        if 'data_sources_used' in metrics:
            print(f"‚úÖ Active sources: {', '.join(metrics['data_sources_used'])}")
        print(f"üìÑ Total documents: {ingestion_result.get('total_documents', 0)}")
    else:
        print("‚ÑπÔ∏è Data source metrics available after ingestion completes")
else:
    print("‚ö†Ô∏è Knowledge graph not ready")


üìä Data Source Summary
üíæ Current Graph Size: 0.00 MB
‚ÑπÔ∏è Data source metrics available after ingestion completes


## 3b. Data Source Contribution Visualization (Week 5)

In [10]:
ice.core.get_graph_stats()

{'is_ready': True,
 'storage_indicators': {'all_components_present': False,
  'chunks_file_size': 0.0,
  'entities_file_size': 0.0,
  'relationships_file_size': 0.0,
  'graph_file_size': 0.0}}

In [11]:
print("üìä ICE Data Sources Summary (Phase 1 Integration)")
print("=" * 60)
print("\n‚ÑπÔ∏è  Phase 1 focuses on architecture and data flow patterns.")
print("Actual data ingestion depends on configured API keys.\n")
print("\n1Ô∏è‚É£ API/MCP Sources:")
print("   - NewsAPI: Real-time financial news")
print("   - SEC EDGAR: Regulatory filings (10-K, 10-Q, 8-K)")
print("   - Alpha Vantage: Market data")
print("\n2Ô∏è‚É£ Email Pipeline (Phase 1 Enhanced Documents):")
print("   - Broker research with BUY/SELL signals")
print("   - Enhanced documents: [TICKER:NVDA|confidence:0.95]")
print("   - See detailed demo: imap_email_ingestion_pipeline/investment_email_extractor_simple.ipynb")
print("\n3Ô∏è‚É£ SEC Filings:")
print("   - Management commentary and financial statements")
print("   - Integrated via SEC EDGAR connector")
print("\nüí° All sources ‚Üí Single LightRAG knowledge graph via ice_simplified.py")

üìä ICE Data Sources Summary (Phase 1 Integration)

‚ÑπÔ∏è  Phase 1 focuses on architecture and data flow patterns.
Actual data ingestion depends on configured API keys.


1Ô∏è‚É£ API/MCP Sources:
   - NewsAPI: Real-time financial news
   - SEC EDGAR: Regulatory filings (10-K, 10-Q, 8-K)
   - Alpha Vantage: Market data

2Ô∏è‚É£ Email Pipeline (Phase 1 Enhanced Documents):
   - Broker research with BUY/SELL signals
   - Enhanced documents: [TICKER:NVDA|confidence:0.95]
   - See detailed demo: imap_email_ingestion_pipeline/investment_email_extractor_simple.ipynb

3Ô∏è‚É£ SEC Filings:
   - Management commentary and financial statements
   - Integrated via SEC EDGAR connector

üí° All sources ‚Üí Single LightRAG knowledge graph via ice_simplified.py


## 3a. ICE Data Sources Integration (Week 5)

ICE integrates 3 heterogeneous data sources into unified knowledge graph:

**Detailed Demonstrations Available**:
- üìß **Email Pipeline**: See `imap_email_ingestion_pipeline/investment_email_extractor_simple.ipynb` (25 cells)
  - Entity extraction (tickers, ratings, price targets)
  - BUY/SELL signal extraction with confidence scores
  - Enhanced document creation with inline metadata
  
- üìä **Quick Summary Below**

In [None]:
# Execute data ingestion
# NOTE: This operation may take several minutes. If it hangs, restart kernel.
print(f"\nüì• Fetching Portfolio Data")
print(f"‚îÅ" * 50)

if not (ice and ice.is_ready()):
    raise RuntimeError("ICE system not ready for data ingestion")

# Fetch historical data (2 years default - adjust years parameter as needed for faster testing)
print(f"üîÑ Fetching data for {len(holdings)} holdings...")
ingestion_result = ice.ingest_historical_data(holdings, years=2)

# Display results
print(f"\nüìä Ingestion Results:")
print(f"  Status: {ingestion_result['status']}")
print(f"  Holdings: {len(ingestion_result['holdings_processed'])}/{len(holdings)}")
print(f"  Documents: {ingestion_result['total_documents']}")

# Show successful holdings
if ingestion_result['holdings_processed']:
    print(f"  ‚úÖ Successful: {', '.join(ingestion_result['holdings_processed'])}")

# Show metrics
if 'metrics' in ingestion_result:
    print(f"\n‚è±Ô∏è  Processing Time: {ingestion_result['metrics']['processing_time']:.2f}s")
    if 'data_sources_used' in ingestion_result['metrics']:
        print(f"  Data Sources: {', '.join(ingestion_result['metrics']['data_sources_used'])}")

# Show failures if any
if ingestion_result.get('failed_holdings'):
    print(f"\n‚ùå Failed Holdings:")
    for failure in ingestion_result['failed_holdings']:
        print(f"  {failure['symbol']}: {failure['error']}")

## 4. Knowledge Graph Building Pipeline

In [None]:
# Knowledge Graph Building - Already completed during ingestion
print(f"\nüß† Knowledge Graph Building")
print(f"‚îÅ" * 60)

if not (ice and ice.core.is_ready()):
    raise RuntimeError("LightRAG not ready")

print(f"‚ÑπÔ∏è  NOTE: Knowledge graph building happens automatically during data ingestion")
print(f"   The ingestion method (ingest_historical_data) already added documents")
print(f"   to the graph via LightRAG. This cell validates that building succeeded.\n")

# Validate that building succeeded by checking storage
storage_stats = ice.core.get_storage_stats()

if storage_stats['total_storage_bytes'] > 0:
    print(f"‚úÖ KNOWLEDGE GRAPH BUILT SUCCESSFULLY")
    print(f"‚îÅ" * 40)
    print(f"   üìÑ Documents processed: {ingestion_result.get('total_documents', 0)}")
    print(f"   üíæ Storage size: {storage_stats['total_storage_bytes'] / (1024*1024):.2f} MB")
    
    components_ready = sum(1 for c in storage_stats['components'].values() if c['exists'])
    print(f"   üîó Components ready: {components_ready}/4")
    
    # Create success result for metrics tracking
    building_result = {
        'status': 'success',
        'total_documents': ingestion_result.get('total_documents', 0),
        'metrics': {
            'building_time': ingestion_result.get('metrics', {}).get('processing_time', 0.0),
            'graph_initialized': True
        }
    }
    
    print(f"\nüéØ Graph Building Process:")
    print(f"   1Ô∏è‚É£ Text Chunking: 1200 tokens (optimal for financial documents)")
    print(f"   2Ô∏è‚É£ Entity Extraction: Companies, metrics, risks, regulations")
    print(f"   3Ô∏è‚É£ Relationship Discovery: Dependencies, impacts, correlations")
    print(f"   4Ô∏è‚É£ Graph Construction: LightRAG optimized structure")
    print(f"   5Ô∏è‚É£ Storage: chunks_vdb, entities_vdb, relationships_vdb, graph")
    
    print(f"\nüöÄ System ready for intelligent queries!")
    
else:
    print(f"‚ö†Ô∏è NO GRAPH DATA DETECTED")
    print(f"   Storage size: 0 MB")
    print(f"   Check ingestion results above for errors")
    print(f"   Possible causes:")
    print(f"   - No API keys configured")
    print(f"   - All holdings failed to fetch data")
    print(f"   - Network connectivity issues")
    
    building_result = {
        'status': 'error',
        'message': 'No graph data - check ingestion results'
    }

## 5. Storage Architecture Validation & Monitoring

In [14]:
# Comprehensive storage validation and metrics
print(f"\nüîç Storage Architecture Validation")
print(f"‚îÅ" * 40)

if not (ice and ice.core.is_ready()):
    raise RuntimeError("Cannot validate storage without initialized system")

# Get detailed storage statistics
storage_stats = ice.core.get_storage_stats()
graph_stats = ice.core.get_graph_stats()

print(f"üì¶ LightRAG Storage Components Status:")
for component_name, component_info in storage_stats['components'].items():
    status_icon = "‚úÖ" if component_info['exists'] else "‚ö†Ô∏è"
    size_mb = component_info['size_bytes'] / (1024 * 1024) if component_info['size_bytes'] > 0 else 0
    
    print(f"  {status_icon} {component_name}:")
    print(f"    File: {component_info['file']}")
    print(f"    Purpose: {component_info['description']}")
    print(f"    Size: {size_mb:.2f} MB" if size_mb > 0 else "    Size: Not created yet")

print(f"\nüìä Storage Summary:")
print(f"  Working Directory: {storage_stats['working_dir']}")
print(f"  Total Storage: {storage_stats['total_storage_bytes'] / (1024 * 1024):.2f} MB")
print(f"  System Initialized: {storage_stats['is_initialized']}")

print(f"\nüï∏Ô∏è Knowledge Graph Status:")
print(f"  Graph Ready: {graph_stats['is_ready']}")
if graph_stats.get('storage_indicators'):
    indicators = graph_stats['storage_indicators']
    print(f"  All Components Present: {indicators['all_components_present']}")
    print(f"  Chunks Storage: {indicators['chunks_file_size']:.2f} MB")
    print(f"  Entity Storage: {indicators['entities_file_size']:.2f} MB")
    print(f"  Relationship Storage: {indicators['relationships_file_size']:.2f} MB")
    print(f"  Graph Structure: {indicators['graph_file_size']:.2f} MB")

# Validation checks
print(f"\n‚úÖ Validation Checks:")
validation_score = 0
max_score = 4

# Check 1: System ready
if storage_stats['is_initialized']:
    print(f"  ‚úÖ System initialization: PASSED")
    validation_score += 1
else:
    print(f"  ‚ùå System initialization: FAILED")

# Check 2: Storage exists
if storage_stats['storage_exists']:
    print(f"  ‚úÖ Storage directory: PASSED")
    validation_score += 1
else:
    print(f"  ‚ùå Storage directory: FAILED")

# Check 3: Components created
components_exist = sum(1 for c in storage_stats['components'].values() if c['exists'])
if components_exist > 0:
    print(f"  ‚úÖ Storage components: PASSED ({components_exist}/4 created)")
    validation_score += 1
else:
    print(f"  ‚ùå Storage components: FAILED (no components created)")

# Check 4: Has storage content
if storage_stats['total_storage_bytes'] > 0:
    print(f"  ‚úÖ Storage content: PASSED")
    validation_score += 1
else:
    print(f"  ‚ùå Storage content: FAILED (no data stored)")

print(f"\nüìä Validation Score: {validation_score}/{max_score} ({(validation_score/max_score)*100:.0f}%)")

if validation_score == max_score:
    print(f"üéâ All validations passed! Knowledge graph is ready for queries.")
elif validation_score >= max_score * 0.75:
    print(f"‚úÖ Most validations passed. System is functional.")
else:
    print(f"‚ö†Ô∏è Some validations failed. Check configuration and retry building.")


üîç Storage Architecture Validation
‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ
üì¶ LightRAG Storage Components Status:
  ‚úÖ chunks_vdb:
    File: vdb_chunks.json
    Purpose: Vector database for document chunks
    Size: 0.27 MB
  ‚úÖ entities_vdb:
    File: vdb_entities.json
    Purpose: Vector database for extracted entities
    Size: 1.96 MB
  ‚úÖ relationships_vdb:
    File: vdb_relationships.json
    Purpose: Vector database for entity relationships
    Size: 1.65 MB
  ‚úÖ graph:
    File: graph_chunk_entity_relation.graphml
    Purpose: NetworkX graph structure
    Size: 0.13 MB

üìä Storage Summary:
  Working Directory: ice_lightrag/storage
  Total Storage: 4.70 MB
  System Initialized: True

üï∏Ô∏è Knowledge Graph Status:
  Graph Ready: True
  All Components Present: True
  Chunks Storage: 0.27 MB
  Entity Storage: 1.96 MB
  Relationship Storage: 1.65 MB
  Graph Structure: 0.13 MB

‚úÖ Validatio

## 6. Building Metrics & Performance Analysis

In [None]:
# Comprehensive building session metrics
print(f"\nüìä Building Session Metrics & Performance")
print(f"‚îÅ" * 50)

session_metrics = {
    'holdings_count': len(holdings),
    'total_processing_time': 0.0,
    'documents_processed': 0,
    'building_successful': False
}

# Collect metrics from ingestion and building
if 'ingestion_result' in locals() and ingestion_result:
    if 'metrics' in ingestion_result:
        session_metrics['ingestion_time'] = ingestion_result['metrics'].get('processing_time', 0.0)
    session_metrics['documents_processed'] = ingestion_result.get('total_documents', 0)

if 'building_result' in locals() and building_result:
    if building_result.get('status') == 'success':
        session_metrics['building_successful'] = True
    if 'metrics' in building_result:
        building_time = building_result['metrics'].get('building_time', building_result['metrics'].get('update_time', 0.0))
        session_metrics['building_time'] = building_time

# Calculate total time
if 'pipeline_stats' in locals():
    session_metrics['total_processing_time'] = pipeline_stats.get('processing_time', 0.0)

print(f"üéØ Session Overview:")
print(f"  Holdings Processed: {session_metrics['holdings_count']}")
print(f"  Documents Processed: {session_metrics['documents_processed']}")
print(f"  Building Successful: {session_metrics['building_successful']}")

if session_metrics.get('ingestion_time', 0) > 0:
    print(f"\n‚è±Ô∏è Performance Metrics:")
    print(f"  Data Ingestion Time: {session_metrics['ingestion_time']:.2f}s")
    if session_metrics.get('building_time', 0) > 0:
        print(f"  Graph Building Time: {session_metrics['building_time']:.2f}s")
        print(f"  Total Processing Time: {session_metrics['ingestion_time'] + session_metrics['building_time']:.2f}s")
    
    print(f"\nüìà Efficiency Analysis:")
    if session_metrics['documents_processed'] > 0:
        docs_per_second = session_metrics['documents_processed'] / session_metrics['ingestion_time']
        print(f"  Processing Rate: {docs_per_second:.2f} documents/second")
    
    holdings_per_second = session_metrics['holdings_count'] / session_metrics['ingestion_time']
    print(f"  Holdings Rate: {holdings_per_second:.2f} holdings/second")

# Architecture efficiency comparison
print(f"\nüèóÔ∏è Architecture Efficiency:")
print(f"  ICE Simplified: 2,508 lines of code")
print(f"  Code Reduction: 83% (vs 15,000 line original)")
print(f"  Files Count: 5 core modules")
print(f"  Dependencies: Direct LightRAG wrapper")
print(f"  Token Efficiency: 4,000x better than GraphRAG")

# Success summary
print(f"\n‚úÖ Building Session Summary:")
if session_metrics['building_successful']:
    print(f"  üéâ Knowledge graph building completed successfully")
    print(f"  üìä {session_metrics['documents_processed']} documents processed")
    print(f"  üöÄ System ready for intelligent investment queries")
    print(f"  üí° Proceed to ice_query_workflow.ipynb for analysis")
else:
    print(f"  ‚ö†Ô∏è Building completed with warnings or in demo mode")
    print(f"  üìã Review configuration and API settings")
    print(f"  üîß Consider running with fresh data if issues persist")

print(f"\nüîó Next Steps:")
print(f"  1. Review building metrics and validate storage")
print(f"  2. Run ice_query_workflow.ipynb for portfolio analysis")
print(f"  3. Test different LightRAG query modes")
print(f"  4. Monitor system performance and optimize as needed")

## üìã Building Workflow Complete

**Summary**: This notebook demonstrated the complete ICE building workflow from data ingestion through knowledge graph construction.

### Key Achievements
‚úÖ **System Initialization**: ICE simplified architecture deployed  
‚úÖ **Data Ingestion**: Portfolio data fetched and processed  
‚úÖ **Graph Building**: LightRAG knowledge graph constructed  
‚úÖ **Storage Validation**: All components verified and metrics tracked  

### Architecture Benefits
- **83% Code Reduction**: 2,508 lines vs 15,000 original
- **4,000x Token Efficiency**: vs GraphRAG baseline
- **Mode Flexibility**: Initial build or incremental updates
- **Complete Metrics**: Processing time, success rates, storage stats

### Next Steps
1. **Launch Query Workflow**: Open `ice_query_workflow.ipynb`
2. **Test Investment Intelligence**: Run portfolio analysis queries
3. **Explore Query Modes**: Test all 5 LightRAG modes
4. **Monitor Performance**: Track query response times and accuracy

---
**Ready for Investment Intelligence Queries** üöÄ