# üöÄ Tachyon Argus - TFT Training Quick Start

## Simplified workflow for dataset creation and model training

This notebook does two things:
1. **Generate training dataset** - Create realistic server metrics data with inter-server dependencies
2. **Train TFT model** - Train the Temporal Fusion Transformer

The dashboard and inference daemon handle everything else!

---

**‚è±Ô∏è Estimated Times:**
- Dataset generation (24h): ~30-60 seconds
- Dataset generation (720h): ~5-10 minutes
- Model training (10 epochs): ~3-5 hours on RTX 4090

**üéØ After Training:**
- Start system: `start_all.bat` (Windows) or `./start_all.sh` (Linux/Mac)
- Dashboard: http://localhost:8050
- API: http://localhost:8000

---

**üìä Argus Metrics Framework:**
- 15 server metrics including cascade_impact
- 7 server profiles with inter-server dependencies
- Cascade failure simulation (database ‚Üí dependent services)

In [None]:
# Cell 1: Setup and Configuration
import sys
import time
from pathlib import Path

# Add src/ to Python path (works from either root or Argus directory)
current_dir = Path.cwd()
if current_dir.name == 'Argus':
    # Notebook is in Argus folder
    argus_src = (current_dir / 'src').absolute()
    argus_root = current_dir
else:
    # Notebook is in root folder
    argus_src = (current_dir / 'Argus' / 'src').absolute()
    argus_root = current_dir / 'Argus'

if str(argus_src) not in sys.path:
    sys.path.insert(0, str(argus_src))

print("üéØ Tachyon Argus - TFT Training System")
print("=" * 70)
print("‚úÖ Python path configured")
print(f"üìÅ Argus source: {argus_src}")
print(f"üìÅ Argus root: {argus_root}")
print("\nüîß Configuration:")
print(f"   Training directory: {argus_root}/training/")
print(f"   Models directory: {argus_root}/models/")
print("   Prediction horizon: 96 steps (8 hours)")
print("   Context length: 288 steps (24 hours)")
print("   Metrics: 15 Argus metrics (including cascade_impact)")
print("=" * 70)

---

## System Health Check

Verify your environment is ready for training:

In [None]:
# Cell 2: Comprehensive System Check
# Verify GPU, Python environment, dependencies, and system readiness

import sys
import platform
from pathlib import Path
import importlib.util

# Setup paths (same as Cell 1)
current_dir = Path.cwd()
if current_dir.name == 'Argus':
    argus_src = (current_dir / 'src').absolute()
    argus_root = current_dir
else:
    argus_src = (current_dir / 'Argus' / 'src').absolute()
    argus_root = current_dir / 'Argus'

if str(argus_src) not in sys.path:
    sys.path.insert(0, str(argus_src))

print("‚ïî" + "‚ïê" * 68 + "‚ïó")
print("‚ïë" + " " * 20 + "SYSTEM HEALTH CHECK" + " " * 29 + "‚ïë")
print("‚ïö" + "‚ïê" * 68 + "‚ïù")
print()

# ============================================================================
# 1. PYTHON ENVIRONMENT
# ============================================================================
print("‚îå‚îÄ Python Environment " + "‚îÄ" * 47 + "‚îê")
print(f"‚îÇ Python Version:     {platform.python_version():<46}‚îÇ")
print(f"‚îÇ Platform:           {platform.system()} {platform.release():<36}‚îÇ")
print(f"‚îÇ Architecture:       {platform.machine():<46}‚îÇ")

# Working directory - handle long paths gracefully
cwd = str(Path.cwd())
if len(cwd) <= 45:
    print(f"‚îÇ Working Directory:  {cwd:<46}‚îÇ")
else:
    print(f"‚îÇ Working Directory:                                          ‚îÇ")
    chunk_size = 60
    for i in range(0, len(cwd), chunk_size):
        chunk = cwd[i:i+chunk_size]
        print(f"‚îÇ   {chunk:<64}‚îÇ")

# Show Argus root detection
argus_root_str = str(argus_root)
if len(argus_root_str) <= 45:
    print(f"‚îÇ Argus Root:         {argus_root_str:<46}‚îÇ")
else:
    print(f"‚îÇ Argus Root:                                                 ‚îÇ")
    for i in range(0, len(argus_root_str), chunk_size):
        chunk = argus_root_str[i:i+chunk_size]
        print(f"‚îÇ   {chunk:<64}‚îÇ")

print("‚îî" + "‚îÄ" * 68 + "‚îò")
print()

# ============================================================================
# 2. GPU AVAILABILITY & PYTORCH CUDA CHECK
# ============================================================================
print("‚îå‚îÄ GPU Status " + "‚îÄ" * 54 + "‚îê")

gpu_available = False
gpu_name = "Not available"
gpu_memory = 0
cuda_version = "N/A"
torch_cuda_enabled = False
pytorch_installed = False

try:
    import torch
    pytorch_installed = True
    torch_cuda_enabled = torch.cuda.is_available()
    gpu_available = torch_cuda_enabled
    
    if torch_cuda_enabled:
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
        cuda_version = torch.version.cuda
        
        print(f"‚îÇ ‚úÖ GPU Detected:     {gpu_name[:45]:<45}‚îÇ")
        print(f"‚îÇ    CUDA Version:     {cuda_version:<46}‚îÇ")
        print(f"‚îÇ    Memory:           {gpu_memory:.1f} GB{' ' * 42}‚îÇ")
        print(f"‚îÇ    PyTorch CUDA:     Enabled{' ' * 40}‚îÇ")
        
        try:
            import subprocess
            result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used,memory.total', 
                                   '--format=csv,noheader,nounits'], 
                                  capture_output=True, text=True, timeout=2)
            if result.returncode == 0:
                gpu_util, mem_used, mem_total = result.stdout.strip().split(',')
                print(f"‚îÇ    Utilization:      {gpu_util.strip()}%{' ' * 43}‚îÇ")
                print(f"‚îÇ    Memory Used:      {mem_used.strip()} MB / {mem_total.strip()} MB{' ' * 28}‚îÇ")
        except:
            pass
            
    else:
        print(f"‚îÇ ‚ö†Ô∏è  PyTorch installed but CUDA not enabled{' ' * 24}‚îÇ")
        print(f"‚îÇ    PyTorch Version:  {torch.__version__:<46}‚îÇ")
        print(f"‚îÇ    Training will use CPU (20-40x slower){' ' * 25}‚îÇ")
        
except ImportError:
    print(f"‚îÇ ‚ùå PyTorch not installed{' ' * 42}‚îÇ")
    print(f"‚îÇ   pip install torch --index-url{' ' * 34}‚îÇ")
    print(f"‚îÇ     https://download.pytorch.org/whl/cu121{' ' * 24}‚îÇ")

print("‚îî" + "‚îÄ" * 68 + "‚îò")
print()

# ============================================================================
# 3. CRITICAL DEPENDENCIES
# ============================================================================
print("‚îå‚îÄ Critical Dependencies " + "‚îÄ" * 43 + "‚îê")

dependencies = {
    'torch': 'PyTorch (Deep Learning)',
    'lightning': 'PyTorch Lightning (Training)',
    'pandas': 'Pandas (Data Processing)',
    'numpy': 'NumPy (Numerical Computing)',
    'pytorch_forecasting': 'PyTorch Forecasting (TFT Model)',
    'fastapi': 'FastAPI (Inference API)',
    'plotly': 'Plotly (Dashboard)',
    'dash': 'Dash (Dashboard Framework)'
}

missing_deps = []
installed_deps = []

for package, description in dependencies.items():
    spec = importlib.util.find_spec(package)
    if spec is not None:
        try:
            module = importlib.import_module(package)
            version = getattr(module, '__version__', 'unknown')
            status = "‚úÖ"
            installed_deps.append(package)
            pkg_display = f"{package} ({version})"
        except:
            status = "‚ö†Ô∏è"
            pkg_display = package
    else:
        status = "‚ùå"
        missing_deps.append(package)
        pkg_display = package
    
    print(f"‚îÇ {status} {pkg_display:<63}‚îÇ")

print("‚îî" + "‚îÄ" * 68 + "‚îò")
print()

# ============================================================================
# 4. DIRECTORY STRUCTURE
# ============================================================================
print("‚îå‚îÄ Directory Structure " + "‚îÄ" * 46 + "‚îê")

required_dirs = {
    'training': argus_root / 'training',
    'models': argus_root / 'models',
    'checkpoints': argus_root / 'checkpoints',
    'logs': argus_root / 'logs'
}

for name, path in required_dirs.items():
    exists = path.exists()
    status = "‚úÖ" if exists else "‚ö†Ô∏è"
    existence = "exists" if exists else "will be created"
    print(f"‚îÇ {status} {name + '/':20} {existence:<44}‚îÇ")

print("‚îî" + "‚îÄ" * 68 + "‚îò")
print()

# ============================================================================
# 5. EXISTING MODELS CHECK
# ============================================================================
print("‚îå‚îÄ Existing Models " + "‚îÄ" * 50 + "‚îê")

models_dir = argus_root / 'models'
if models_dir.exists():
    model_dirs = sorted(models_dir.glob('tft_model_*'), reverse=True)
    
    if model_dirs:
        print(f"‚îÇ Found {len(model_dirs)} trained model(s):{' ' * 40}‚îÇ")
        for i, model_dir in enumerate(model_dirs[:3], 1):
            model_name = model_dir.name
            model_size = sum(f.stat().st_size for f in model_dir.rglob('*') if f.is_file()) / (1024**2)
            print(f"‚îÇ   {i}. {model_name:<40} ({model_size:>6.1f} MB) ‚îÇ")
        if len(model_dirs) > 3:
            print(f"‚îÇ   ... and {len(model_dirs) - 3} more{' ' * 44}‚îÇ")
    else:
        print(f"‚îÇ No trained models found - ready for first training{' ' * 16}‚îÇ")
else:
    print(f"‚îÇ Models directory will be created on first training{' ' * 16}‚îÇ")

print("‚îî" + "‚îÄ" * 68 + "‚îò")
print()

# ============================================================================
# 6. OVERALL READINESS
# ============================================================================
print("‚ïî" + "‚ïê" * 68 + "‚ïó")

all_critical_deps = all(dep in installed_deps for dep in ['torch', 'lightning', 'pandas', 'pytorch_forecasting'])

if all_critical_deps and torch_cuda_enabled:
    print("‚ïë" + " " * 15 + "‚úÖ SYSTEM READY FOR TRAINING" + " " * 24 + "‚ïë")
    gpu_short = gpu_name[:20] if 'gpu_name' in dir() else 'GPU'
    print(f"‚ïë" + " " * 15 + f"Estimated: 10 epochs ‚âà 3-5 hours on {gpu_short}" + " " * max(0, 12 - len(gpu_short)) + "‚ïë")
elif all_critical_deps and pytorch_installed and not torch_cuda_enabled:
    print("‚ïë" + " " * 10 + "‚ö†Ô∏è  PYTORCH INSTALLED WITHOUT CUDA SUPPORT" + " " * 16 + "‚ïë")
    print("‚ïë" + " " * 15 + "Training will be 20-40x slower on CPU" + " " * 15 + "‚ïë")
else:
    print("‚ïë" + " " * 12 + "‚ùå MISSING DEPENDENCIES - INSTALL FIRST" + " " * 17 + "‚ïë")

print("‚ïö" + "‚ïê" * 68 + "‚ïù")

---

## Dataset Generation

Creates realistic server metrics with:
- **7 server profiles** (ML, DB, Web, Conductor, ETL, Risk, Generic)
- **15 Argus metrics** including cascade_impact
- **Inter-server dependencies** (database failures cascade to dependent services)
- **Financial market hours patterns**

### Server Dependency Graph:
```
DATABASE (upstream)
    ‚îú‚îÄ‚îÄ WEB_API (connection_failures, request_queuing)
    ‚îú‚îÄ‚îÄ DATA_INGEST (connection_failures, request_queuing)  
    ‚îú‚îÄ‚îÄ RISK_ANALYTICS (data_starvation, idle_resources)
    ‚îî‚îÄ‚îÄ CONDUCTOR_MGMT (connection_failures)

CONDUCTOR_MGMT (upstream)
    ‚îî‚îÄ‚îÄ ML_COMPUTE (idle_resources)

DATA_INGEST (upstream)
    ‚îî‚îÄ‚îÄ ML_COMPUTE (data_starvation)
```

**Adjust parameters below:**

In [None]:
# Cell 3: Generate Training Dataset
# Expected time: 24h=30-60s | 720h=15-20min (optimized with parallelization)

import sys
import time
from pathlib import Path
import pandas as pd

# Add src/ to Python path (works from either root or Argus directory)
current_dir = Path.cwd()
if current_dir.name == 'Argus':
    argus_src = (current_dir / 'src').absolute()
    argus_root = current_dir
else:
    argus_src = (current_dir / 'Argus' / 'src').absolute()
    argus_root = current_dir / 'Argus'

if str(argus_src) not in sys.path:
    sys.path.insert(0, str(argus_src))

# ============================================
# CONFIGURATION - SIMPLE TWO-PARAMETER SETUP
# ============================================

TRAINING_HOURS = 366        # Duration: 24 (1 day), 168 (1 week), 720 (30 days - production)
TOTAL_SERVERS = 45         # Fleet size: 20 (demo), 90 (default), 400 (production)

# Servers are AUTO-DISTRIBUTED across 7 profiles:
#   - Web/API:       28% (user-facing services)
#   - ML Compute:    22% (training workloads)
#   - Database:      17% (critical infrastructure - CASCADE SOURCE)
#   - Data Ingest:   11% (ETL pipelines)
#   - Risk Analytics: 9% (EOD calculations)
#   - Generic:        7% (utility, max 10)
#   - Conductor:      6% (orchestration)

TRAINING_DIR = str(argus_root / 'training')

# ============================================

print(f"üè¢ Argus Dataset Generation")
print("-" * 70)
print(f"‚öôÔ∏è  Configuration:")
print(f"   Duration: {TRAINING_HOURS} hours ({TRAINING_HOURS/24:.1f} days)")
print(f"   Fleet size: {TOTAL_SERVERS} servers (auto-distributed across 7 profiles)")
print(f"   Output: {TRAINING_DIR}")

# Show expected distribution
print(f"\nüìä Expected Profile Distribution:")
dist = {
    'Web/API': int(TOTAL_SERVERS * 0.28),
    'ML Compute': int(TOTAL_SERVERS * 0.22),
    'Database': int(TOTAL_SERVERS * 0.17),
    'Data Ingest': int(TOTAL_SERVERS * 0.11),
    'Risk Analytics': int(TOTAL_SERVERS * 0.09),
    'Generic': min(int(TOTAL_SERVERS * 0.07), 10),
    'Conductor': int(TOTAL_SERVERS * 0.06)
}
for profile, count in dist.items():
    cascade_note = " (CASCADE SOURCE)" if profile == 'Database' else ""
    print(f"   {profile:<15} ~{count:>3} servers{cascade_note}")

# Estimate rows and time
expected_timestamps = TRAINING_HOURS * 3600 // 5  # 5-second intervals
expected_rows = expected_timestamps * TOTAL_SERVERS
print(f"\nüìà Expected Output:")
print(f"   ~{expected_rows:,} rows ({expected_timestamps:,} timestamps √ó {TOTAL_SERVERS} servers)")
print(f"   Parallelized generation: ~{TRAINING_HOURS // 24 * 2 + 1}-{TRAINING_HOURS // 24 * 4 + 2} minutes")
print()

print("üîó Inter-Server Dependencies:")
print("   Database failures cascade to: Web/API, Data Ingest, Risk, Conductor")
print("   Impact types: connection_failures, request_queuing, data_starvation")
print("   cascade_impact metric: 0.0 (no impact) to 1.0 (full cascade)")
print()

_start = time.time()

# Import and run generator
from generators.metrics_generator import main as generate_metrics

# Set up command-line arguments - SIMPLE: just --servers and --hours
old_argv = sys.argv
sys.argv = [
    'metrics_generator.py',
    '--hours', str(TRAINING_HOURS),
    '--servers', str(TOTAL_SERVERS),  # Auto-distributes across profiles!
    '--out_dir', TRAINING_DIR,
    '--format', 'parquet'
]

try:
    generate_metrics()
    print("\n‚úÖ Dataset generation complete!")
    success = True
except Exception as e:
    print(f"\n‚ùå Generation failed: {e}")
    import traceback
    traceback.print_exc()
    success = False
finally:
    sys.argv = old_argv

_elapsed = time.time() - _start
_mins = int(_elapsed // 60)
_secs = int(_elapsed % 60)
print(f"\n‚è±Ô∏è  Execution time: {_mins}m {_secs}s")

if success:
    # Show what was created
    training_path = Path(TRAINING_DIR)
    parquet_files = list(training_path.glob("*.parquet"))
    
    if parquet_files:
        latest = max(parquet_files, key=lambda p: p.stat().st_mtime)
        df = pd.read_parquet(latest)
        
        print(f"\nüìä Dataset Summary:")
        print(f"   File: {latest.name}")
        print(f"   Size: {latest.stat().st_size / (1024*1024):.1f} MB")
        print(f"   Records: {len(df):,}")
        print(f"   Servers: {df['server_name'].nunique()}")
        
        # Show actual profile distribution
        if 'profile' in df.columns:
            profile_counts = df.groupby('profile')['server_name'].nunique()
            print(f"\n   Profile Distribution:")
            for profile, count in profile_counts.sort_values(ascending=False).items():
                print(f"     {profile:<20} {count:>3} servers")
        
        # Show cascade_impact stats
        if 'cascade_impact' in df.columns:
            cascade_affected = (df['cascade_impact'] > 0).sum()
            cascade_pct = (cascade_affected / len(df)) * 100
            print(f"\n   Cascade Impact:")
            print(f"     Records affected: {cascade_affected:,} ({cascade_pct:.1f}%)")
            print(f"     Max intensity: {df['cascade_impact'].max():.2f}")
        
        print(f"\n   Time span: {(df['timestamp'].max() - df['timestamp'].min()).total_seconds() / 3600:.1f} hours")
        print(f"\nüéØ Ready for training!")

---

## Dataset Explorer

Executive-level dataset analysis and visualization:

In [None]:
# Dataset Explorer - Executive Presentation View
# Professional analysis with visualizations suitable for C-suite presentations

import sys
from pathlib import Path
import pandas as pd
import numpy as np

# Setup paths
current_dir = Path.cwd()
if current_dir.name == 'Argus':
    argus_root = current_dir
else:
    argus_root = current_dir / 'Argus'

if str(argus_root / 'src') not in sys.path:
    sys.path.insert(0, str(argus_root / 'src'))

# Plotting imports
try:
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
    PLOTLY_AVAILABLE = True
except ImportError:
    PLOTLY_AVAILABLE = False
    print("‚ö†Ô∏è  Plotly not available - visualizations disabled")
    print("   Install: pip install plotly")

# Find the most recent dataset
training_dir = argus_root / 'training'
parquet_files = list(training_dir.glob("*.parquet"))

if not parquet_files:
    print("‚ùå No dataset found. Please run the Dataset Generation cell first.")
else:
    latest_file = max(parquet_files, key=lambda p: p.stat().st_mtime)
    
    print("‚ïî" + "‚ïê" * 68 + "‚ïó")
    print("‚ïë" + " " * 18 + "DATASET ANALYSIS REPORT" + " " * 27 + "‚ïë")
    print("‚ïë" + " " * 15 + "Tachyon Argus Predictive Monitoring" + " " * 17 + "‚ïë")
    print("‚ïö" + "‚ïê" * 68 + "‚ïù")
    print()
    
    # Load dataset
    print(f"üìÇ Loading dataset: {latest_file.name}")
    df = pd.read_parquet(latest_file)
    print(f"‚úÖ Loaded {len(df):,} records")
    print()
    
    # ========================================================================
    # EXECUTIVE SUMMARY
    # ========================================================================
    print("‚ïî" + "‚ïê" * 68 + "‚ïó")
    print("‚ïë" + " " * 22 + "EXECUTIVE SUMMARY" + " " * 29 + "‚ïë")
    print("‚ïö" + "‚ïê" * 68 + "‚ïù")
    print()
    
    file_size_mb = latest_file.stat().st_size / (1024 * 1024)
    time_span = (df['timestamp'].max() - df['timestamp'].min()).total_seconds() / 3600
    num_servers = df['server_name'].nunique()
    num_profiles = df['profile'].nunique() if 'profile' in df.columns else 0
    records_per_hour = len(df) / time_span if time_span > 0 else 0
    
    print(f"‚îå‚îÄ Dataset Metrics " + "‚îÄ" * 50 + "‚îê")
    print(f"‚îÇ Total Records:          {len(df):>12,} samples{' ' * 24}‚îÇ")
    print(f"‚îÇ File Size:              {file_size_mb:>12.1f} MB{' ' * 27}‚îÇ")
    print(f"‚îÇ Time Span:              {time_span:>12.1f} hours ({time_span/24:.1f} days){' ' * 13}‚îÇ")
    print(f"‚îÇ Sampling Rate:          {records_per_hour:>12.1f} records/hour{' ' * 16}‚îÇ")
    print(f"‚îÇ Date Range:             {df['timestamp'].min().strftime('%Y-%m-%d %H:%M'):<33}‚îÇ")
    print(f"‚îÇ                    to   {df['timestamp'].max().strftime('%Y-%m-%d %H:%M'):<33}‚îÇ")
    print("‚îî" + "‚îÄ" * 68 + "‚îò")
    print()
    
    # ========================================================================
    # FLEET COMPOSITION
    # ========================================================================
    print(f"‚îå‚îÄ Fleet Composition " + "‚îÄ" * 48 + "‚îê")
    print(f"‚îÇ Total Servers:          {num_servers:>12} servers{' ' * 25}‚îÇ")
    
    if 'profile' in df.columns:
        print(f"‚îÇ Server Profiles:        {num_profiles:>12} types{' ' * 27}‚îÇ")
        print(f"‚îÇ{' ' * 68}‚îÇ")
        
        profile_counts = df.groupby('profile')['server_name'].nunique().sort_values(ascending=False)
        for profile, count in profile_counts.items():
            pct = (count / num_servers) * 100
            print(f"‚îÇ  {profile[:20]:<20} {count:>3} ({pct:>5.1f}%) ‚îÇ")
    
    print("‚îî" + "‚îÄ" * 68 + "‚îò")
    print()
    
    # ========================================================================
    # METRICS COVERAGE (15 ARGUS METRICS)
    # ========================================================================
    print(f"‚îå‚îÄ Argus Metrics Coverage " + "‚îÄ" * 43 + "‚îê")
    
    from core.nordiq_metrics import NORDIQ_METRICS, NUM_NORDIQ_METRICS
    
    available_metrics = [m for m in NORDIQ_METRICS if m in df.columns]
    coverage_pct = (len(available_metrics) / NUM_NORDIQ_METRICS) * 100
    
    print(f"‚îÇ Argus Metrics:          {len(available_metrics):>12} / {NUM_NORDIQ_METRICS} ({coverage_pct:.0f}%){' ' * 20}‚îÇ")
    print(f"‚îÇ{' ' * 68}‚îÇ")
    
    # Group metrics by category (now includes cascade)
    metric_categories = {
        'CPU': ['cpu_user_pct', 'cpu_sys_pct', 'cpu_iowait_pct', 'cpu_idle_pct', 'java_cpu_pct'],
        'Memory': ['mem_used_pct', 'swap_used_pct'],
        'Disk': ['disk_usage_pct'],
        'Network': ['net_in_mb_s', 'net_out_mb_s'],
        'Connections': ['back_close_wait', 'front_close_wait'],
        'System': ['load_average', 'uptime_days'],
        'Cascade': ['cascade_impact']  # NEW!
    }
    
    for category, metrics in metric_categories.items():
        category_available = [m for m in metrics if m in df.columns]
        cat_pct = (len(category_available) / len(metrics)) * 100
        status = "‚úÖ" if cat_pct == 100 else "‚ö†Ô∏è" if cat_pct > 0 else "‚ùå"
        print(f"‚îÇ  {status} {category:<15} {len(category_available):>2}/{len(metrics)} metrics ({cat_pct:>5.1f}%){' ' * 25}‚îÇ")
    
    print("‚îî" + "‚îÄ" * 68 + "‚îò")
    print()
    
    # ========================================================================
    # CASCADE IMPACT ANALYSIS (NEW!)
    # ========================================================================
    if 'cascade_impact' in df.columns:
        print(f"‚îå‚îÄ Cascade Impact Analysis " + "‚îÄ" * 42 + "‚îê")
        
        cascade_affected = (df['cascade_impact'] > 0).sum()
        cascade_pct = (cascade_affected / len(df)) * 100
        
        print(f"‚îÇ Records with cascade:   {cascade_affected:>12,} ({cascade_pct:.1f}%){' ' * 16}‚îÇ")
        print(f"‚îÇ Max cascade intensity:  {df['cascade_impact'].max():>12.3f}{' ' * 26}‚îÇ")
        print(f"‚îÇ Mean (when active):     {df[df['cascade_impact'] > 0]['cascade_impact'].mean():>12.3f}{' ' * 26}‚îÇ")
        
        # Cascade by profile
        if 'profile' in df.columns:
            print(f"‚îÇ{' ' * 68}‚îÇ")
            print(f"‚îÇ Cascade Impact by Profile:{' ' * 40}‚îÇ")
            cascade_by_profile = df[df['cascade_impact'] > 0].groupby('profile')['cascade_impact'].mean()
            for profile, avg_impact in cascade_by_profile.sort_values(ascending=False).head(5).items():
                print(f"‚îÇ   {profile:<20} avg: {avg_impact:.3f}{' ' * 30}‚îÇ")
        
        print("‚îî" + "‚îÄ" * 68 + "‚îò")
        print()
    
    # ========================================================================
    # DATA QUALITY METRICS
    # ========================================================================
    print(f"‚îå‚îÄ Data Quality " + "‚îÄ" * 53 + "‚îê")
    
    total_cells = len(df) * len(available_metrics)
    missing_cells = df[available_metrics].isna().sum().sum()
    completeness = ((total_cells - missing_cells) / total_cells) * 100
    
    print(f"‚îÇ Completeness:           {completeness:>12.2f}%{' ' * 28}‚îÇ")
    print(f"‚îÇ Missing Values:         {missing_cells:>12,} cells{' ' * 24}‚îÇ")
    
    # Check for duplicates
    duplicates = df.duplicated(subset=['timestamp', 'server_name']).sum()
    duplicate_pct = (duplicates / len(df)) * 100
    print(f"‚îÇ Duplicate Records:      {duplicates:>12,} ({duplicate_pct:.2f}%){' ' * 20}‚îÇ")
    
    print("‚îî" + "‚îÄ" * 68 + "‚îò")
    print()
    
    # ========================================================================
    # STATISTICAL SUMMARY
    # ========================================================================
    print(f"‚îå‚îÄ Key Metrics Statistics " + "‚îÄ" * 43 + "‚îê")
    print(f"‚îÇ {'Metric':<20} {'Mean':>10} {'Std':>10} {'Min':>10} {'Max':>10} ‚îÇ")
    print(f"‚îÇ {'-'*20} {'-'*10} {'-'*10} {'-'*10} {'-'*10} ‚îÇ")
    
    key_metrics = ['cpu_user_pct', 'mem_used_pct', 'disk_usage_pct', 'load_average', 'cascade_impact']
    for metric in key_metrics:
        if metric in df.columns:
            stats = df[metric].describe()
            print(f"‚îÇ {metric:<20} {stats['mean']:>10.2f} {stats['std']:>10.2f} {stats['min']:>10.2f} {stats['max']:>10.2f} ‚îÇ")
    
    print("‚îî" + "‚îÄ" * 68 + "‚îò")
    print()
    
    # ========================================================================
    # VISUALIZATIONS
    # ========================================================================
    if PLOTLY_AVAILABLE:
        print("‚ïî" + "‚ïê" * 68 + "‚ïó")
        print("‚ïë" + " " * 20 + "EXECUTIVE VISUALIZATIONS" + " " * 24 + "‚ïë")
        print("‚ïö" + "‚ïê" * 68 + "‚ïù")
        print()
        
        # 1. Fleet Distribution by Profile
        if 'profile' in df.columns:
            fig_fleet = px.pie(
                profile_counts.reset_index(), 
                values='server_name', 
                names='profile',
                title='Fleet Distribution by Server Profile',
                color_discrete_sequence=px.colors.qualitative.Set3
            )
            fig_fleet.update_layout(font=dict(size=14), height=500)
            fig_fleet.show()
        
        # 2. Cascade Impact Distribution (NEW!)
        if 'cascade_impact' in df.columns and df['cascade_impact'].sum() > 0:
            cascade_data = df[df['cascade_impact'] > 0]
            fig_cascade = px.histogram(
                cascade_data, 
                x='cascade_impact',
                color='profile' if 'profile' in df.columns else None,
                title='Cascade Impact Distribution by Profile',
                nbins=50
            )
            fig_cascade.update_layout(font=dict(size=12), height=400)
            fig_cascade.show()
        
        # 3. CPU Heatmap by Profile and Hour
        if all(m in df.columns for m in ['cpu_user_pct']) and 'profile' in df.columns:
            df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
            heatmap_data = df.groupby(['hour', 'profile'])['cpu_user_pct'].mean().reset_index()
            heatmap_pivot = heatmap_data.pivot(index='profile', columns='hour', values='cpu_user_pct')
            
            fig_heatmap = go.Figure(data=go.Heatmap(
                z=heatmap_pivot.values,
                x=heatmap_pivot.columns,
                y=heatmap_pivot.index,
                colorscale='RdYlGn_r',
                colorbar=dict(title="CPU %")
            ))
            fig_heatmap.update_layout(
                title='CPU Utilization by Profile and Hour',
                xaxis_title='Hour of Day',
                yaxis_title='Server Profile',
                height=400
            )
            fig_heatmap.show()
        
        print("‚úÖ Executive visualizations generated")
    
    # ========================================================================
    # READINESS ASSESSMENT
    # ========================================================================
    print()
    print("‚ïî" + "‚ïê" * 68 + "‚ïó")
    
    is_ready = (
        len(df) >= 1000 and
        num_servers >= 5 and
        completeness >= 95.0 and
        len(available_metrics) >= 14  # At least 14 of 15 metrics
    )
    
    if is_ready:
        print("‚ïë" + " " * 15 + "‚úÖ DATASET READY FOR TRAINING" + " " * 23 + "‚ïë")
        print(f"‚ïë" + " " * 10 + f"{len(df):,} records | {num_servers} servers | {len(available_metrics)}/15 metrics" + " " * 10 + "‚ïë")
    else:
        print("‚ïë" + " " * 12 + "‚ö†Ô∏è  DATASET MAY NEED MORE DATA" + " " * 26 + "‚ïë")
    
    print("‚ïö" + "‚ïê" * 68 + "‚ïù")

---

## Model Training

Trains the Temporal Fusion Transformer with:
- Profile-based transfer learning
- GPU acceleration (if available)
- Early stopping to prevent overfitting
- **15 Argus metrics including cascade_impact**

**Adjust parameters below:**

In [None]:
# Cell 4: Train TFT Model
# Expected time: 10 epochs=3-5h | 20 epochs=6-10h
# STREAMING MODE: ~10x less memory usage for large datasets

import sys
import os
import time
from pathlib import Path

# Add src/ to Python path (works from either root or Argus directory)
current_dir = Path.cwd()
if current_dir.name == 'Argus':
    argus_src = (current_dir / 'src').absolute()
    argus_root = current_dir
else:
    argus_src = (current_dir / 'Argus' / 'src').absolute()
    argus_root = current_dir / 'Argus'

if str(argus_src) not in sys.path:
    sys.path.insert(0, str(argus_src))

# ============================================
# CONFIGURATION - ADJUST THESE VALUES
# ============================================

TRAINING_EPOCHS = 10      # Recommended: 10-20 epochs

# STREAMING MODE: Use for large datasets (30+ days, 90+ servers)
# - Loads time chunks one at a time instead of full dataset
# - Memory: ~2-4 GB instead of 130+ GB
USE_STREAMING_MODE = True  # Set to True for large datasets

# IMPORTANT: Training must run from Argus directory for paths to work correctly
original_dir = Path.cwd()

# ============================================

print(f"ü§ñ Tachyon Argus Model Training")
print("-" * 70)
print(f"‚öôÔ∏è  Configuration:")
print(f"   Epochs: {TRAINING_EPOCHS}")
print(f"   Dataset: ./training/ (relative to Argus/)")
print(f"   Mode: {'STREAMING (memory-efficient)' if USE_STREAMING_MODE else 'Standard (full dataset in memory)'}")
print(f"   Metrics: 15 Argus metrics (including cascade_impact)")
print()

# Estimate training time
est_mins_low = TRAINING_EPOCHS * 20
est_mins_high = TRAINING_EPOCHS * 30
if USE_STREAMING_MODE:
    est_mins_low = int(est_mins_low * 1.2)
    est_mins_high = int(est_mins_high * 1.2)
print(f"‚è±Ô∏è  Estimated time: {est_mins_low//60}h {est_mins_low%60}m - {est_mins_high//60}h {est_mins_high%60}m")
print(f"   (Based on ~20-30 minutes per epoch on RTX 4090)")
print()
print("üöÄ Starting training...")
print()

_start = time.time()

# Import and run trainer
from training.tft_trainer import train_model

try:
    # CRITICAL: Change to Argus directory before training
    os.chdir(argus_root)
    print(f"[INFO] Working directory: {Path.cwd()}")
    
    model_path = train_model(
        dataset_path='./training/',
        epochs=TRAINING_EPOCHS,
        per_server=False,
        streaming=USE_STREAMING_MODE
    )
    
    if model_path:
        print("\n" + "=" * 70)
        print("‚úÖ TRAINING COMPLETED SUCCESSFULLY!")
        print("=" * 70)
        print(f"üìÅ Model saved: {model_path}")
        print()
        print("üéØ Transfer Learning Enabled:")
        print("   ‚úÖ Model learned patterns for each server profile")
        print("   ‚úÖ Model learned cascade dependency patterns")
        print("   ‚úÖ New servers get strong predictions from day 1")
        print("   ‚úÖ No retraining needed when adding servers of known types")
        print()
        print("üí° Next Steps:")
        print("   1. Start system: start_all.bat (Windows) or ./start_all.sh (Linux/Mac)")
        print("   2. Open dashboard: http://localhost:8050")
        print("   3. API endpoint: http://localhost:8000")
    else:
        print("\n‚ùå Training failed - check logs above")
        
except Exception as e:
    print(f"\n‚ùå Training error: {e}")
    import traceback
    traceback.print_exc()
finally:
    os.chdir(original_dir)
    print(f"\n[INFO] Restored working directory: {Path.cwd()}")

_elapsed = time.time() - _start
_hours = int(_elapsed // 3600)
_mins = int((_elapsed % 3600) // 60)
_secs = int(_elapsed % 60)
print(f"\n‚è±Ô∏è  Execution time: {_hours}h {_mins}m {_secs}s")

---

## Training Complete!

### What you've built:

**Profile-Based Transfer Learning**
- Model learned patterns for 7 server profiles
- New servers get accurate predictions immediately
- No retraining needed for known server types

**Inter-Server Cascade Dependencies**
- Model understands database ‚Üí service dependencies
- Predicts cascade impact propagation
- 15 Argus metrics including cascade_impact

**Production-Ready System**
- 8-hour forecast horizon (96 steps)
- Quantile uncertainty estimates (p10, p50, p90)
- Safetensors model format

---

### Launch the System:

**Windows:**
```bash
cd Argus
start_all.bat
```

**Linux/Mac:**
```bash
cd Argus
./start_all.sh
```

**Manual start (development):**
```bash
# Terminal 1 - Inference daemon
cd Argus
conda activate py310
python src/daemons/tft_inference_daemon.py --port 8000

# Terminal 2 - Metrics generator (with cascade scenario)
cd Argus
conda activate py310
python src/daemons/metrics_generator_daemon.py --stream --servers 20

# Terminal 3 - Dashboard
cd Argus
conda activate py310
python dash_app.py
```

---

### Access Points:

- **Dashboard:** http://localhost:8050
- **Inference API:** http://localhost:8000
- **Metrics Generator API:** http://localhost:8001
- **Health Check:** http://localhost:8000/health

---

### Dashboard Scenario Buttons:

| Button | Description |
|--------|-------------|
| üü¢ Healthy | All servers healthy (force_healthy mode) |
| üü° Degrading | Gradual performance degradation |
| üî¥ Critical | Critical server failures |
| üîó Cascade | Database cascade failure simulation |

---

### Documentation:

- **[Argus/README.md](Argus/README.md)** - Complete system overview
- **[Argus/Docs/SERVER_PROFILES.md](Argus/Docs/SERVER_PROFILES.md)** - 7 server profiles explained
- **[Argus/Docs/GETTING_STARTED.md](Argus/Docs/GETTING_STARTED.md)** - Setup and configuration
- **[Docs/ARCHITECTURE_GUIDE.md](Docs/ARCHITECTURE_GUIDE.md)** - System architecture and data contract

---

**Your Tachyon Argus predictive monitoring system is ready!**