# 🚀 TFT Training Quick Start

## Simplified workflow for dataset creation and model training

This notebook does two things:
1. **Generate training dataset** - Create realistic server metrics data
2. **Train TFT model** - Train the Temporal Fusion Transformer

The dashboard and inference daemon handle everything else!

---

**⏱️ Estimated Times:**
- Dataset generation (24h): ~30-60 seconds
- Dataset generation (720h): ~5-10 minutes
- Model training (10 epochs): ~3-5 hours on RTX 4090

**🎯 After Training:**
- Start system: `start_all.bat` (Windows) or `./start_all.sh` (Linux/Mac)
- Dashboard: http://localhost:8501
- API: http://localhost:8000

In [None]:
# Cell 1: Setup and Configuration
import sys
import time
from pathlib import Path

# Add src/ to Python path (works from either root or NordIQ directory)
current_dir = Path.cwd()
if current_dir.name == 'NordIQ':
    # Notebook is in NordIQ folder
    nordiq_src = (current_dir / 'src').absolute()
else:
    # Notebook is in root folder
    nordiq_src = (current_dir / 'NordIQ' / 'src').absolute()

if str(nordiq_src) not in sys.path:
    sys.path.insert(0, str(nordiq_src))

print("🎯 TFT Training System")
print("=" * 70)
print("✅ Python path configured")
print(f"📁 NordIQ source: {nordiq_src}")
print("\n🔧 Configuration:")
print("   Training directory: ./training/")
print("   Models directory: ./models/")
print("   Prediction horizon: 96 steps (8 hours)")
print("   Context length: 288 steps (24 hours)")
print("=" * 70)

---

## Cell 2: Generate Training Dataset

Creates realistic server metrics with:
- 7 server profiles (ML, DB, Web, Conductor, ETL, Risk, Generic)
- Financial market hours patterns
- 14 LINBORG-compatible metrics

**Adjust parameters below:**

In [None]:
# Cell 2: Generate Training Dataset
# Expected time: 24h=30-60s | 720h=5-10min

import sys
import time
from pathlib import Path
import pandas as pd

# Add src/ to Python path (works from either root or NordIQ directory)
current_dir = Path.cwd()
if current_dir.name == 'NordIQ':
    # Notebook is in NordIQ folder
    nordiq_src = (current_dir / 'src').absolute()
else:
    # Notebook is in root folder
    nordiq_src = (current_dir / 'NordIQ' / 'src').absolute()

if str(nordiq_src) not in sys.path:
    sys.path.insert(0, str(nordiq_src))

# ============================================
# CONFIGURATION - ADJUST THESE VALUES
# ============================================

TRAINING_HOURS = 24        # Options: 24, 168, 720 (recommended: 720 for production)
NUM_ML_COMPUTE = 5         # ML training nodes
NUM_DATABASE = 4           # Database servers
NUM_WEB_API = 6            # Web/API servers
NUM_CONDUCTOR_MGMT = 1     # Conductor management
NUM_DATA_INGEST = 2        # ETL/streaming servers
NUM_RISK_ANALYTICS = 1     # Risk calculation servers
NUM_GENERIC = 1            # Generic/utility servers

TRAINING_DIR = './training'

# ============================================

print(f"🏢 Dataset Generation")
print("-" * 70)
print(f"⚙️  Configuration:")
print(f"   Duration: {TRAINING_HOURS} hours ({TRAINING_HOURS/24:.1f} days)")
total_servers = NUM_ML_COMPUTE + NUM_DATABASE + NUM_WEB_API + NUM_CONDUCTOR_MGMT + NUM_DATA_INGEST + NUM_RISK_ANALYTICS + NUM_GENERIC
print(f"   Total servers: {total_servers}")
print(f"   Output: {TRAINING_DIR}")
print()

_start = time.time()

# Import and run generator
from generators.metrics_generator import main as generate_metrics

# Set up command-line arguments for the generator
old_argv = sys.argv
sys.argv = [
    'metrics_generator.py',
    '--hours', str(TRAINING_HOURS),
    '--num_ml_compute', str(NUM_ML_COMPUTE),
    '--num_database', str(NUM_DATABASE),
    '--num_web_api', str(NUM_WEB_API),
    '--num_conductor_mgmt', str(NUM_CONDUCTOR_MGMT),
    '--num_data_ingest', str(NUM_DATA_INGEST),
    '--num_risk_analytics', str(NUM_RISK_ANALYTICS),
    '--num_generic', str(NUM_GENERIC),
    '--out_dir', TRAINING_DIR,
    '--format', 'parquet'
]

try:
    generate_metrics()
    print("\n✅ Dataset generation complete!")
    success = True
except Exception as e:
    print(f"\n❌ Generation failed: {e}")
    success = False
finally:
    sys.argv = old_argv

_elapsed = time.time() - _start
_mins = int(_elapsed // 60)
_secs = int(_elapsed % 60)
print(f"\n⏱️  Execution time: {_mins}m {_secs}s")

if success:
    # Show what was created
    training_path = Path(TRAINING_DIR)
    parquet_files = list(training_path.glob("*.parquet"))
    
    if parquet_files:
        latest = max(parquet_files, key=lambda p: p.stat().st_mtime)
        df = pd.read_parquet(latest)
        
        print(f"\n📊 Dataset Summary:")
        print(f"   File: {latest.name}")
        print(f"   Size: {latest.stat().st_size / (1024*1024):.1f} MB")
        print(f"   Records: {len(df):,}")
        print(f"   Servers: {df['server_name'].nunique()}")
        print(f"   Profiles: {sorted(df['profile'].unique())}")
        print(f"   Time span: {(df['timestamp'].max() - df['timestamp'].min()).total_seconds() / 3600:.1f} hours")
        print(f"\n🎯 Ready for training!")

---

## Cell 3: Train TFT Model

Trains the Temporal Fusion Transformer with:
- Profile-based transfer learning
- GPU acceleration (if available)
- Early stopping to prevent overfitting

**Adjust parameters below:**

In [None]:
# Cell 3: Train TFT Model
# Expected time: 10 epochs=3-5h | 20 epochs=6-10h

import sys
import time
from pathlib import Path

# Add src/ to Python path (works from either root or NordIQ directory)
current_dir = Path.cwd()
if current_dir.name == 'NordIQ':
    # Notebook is in NordIQ folder
    nordiq_src = (current_dir / 'src').absolute()
else:
    # Notebook is in root folder
    nordiq_src = (current_dir / 'NordIQ' / 'src').absolute()

if str(nordiq_src) not in sys.path:
    sys.path.insert(0, str(nordiq_src))

# ============================================
# CONFIGURATION - ADJUST THESE VALUES
# ============================================

TRAINING_EPOCHS = 10       # Recommended: 10-20 epochs
DATASET_PATH = './training'

# ============================================

print(f"🤖 Model Training")
print("-" * 70)
print(f"⚙️  Configuration:")
print(f"   Epochs: {TRAINING_EPOCHS}")
print(f"   Dataset: {DATASET_PATH}")
print(f"   Mode: Fleet-wide with profile-based transfer learning")
print()

# Estimate training time
est_mins_low = TRAINING_EPOCHS * 20
est_mins_high = TRAINING_EPOCHS * 30
print(f"⏱️  Estimated time: {est_mins_low//60}h {est_mins_low%60}m - {est_mins_high//60}h {est_mins_high%60}m")
print(f"   (Based on ~20-30 minutes per epoch on RTX 4090)")
print()
print("🚀 Starting training...")
print()

_start = time.time()

# Import and run trainer
from training.tft_trainer import train_model

try:
    model_path = train_model(
        dataset_path=DATASET_PATH,
        epochs=TRAINING_EPOCHS,
        per_server=False  # Fleet-wide training with profiles
    )
    
    if model_path:
        print("\n" + "=" * 70)
        print("✅ TRAINING COMPLETED SUCCESSFULLY!")
        print("=" * 70)
        print(f"📁 Model saved: {model_path}")
        print()
        print("🎯 Transfer Learning Enabled:")
        print("   ✅ Model learned patterns for each server profile")
        print("   ✅ New servers get strong predictions from day 1")
        print("   ✅ No retraining needed when adding servers of known types")
        print()
        print("💡 Next Steps:")
        print("   1. Start system: start_all.bat (Windows) or ./start_all.sh (Linux/Mac)")
        print("   2. Open dashboard: http://localhost:8501")
        print("   3. API endpoint: http://localhost:8000")
    else:
        print("\n❌ Training failed - check logs above")
        
except Exception as e:
    print(f"\n❌ Training error: {e}")
    import traceback
    traceback.print_exc()

_elapsed = time.time() - _start
_hours = int(_elapsed // 3600)
_mins = int((_elapsed % 3600) // 60)
_secs = int(_elapsed % 60)
print(f"\n⏱️  Execution time: {_hours}h {_mins}m {_secs}s")

---

## 🎉 Training Complete!

### What you've built:

✅ **Profile-Based Transfer Learning**
- Model learned patterns for 7 server profiles
- New servers get accurate predictions immediately
- No retraining needed for known server types

✅ **Production-Ready System**
- 8-hour forecast horizon (96 steps)
- Quantile uncertainty estimates (p10, p50, p90)
- 14 LINBORG-compatible metrics
- Safetensors model format

---

### 🚀 Launch the System:

**Windows:**
```bash
start_all.bat
```

**Linux/Mac:**
```bash
./start_all.sh
```

**Manual start (development):**
```bash
# Terminal 1 - Inference daemon
conda activate py310
python NordIQ/src/daemons/tft_inference_daemon.py --port 8000

# Terminal 2 - Metrics generator
conda activate py310
python NordIQ/src/daemons/metrics_generator_daemon.py --stream --servers 20

# Terminal 3 - Dashboard
conda activate py310
streamlit run NordIQ/src/dashboard/tft_dashboard_web.py
```

---

### 📊 Access Points:

- **Dashboard:** http://localhost:8501
- **Inference API:** http://localhost:8000
- **Metrics Generator API:** http://localhost:8001
- **Health Check:** http://localhost:8000/health

---

### 📚 Documentation:

- **[NordIQ/README.md](NordIQ/README.md)** - Complete system overview
- **[NordIQ/Docs/SERVER_PROFILES.md](NordIQ/Docs/SERVER_PROFILES.md)** - 7 server profiles explained
- **[NordIQ/Docs/API_KEY_SETUP.md](NordIQ/Docs/API_KEY_SETUP.md)** - Security configuration
- **[NordIQ/Docs/DATA_CONTRACT.md](NordIQ/Docs/DATA_CONTRACT.md)** - Metrics schema

---

### 🔄 Incremental Training:

To add more training epochs later (recommended for continuous learning):

```bash
# Add 1-5 epochs per week
python NordIQ/src/training/tft_trainer.py --epochs 5 --incremental
```

The system will add epochs to your existing model without starting over!

---

**🎉 Your predictive monitoring system is ready!**