# üõ°Ô∏è NIDS-ML - Pipeline Completa (Pre-Training)

**Versione:** 2.0 - Ottimizzata e Modulare

### üìã Pipeline Steps:
1. **Environment Detection** - Rileva Kaggle/Locale
2. **Repository Setup** - Download `src/` (solo Kaggle)
3. **Dataset Management** - Importa dataset raw
4. **Preprocessing** - Pulizia e trasformazione dati
5. **Feature Engineering** - Statistical preprocessing + RobustScaler + RF feature selection
6. **Validation** - Verifica artifacts e dataset pronti

### ‚ú® Caratteristiche:
- ‚úÖ **Auto-detection**: Kaggle vs Locale
- ‚úÖ **Path Management**: Gestione automatica path dataset e script
- ‚úÖ **Clean Run**: Cancella risultati precedenti (configurabile)
- ‚úÖ **Checkpoints**: Salvataggio intermedio per debug
- ‚úÖ **Modular**: Ogni step √® indipendente e riutilizzabile

---

**Prossimi Notebook:**
- `nids_training_random_forest.ipynb` - Tuning Random Forest
- `nids_training_xgboost.ipynb` - Tuning XGBoost
- `nids_training_lightgbm.ipynb` - Tuning LightGBM

## üîß 1. CONFIGURAZIONE PIPELINE

In [None]:
# ==========================================
# CONFIGURAZIONE GLOBALE
# ==========================================

# --- Clean Run ---
CLEAN_RUN = True  # Se True: cancella data/processed e artifacts prima di iniziare

# --- Repository (solo Kaggle) ---
REPO_URL = "https://github.com/riiccardob/nids-ml-ssr2"
BRANCH = "main"

# --- Dataset Paths ---
# Su Kaggle: path del dataset Input
KAGGLE_DATASET_PATH = "/kaggle/input/network-intrusion-dataset/"

# In Locale: path relativo al repo
LOCAL_DATASET_PATH = "data/raw"  # Assumendo che il dataset sia gi√† in data/raw

# --- Feature Engineering Config ---
USE_STATISTICAL = True   # Statistical preprocessing (CONSIGLIATO)
USE_ROBUST = True        # RobustScaler (CONSIGLIATO)
N_FEATURES = 30          # Numero feature da selezionare
RF_ESTIMATORS = 100      # Alberi Random Forest per feature importance

# ==========================================
# VALIDAZIONE
# ==========================================

if N_FEATURES < 5 or N_FEATURES > 100:
    raise ValueError(f"‚ùå N_FEATURES deve essere tra 5 e 100 (valore: {N_FEATURES})")

if RF_ESTIMATORS < 10 or RF_ESTIMATORS > 1000:
    raise ValueError(f"‚ùå RF_ESTIMATORS deve essere tra 10 e 1000 (valore: {RF_ESTIMATORS})")

print("="*70)
print("‚úÖ CONFIGURAZIONE PIPELINE")
print("="*70)
print(f"Clean Run:              {CLEAN_RUN}")
print(f"Statistical Preproc:    {USE_STATISTICAL}")
print(f"RobustScaler:           {USE_ROBUST}")
print(f"Feature Selection:      Random Forest ({N_FEATURES} features, {RF_ESTIMATORS} trees)")
print("="*70)

## üåç 2. ENVIRONMENT DETECTION & SETUP

In [None]:
# ==========================================
# RILEVAMENTO AMBIENTE
# ==========================================

import os
import sys
import shutil
from pathlib import Path

# Rileva Kaggle
IS_KAGGLE = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '') != ''

# Determina working directory
if IS_KAGGLE:
    WORKING_DIR = Path("/kaggle/working")
    DATASET_SOURCE = KAGGLE_DATASET_PATH
    print("üöÄ Ambiente: KAGGLE")
else:
    WORKING_DIR = Path.cwd()
    DATASET_SOURCE = LOCAL_DATASET_PATH
    print("üíª Ambiente: LOCALE")

print(f"üìÇ Working Directory: {WORKING_DIR}")
print(f"üì¶ Dataset Source:    {DATASET_SOURCE}")

# Setup cartelle di lavoro
DIRS_STRUCTURE = [
    "data/raw",
    "data/processed",
    "artifacts",
    "logs/timing",
    "models"
]

for directory in DIRS_STRUCTURE:
    dir_path = WORKING_DIR / directory
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"‚úÖ Struttura cartelle creata")

## üì• 3. REPOSITORY SETUP (Solo Kaggle)

In [None]:
# ==========================================
# SETUP REPOSITORY & REQUIREMENTS
# ==========================================

if IS_KAGGLE:
    print("üì• Setup Repository per Kaggle...")
    
    # Percorsi
    repo_zip = "repo.zip"
    repo_folder = f"nids-ml-ssr2-{BRANCH}"
    src_dir = WORKING_DIR / "src"
    
    # Download solo se src non esiste
    if not src_dir.exists():
        print(f"  Downloading {REPO_URL}...")
        
        # Download ZIP
        download_url = f"{REPO_URL}/archive/refs/heads/{BRANCH}.zip"
        os.system(f"wget -q {download_url} -O {repo_zip}")
        
        # Estrazione
        os.system(f"unzip -qo {repo_zip}")
        
        # Sposta solo src/
        extracted_src = Path(repo_folder) / "src"
        if extracted_src.exists():
            shutil.move(str(extracted_src), str(src_dir))
            print(f"  ‚úÖ Cartella src/ importata")
        
        # Sposta requirements.txt se presente
        extracted_req = Path(repo_folder) / "requirements.txt"
        if extracted_req.exists():
            shutil.move(str(extracted_req), str(WORKING_DIR / "requirements.txt"))
        
        # Cleanup
        if Path(repo_folder).exists():
            shutil.rmtree(repo_folder)
        if Path(repo_zip).exists():
            os.remove(repo_zip)
    else:
        print("  ‚è© src/ gi√† presente, skip download")
    
    # Installazione dipendenze
    requirements_file = WORKING_DIR / "requirements.txt"
    if requirements_file.exists():
        print("  üì¶ Installazione dipendenze...")
        os.system(f"pip install -q -r {requirements_file}")
        print("  ‚úÖ Dipendenze installate")
    
    # Aggiungi src/ al path
    sys.path.insert(0, str(src_dir))
    print(f"‚úÖ Repository pronto: {src_dir}")
    
else:
    # Locale: verifica che src/ esista
    src_dir = WORKING_DIR / "src"
    if not src_dir.exists():
        print("‚ö†Ô∏è  ATTENZIONE: cartella src/ non trovata!")
        print("    Assicurati di essere nella root del repository.")
        raise FileNotFoundError(f"src/ non trovata in {WORKING_DIR}")
    
    sys.path.insert(0, str(src_dir))
    print(f"‚úÖ Repository locale: {src_dir}")

## üßπ 4. CLEAN RUN MANAGER

In [None]:
# ==========================================
# CLEAN RUN - Pulizia Output Precedenti
# ==========================================

CLEAN_DIRS = [
    "data/processed",
    "artifacts"
]

if CLEAN_RUN:
    print("üßπ CLEAN_RUN = True: Pulizia cartelle di output...")
    
    for directory in CLEAN_DIRS:
        dir_path = WORKING_DIR / directory
        if dir_path.exists():
            print(f"  üóëÔ∏è  Rimozione: {directory}")
            shutil.rmtree(dir_path)
            dir_path.mkdir(parents=True)
    
    print("‚úÖ Pulizia completata - Pipeline parte da zero")
else:
    print("‚è© CLEAN_RUN = False: Mantengo risultati precedenti (se esistono)")

## üì¶ 5. DATASET IMPORT

In [None]:
# ==========================================
# IMPORT DATASET RAW
# ==========================================

RAW_DIR = WORKING_DIR / "data/raw"

print(f"üì¶ Import Dataset in {RAW_DIR}...")

# Conta CSV gi√† presenti
existing_csv = list(RAW_DIR.glob("*.csv"))

if len(existing_csv) > 0:
    print(f"  ‚úÖ Dataset gi√† presente: {len(existing_csv)} file CSV")
    for csv_file in existing_csv:
        print(f"     - {csv_file.name}")
else:
    print(f"  üì• Importazione dataset da {DATASET_SOURCE}...")
    
    if IS_KAGGLE:
        # Kaggle: copia da /kaggle/input
        if not Path(DATASET_SOURCE).exists():
            raise FileNotFoundError(f"Dataset non trovato: {DATASET_SOURCE}")
        
        copied_count = 0
        for root, dirs, files in os.walk(DATASET_SOURCE):
            for file in files:
                if file.lower().endswith(".csv"):
                    src_path = Path(root) / file
                    dst_path = RAW_DIR / file
                    
                    if not dst_path.exists():
                        shutil.copy2(src_path, dst_path)
                        copied_count += 1
                        print(f"     ‚úì Copiato: {file}")
        
        print(f"  ‚úÖ Importati {copied_count} file CSV")
    else:
        # Locale: verifica che esistano CSV
        if not RAW_DIR.exists():
            raise FileNotFoundError(f"Cartella dataset non trovata: {RAW_DIR}")
        
        local_csv = list(RAW_DIR.glob("*.csv"))
        if len(local_csv) == 0:
            raise FileNotFoundError(f"Nessun CSV trovato in {RAW_DIR}")
        
        print(f"  ‚úÖ Trovati {len(local_csv)} file CSV locali")

# Verifica finale
final_csv = list(RAW_DIR.glob("*.csv"))
print(f"\n‚úÖ Dataset pronto: {len(final_csv)} file CSV in {RAW_DIR}")

## üîÑ 6. PREPROCESSING

In [None]:
# ==========================================
# PREPROCESSING
# ==========================================

print("="*70)
print("üîÑ STEP 1: PREPROCESSING")
print("="*70)

# Verifica se gi√† fatto
processed_dir = WORKING_DIR / "data/processed"
required_files = ["train.parquet", "val.parquet", "test.parquet"]
all_exist = all((processed_dir / f).exists() for f in required_files)

if all_exist and not CLEAN_RUN:
    print("‚è© Preprocessing gi√† completato (file parquet esistenti)")
    print("   Per rieseguire: imposta CLEAN_RUN = True")
else:
    print("üöÄ Avvio preprocessing.py...\n")
    
    # Esegui preprocessing
    os.chdir(WORKING_DIR)
    result = os.system("python src/preprocessing.py")
    
    if result != 0:
        raise RuntimeError(f"‚ùå Preprocessing fallito con exit code {result}")
    
    print("\n‚úÖ Preprocessing completato con successo")

# Verifica output
for file in required_files:
    file_path = processed_dir / file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        print(f"  ‚úì {file:<20} ({size_mb:.1f} MB)")
    else:
        raise FileNotFoundError(f"File mancante: {file_path}")

## ‚öôÔ∏è 7. FEATURE ENGINEERING

In [None]:
# ==========================================
# FEATURE ENGINEERING
# ==========================================

print("="*70)
print("‚öôÔ∏è  STEP 2: FEATURE ENGINEERING")
print("="*70)

# Verifica se gi√† fatto
artifacts_dir = WORKING_DIR / "artifacts"
required_artifacts = [
    "scaler.pkl",
    "selected_features.json",
    "feature_importances.json"
]
all_exist = all((artifacts_dir / f).exists() for f in required_artifacts)

ready_dir = processed_dir
ready_files = ["train_ready.parquet", "val_ready.parquet", "test_ready.parquet"]
ready_exist = all((ready_dir / f).exists() for f in ready_files)

if all_exist and ready_exist and not CLEAN_RUN:
    print("‚è© Feature Engineering gi√† completato")
    print("   Per rieseguire: imposta CLEAN_RUN = True")
else:
    print("üöÄ Avvio feature_engineering.py...\n")
    
    # Costruisci comando
    cmd_parts = ["python", "src/feature_engineering.py"]
    
    if USE_STATISTICAL:
        cmd_parts.append("--use-statistical")
    
    if USE_ROBUST:
        cmd_parts.append("--use-robust")
    
    cmd_parts.extend([
        "--n-features", str(N_FEATURES),
        "--rf-estimators", str(RF_ESTIMATORS)
    ])
    
    cmd = " ".join(cmd_parts)
    print(f"Comando: {cmd}\n")
    
    # Esegui
    os.chdir(WORKING_DIR)
    result = os.system(cmd)
    
    if result != 0:
        raise RuntimeError(f"‚ùå Feature Engineering fallito con exit code {result}")
    
    print("\n‚úÖ Feature Engineering completato con successo")

# Verifica output
print("\nüìã Artifacts generati:")
for artifact in required_artifacts:
    artifact_path = artifacts_dir / artifact
    if artifact_path.exists():
        size_kb = artifact_path.stat().st_size / 1024
        print(f"  ‚úì {artifact:<30} ({size_kb:.1f} KB)")
    else:
        raise FileNotFoundError(f"Artifact mancante: {artifact_path}")

print("\nüìã Dataset pronti per training:")
for file in ready_files:
    file_path = ready_dir / file
    if file_path.exists():
        size_mb = file_path.stat().st_size / (1024 * 1024)
        print(f"  ‚úì {file:<30} ({size_mb:.1f} MB)")
    else:
        raise FileNotFoundError(f"File mancante: {file_path}")

## ‚úÖ 8. VALIDATION & SUMMARY

In [None]:
# ==========================================
# VALIDAZIONE FINALE & SUMMARY
# ==========================================

import json
import pandas as pd

print("="*70)
print("‚úÖ VALIDAZIONE PIPELINE")
print("="*70)

# 1. Verifica dataset ready
print("\n1. Dataset Ready:")
for dataset_name in ["train", "val", "test"]:
    file_path = processed_dir / f"{dataset_name}_ready.parquet"
    df = pd.read_parquet(file_path)
    print(f"  {dataset_name.upper():5} - Shape: {df.shape[0]:>7,} samples x {df.shape[1]:>2} features")

# 2. Feature selezionate
print("\n2. Feature Selection:")
with open(artifacts_dir / "selected_features.json") as f:
    selected_features = json.load(f)
print(f"  Selezionate: {len(selected_features)} features")

# 3. Top-10 importances
print("\n3. Top-10 Feature Importances:")
with open(artifacts_dir / "feature_importances.json") as f:
    importances = json.load(f)

sorted_features = sorted(importances.items(), key=lambda x: x[1], reverse=True)[:10]
for i, (feat, score) in enumerate(sorted_features, 1):
    print(f"  {i:2}. {feat:<40} {score:.6f}")

# 4. Scaler info
print("\n4. Scaler:")
scaler_type = "RobustScaler" if USE_ROBUST else "StandardScaler"
print(f"  Tipo: {scaler_type}")
statistical_info_path = artifacts_dir / "statistical_preprocessing_info.json"
if statistical_info_path.exists():
    with open(statistical_info_path) as f:
        stat_info = json.load(f)
    print(f"  Statistical Preprocessing: ATTIVO")
    if 'summary' in stat_info:
        summary = stat_info['summary']
        print(f"    - Feature ridotte: {summary.get('reduction_percent', 0):.1f}%")
        print(f"    - Varianza rimossa: {stat_info.get('step1_variance', {}).get('removed_count', 0)}")
        print(f"    - Correlazione rimossa: {stat_info.get('step2_correlation', {}).get('removed_count', 0)}")
else:
    print(f"  Statistical Preprocessing: DISATTIVO")

print("\n" + "="*70)
print("üéâ PIPELINE COMPLETATA CON SUCCESSO!")
print("="*70)
print("\nüìç Prossimi Step:")
print("  1. Tuning Random Forest:  nids_training_random_forest.ipynb")
print("  2. Tuning XGBoost:        nids_training_xgboost.ipynb")
print("  3. Tuning LightGBM:       nids_training_lightgbm.ipynb")
print("\nüíæ Output Disponibili:")
print(f"  - Dataset: {processed_dir}")
print(f"  - Artifacts: {artifacts_dir}")
print(f"  - Logs: {WORKING_DIR / 'logs'}")

## üíæ 9. EXPORT ARTIFACTS (Opzionale - Solo Kaggle)

In [None]:
# ==========================================
# EXPORT ARTIFACTS - Solo Kaggle
# ==========================================

if IS_KAGGLE:
    print("üì¶ Creazione archivio artifacts per download...")
    
    OUTPUT_ZIP = "pipeline_artifacts.zip"
    DIRS_TO_EXPORT = ["artifacts", "data/processed"]
    
    existing_dirs = [d for d in DIRS_TO_EXPORT if (WORKING_DIR / d).exists()]
    
    if existing_dirs:
        # Crea zip
        dirs_str = " ".join(existing_dirs)
        os.chdir(WORKING_DIR)
        os.system(f"zip -qr {OUTPUT_ZIP} {dirs_str}")
        
        zip_path = WORKING_DIR / OUTPUT_ZIP
        if zip_path.exists():
            size_mb = zip_path.stat().st_size / (1024 * 1024)
            print(f"‚úÖ Archivio creato: {OUTPUT_ZIP} ({size_mb:.1f} MB)")
            print("   Scarica il file dall'interfaccia Kaggle (Output > {OUTPUT_ZIP})")
        else:
            print("‚ö†Ô∏è  Errore nella creazione dell'archivio")
    else:
        print("‚ö†Ô∏è  Nessun artifact da esportare")
else:
    print("üíª Ambiente locale: export non necessario")
    print(f"   Artifacts disponibili in: {artifacts_dir}")