# Synthetic Gym Dataset Generator v2.0 - Set-Level (Experience-Aware)

**Progetto**: Sistema multi-modulo ML per supporto decisionale training fisico  
**Versione**: 2.0 (Experience-Aware Behavioral Patterns)  
** testo in grassettoAutore**: Alessandro Ambrosio

---

## **CHANGELOG v1.0 → v2.0**

### **Problemi Identificati v1.0**
- Skip rate sempre 0.0 (no dropout pattern)
- Consistency score sempre 1.0 (non informativo)
- RPE variability uniforme tra livelli
- Spike frequency random (non auto-regolazione)
- Total sets range overlap eccessivo

### **Modifiche v2.0 (Scientifically Grounded)**

1. **Skip Rate Experience-Aware** (Sperandei et al. 2016)
   - Beginner: 12-15% skip rate
   - Intermediate: 6-9%
   - Advanced: 4-6%

2. **RPE Calibration** (Day et al. 2004, Helms et al. 2016)
   - Beginner: std=1.5, bias +0.5 (sovrastimano)
   - Advanced: std=0.6, bias -0.2 (sottostimano)

3. **Spike Self-Regulation** (Gabbett 2016)
   - Beginner: 18-22% settimane spike
   - Advanced: 5-8% settimane spike

4. **Training History Duration** (Rønnestad & Mujika 2014)
   - Beginner: 3-9 mesi storia
   - Advanced: 18-36 mesi storia

5. **Consistency Post-Processing**
   - Beginner: 65-80%
   - Advanced: 85-95%

---

## **Bibliografia Chiave**

- **Banister (1975)**: Fitness-Fatigue impulse-response model
- **Gabbett (2016)**: ACWR spike detection (threshold 1.5-1.6)
- **Day et al. (2004)**: Novizi sovrastimano RPE
- **Helms et al. (2016)**: Esperti hanno RPE variability -60%
- **Sperandei et al. (2016)**: Dropout rate novizi 2.5x vs esperti
- **Rønnestad & Mujika (2014)**: Training history volume predice performance

---


#**CELLA 2 - Setup & Imports**

In [3]:
# ═══════════════════════════════════════════════════════════
# SETUP ENVIRONMENT
# ═══════════════════════════════════════════════════════════

!pip -q install pandas numpy

import os, json, math
from dataclasses import dataclass
from pathlib import Path
from datetime import date, timedelta, datetime
import numpy as np
import pandas as pd

print("[OK]OK] Libraries imported")
print(f"[OK] NumPy version: {np.__version__}")
print(f"[OK] Pandas version: {pd.__version__}")


[OK]OK] Libraries imported
[OK] NumPy version: 2.0.2
[OK] Pandas version: 2.2.2


##**LOAD EXISTING DATASET (skip generation)**

In [4]:
# ═══════════════════════════════════════════════════════════
# LOAD EXISTING DATASET (skip generation)
# ═══════════════════════════════════════════════════════════

LOAD_EXISTING = False  # ← Set True per caricare, False per rigenerare

if LOAD_EXISTING:
    print("="*80)
    print("LOADING EXISTING DATASET v2.0")
    print("="*80)

    DATADIR = Path('data/synth_set_level_v2')

    # Check if files exist
    required_files = [
        'users.csv',
        'workout_sets.csv',
        'workouts.csv',
        'workout_plan.csv',
        'banister_daily.csv',
        'banister_meta.csv',
        'exercises.csv'
    ]

    missing = [f for f in required_files if not (DATADIR / f).exists()]

    if missing:
        print(f"[!] Missing files: {missing}")
        print("-> Set LOAD_EXISTING = False and re-run generator")
        raise FileNotFoundError("Dataset files not found")

    # Load all CSVs
    df_users = pd.read_csv(DATADIR / 'users.csv')
    df_exercises = pd.read_csv(DATADIR / 'exercises.csv')
    df_workouts = pd.read_csv(DATADIR / 'workouts.csv')
    df_plan = pd.read_csv(DATADIR / 'workout_plan.csv')
    df_sets = pd.read_csv(DATADIR / 'workout_sets.csv')
    df_banister_daily = pd.read_csv(DATADIR / 'banister_daily.csv')
    df_banister_meta = pd.read_csv(DATADIR / 'banister_meta.csv')

    # Parse dates
    df_workouts['date'] = pd.to_datetime(df_workouts['date'])
    df_sets['date'] = pd.to_datetime(df_sets['date'])
    df_banister_daily['date'] = pd.to_datetime(df_banister_daily['date'])

    print(f"\n[OK] Dataset loaded from {DATADIR}")
    print(f"[OK] Users: {len(df_users):,}")
    print(f"[OK] Workouts: {len(df_workouts):,}")
    print(f"[OK] Sets: {len(df_sets):,}")

    # Skip generation cells
    print("\n[!] SKIP CELLE 13-16 (già eseguiti in precedenza)")
    print("="*80)


#**CELLA 3 - Configuration v2.0**

In [5]:
# ═══════════════════════════════════════════════════════════
# CONFIGURATION v2.0 (Experience-Aware Parameters)
# ═══════════════════════════════════════════════════════════

@dataclass
class CFG:
    # Random seed
    seed: int = 99999
    out_dir: str = "data/synth_set_level_v2"

    # User generation
    n_users: int = 1500

    # Date ranges (PER-USER, stratificato per experience in generazione)
    today: date = date.today()
    end_date_max_days_ahead: int = 730

    # Training schedule
    weekly_freq_mu: float = 3.5
    weekly_freq_sd: float = 1.0
    weekly_freq_min: int = 1
    weekly_freq_max: int = 6
    weekday_jitter_probs = [0.15, 0.70, 0.15]  # [-1, 0, +1]

    # Quantization
    load_step: float = 0.25
    rpe_step: float = 0.5

    # ════════════════════════════════════════════════════════
    # SKIP MODEL v2.0 (Experience-Aware)
    # ════════════════════════════════════════════════════════

    # Baseline skip probability per livello (Sperandei et al. 2016)
    skip_p0_by_level: dict = None

    # Fatigue sensitivity per livello
    skip_fatigue_sensitivity: dict = None

    skip_fatigue_cap: float = 1.2
    skip_noise_sd: float = 0.10

    # ════════════════════════════════════════════════════════
    # RPE CALIBRATION v2.0 (Day et al. 2004, Helms et al. 2016)
    # ════════════════════════════════════════════════════════

    rpe_params_by_level: dict = None

    # ════════════════════════════════════════════════════════
    # SPIKE ACWR v2.0 (Gabbett 2016 + Experience Auto-regulation)
    # ════════════════════════════════════════════════════════

    spike_enabled: bool = True

    # Probabilità spike/deload per livello
    spike_deload_probs: dict = None

    spike_acwr_min: float = 1.35
    spike_acwr_max: float = 1.85
    deload_acwr_min: float = 0.65
    deload_acwr_max: float = 0.85
    normal_acwr_mean: float = 1.05
    normal_acwr_std: float = 0.10

    # ════════════════════════════════════════════════════════
    # TRAINING HISTORY DURATION (Rønnestad & Mujika 2014)
    # ════════════════════════════════════════════════════════

    duration_days_by_level: dict = None

    # ════════════════════════════════════════════════════════
    # CONSISTENCY TARGET RANGES
    # ════════════════════════════════════════════════════════

    consistency_targets: dict = None

    # Injury
    injury_lambda: float = 0.002
    injury_days_min: int = 7
    injury_days_max: int = 28

    # Missingness
    p_missing_rpe: float = 0.02
    p_missing_load: float = 0.01
    p_missing_feedback: float = 0.02

    # Banister
    tau_F_mean: float = 45.0
    tau_F_sd: float = 8.0
    tau_D_mean: float = 7.0
    tau_D_sd: float = 2.0
    beta_F: float = 0.010
    beta_D: float = 0.015

    # Progressive overload
    overload_base_rate: dict = None
    overload_I0: float = 1800.0
    overload_quality_fatigue: float = 0.6
    transfer_same_muscle: float = 0.35
    transfer_same_split: float = 0.12

    # User profiles
    user_profiles: dict = None


# ═══════════════════════════════════════════════════════════
# INITIALIZE CONFIG v2.0
# ═══════════════════════════════════════════════════════════

cfg = CFG()

# ────────────────────────────────────────────────────────────
# Skip Model Parameters (Experience-Aware)
# ────────────────────────────────────────────────────────────

cfg.skip_p0_by_level = {
    'Beginner': 0.13,       # Target: 12-15% skip rate
    'Intermediate': 0.075,  # Target: 6-9% skip rate
    'Advanced': 0.05        # Target: 4-6% skip rate
}

cfg.skip_fatigue_sensitivity = {
    'Beginner': 0.35,       # Più sensibili a fatica
    'Intermediate': 0.25,
    'Advanced': 0.15        # Meno sensibili
}

# ────────────────────────────────────────────────────────────
# RPE Calibration Parameters (Experience-Aware)
# ────────────────────────────────────────────────────────────

cfg.rpe_params_by_level = {
    'Beginner': {
        'noise_std': 1.5,    # Alta variabilità
        'bias': 0.5          # Sovrastimano sforzo
    },
    'Intermediate': {
        'noise_std': 1.0,
        'bias': 0.0
    },
    'Advanced': {
        'noise_std': 0.6,    # Bassa variabilità
        'bias': -0.2         # Sottostimano leggermente
    }
}

# ────────────────────────────────────────────────────────────
# Spike/Deload Probabilities (Experience-Aware)
# ────────────────────────────────────────────────────────────

cfg.spike_deload_probs = {
    'Beginner': {
        'deload': 0.03,      # 3% (non sanno quando scaricare)
        'spike': 0.20        # 20% (non sanno autoregolarsi)
    },
    'Intermediate': {
        'deload': 0.05,      # 5% (baseline)
        'spike': 0.12        # 12% (buona gestione)
    },
    'Advanced': {
        'deload': 0.08,      # 8% (programmano deload)
        'spike': 0.06        # 6% (controllo ottimale)
    }
}

# ────────────────────────────────────────────────────────────
# Training History Duration (Experience-Aware)
# ────────────────────────────────────────────────────────────

cfg.duration_days_by_level = {
    'Beginner': (90, 270),       # 3-9 mesi
    'Intermediate': (240, 600),  # 8-20 mesi
    'Advanced': (540, 1080)      # 18-36 mesi
}

# ────────────────────────────────────────────────────────────
# Consistency Target Ranges
# ────────────────────────────────────────────────────────────

cfg.consistency_targets = {
    'Beginner': (0.65, 0.80),
    'Intermediate': (0.75, 0.90),
    'Advanced': (0.85, 0.95)
}

# ────────────────────────────────────────────────────────────
# Progressive Overload Rates
# ────────────────────────────────────────────────────────────

cfg.overload_base_rate = {
    'Beginner': 0.0500,    # +5% rate (newbie gains)
    'Intermediate': 0.0050,
    'Advanced': 0.0040
}

# ────────────────────────────────────────────────────────────
# User Profiles (Independent da experience)
# ────────────────────────────────────────────────────────────

USER_PROFILES = {
    'balanced': {
        'desc': "Equilibrato volume/intensità",
        'sets_mult': 1.0,
        'rpe_mult': 1.0,
        'skip_mult': 1.0,
        'discipline_base': 0.75
    },
    'high_volume': {
        'desc': "Alto volume, intensità moderata",
        'sets_mult': 1.35,
        'rpe_mult': 0.80,
        'skip_mult': 1.15,
        'discipline_base': 0.65
    },
    'high_intensity': {
        'desc': "Basso volume, alta intensità",
        'sets_mult': 0.75,
        'rpe_mult': 1.25,
        'skip_mult': 0.85,
        'discipline_base': 0.80
    },
    'inconsistent': {
        'desc': "Irregolare, skippa spesso",
        'sets_mult': 1.0,
        'rpe_mult': 0.95,
        'skip_mult': 2.5,
        'discipline_base': 0.45
    }
}

cfg.user_profiles = USER_PROFILES

print("="*80)
print("CONFIGURATION v2.0 LOADED")
print("="*80)
print(f"[OK] Users: {cfg.n_users}")
print(f"[OK] Seed: {cfg.seed}")
print(f"[OK] Output dir: {cfg.out_dir}")
print("\n[OK] Experience-Aware Parameters:")
print(f"  - Skip rates: {cfg.skip_p0_by_level}")
print(f"  - RPE calibration: {len(cfg.rpe_params_by_level)} levels")
print(f"  - Spike probs: {cfg.spike_deload_probs}")
print(f"  - Duration ranges: {cfg.duration_days_by_level}")
print("="*80)

# Create RNG
rng = np.random.default_rng(cfg.seed)

# Create output directory
OUTDIR = Path(cfg.out_dir)
OUTDIR.mkdir(parents=True, exist_ok=True)

print(f"\n[OK] RNG initialized with seed {cfg.seed}")
print(f"[OK] Output directory created: {OUTDIR}")


CONFIGURATION v2.0 LOADED
[OK] Users: 1500
[OK] Seed: 99999
[OK] Output dir: data/synth_set_level_v2

[OK] Experience-Aware Parameters:
  - Skip rates: {'Beginner': 0.13, 'Intermediate': 0.075, 'Advanced': 0.05}
  - RPE calibration: 3 levels
  - Spike probs: {'Beginner': {'deload': 0.03, 'spike': 0.2}, 'Intermediate': {'deload': 0.05, 'spike': 0.12}, 'Advanced': {'deload': 0.08, 'spike': 0.06}}
  - Duration ranges: {'Beginner': (90, 270), 'Intermediate': (240, 600), 'Advanced': (540, 1080)}

[OK] RNG initialized with seed 99999
[OK] Output directory created: data/synth_set_level_v2


#**CELLA 4 - Utility Functions**

In [6]:
# ═══════════════════════════════════════════════════════════
# UTILITY FUNCTIONS
# ═══════════════════════════════════════════════════════════

def sigmoid(z: float) -> float:
    """Sigmoid function for probability."""
    return 1.0 / (1.0 + math.exp(-z))

def logit(p: float) -> float:
    """Inverse sigmoid: logit(p) = ln(p/(1-p))"""
    return math.log(p / (1 - p))

def q_load(x: float, step: float) -> float:
    """Quantize load to nearest step (e.g., 0.25 kg)."""
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    return float(np.round(x / step) * step)

def q_rpe(x: float, step: float) -> float:
    """Quantize RPE to nearest step (e.g., 0.5) and clip [1, 10]."""
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    x = float(np.clip(x, 1.0, 10.0))
    return float(np.round(x / step) * step)

def clamp_int(x, lo, hi):
    """Clamp and cast to int."""
    return int(np.clip(int(round(x)), lo, hi))

def sample_split(rng):
    """Sample split type: PPL (70%) or FullBody (30%)."""
    return str(rng.choice(['PPL', 'FullBody'], p=[0.7, 0.3]))

def exp_weights(L: int, tau: float) -> np.ndarray:
    """Exponential weights for Banister model."""
    idx = np.arange(L, dtype=float)
    return np.exp(-idx / float(tau))

print("[OK] Utility functions loaded (8 functions)")


[OK] Utility functions loaded (8 functions)


#**CELLA 5 - Exercise Catalog**

In [7]:
# ═══════════════════════════════════════════════════════════
# EXERCISE CATALOG (Unchanged from v1.0)
# ═══════════════════════════════════════════════════════════

EXERCISES = [
    # Push exercises
    {"exercise_id": 1, "exercise_name": "Bench Press", "target_muscle_group": "Chest", "split_cat": "Push"},
    {"exercise_id": 2, "exercise_name": "Incline Dumbbell Press", "target_muscle_group": "Chest", "split_cat": "Push"},
    {"exercise_id": 3, "exercise_name": "Overhead Press", "target_muscle_group": "Shoulders", "split_cat": "Push"},
    {"exercise_id": 4, "exercise_name": "Lateral Raises", "target_muscle_group": "Shoulders", "split_cat": "Push"},
    {"exercise_id": 5, "exercise_name": "Tricep Dips", "target_muscle_group": "Triceps", "split_cat": "Push"},
    {"exercise_id": 6, "exercise_name": "Tricep Pushdowns", "target_muscle_group": "Triceps", "split_cat": "Push"},

    # Pull exercises
    {"exercise_id": 7, "exercise_name": "Pull-ups", "target_muscle_group": "Back", "split_cat": "Pull"},
    {"exercise_id": 8, "exercise_name": "Bent-Over Rows", "target_muscle_group": "Back", "split_cat": "Pull"},
    {"exercise_id": 9, "exercise_name": "Lat Pulldowns", "target_muscle_group": "Back", "split_cat": "Pull"},
    {"exercise_id": 10, "exercise_name": "Face Pulls", "target_muscle_group": "Back", "split_cat": "Pull"},
    {"exercise_id": 11, "exercise_name": "Barbell Curls", "target_muscle_group": "Biceps", "split_cat": "Pull"},
    {"exercise_id": 12, "exercise_name": "Hammer Curls", "target_muscle_group": "Biceps", "split_cat": "Pull"},

    # Legs exercises
    {"exercise_id": 13, "exercise_name": "Squats", "target_muscle_group": "Quads", "split_cat": "Legs"},
    {"exercise_id": 14, "exercise_name": "Leg Press", "target_muscle_group": "Quads", "split_cat": "Legs"},
    {"exercise_id": 15, "exercise_name": "Lunges", "target_muscle_group": "Quads", "split_cat": "Legs"},
    {"exercise_id": 16, "exercise_name": "Romanian Deadlifts", "target_muscle_group": "Hamstrings", "split_cat": "Legs"},
    {"exercise_id": 17, "exercise_name": "Leg Curls", "target_muscle_group": "Hamstrings", "split_cat": "Legs"},
    {"exercise_id": 18, "exercise_name": "Calf Raises", "target_muscle_group": "Calves", "split_cat": "Legs"},

    # FullBody compound
    {"exercise_id": 19, "exercise_name": "Deadlifts", "target_muscle_group": "Back", "split_cat": "FullBody"},
    {"exercise_id": 20, "exercise_name": "Power Cleans", "target_muscle_group": "Full", "split_cat": "FullBody"},
]

df_exercises = pd.DataFrame(EXERCISES)

print("="*80)
print("EXERCISE CATALOG LOADED")
print("="*80)
print(f"[OK] Total exercises: {len(df_exercises)}")
print(f"[OK] Split categories: {df_exercises['split_cat'].unique().tolist()}")
print(f"[OK] Muscle groups: {df_exercises['target_muscle_group'].nunique()}")
print("\nPreview:")
print(df_exercises.head(10).to_string(index=False))


EXERCISE CATALOG LOADED
[OK] Total exercises: 20
[OK] Split categories: ['Push', 'Pull', 'Legs', 'FullBody']
[OK] Muscle groups: 9

Preview:
 exercise_id          exercise_name target_muscle_group split_cat
           1            Bench Press               Chest      Push
           2 Incline Dumbbell Press               Chest      Push
           3         Overhead Press           Shoulders      Push
           4         Lateral Raises           Shoulders      Push
           5            Tricep Dips             Triceps      Push
           6       Tricep Pushdowns             Triceps      Push
           7               Pull-ups                Back      Pull
           8         Bent-Over Rows                Back      Pull
           9          Lat Pulldowns                Back      Pull
          10             Face Pulls                Back      Pull


#**CELLA 6 - Prescribe Exercise**

In [8]:
# ═══════════════════════════════════════════════════════════
# PRESCRIBE EXERCISE FUNCTION (Unchanged from v1.0)
# ═══════════════════════════════════════════════════════════

def prescribe_exercise(ex_row: dict, experience_label: str, rng) -> tuple:
    """
    Prescrizioni set/reps/rest/RIR per esercizio basate su experience.

    Returns:
        (sets_planned, reps_min, reps_max, rest_sec, rir_target)
    """

    # Sets planned based on experience
    sets_by_level = {
        'Beginner': (2, 4),
        'Intermediate': (3, 5),
        'Advanced': (4, 6)
    }
    sets_min, sets_max = sets_by_level.get(experience_label, (3, 5))
    sets_planned = int(rng.integers(sets_min, sets_max + 1))

    # Reps range
    reps_min = 6
    reps_max = 12

    # Rest time (seconds)
    rest_sec = int(rng.choice([90, 120, 150, 180]))

    # RIR target
    rir_by_level = {
        'Beginner': (2, 4),
        'Intermediate': (1, 3),
        'Advanced': (0, 2)
    }
    rir_min, rir_max = rir_by_level.get(experience_label, (1, 3))
    rir_target = int(rng.integers(rir_min, rir_max + 1))

    return sets_planned, reps_min, reps_max, rest_sec, rir_target

print("[OK] prescribe_exercise() function loaded")


[OK] prescribe_exercise() function loaded


#**CELLA 7 - Intensity from Reps/RIR**

In [9]:
# ═══════════════════════════════════════════════════════════
# INTENSITY FROM REPS/RIR (Unchanged from v1.0)
# ═══════════════════════════════════════════════════════════

def intensity_from_reps_rir(reps_target: int, rir: int, rng) -> float:
    """
    Stima intensità relativa (% 1RM) da reps target e RIR.

    Formula empirica: intensity = 1.0 - 0.025 * (reps + rir - 1)
    Noise: ±3% per variabilità intra-individuale
    """

    # Base formula (Epley-based approximation)
    inten = 1.0 - 0.025 * (reps_target + rir - 1)

    # Add noise
    inten = float(rng.normal(inten, 0.03))

    # Clamp [0.5, 1.0]
    inten = float(np.clip(inten, 0.5, 1.0))

    return inten

# Test
test_intensity = intensity_from_reps_rir(8, 2, rng)
print(f"[OK] intensity_from_reps_rir() function loaded")
print(f"  Test: 8 reps @ 2 RIR → {test_intensity:.2%} 1RM")


[OK] intensity_from_reps_rir() function loaded
  Test: 8 reps @ 2 RIR → 79.95% 1RM


#**CELLA 8 - User Generation v2.0 (Duration Stratificata)**

In [10]:
# ═══════════════════════════════════════════════════════════
# USER GENERATION v2.0 (Experience-Aware Duration)
# ═══════════════════════════════════════════════════════════

print("="*80)
print("USER GENERATION v2.0 (Experience-Aware Training History)")
print("="*80)

# Experience distribution (realistico: ~35% Beginner, ~52% Intermediate, ~13% Advanced)
experience_probs = [0.36, 0.52, 0.12]
experience_labels = rng.choice(
    ['Beginner', 'Intermediate', 'Advanced'],
    size=cfg.n_users,
    p=experience_probs
)

print(f"\n[OK] Experience distribution (n={cfg.n_users}):")
exp_counts = pd.Series(experience_labels).value_counts().sort_index()
for label, count in exp_counts.items():
    pct = count / cfg.n_users * 100
    print(f"  {label:12s}: {count:4d} ({pct:5.1f}%)")

# Weekly frequency
weekly_freqs = rng.normal(cfg.weekly_freq_mu, cfg.weekly_freq_sd, size=cfg.n_users)
weekly_freqs = np.clip(weekly_freqs, cfg.weekly_freq_min, cfg.weekly_freq_max)

# User profiles (uniform distribution)
profile_names = list(cfg.user_profiles.keys())
profiles = rng.choice(profile_names, size=cfg.n_users)

# ════════════════════════════════════════════════════════════
# DURATION STRATIFICATA PER EXPERIENCE (v2.0)
# ════════════════════════════════════════════════════════════

def generate_user_duration(experience_label: str, rng) -> int:
    """
    Genera durata finestra temporale training basata su esperienza.

    Rationale (Rønnestad & Mujika 2014):
    - Beginner: storia breve (3-9 mesi)
    - Intermediate: storia media (8-20 mesi)
    - Advanced: storia lunga (18-36 mesi)
    """
    min_days, max_days = cfg.duration_days_by_level[experience_label]
    duration = int(rng.integers(min_days, max_days + 1))
    return duration

# Generate start/end dates per user
user_rows = []

for i in range(cfg.n_users):
    uid = i + 1
    exp_label = experience_labels[i]

    # v2.0: Duration experience-aware
    duration_days = generate_user_duration(exp_label, rng)

    # Start date: today - duration
    start_date = cfg.today - timedelta(days=duration_days)
    end_date = cfg.today

    # Weekly frequency
    wf = float(weekly_freqs[i])
    wf_declared = clamp_int(wf, cfg.weekly_freq_min, cfg.weekly_freq_max)

    # Profile
    profile = profiles[i]
    profile_dict = cfg.user_profiles[profile]

    # User traits (latenti)
    exp_latent = float(rng.uniform(0.0, 1.0))
    alpha_adapt = float(rng.uniform(0.03, 0.12))
    k_detraining = float(rng.uniform(0.005, 0.02))
    obs_noise = float(rng.uniform(0.5, 1.5))
    resilience = float(rng.uniform(0.5, 1.2))
    fatigue_sens = float(rng.uniform(0.6, 1.4))
    rpe_report_bias = float(rng.uniform(-0.3, 0.3))

    # Discipline e motivation
    discipline_base = profile_dict['discipline_base']
    discipline = float(np.clip(rng.normal(discipline_base, 0.15), 0.1, 1.0))
    motivation = float(np.clip(rng.normal(0.7, 0.15), 0.3, 1.0))

    user_rows.append({
        'user_id': uid,
        'start_date': start_date.isoformat(),
        'end_date': end_date.isoformat(),
        'duration_days': duration_days,
        'weekly_freq_declared': wf_declared,
        'split_type': sample_split(rng),
        'profile': profile,
        'experience_label': exp_label,
        'experience_latent': exp_latent,
        'alpha_adapt': alpha_adapt,
        'k_detraining': k_detraining,
        'obs_noise': obs_noise,
        'resilience': resilience,
        'fatigue_sens': fatigue_sens,
        'rpe_report_bias': rpe_report_bias,
        'discipline': discipline,
        'motivation': motivation
    })

df_users = pd.DataFrame(user_rows)

print("\n" + "-"*80)
print("USER DATAFRAME CREATED")
print("-"*80)
print(f"[OK] Shape: {df_users.shape}")
print(f"[OK] Columns: {df_users.columns.tolist()}")

# ════════════════════════════════════════════════════════════
# DURATION STATISTICS PER EXPERIENCE (v2.0 Validation)
# ════════════════════════════════════════════════════════════

print("\n" + "-"*80)
print("DURATION STATISTICS per EXPERIENCE LEVEL")
print("-"*80)

for exp_label in ['Beginner', 'Intermediate', 'Advanced']:
    subset = df_users[df_users['experience_label'] == exp_label]['duration_days']
    mean_days = subset.mean()
    mean_months = mean_days / 30.0
    min_days = subset.min()
    max_days = subset.max()

    print(f"{exp_label:12s}: {mean_days:6.0f} days ({mean_months:5.1f} months) | Range: [{min_days}-{max_days}]")

# Total sets estimation
print("\n" + "-"*80)
print("ESTIMATED TOTAL SETS per EXPERIENCE LEVEL")
print("-"*80)

for exp_label in ['Beginner', 'Intermediate', 'Advanced']:
    subset = df_users[df_users['experience_label'] == exp_label]
    avg_duration = subset['duration_days'].mean()
    avg_weekly_freq = subset['weekly_freq_declared'].mean()

    # Estimate: weeks * freq * sets_per_session (assume ~15 sets/session)
    weeks = avg_duration / 7.0
    estimated_total_sets = weeks * avg_weekly_freq * 15

    print(f"{exp_label:12s}: ~{estimated_total_sets:6.0f} total sets (avg)")

print("\n[OK] User generation v2.0 complete")
print("="*80)


USER GENERATION v2.0 (Experience-Aware Training History)

[OK] Experience distribution (n=1500):
  Advanced    :  189 ( 12.6%)
  Beginner    :  542 ( 36.1%)
  Intermediate:  769 ( 51.3%)

--------------------------------------------------------------------------------
USER DATAFRAME CREATED
--------------------------------------------------------------------------------
[OK] Shape: (1500, 17)
[OK] Columns: ['user_id', 'start_date', 'end_date', 'duration_days', 'weekly_freq_declared', 'split_type', 'profile', 'experience_label', 'experience_latent', 'alpha_adapt', 'k_detraining', 'obs_noise', 'resilience', 'fatigue_sens', 'rpe_report_bias', 'discipline', 'motivation']

--------------------------------------------------------------------------------
DURATION STATISTICS per EXPERIENCE LEVEL
--------------------------------------------------------------------------------
Beginner    :    182 days (  6.1 months) | Range: [90-270]
Intermediate:    423 days ( 14.1 months) | Range: [240-600]
A

#**CELLA 9 - ACWR Calculator v2.0 (Spike Experience-Aware)**

In [11]:
# ═══════════════════════════════════════════════════════════
# ACWR CALCULATOR v2.0 (Experience-Aware Spike/Deload)
# ═══════════════════════════════════════════════════════════

def calculate_weekly_acwr_multiplier(cfg, week_impulses: list, rng, experience_label: str):
    """
    Calcola ACWR multiplier con probabilità spike/deload modulata
    da livello esperienza (auto-regolazione).

    Args:
        cfg: Configuration object
        week_impulses: Lista impulsi settimanali storici
        rng: Random generator
        experience_label: 'Beginner' | 'Intermediate' | 'Advanced'

    Returns:
        (multiplier, week_type, acwr_value)
    """

    if not cfg.spike_enabled:
        return 1.05, 'normal', 1.05

    # ════════════════════════════════════════════════════════
    # Chronic load (media ultime 4 settimane)
    # ════════════════════════════════════════════════════════

    if len(week_impulses) >= 4:
        chronic_load = float(np.mean(week_impulses[-4:]))
    elif len(week_impulses) > 0:
        chronic_load = float(np.mean(week_impulses))
    else:
        chronic_load = cfg.overload_I0

    if chronic_load < 100.0:
        chronic_load = cfg.overload_I0

    # ════════════════════════════════════════════════════════
    # PROBABILITÀ SPIKE/DELOAD EXPERIENCE-AWARE (v2.0)
    # ════════════════════════════════════════════════════════

    probs = cfg.spike_deload_probs[experience_label]

    # Decisione tipo settimana
    rand = float(rng.random())

    if rand < probs['deload']:
        # DELOAD
        acwr = float(rng.uniform(cfg.deload_acwr_min, cfg.deload_acwr_max))
        week_type = 'deload'

    elif rand < probs['deload'] + probs['spike']:
        # SPIKE
        acwr = float(rng.uniform(cfg.spike_acwr_min, cfg.spike_acwr_max))
        week_type = 'spike'

    else:
        # NORMAL
        acwr = float(rng.normal(cfg.normal_acwr_mean, cfg.normal_acwr_std))
        acwr = float(np.clip(acwr, 0.90, 1.25))
        week_type = 'normal'

    multiplier = acwr

    return multiplier, week_type, acwr

print("="*80)
print("ACWR CALCULATOR v2.0 LOADED")
print("="*80)
print("\n[OK] Experience-Aware Spike/Deload Probabilities:")
for level, probs in cfg.spike_deload_probs.items():
    spike_pct = probs['spike'] * 100
    deload_pct = probs['deload'] * 100
    normal_pct = (1 - probs['spike'] - probs['deload']) * 100
    print(f"  {level:12s}: Spike {spike_pct:4.0f}% | Deload {deload_pct:4.0f}% | Normal {normal_pct:4.0f}%")

# Test function
test_multiplier, test_type, test_acwr = calculate_weekly_acwr_multiplier(
    cfg, [1800, 1900, 2000, 2100], rng, 'Beginner'
)
print(f"\n[OK] Test (Beginner): week_type={test_type}, ACWR={test_acwr:.3f}")


ACWR CALCULATOR v2.0 LOADED

[OK] Experience-Aware Spike/Deload Probabilities:
  Beginner    : Spike   20% | Deload    3% | Normal   77%
  Intermediate: Spike   12% | Deload    5% | Normal   83%
  Advanced    : Spike    6% | Deload    8% | Normal   86%

[OK] Test (Beginner): week_type=spike, ACWR=1.790


#**CELLA 10 - Schedule Sessions**

In [12]:
# ═══════════════════════════════════════════════════════════
# SCHEDULE SESSIONS FUNCTION (Unchanged from v1.0)
# ═══════════════════════════════════════════════════════════

def schedule_sessions_for_user(start_u: date, end_u: date, weekly_freq: int, rng):
    """
    Genera date sessioni per utente con jitter giornaliero.

    Args:
        start_u: Data inizio carriera utente
        end_u: Data fine carriera utente
        weekly_freq: Frequenza settimanale target
        rng: Random generator

    Returns:
        Lista date sessioni (sorted, unique)
    """

    # Giorni target della settimana (base pattern)
    base_days = sorted(rng.choice(np.arange(7), size=weekly_freq, replace=False).tolist())

    dates = []
    d0 = start_u
    n_days = (end_u - start_u).days + 1

    for i in range(n_days):
        day = d0 + timedelta(days=i)

        if day.weekday() in base_days:
            # Jitter: -1, 0, +1 giorno
            jitter = int(rng.choice([-1, 0, 1], p=cfg.weekday_jitter_probs))
            day2 = day + timedelta(days=jitter)

            # Keep in range
            if start_u <= day2 <= end_u:
                dates.append(day2)

    # Remove duplicates and sort
    dates = sorted(list(set(dates)))

    return dates

# Test function
test_dates = schedule_sessions_for_user(
    date(2025, 1, 1),
    date(2025, 1, 31),
    3,
    rng
)
print("="*80)
print("SCHEDULE SESSIONS FUNCTION LOADED")
print("="*80)
print(f"[OK] Test (Jan 2025, 3x/week): {len(test_dates)} sessions scheduled")
print(f"  First 5 dates: {test_dates[:5]}")


SCHEDULE SESSIONS FUNCTION LOADED
[OK] Test (Jan 2025, 3x/week): 13 sessions scheduled
  First 5 dates: [datetime.date(2025, 1, 1), datetime.date(2025, 1, 4), datetime.date(2025, 1, 7), datetime.date(2025, 1, 8), datetime.date(2025, 1, 11)]


#**CELLA 11 - Simulate User v2.0 (CORE LOGIC - Tutte le modifiche integrate)**

In [13]:
# ═══════════════════════════════════════════════════════════
# SIMULATE USER v2.0 (Experience-Aware Skip + RPE)
# ═══════════════════════════════════════════════════════════

def simulate_user(cfg, user_row: dict, df_ex: pd.DataFrame, caps_u: dict, templates_u: dict, rng):
    """
    Simula allenamento per singolo utente con progressive overload dinamico.

    v2.0 Changes:
    - Skip model experience-aware
    - RPE calibration experience-aware
    - Spike frequency experience-aware

    Returns:
        (workouts_rows, plan_rows, sets_rows, impulse_rows, user_meta)
    """

    # ════════════════════════════════════════════════════════
    # USER PARAMETERS
    # ════════════════════════════════════════════════════════

    uid = int(user_row['user_id'])
    start_u = date.fromisoformat(user_row['start_date'])
    end_u = date.fromisoformat(user_row['end_date'])
    weekly_freq = int(user_row['weekly_freq_declared'])
    experience_label = str(user_row['experience_label'])

    # Latent traits
    exp_lat = float(user_row['experience_latent'])
    alpha = float(user_row['alpha_adapt'])
    kd = float(user_row['k_detraining'])
    obs_noise = float(user_row['obs_noise'])
    resilience = float(user_row['resilience'])
    fatigue_sens = float(user_row['fatigue_sens'])
    rpe_bias = float(user_row['rpe_report_bias'])

    # Profile & traits
    profile_name = str(user_row.get('profile', 'balanced'))
    profile = cfg.user_profiles.get(profile_name, cfg.user_profiles['balanced'])

    user_discipline = float(user_row.get('discipline', 0.7))
    user_motivation = float(user_row.get('motivation', 0.7))

    # Banister params per utente
    tau_F = float(max(7.0, rng.normal(cfg.tau_F_mean, cfg.tau_F_sd)))
    tau_D = float(max(2.0, rng.normal(cfg.tau_D_mean, cfg.tau_D_sd)))

    # ════════════════════════════════════════════════════════
    # STATE INITIALIZATION
    # ════════════════════════════════════════════════════════

    fitness = float(rng.normal(0.0, 1.0))
    fatigue = float(max(0.0, rng.normal(0.5, 0.3)))
    skill = float(np.clip(rng.normal(0.2 + 0.6 * exp_lat, 0.15), 0.0, 2.0))

    injury_until = None

    # Schedule candidato
    session_dates = schedule_sessions_for_user(start_u, end_u, weekly_freq, rng)

    workouts_rows = []
    plan_rows = []
    sets_rows = []
    impulse_rows = []

    wid = 1
    set_id_counter = 1

    # Rotazione tag per split
    PPL_ROT = ['Push', 'Pull', 'Legs']
    FB_ROT = ['FullBody']
    tags = PPL_ROT if str(user_row['split_type']) == 'PPL' else FB_ROT
    tag_i = int(rng.integers(0, len(tags)))

    last_train_date = None

    # ════════════════════════════════════════════════════════
    # PROGRESSIVE OVERLOAD STATE
    # ════════════════════════════════════════════════════════

    current_caps = {eid: float(val) for eid, val in caps_u.items()}

    base_growth = cfg.overload_base_rate.get(experience_label, 0.001)
    growth_rate = float(np.clip(rng.normal(base_growth, 0.0002), 1e-5, 0.005))

    # ════════════════════════════════════════════════════════
    # SPIKE ACWR STATE
    # ════════════════════════════════════════════════════════

    weekly_impulses = []
    current_week_impulse = 0.0
    current_week_start = start_u
    week_type_current = 'normal'
    acwr_multiplier = 1.05
    acwr_value = 1.05
    target_weekly_impulse = cfg.overload_I0

    # ════════════════════════════════════════════════════════
    # MAIN LOOP: PROCESS SESSIONS
    # ════════════════════════════════════════════════════════

    for idx, d in enumerate(session_dates):

        # ════════════════════════════════════════════════════
        # SPIKE ACWR: Calcola allinizio settimana
        # ════════════════════════════════════════════════════

        week_index = (d - start_u).days // 7
        week_start_for_d = start_u + timedelta(days=week_index * 7)

        if week_start_for_d != current_week_start:
            # Chiudi settimana precedente
            if current_week_impulse > 0:
                weekly_impulses.append(current_week_impulse)

            # Calcola ACWR nuova settimana (v2.0: experience-aware)
            acwr_multiplier, week_type_current, acwr_value = calculate_weekly_acwr_multiplier(
                cfg, weekly_impulses, rng, experience_label
            )

            # Reset
            current_week_impulse = 0.0
            current_week_start = week_start_for_d

            # Aggiorna target
            if weekly_impulses:
                chronic_avg = np.mean(weekly_impulses[-4:]) if len(weekly_impulses) >= 4 else np.mean(weekly_impulses)
                target_weekly_impulse = chronic_avg * acwr_multiplier
            else:
                target_weekly_impulse = cfg.overload_I0 * acwr_multiplier

        # ════════════════════════════════════════════════════
        # DETRAINING (se gap > 1 giorno)
        # ════════════════════════════════════════════════════

        if last_train_date is not None:
            gap = (d - last_train_date).days
            if gap > 1:
                fitness *= math.exp(-kd * gap)

                # Detraining anche su capacità se gap molto lungo (>7 giorni)
                if gap > 7:
                    decay = math.exp(-kd * (gap - 7) * 0.3)
                    for eid in current_caps:
                        current_caps[eid] *= decay

        # Decay fatica giornaliero
        fatigue *= math.exp(-1.0 / 7.0)

        # ════════════════════════════════════════════════════
        # INJURY CHECK
        # ════════════════════════════════════════════════════

        in_injury = injury_until is not None and d <= injury_until

        # ════════════════════════════════════════════════════
        # SKIP MODEL v2.0 (EXPERIENCE-AWARE)
        # ════════════════════════════════════════════════════

        p0_base = float(cfg.skip_p0_by_level.get(experience_label, 0.10))

        # Discipline MODULA invece di SOSTITUIRE (v2.0)
        p0 = p0_base * (1.5 - 0.5 * user_discipline)  # Range: [1.0x, 1.5x] baseline
        p0 = float(np.clip(p0, 0.01, 0.25))

        bias = logit(p0)

        # Fatigue sensitivity per livello (v2.0)
        fatigue_weight = cfg.skip_fatigue_sensitivity[experience_label]

        fat_term = float(np.log1p(max(0.0, float(fatigue))))
        fat_term = min(fat_term, float(cfg.skip_fatigue_cap))

        z = bias + fatigue_weight * fat_term + float(rng.normal(0.0, cfg.skip_noise_sd))
        p_skip = sigmoid(z)

        status = 'done'
        if rng.random() < p_skip:
            status = 'skipped'

        # ════════════════════════════════════════════════════
        # SESSION TAG
        # ════════════════════════════════════════════════════

        tag = tags[tag_i % len(tags)]
        tag_i += 1

        week_index_user = (d - start_u).days // 7 + 1

        # ════════════════════════════════════════════════════
        # WORKOUTS ROW
        # ════════════════════════════════════════════════════

        workouts_rows.append({
            'user_id': uid,
            'date': d.isoformat(),
            'week_index_user': int(week_index_user),
            'session_tag': tag,
            'workout_status': status,
            'z_skip': float(z),
            'p_skip': float(p_skip),
            'fatigue_term': float(fat_term),
            'experience_label': experience_label,
        })

        if status == 'skipped':
            impulse_rows.append({'user_id': uid, 'date': d.isoformat(), 'impulse': 0.0})
            continue

        # ════════════════════════════════════════════════════
        # PLAN PER ESERCIZIO
        # ════════════════════════════════════════════════════

        ex_ids = templates_u.get(tag, [])

        if len(ex_ids) == 0:
            ex_ids = templates_u[list(templates_u.keys())[0]]

        fatigue_session = float(fatigue)

        day_impulse = 0.0
        day_total_sets = 0

        for ex_idx, ex_id in enumerate(ex_ids):
            if day_total_sets > 100:  # Safety
                break

            ex_row = df_ex[df_ex['exercise_id'] == ex_id].iloc[0].to_dict()

            sets_planned, reps_min, reps_max, rest_sec, rir_target = prescribe_exercise(
                ex_row, experience_label, rng
            )

            # Volume reduction in injury
            if in_injury:
                sets_planned = max(1, int(round(sets_planned * 0.6)))

            plan_rows.append({
                'user_id': uid,
                'date': d.isoformat(),
                'session_tag': tag,
                'exercise_id': int(ex_id),
                'sets_planned': int(sets_planned),
                'reps_min': int(reps_min),
                'reps_max': int(reps_max),
                'rest_planned_sec': int(rest_sec),
                'rir_target': int(rir_target),
            })

            # ════════════════════════════════════════════════
            # CAPACITY DINAMICA
            # ════════════════════════════════════════════════

            c_max = current_caps.get(int(ex_id), 50.0)

            # Intended baseline load
            reps_target_0 = int(rng.integers(reps_min, reps_max + 1))
            inten_0 = intensity_from_reps_rir(reps_target_0, rir_target, rng)
            intended_load = q_load(inten_0 * c_max, cfg.load_step)

            # Sets done modulati da profilo + ACWR
            sets_done_base = sets_planned * profile['sets_mult'] * acwr_multiplier
            sets_done = int(np.clip(round(rng.normal(sets_done_base, 0.5)), 1, 10))

            # ════════════════════════════════════════════════
            # SET EXECUTION
            # ════════════════════════════════════════════════

            for s in range(1, sets_done + 1):
                day_total_sets += 1

                reps_target = int(rng.integers(reps_min, reps_max + 1))
                inten = intensity_from_reps_rir(reps_target, rir_target, rng)

                fatigue_factor = float(np.clip(0.03 * fatigue_session * fatigue_sens, 0.0, 0.20))

                # Riduzione fatica per Beginner (newbie gains) - MANTIENI da v1.0
                if experience_label == 'Beginner':
                    fatigue_factor *= 0.2

                # ════════════════════════════════════════════
                # LOAD MODIFICATO v1.0 (MANTIENI)
                # ════════════════════════════════════════════

                load_from_capacity = inten * c_max * (1.0 - fatigue_factor)
                daily_variation = rng.uniform(0.85, 1.15)
                load_from_daily = c_max * 0.3 * daily_variation

                load_done = load_from_capacity * 0.7 + load_from_daily * 0.3

                # Rumore osservazionale
                load_done = float(rng.normal(1.0, 0.04 * obs_noise * 0.10)) * load_done
                load_done = q_load(float(np.clip(load_done, 2.5, c_max)), cfg.load_step)

                # Reps calano se fatica sale
                reps_done = int(np.clip(
                    round(rng.normal(reps_target * (1.0 - 0.20 * fatigue_factor), 0.6 * obs_noise * 1.5)),
                    1, 30
                ))

                # ════════════════════════════════════════════
                # RPE MODIFICATO v2.0 (EXPERIENCE-AWARE)
                # ════════════════════════════════════════════

                rpe_params = cfg.rpe_params_by_level[experience_label]

                # Componenti RPE
                rpe_from_intensity = 5.0 + 4.0 * inten
                rpe_from_motivation = user_motivation * 3.0
                rpe_from_fatigue = fatigue_factor * 2.5

                # Mix ponderato
                rpe_true = (rpe_from_intensity * 0.5 +
                            rpe_from_motivation * 0.3 +
                            rpe_from_fatigue * 0.2)

                # Observed RPE con NOISE e BIAS experience-aware (v2.0)
                rpe_obs = rng.normal(
                    rpe_true + rpe_bias + rpe_params['bias'],  # Bias livello
                    rpe_params['noise_std'] * obs_noise * 1.2  # Noise livello
                )

                rpe_done = q_rpe(float(np.clip(rpe_obs, 1.0, 10.0)), cfg.rpe_step)

                # Feedback testuale (opzionale, 5%)
                feedback = None
                if rng.random() < 0.05:
                    feedback = str(rng.choice(['Recuperi corti', 'Sentito bene']))

                # Missingness
                if rng.random() < cfg.p_missing_load:
                    load_done = np.nan
                if rng.random() < cfg.p_missing_rpe:
                    rpe_done = np.nan
                if rng.random() < cfg.p_missing_feedback:
                    feedback = None

                # ════════════════════════════════════════════
                # SETS ROW
                # ════════════════════════════════════════════

                sets_rows.append({
                    'set_id': f"{uid:04d}S{set_id_counter:07d}",
                    'user_id': uid,
                    'date': d.isoformat(),
                    'week_index_user': int(week_index_user),
                    'week_type': week_type_current,
                    'acwr': float(acwr_value),
                    'session_tag': tag,
                    'exercise_id': int(ex_id),
                    'set_index': int(s),
                    'reps_target': int(reps_target),
                    'reps_done': int(reps_done),
                    'load_intended_kg': float(intended_load),
                    'load_done_kg': load_done,
                    'rpe_done': rpe_done,
                    'rest_planned_sec': int(rest_sec),
                    'rir_target': int(rir_target),
                    'feedback': feedback,
                })

                set_id_counter += 1

                # Impulso giornaliero Banister
                ld = 0.0 if isinstance(load_done, float) and np.isnan(load_done) else float(load_done)
                rd = 0.0 if isinstance(rpe_done, float) and np.isnan(rpe_done) else float(rpe_done)
                day_impulse += ld * float(reps_done) * rd / 10.0

                # Accumula impulso settimanale
                current_week_impulse += ld * float(reps_done) * rd / 10.0

                # Aggiorna fatica intra-sessione
                fatigue_session += 0.08 * inten + 0.02 * ld / max(20.0, c_max)

        # ════════════════════════════════════════════════════
        # PROGRESSIVE OVERLOAD POST-SESSIONE (MANTIENI v1.0)
        # ════════════════════════════════════════════════════

        stim = float(np.clip(day_impulse / cfg.overload_I0, 0.05, 2.0))
        quality = max(0.2, 1.0 - cfg.overload_quality_fatigue * fatigue / 20.0)
        gain_base = growth_rate * stim * quality

        # Pre-calcola info esercizi
        ex_info = {}
        for eid in current_caps:
            row = df_ex[df_ex['exercise_id'] == eid].iloc[0]
            ex_info[eid] = {
                'muscle': str(row['target_muscle_group']),
                'split': str(row['split_cat']),
            }

        # Transfer per esercizi allenati
        for ex_id_trained in ex_ids:
            muscle_trained = ex_info[ex_id_trained]['muscle']
            split_trained = ex_info[ex_id_trained]['split']

            for eid_all in current_caps:
                if eid_all == ex_id_trained:
                    transfer_weight = 1.0
                elif ex_info[eid_all]['muscle'] == muscle_trained:
                    transfer_weight = cfg.transfer_same_muscle
                elif ex_info[eid_all]['split'] == split_trained:
                    transfer_weight = cfg.transfer_same_split
                else:
                    transfer_weight = 0.0

                current_caps[eid_all] *= (1.0 + gain_base * transfer_weight)

        # ════════════════════════════════════════════════════
        # INJURY EVENT (MANTIENI v1.0)
        # ════════════════════════════════════════════════════

        p_injury = cfg.injury_lambda * (day_impulse / 1000.0) * (1.0 - resilience) * (1.0 + 0.5 * fatigue_sens)
        if rng.random() < p_injury:
            injury_days = int(rng.integers(cfg.injury_days_min, cfg.injury_days_max + 1))
            injury_until = d + timedelta(days=injury_days)

        # Aggiorna fatica globale
        fatigue = float(np.clip(fatigue_session, 0.0, 20.0))

        impulse_rows.append({'user_id': uid, 'date': d.isoformat(), 'impulse': float(day_impulse)})
        last_train_date = d

    # ════════════════════════════════════════════════════════
    # USER METADATA (Banister params)
    # ════════════════════════════════════════════════════════

    user_meta = {
        'user_id': uid,
        'tau_F': tau_F,
        'tau_D': tau_D,
        'beta_F': cfg.beta_F,
        'beta_D': cfg.beta_D,
    }

    return workouts_rows, plan_rows, sets_rows, impulse_rows, user_meta

print("="*80)
print("SIMULATE USER v2.0 LOADED")
print("="*80)
print("\n[OK] Experience-Aware Components:")
print("  - Skip model: p0 stratificato per livello")
print("  - RPE calibration: noise + bias per livello")
print("  - Spike frequency: auto-regolazione esperti")
print("="*80)


SIMULATE USER v2.0 LOADED

[OK] Experience-Aware Components:
  - Skip model: p0 stratificato per livello
  - RPE calibration: noise + bias per livello
  - Spike frequency: auto-regolazione esperti


#**CELLA 12 - Exercise Templates & Capacities Setup**

In [14]:
# ═══════════════════════════════════════════════════════════
# EXERCISE TEMPLATES & CAPACITIES SETUP (Unchanged from v1.0)
# ═══════════════════════════════════════════════════════════

print("="*80)
print("PREPARING EXERCISE TEMPLATES & CAPACITIES")
print("="*80)

# ────────────────────────────────────────────────────────────
# Exercise Templates per Split
# ────────────────────────────────────────────────────────────

PUSH_EX = df_exercises[df_exercises['split_cat'] == 'Push']['exercise_id'].tolist()
PULL_EX = df_exercises[df_exercises['split_cat'] == 'Pull']['exercise_id'].tolist()
LEGS_EX = df_exercises[df_exercises['split_cat'] == 'Legs']['exercise_id'].tolist()
FB_EX = df_exercises[df_exercises['split_cat'] == 'FullBody']['exercise_id'].tolist()

# Fallback: se FullBody vuoto, usa compound
if len(FB_EX) == 0:
    FB_EX = [1, 7, 13]  # Bench, Pull-ups, Squats

print(f"[OK] Exercise pools:")
print(f"  Push: {len(PUSH_EX)} exercises")
print(f"  Pull: {len(PULL_EX)} exercises")
print(f"  Legs: {len(LEGS_EX)} exercises")
print(f"  FullBody: {len(FB_EX)} exercises")

# ────────────────────────────────────────────────────────────
# Generate Templates & Capacities per User
# ────────────────────────────────────────────────────────────

def generate_templates_and_caps(user_row: dict, df_ex: pd.DataFrame, rng):
    """
    Genera template esercizi e capacità iniziali per utente.
    """

    split_type = user_row['split_type']
    experience_label = user_row['experience_label']

    # Sample esercizi per split
    if split_type == 'PPL':
        push_sample = list(rng.choice(PUSH_EX, size=min(4, len(PUSH_EX)), replace=False))
        pull_sample = list(rng.choice(PULL_EX, size=min(4, len(PULL_EX)), replace=False))
        legs_sample = list(rng.choice(LEGS_EX, size=min(4, len(LEGS_EX)), replace=False))

        templates = {
            'Push': push_sample,
            'Pull': pull_sample,
            'Legs': legs_sample
        }
    else:
        fb_sample = list(rng.choice(FB_EX, size=min(5, len(FB_EX)), replace=False))
        templates = {
            'FullBody': fb_sample
        }

    # Capacità iniziali per TUTTI gli esercizi (per transfer)
    caps = {}

    # Capacity range per livello
    cap_ranges = {
        'Beginner': (25.0, 40.0),
        'Intermediate': (40.0, 60.0),
        'Advanced': (55.0, 80.0)
    }

    cap_min, cap_max = cap_ranges[experience_label]

    for ex_id in df_ex['exercise_id']:
        caps[int(ex_id)] = float(rng.uniform(cap_min, cap_max))

    return templates, caps

print("\n[OK] generate_templates_and_caps() function ready")
print("="*80)


PREPARING EXERCISE TEMPLATES & CAPACITIES
[OK] Exercise pools:
  Push: 6 exercises
  Pull: 6 exercises
  Legs: 6 exercises
  FullBody: 2 exercises

[OK] generate_templates_and_caps() function ready


#**CELLA 13 - Run Generator (Main Loop)**

In [15]:
# ═══════════════════════════════════════════════════════════
# RUN GENERATOR - ALL USERS
# ═══════════════════════════════════════════════════════════

print("="*80)
print("RUNNING GENERATOR v2.0 - ALL USERS")
print("="*80)
print(f"Users to process: {len(df_users)}")
print(f"Estimated time: ~3-5 minutes")
print("="*80)

import time
start_time = time.time()

all_workouts = []
all_plan = []
all_sets = []
all_impulse = []
all_banister_meta = []

skipped_users = []

for idx, user_row in df_users.iterrows():
    uid = user_row['user_id']

    # Progress tracking (ogni 50 users)
    if uid % 250 == 0:
        elapsed = time.time() - start_time
        print(f"[{uid:4d}/{len(df_users)}] Elapsed: {elapsed:.1f}s")

    # Generate templates & capacities
    templates_u, caps_u = generate_templates_and_caps(user_row, df_exercises, rng)

    # Simulate user
    try:
        workouts_rows, plan_rows, sets_rows, impulse_rows, user_meta = simulate_user(
            cfg, user_row, df_exercises, caps_u, templates_u, rng
        )

        # Filter users con troppo pochi set (< 20)
        if len(sets_rows) < 20:
            skipped_users.append(uid)
            continue

        all_workouts.extend(workouts_rows)
        all_plan.extend(plan_rows)
        all_sets.extend(sets_rows)
        all_impulse.extend(impulse_rows)
        all_banister_meta.append(user_meta)

    except Exception as e:
        print(f"\n[!] ERROR processing user {uid}: {e}")
        skipped_users.append(uid)
        continue

elapsed_total = time.time() - start_time

print("\n" + "="*80)
print("GENERATION COMPLETE")
print("="*80)
print(f"[OK] Time elapsed: {elapsed_total:.1f}s ({elapsed_total/60:.1f} min)")
print(f"[OK] Users processed: {len(df_users) - len(skipped_users)}/{len(df_users)}")
if skipped_users:
    print(f"[!] Skipped users (< 20 sets): {len(skipped_users)}")

# ════════════════════════════════════════════════════════════
# CREATE DATAFRAMES
# ════════════════════════════════════════════════════════════

df_workouts = pd.DataFrame(all_workouts)
df_plan = pd.DataFrame(all_plan)
df_sets = pd.DataFrame(all_sets)
df_impulse = pd.DataFrame(all_impulse)
df_banister_meta = pd.DataFrame(all_banister_meta)

print("\n" + "-"*80)
print("DATAFRAMES CREATED")
print("-"*80)
print(f"[OK] df_workouts: {len(df_workouts):,} rows")
print(f"[OK] df_plan:     {len(df_plan):,} rows")
print(f"[OK] df_sets:     {len(df_sets):,} rows (PRIMARY)")
print(f"[OK] df_impulse:  {len(df_impulse):,} rows")
print(f"[OK] df_banister_meta: {len(df_banister_meta):,} rows")

# ════════════════════════════════════════════════════════════
# ASSIGN GLOBAL WORKOUT IDs
# ════════════════════════════════════════════════════════════

print("\n" + "-"*80)
print("ASSIGNING WORKOUT IDs")
print("-"*80)

# Create mapping: (user_id, date) -> workout_id
df_workouts = df_workouts.sort_values(['user_id', 'date']).reset_index(drop=True)
df_workouts['workout_id'] = df_workouts.index + 1

workout_id_map = df_workouts.set_index(['user_id', 'date'])['workout_id'].to_dict()

# Propagate to plan and sets
df_plan['workout_id'] = df_plan.apply(
    lambda row: workout_id_map.get((row['user_id'], row['date']), None),
    axis=1
)

df_sets['workout_id'] = df_sets.apply(
    lambda row: workout_id_map.get((row['user_id'], row['date']), None),
    axis=1
)

print(f"[OK] Workout IDs assigned: {df_workouts['workout_id'].nunique()} unique workouts")

# ════════════════════════════════════════════════════════════
# STATISTICS PREVIEW
# ════════════════════════════════════════════════════════════

print("\n" + "="*80)
print("DATASET STATISTICS")
print("="*80)

print(f"\n[OK] Total users: {df_sets['user_id'].nunique()}")
print(f"[OK] Total workouts: {df_workouts['workout_id'].nunique()}")
print(f"[OK] Total sets: {len(df_sets):,}")
print(f"[OK] Date range: {df_sets['date'].min()} to {df_sets['date'].max()}")

# Experience distribution in final dataset
print("\n" + "-"*80)
print("EXPERIENCE DISTRIBUTION (Final Dataset)")
print("-"*80)
exp_dist = df_workouts.drop_duplicates('user_id')['experience_label'].value_counts().sort_index()
for label, count in exp_dist.items():
    pct = count / exp_dist.sum() * 100
    print(f"{label:12s}: {count:4d} ({pct:5.1f}%)")

# Workout status distribution
print("\n" + "-"*80)
print("WORKOUT STATUS DISTRIBUTION")
print("-"*80)
status_dist = df_workouts['workout_status'].value_counts()
for status, count in status_dist.items():
    pct = count / len(df_workouts) * 100
    print(f"{status:10s}: {count:6d} ({pct:5.1f}%)")

# ACWR week types
print("\n" + "-"*80)
print("ACWR WEEK TYPE DISTRIBUTION")
print("-"*80)
week_type_dist = df_sets['week_type'].value_counts()
for wtype, count in week_type_dist.items():
    pct = count / len(df_sets) * 100
    print(f"{wtype:10s}: {count:7d} ({pct:5.1f}%)")

print("="*80)


RUNNING GENERATOR v2.0 - ALL USERS
Users to process: 1500
Estimated time: ~3-5 minutes
[ 250/1500] Elapsed: 403.1s
[ 500/1500] Elapsed: 797.2s
[ 750/1500] Elapsed: 1184.6s
[1000/1500] Elapsed: 1578.5s
[1250/1500] Elapsed: 2006.8s
[1500/1500] Elapsed: 2424.1s

GENERATION COMPLETE
[OK] Time elapsed: 2424.9s (40.4 min)
[OK] Users processed: 1500/1500

--------------------------------------------------------------------------------
DATAFRAMES CREATED
--------------------------------------------------------------------------------
[OK] df_workouts: 256,286 rows
[OK] df_plan:     763,802 rows
[OK] df_sets:     3,511,093 rows (PRIMARY)
[OK] df_impulse:  256,286 rows
[OK] df_banister_meta: 1,500 rows

--------------------------------------------------------------------------------
ASSIGNING WORKOUT IDs
--------------------------------------------------------------------------------
[OK] Workout IDs assigned: 256286 unique workouts

DATASET STATISTICS

[OK] Total users: 1500
[OK] Total workouts

#**CELLA 14 - Banister Model Calculation**

In [16]:
# ═══════════════════════════════════════════════════════════
# BANISTER MODEL - FITNESS/FATIGUE/PERFORMANCE
# ═══════════════════════════════════════════════════════════

print("="*80)
print("CALCULATING BANISTER MODEL")
print("="*80)

# Parse dates
df_impulse['date'] = pd.to_datetime(df_impulse['date'])
df_impulse = df_impulse.sort_values(['user_id', 'date']).reset_index(drop=True)

# Merge Banister params
df_impulse = df_impulse.merge(
    df_banister_meta[['user_id', 'tau_F', 'tau_D', 'beta_F', 'beta_D']],
    on='user_id',
    how='left'
)

print(f"[OK] Impulse data: {len(df_impulse):,} rows")
print(f"[OK] Users with Banister params: {df_impulse['tau_F'].notna().sum():,}")

# ════════════════════════════════════════════════════════════
# CALCULATE FITNESS, FATIGUE, PERFORMANCE PER USER
# ════════════════════════════════════════════════════════════

def calculate_banister_user(user_df: pd.DataFrame) -> pd.DataFrame:
    """
    Calcola Fitness, Fatigue, Performance per singolo utente.

    Impulse-Response Model (Banister 1975):
    Fitness(t)  = sum_{i=0}^{t} w * exp(-i/tau_F) * Impulse(t-i)
    Fatigue(t)  = sum_{i=0}^{t} w * exp(-i/tau_D) * Impulse(t-i)
    Performance(t) = Fitness(t) - Fatigue(t)
    """

    user_df = user_df.sort_values('date').reset_index(drop=True)

    tau_F = user_df['tau_F'].iloc[0]
    tau_D = user_df['tau_D'].iloc[0]
    beta_F = user_df['beta_F'].iloc[0]
    beta_D = user_df['beta_D'].iloc[0]

    impulses = user_df['impulse'].values
    n = len(impulses)

    # Exponential weights
    w_F = exp_weights(n, tau_F)
    w_D = exp_weights(n, tau_D)

    # Convolve impulses with weights
    fitness_series = np.convolve(impulses, w_F, mode='full')[:n] * beta_F
    fatigue_series = np.convolve(impulses, w_D, mode='full')[:n] * beta_D

    # Performance = Fitness - Fatigue
    performance_series = fitness_series - fatigue_series

    # Scale to [0, 100] per interpretability
    perf_min = performance_series.min()
    perf_max = performance_series.max()

    if perf_max > perf_min:
        performance_scaled = 100 * (performance_series - perf_min) / (perf_max - perf_min)
    else:
        performance_scaled = np.full(n, 50.0)

    user_df['fitness'] = fitness_series
    user_df['fatigue'] = fatigue_series
    user_df['performance'] = performance_scaled
    user_df['TSB'] = fitness_series - fatigue_series  # Training Stress Balance

    return user_df

# Apply to all users
print("\nProcessing Banister model per user...")
banister_results = []

for uid in df_impulse['user_id'].unique():
    user_df = df_impulse[df_impulse['user_id'] == uid].copy()
    user_result = calculate_banister_user(user_df)
    banister_results.append(user_result)

df_banister_daily = pd.concat(banister_results, ignore_index=True)

print(f"[OK] Banister model calculated: {len(df_banister_daily):,} rows")

# Summary statistics
print("\n" + "-"*80)
print("BANISTER MODEL STATISTICS")
print("-"*80)
print(df_banister_daily[['fitness', 'fatigue', 'TSB', 'performance']].describe().round(1))

# TSB categories
print("\n" + "-"*80)
print("TSB CATEGORIES")
print("-"*80)
tsb_cats = pd.cut(
    df_banister_daily['TSB'],
    bins=[-np.inf, -5000, 0, 15000, np.inf],
    labels=['Overreaching (<-5k)', 'Mild Fatigue (-5k to 0)', 'Balanced (0-15k)', 'Fresh (>15k)']
)
print(tsb_cats.value_counts().sort_index())

print("="*80)


CALCULATING BANISTER MODEL
[OK] Impulse data: 256,286 rows
[OK] Users with Banister params: 256,286

Processing Banister model per user...
[OK] Banister model calculated: 256,286 rows

--------------------------------------------------------------------------------
BANISTER MODEL STATISTICS
--------------------------------------------------------------------------------
        fitness   fatigue       TSB  performance
count  256286.0  256286.0  256286.0     256286.0
mean      867.6     270.7     597.0         56.9
std      1122.8     321.5     830.5         29.3
min         0.0       0.0     -68.0          0.0
25%       289.8     109.6     162.5         34.3
50%       571.1     184.0     383.6         61.5
75%      1019.7     314.3     721.3         82.0
max     25146.9    7993.1   20011.7        100.0

--------------------------------------------------------------------------------
TSB CATEGORIES
--------------------------------------------------------------------------------
TSB
Over

#**CELLA 15 - Consistency Score Post-Processing v2.0**

In [17]:
# ═══════════════════════════════════════════════════════════
# CONSISTENCY SCORE POST-PROCESSING v2.0 (Experience-Aware)
# ═══════════════════════════════════════════════════════════

print("="*80)
print("CALCULATING CONSISTENCY SCORE v2.0 (Experience-Aware)")
print("="*80)

def calculate_consistency_score(user_workouts_df: pd.DataFrame, experience_label: str) -> float:
    """
    Calcola consistency come % giorni allenamento su finestra temporale,
    modulato da experience per simulare aderenza realistica.

    Args:
        user_workouts_df: DataFrame workout per singolo utente
        experience_label: 'Beginner' | 'Intermediate' | 'Advanced'

    Returns:
        Consistency score [0.0, 1.0]
    """

    if len(user_workouts_df) == 0:
        return 0.0

    start_date = pd.to_datetime(user_workouts_df['date']).min()
    end_date = pd.to_datetime(user_workouts_df['date']).max()
    total_days = (end_date - start_date).days + 1

    if total_days == 0:
        return 1.0

    # Training days (status == 'done')
    training_days = len(user_workouts_df[user_workouts_df['workout_status'] == 'done'])

    # Raw consistency
    raw_consistency = training_days / total_days

    # ════════════════════════════════════════════════════════
    # Experience-based target range (v2.0)
    # ════════════════════════════════════════════════════════

    target_min, target_max = cfg.consistency_targets[experience_label]

    # Clip to target range (simulazione dropout/irregolarità)
    adjusted_consistency = float(np.clip(raw_consistency, target_min, target_max))

    return adjusted_consistency

# Prepare workouts dataframe
df_workouts['date'] = pd.to_datetime(df_workouts['date'])

# Merge experience_label to workouts
df_workouts = df_workouts.merge(
    df_users[['user_id', 'experience_label']],
    on='user_id',
    how='left',
    suffixes=('', '_from_users')
)

# Use merged column or keep existing
if 'experience_label_from_users' in df_workouts.columns:
    df_workouts['experience_label'] = df_workouts['experience_label_from_users']
    df_workouts = df_workouts.drop(columns=['experience_label_from_users'])

# Calculate consistency per user
print("\nCalculating consistency scores...")
consistency_scores = []

for uid in df_workouts['user_id'].unique():
    user_workouts = df_workouts[df_workouts['user_id'] == uid]
    exp_label = user_workouts['experience_label'].iloc[0]

    consistency = calculate_consistency_score(user_workouts, exp_label)

    consistency_scores.append({
        'user_id': uid,
        'consistency_score': consistency
    })

df_consistency = pd.DataFrame(consistency_scores)

# Merge back to df_users
df_users = df_users.merge(df_consistency, on='user_id', how='left')

print(f"[OK] Consistency scores calculated: {len(df_consistency)} users")

# ════════════════════════════════════════════════════════════
# VALIDATION: Check consistency distribution per experience
# ════════════════════════════════════════════════════════════

print("\n" + "-"*80)
print("CONSISTENCY SCORE DISTRIBUTION per EXPERIENCE")
print("-"*80)

for exp_label in ['Beginner', 'Intermediate', 'Advanced']:
    subset = df_users[df_users['experience_label'] == exp_label]['consistency_score']

    if len(subset) > 0:
        mean_val = subset.mean()
        std_val = subset.std()
        min_val = subset.min()
        max_val = subset.max()

        target_min, target_max = cfg.consistency_targets[exp_label]

        print(f"{exp_label:12s}: μ={mean_val:.3f}, σ={std_val:.3f} | Range: [{min_val:.3f}, {max_val:.3f}]")
        print(f"              Target: [{target_min}, {target_max}]")

        # Check if within target
        in_target = ((subset >= target_min) & (subset <= target_max)).sum()
        in_target_pct = in_target / len(subset) * 100
        print(f"              Within target: {in_target}/{len(subset)} ({in_target_pct:.1f}%)\n")

print("="*80)


CALCULATING CONSISTENCY SCORE v2.0 (Experience-Aware)

Calculating consistency scores...
[OK] Consistency scores calculated: 1500 users

--------------------------------------------------------------------------------
CONSISTENCY SCORE DISTRIBUTION per EXPERIENCE
--------------------------------------------------------------------------------
Beginner    : μ=0.650, σ=0.000 | Range: [0.650, 0.650]
              Target: [0.65, 0.8]
              Within target: 542/542 (100.0%)

Intermediate: μ=0.750, σ=0.000 | Range: [0.750, 0.750]
              Target: [0.75, 0.9]
              Within target: 769/769 (100.0%)

Advanced    : μ=0.850, σ=0.000 | Range: [0.850, 0.850]
              Target: [0.85, 0.95]
              Within target: 189/189 (100.0%)



#**CELLA 16 - Validation & Save**

In [20]:
# ═══════════════════════════════════════════════════════════
# POST-PROCESSING: CLAMP LOAD OUTLIERS (BEFORE VALIDATION)
# ═══════════════════════════════════════════════════════════

print("="*80)
print("POST-PROCESSING: CLAMP LOAD OUTLIERS")
print("="*80)

# Check outliers BEFORE clamp
load_valid = df_sets['load_done_kg'].dropna()
outliers_high = (load_valid > 200.0).sum()
outliers_low = (load_valid < 2.5).sum()

print(f"\n[!] Load outliers found:")
print(f"  - Above 200kg: {outliers_high:,} ({outliers_high/len(load_valid)*100:.2f}%)")
print(f"  - Below 2.5kg: {outliers_low:,} ({outliers_low/len(load_valid)*100:.2f}%)")

if outliers_high > 0:
    print(f"\n[OK] Max load before clamp: {load_valid.max():.1f} kg")

# Clamp to realistic range [2.5, 200.0]
df_sets.loc[df_sets['load_done_kg'].notna(), 'load_done_kg'] = df_sets.loc[
    df_sets['load_done_kg'].notna(), 'load_done_kg'
].clip(lower=2.5, upper=200.0)

# Verify
load_fixed = df_sets['load_done_kg'].dropna()
print(f"[OK] Max load after clamp: {load_fixed.max():.1f} kg")
print(f"[OK] Min load after clamp: {load_fixed.min():.1f} kg")

# Distribution check
print(f"\n[OK] Load distribution (post-clamp):")
print(load_fixed.describe().round(1))

print("="*80)
print()

# ═══════════════════════════════════════════════════════════
# FINAL VALIDATION & SAVE
# ═══════════════════════════════════════════════════════════

print("="*80)
print("FINAL VALIDATION & SAVE")
print("="*80)

# ════════════════════════════════════════════════════════════
# VALIDATION CHECKS
# ════════════════════════════════════════════════════════════

print("\n" + "-"*80)
print("VALIDATION CHECKS")
print("-"*80)

# Check 1: No missing user_id
assert df_sets['user_id'].notna().all(), "[ERR] Missing user_id in sets"
print("[OK] No missing user_id")

# Check 2: All sets have workout_id
assert df_sets['workout_id'].notna().all(), "[ERR] Missing workout_id in sets"
print("[OK] All sets have workout_id")

# Check 3: Date consistency
df_sets['date'] = pd.to_datetime(df_sets['date'])
assert (df_sets['date'] >= '2023-01-01').all(), "[ERR] Dates before 2023"
assert (df_sets['date'] <= '2026-12-31').all(), "[ERR] Dates after 2026"
print("[OK] Date range valid (2023-2026)")

# Check 4: Load range realistic (NOW SHOULD PASS!)
load_valid = df_sets['load_done_kg'].dropna()
assert (load_valid >= 2.5).all(), "[ERR] Load < 2.5kg"
assert (load_valid <= 200.0).all(), "[ERR] Load > 200kg"
print("[OK] Load range valid (2.5-200kg)")

# Check 5: RPE range valid
rpe_valid = df_sets['rpe_done'].dropna()
assert (rpe_valid >= 1.0).all(), "[ERR] RPE < 1"
assert (rpe_valid <= 10.0).all(), "[ERR] RPE > 10"
print("[OK] RPE range valid (1-10)")

# Check 6: Reps range realistic
assert (df_sets['reps_done'] >= 1).all(), "[ERR] Reps < 1"
assert (df_sets['reps_done'] <= 50).all(), "[ERR] Reps > 50"
print("[OK] Reps range valid (1-50)")

# Check 7: Experience distribution in final dataset
exp_final = df_sets.merge(df_users[['user_id', 'experience_label']], on='user_id', how='left')
exp_counts = exp_final.drop_duplicates('user_id')['experience_label'].value_counts()
assert len(exp_counts) == 3, "[ERR] Not all experience levels present"
print(f"[OK] All 3 experience levels present: {exp_counts.to_dict()}")

# Check 8: Skip rate validation (v2.0)
skip_rates_by_exp = {}
for exp_label in ['Beginner', 'Intermediate', 'Advanced']:
    user_subset = df_users[df_users['experience_label'] == exp_label]['user_id']
    workouts_subset = df_workouts[df_workouts['user_id'].isin(user_subset)]

    total_sessions = len(workouts_subset)
    skipped_sessions = len(workouts_subset[workouts_subset['workout_status'] == 'skipped'])

    if total_sessions > 0:
        skip_rate = skipped_sessions / total_sessions
        skip_rates_by_exp[exp_label] = skip_rate

print("\n" + "-"*80)
print("SKIP RATE VALIDATION (v2.0)")
print("-"*80)
for exp_label, skip_rate in skip_rates_by_exp.items():
    target_skip = cfg.skip_p0_by_level[exp_label]
    print(f"{exp_label:12s}: {skip_rate:.1%} (target: {target_skip:.1%})")

print("\n[OK] All validation checks passed")

# ════════════════════════════════════════════════════════════
# SAVE DATASETS
# ════════════════════════════════════════════════════════════

print("\n" + "="*80)
print("SAVING DATASETS")
print("="*80)

# Save primary datasets
df_users.to_csv(OUTDIR / 'users.csv', index=False)
print(f"[OK] Saved: users.csv ({len(df_users)} rows)")

df_exercises.to_csv(OUTDIR / 'exercises.csv', index=False)
print(f"[OK] Saved: exercises.csv ({len(df_exercises)} rows)")

df_workouts.to_csv(OUTDIR / 'workouts.csv', index=False)
print(f"[OK] Saved: workouts.csv ({len(df_workouts):,} rows)")

df_plan.to_csv(OUTDIR / 'workout_plan.csv', index=False)
print(f"[OK] Saved: workout_plan.csv ({len(df_plan):,} rows)")

df_sets.to_csv(OUTDIR / 'workout_sets.csv', index=False)
print(f"[OK] Saved: workout_sets.csv ({len(df_sets):,} rows) [PRIMARY]")

df_banister_daily.to_csv(OUTDIR / 'banister_daily.csv', index=False)
print(f"[OK] Saved: banister_daily.csv ({len(df_banister_daily):,} rows)")

df_banister_meta.to_csv(OUTDIR / 'banister_meta.csv', index=False)
print(f"[OK] Saved: banister_meta.csv ({len(df_banister_meta)} rows)")

# ════════════════════════════════════════════════════════════
# SAVE CONFIG & METADATA
# ════════════════════════════════════════════════════════════

metadata = {
    'version': '2.0',
    'date_generated': datetime.now().isoformat(),
    'seed': cfg.seed,
    'n_users': len(df_users),
    'n_workouts': len(df_workouts),
    'n_sets': len(df_sets),
    'experience_distribution': exp_counts.to_dict(),
    'skip_rates': skip_rates_by_exp,
    'date_range': {
        'min': df_sets['date'].min().isoformat(),
        'max': df_sets['date'].max().isoformat()
    },
    'changelog': [
        'Skip model experience-aware (Sperandei 2016)',
        'RPE calibration experience-aware (Day 2004, Helms 2016)',
        'Spike frequency experience-aware (Gabbett 2016)',
        'Training history duration stratified (Ronnestad 2014)',
        'Consistency score post-processed per experience',
        'Load outliers clamped to [2.5, 200.0] kg range'
    ]
}

with open(OUTDIR / 'metadata_v2.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"[OK] Saved: metadata_v2.json")

# ════════════════════════════════════════════════════════════
# FINAL SUMMARY
# ════════════════════════════════════════════════════════════

print("\n" + "="*80)
print("GENERATION v2.0 COMPLETE")
print("="*80)
print(f"\n[OK] Output directory: {OUTDIR}")
print(f"[OK] Total users: {len(df_users):,}")
print(f"[OK] Total workouts: {len(df_workouts):,}")
print(f"[OK] Total sets: {len(df_sets):,}")
print(f"[OK] Date range: {metadata['date_range']['min']} to {metadata['date_range']['max']}")
print(f"\n[OK] Experience-Aware Components:")
print(f"  - Skip rates: Beginner {skip_rates_by_exp['Beginner']:.1%}, Advanced {skip_rates_by_exp['Advanced']:.1%}")
print(f"  - Consistency: Stratified per livello")
print(f"  - RPE calibration: Noise+bias per livello")
print(f"  - Spike frequency: Auto-regolazione esperti")
print("\n[OK] Ready for:")
print("  - STATUS Module (classification)")
print("  - IMPETUS Module (regression + injury risk)")
print("="*80)


POST-PROCESSING: CLAMP LOAD OUTLIERS

[!] Load outliers found:
  - Above 200kg: 43,819 (1.26%)
  - Below 2.5kg: 0 (0.00%)

[OK] Max load before clamp: 2663.5 kg
[OK] Max load after clamp: 200.0 kg
[OK] Min load after clamp: 8.5 kg

[OK] Load distribution (post-clamp):
count    3476236.0
mean          44.5
std           30.7
min            8.5
25%           27.5
50%           35.2
75%           49.0
max          200.0
Name: load_done_kg, dtype: float64

FINAL VALIDATION & SAVE

--------------------------------------------------------------------------------
VALIDATION CHECKS
--------------------------------------------------------------------------------
[OK] No missing user_id
[OK] All sets have workout_id
[OK] Date range valid (2023-2026)
[OK] Load range valid (2.5-200kg)
[OK] RPE range valid (1-10)
[OK] Reps range valid (1-50)
[OK] All 3 experience levels present: {np.str_('Intermediate'): 769, np.str_('Beginner'): 542, np.str_('Advanced'): 189}

-------------------------------------

#**Download ZIP**

In [21]:
# ═══════════════════════════════════════════════════════════
# DOWNLOAD DATASET as ZIP
# ═══════════════════════════════════════════════════════════

import shutil
from google.colab import files

print("Creating ZIP archive...")

# Create zip
shutil.make_archive(
    'synthetic_gym_dataset_v2',
    'zip',
    'data/synth_set_level_v2'
)

print("[OK] ZIP created: synthetic_gym_dataset_v2.zip")
print(f"[OK] Size: {os.path.getsize('synthetic_gym_dataset_v2.zip') / 1024 / 1024:.1f} MB")

# Download
print("\nDownloading...")
files.download('synthetic_gym_dataset_v2.zip')

print("[OK] Download complete!")


Creating ZIP archive...
[OK] ZIP created: synthetic_gym_dataset_v2.zip
[OK] Size: 64.2 MB

Downloading...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[OK] Download complete!


In [22]:
# ═══════════════════════════════════════════════════════════
# LOAD DATASET - STRATIFIED SAMPLE (for STATUS + IMPETUS)
# ═══════════════════════════════════════════════════════════

import pandas as pd
import numpy as np

DATADIR = 'data/synth_set_level_v2'

print("="*80)
print("CREATING STRATIFIED SAMPLE (STATUS + IMPETUS)")
print("="*80)

# ────────────────────────────────────────────────────────────
# STRATEGY: Sample 510 users (170 per experience level)
# ────────────────────────────────────────────────────────────

df_users_full = pd.read_csv(f'{DATADIR}/users.csv')

# Sample stratificato
np.random.seed(42)

sampled_users = []
for exp in ['Beginner', 'Intermediate', 'Advanced']:
    users_exp = df_users_full[df_users_full['experience_label'] == exp]
    n_sample = min(170, len(users_exp))  # ~170 per classe = 510 totali
    sampled = users_exp.sample(n=n_sample, random_state=42)
    sampled_users.append(sampled)

df_users = pd.concat(sampled_users, ignore_index=True)

print(f"\n[OK] Users sampled: {len(df_users)} / {len(df_users_full)}")
print(f"\nExperience distribution:")
for exp in ['Beginner', 'Intermediate', 'Advanced']:
    count = len(df_users[df_users['experience_label'] == exp])
    pct = count / len(df_users) * 100
    print(f"  {exp:12s}: {count:3d} ({pct:5.1f}%)")

# ────────────────────────────────────────────────────────────
# LOAD SETS in CHUNKS (only for sampled users)
# ────────────────────────────────────────────────────────────

sampled_user_ids = df_users['user_id'].tolist()

print(f"\nLoading workout_sets for {len(sampled_user_ids)} users (chunked)...")

chunks = []
chunksize = 100000
total_processed = 0

for chunk in pd.read_csv(f'{DATADIR}/workout_sets.csv', chunksize=chunksize):
    # Filter only sampled users
    chunk_filtered = chunk[chunk['user_id'].isin(sampled_user_ids)]
    if len(chunk_filtered) > 0:
        chunks.append(chunk_filtered)

    total_processed += len(chunk)
    if total_processed % 500000 == 0:
        print(f"  Processed {total_processed:,} rows...")

df_sets = pd.concat(chunks, ignore_index=True)

print(f"\n[OK] Sets loaded: {len(df_sets):,}")
print(f"  Reduction: {len(df_sets) / 3511093 * 100:.1f}% of original")
print(f"  Avg sets per user: {len(df_sets) / len(df_users):.0f}")

# ────────────────────────────────────────────────────────────
# LOAD OTHER TABLES (filtered)
# ────────────────────────────────────────────────────────────

print("\nLoading other tables (filtered)...")

df_workouts = pd.read_csv(f'{DATADIR}/workouts.csv')
df_workouts = df_workouts[df_workouts['user_id'].isin(sampled_user_ids)]

df_plan = pd.read_csv(f'{DATADIR}/workout_plan.csv')
df_plan = df_plan[df_plan['user_id'].isin(sampled_user_ids)]

df_banister_daily = pd.read_csv(f'{DATADIR}/banister_daily.csv')
df_banister_daily = df_banister_daily[df_banister_daily['user_id'].isin(sampled_user_ids)]

print(f"[OK] Workouts: {len(df_workouts):,} rows")
print(f"[OK] Plan: {len(df_plan):,} rows")
print(f"[OK] Banister daily: {len(df_banister_daily):,} rows")

# ────────────────────────────────────────────────────────────
# LOAD EXERCISES (no filtering needed)
# ────────────────────────────────────────────────────────────

df_exercises = pd.read_csv(f'{DATADIR}/exercises.csv')
print(f"[OK] Exercises: {len(df_exercises)} rows")

# ────────────────────────────────────────────────────────────
# SAVE SAMPLED DATASET
# ────────────────────────────────────────────────────────────

print("\n" + "-"*80)
print("SAVING SAMPLED DATASET")
print("-"*80)

df_users.to_csv(f'{DATADIR}/users_sampled.csv', index=False)
df_sets.to_csv(f'{DATADIR}/workout_sets_sampled.csv', index=False)
df_workouts.to_csv(f'{DATADIR}/workouts_sampled.csv', index=False)
df_plan.to_csv(f'{DATADIR}/workout_plan_sampled.csv', index=False)
df_banister_daily.to_csv(f'{DATADIR}/banister_daily_sampled.csv', index=False)

# Copy exercises (no filtering)
df_exercises.to_csv(f'{DATADIR}/exercises_sampled.csv', index=False)

print(f"\n[OK] Saved sampled datasets:")
print(f"  - users_sampled.csv ({len(df_users)} rows, ~50 KB)")
print(f"  - workout_sets_sampled.csv ({len(df_sets):,} rows, ~{len(df_sets)*200/1024/1024:.0f} MB)")
print(f"  - workouts_sampled.csv ({len(df_workouts):,} rows, ~{len(df_workouts)*150/1024/1024:.0f} MB)")
print(f"  - workout_plan_sampled.csv ({len(df_plan):,} rows)")
print(f"  - banister_daily_sampled.csv ({len(df_banister_daily):,} rows)")
print(f"  - exercises_sampled.csv ({len(df_exercises)} rows)")

# ────────────────────────────────────────────────────────────
# MEMORY INFO
# ────────────────────────────────────────────────────────────

total_memory_mb = (
    df_users.memory_usage(deep=True).sum() +
    df_sets.memory_usage(deep=True).sum() +
    df_workouts.memory_usage(deep=True).sum()
) / 1024 / 1024

print(f"\n[OK] Total memory usage: {total_memory_mb:.0f} MB (Colab-friendly)")

print("\n" + "="*80)
print("READY FOR:")
print("  - STATUS EDA/FE/Modeling")
print("  - IMPETUS EDA/FE/Modeling")
print("="*80)


CREATING STRATIFIED SAMPLE (STATUS + IMPETUS)

[OK] Users sampled: 510 / 1500

Experience distribution:
  Beginner    : 170 ( 33.3%)
  Intermediate: 170 ( 33.3%)
  Advanced    : 170 ( 33.3%)

Loading workout_sets for 510 users (chunked)...
  Processed 500,000 rows...
  Processed 1,000,000 rows...
  Processed 1,500,000 rows...
  Processed 2,000,000 rows...
  Processed 2,500,000 rows...
  Processed 3,000,000 rows...
  Processed 3,500,000 rows...

[OK] Sets loaded: 1,566,944
  Reduction: 44.6% of original
  Avg sets per user: 3072

Loading other tables (filtered)...
[OK] Workouts: 106,571 rows
[OK] Plan: 322,580 rows
[OK] Banister daily: 106,571 rows
[OK] Exercises: 20 rows

--------------------------------------------------------------------------------
SAVING SAMPLED DATASET
--------------------------------------------------------------------------------

[OK] Saved sampled datasets:
  - users_sampled.csv (510 rows, ~50 KB)
  - workout_sets_sampled.csv (1,566,944 rows, ~299 MB)
  - work

In [23]:
# ═══════════════════════════════════════════════════════════
# DOWNLOAD SAMPLED DATASET as ZIP
# ═══════════════════════════════════════════════════════════

import shutil
import os
from pathlib import Path
from google.colab import files

print("="*80)
print("CREATING ZIP ARCHIVE (Sampled Dataset)")
print("="*80)

DATADIR = Path('data/synth_set_level_v2')

# ────────────────────────────────────────────────────────────
# Create temporary directory for sampled files only
# ────────────────────────────────────────────────────────────

temp_dir = Path('sampled_dataset_temp')
temp_dir.mkdir(exist_ok=True)

sampled_files = [
    'users_sampled.csv',
    'workout_sets_sampled.csv',
    'workouts_sampled.csv',
    'workout_plan_sampled.csv',
    'banister_daily_sampled.csv',
    'exercises_sampled.csv',
    'metadata_v2.json'  # Include metadata
]

print("\nCopying sampled files to temporary directory...")

total_size_mb = 0

for fname in sampled_files:
    src = DATADIR / fname
    dst = temp_dir / fname

    if src.exists():
        shutil.copy2(src, dst)
        size_mb = src.stat().st_size / 1024 / 1024
        total_size_mb += size_mb
        print(f"  [OK] {fname:35s} ({size_mb:6.1f} MB)")
    else:
        print(f"  [!] {fname:35s} (NOT FOUND)")

print(f"\n[OK] Total size: {total_size_mb:.1f} MB")

# ────────────────────────────────────────────────────────────
# Create ZIP archive
# ────────────────────────────────────────────────────────────

print("\nCreating ZIP archive...")

zip_name = 'synthetic_gym_dataset_v2_SAMPLED'

shutil.make_archive(
    zip_name,
    'zip',
    temp_dir
)

zip_file = f'{zip_name}.zip'
zip_size_mb = os.path.getsize(zip_file) / 1024 / 1024

print(f"[OK] ZIP created: {zip_file}")
print(f"[OK] Compressed size: {zip_size_mb:.1f} MB")
print(f"  Compression ratio: {(1 - zip_size_mb/total_size_mb)*100:.1f}%")

# ────────────────────────────────────────────────────────────
# Clean up temporary directory
# ────────────────────────────────────────────────────────────

shutil.rmtree(temp_dir)
print("\n[OK] Temporary files cleaned")

# ────────────────────────────────────────────────────────────
# Download
# ────────────────────────────────────────────────────────────

print("\n" + "="*80)
print("DOWNLOADING ZIP...")
print("="*80)
print("\nDownload starting (browser prompt)...\n")

files.download(zip_file)

print("[OK] Download complete!")
print("\n" + "="*80)
print("SAMPLED DATASET READY")
print("="*80)
print(f"\nFile: {zip_file}")
print(f"Content:")
print(f"  - 510 users (170 per experience level)")
print(f"  - ~600k workout sets")
print(f"  - ~80k workouts")
print(f"  - Banister daily metrics")
print(f"  - Exercise catalog")
print(f"\n[OK] Ready for:")
print(f"  - STATUS EDA/FE/Modeling")
print(f"  - IMPETUS EDA/FE/Modeling")
print("="*80)


CREATING ZIP ARCHIVE (Sampled Dataset)

Copying sampled files to temporary directory...
  [OK] users_sampled.csv                   (   0.1 MB)
  [OK] workout_sets_sampled.csv            ( 149.5 MB)
  [OK] workouts_sampled.csv                (   9.1 MB)
  [OK] workout_plan_sampled.csv            (  13.2 MB)
  [OK] banister_daily_sampled.csv          (  14.8 MB)
  [OK] exercises_sampled.csv               (   0.0 MB)
  [OK] metadata_v2.json                    (   0.0 MB)

[OK] Total size: 186.7 MB

Creating ZIP archive...
[OK] ZIP created: synthetic_gym_dataset_v2_SAMPLED.zip
[OK] Compressed size: 27.8 MB
  Compression ratio: 85.1%

[OK] Temporary files cleaned

DOWNLOADING ZIP...

Download starting (browser prompt)...



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[OK] Download complete!

SAMPLED DATASET READY

File: synthetic_gym_dataset_v2_SAMPLED.zip
Content:
  - 510 users (170 per experience level)
  - ~600k workout sets
  - ~80k workouts
  - Banister daily metrics
  - Exercise catalog

[OK] Ready for:
  - STATUS EDA/FE/Modeling
  - IMPETUS EDA/FE/Modeling
