# **Synthetic Gym Dataset Generator (Set-Level)**

**Generatore di dataset sintetico realistico per allenamento in palestra**

---

## **Overview**

Dataset con simulazione dinamica:
- **Finestra temporale variabile** per utente (2 settimane → 2 anni)
- **Simulazione a stati** (fitness, fatigue, skill, resilience)
- **Eventi realistici** (skip, injury)
- **Progressive overload** con transfer tra esercizi
- **Modello Banister** per fitness/fatigue

### **File Output**
- `workout_sets.csv` — Set-level (principale)
- `workout_logs.csv` — Exercise-level
- `sessions.csv` — Session-level
- `banister_daily.csv` — Serie F/D/P
- `users.csv`, `workouts.csv`, `workout_plan.csv`

### **Parametri Chiave**
- `n_users`: 300 (modificabile)
- `seed`: 27 (riproducibilità)
- `overload_base_rate`: Beginner 5%, Intermediate 0.5%, Advanced 0.4%

 **Tempo esecuzione**: ~5-8 minuti per 300 utenti

---


---
# 1. **Setup & Imports**

Installazione librerie e import moduli.


# Synthetic Gym Logs Generator (Set-level)
Generatore di dataset sintetico per log di allenamento **per serie** (set-level) con:
- finestra temporale per utente variabile (end_date tra oggi e +2 anni, durata 2 settimane → 2 anni)
- simulazione a stati (fitness/fatigue/skill/resilience)
- eventi (skip, injury)
- output canonico: `workout_sets.csv`
- output derivati: `workoutlogs.csv` (per esercizio), `sessions.csv` (per sessione), `banisterdaily.csv`

Obiettivo: produrre dati realistici e poi derivare viste aggregate per i moduli ML.


**CELL 1 — (Python) Setup**

In [None]:
!pip -q install pandas numpy

import os, json, math
from dataclasses import dataclass
from pathlib import Path
from datetime import date, timedelta, datetime
import numpy as np
import pandas as pd


---
#2. **Configurazione**

Definizione parametri globali:

### **User Generation**
- `n_users`: numero utenti
- `seed`: random seed

### **Date Ranges**
- `today`: data riferimento
- `end_date_max_days_ahead`: max giorni futuro
- `min/max_duration_days`: range durata finestra

### **Training Schedule**
- `weekly_freq_mu/sd`: frequenza settimanale
- `weekday_jitter_probs`: jitter giorni

### **Progressive Overload**
- `overload_base_rate`: tasso crescita per livello
- `transfer_same_muscle`: 35% (stesso gruppo muscolare)
- `transfer_same_split`: 12% (stesso split)

### **Skip & Injury**
- `skip_p0_by_level`: probabilità baseline skip
- `skip_fatigue_weight`: modulazione fatica
- `injury_lambda`: scala probabilità infortunio

### **Banister**
- `tauF`: ~45 giorni (fitness decay)
- `tauD`: ~7 giorni (fatigue decay)
- `betaF/betaD`: coefficienti


**CELL 2 — (Python) Config**

In [None]:
@dataclass
class CFG:
    seed: int = 27
    outdir: str = "data_synth_setlevel"

    n_users: int = 1000

    # Per-user date ranges
    today: date = date.today()
    end_date_max_days_ahead: int = 730     # oggi -> +2 anni
    min_duration_days: int = 14           # 2 settimane
    max_duration_days: int = 730          # 2 anni

    # Training schedule
    weekly_freq_mu: float = 3.5
    weekly_freq_sd: float = 1.0
    weekly_freq_min: int = 1
    weekly_freq_max: int = 6
    weekday_jitter_probs = (0.15, 0.70, 0.15)  # -1,0,+1

    # Quantization / realism
    load_step: float = 0.25
    rpe_step: float = 0.5

    # Skip model (baseline per livello)
    skip_p0_by_level: dict = None  # lo definiamo sotto
    # quanto la fatica aumenta lo skip (modulatore leggero)
    skip_fatigue_weight: float = 0.25
    skip_fatigue_cap: float = 1.2
    skip_noise_sd: float = 0.10

    # cap e rumore (tengono stabile il sistema)
    skip_fatigue_cap = 1.2
    skip_noise_sd = 0.10

    skip_exp_fatigue_scale = 0.85   # quanto l’esperienza "spegne" l’effetto fatica (0.0=nessun effetto)

    skip_exp_weight: float = 1.2

    injury_lambda: float = 0.002   # scala probabilità injury
    injury_days_min: int = 7
    injury_days_max: int = 28

    # Missingness (solo su osservazioni)
    p_missing_rpe: float = 0.02
    p_missing_load: float = 0.01
    p_missing_feedback: float = 0.02

    # Banister-like params
    tauF_mean: float = 45.0
    tauF_sd: float = 8.0
    tauD_mean: float = 7.0
    tauD_sd: float = 2.0
    betaF: float = 0.010
    betaD: float = 0.015

    # === PROGRESSIVE OVERLOAD (dose-driven) ===
    overload_base_rate: dict = None      # tasso crescita per livello
    overload_I0: float = 1800.0          # impulse normalizzazione (mediana attesa)
    overload_quality_fatigue: float = 0.6  # quanto la fatica riduce quality
    # Transfer weights
    transfer_same_muscle: float = 0.35   # stesso targetmusclegroup
    transfer_same_split: float = 0.12    # stesso splitcat


# Default per skip_p0_by_level
if not hasattr(CFG, '__dataclass_fields__') or 'skip_p0_by_level' not in CFG.__dataclass_fields__:
    CFG.skip_p0_by_level = {
      "Beginner": 0.10,      # era 0.13 → abbassato per target ~12-13%
      "Intermediate": 0.065,  # era 0.08 → abbassato per target ~8%
      "Advanced": 0.05,       # già perfetto, lasciato invariato
}

cfg = CFG()

cfg.skip_p0_by_level = {
    "Beginner": 0.10,
    "Intermediate": 0.065,
    "Advanced": 0.05,
}

cfg.overload_base_rate = {
    "Beginner": 0.0500,      # ← DA 0.0180 A 0.0300 (x2.7)
    "Intermediate": 0.0050,  # OK
    "Advanced": 0.0040,      # OK
}

rng = np.random.default_rng(cfg.seed)

OUTDIR = Path(cfg.outdir)
OUTDIR.mkdir(parents=True, exist_ok=True)

cfg.today


datetime.date(2026, 2, 1)

---
#3. **Utility Functions**

Funzioni helper:
- `sigmoid(z)` — Sigmoide per probabilità
- `logit(p)` — Inverse sigmoid
- `qload(x, step)` — Quantizzazione carico (0.25 kg)
- `qrpe(x, step)` — Quantizzazione RPE (0.5)
- `clamp_int()` — Clamp + cast
- `sample_split()` — PPL 70% / FullBody 30%
- `exp_weights()` — Pesi esponenziali Banister


**CELL 3 — (Python) Utils**

In [None]:
def sigmoid(z: float) -> float:
    return 1.0 / (1.0 + math.exp(-z))

def qload(x: float, step: float) -> float:
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    return float(np.round(x / step) * step)

def qrpe(x: float, step: float) -> float:
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    x = float(np.clip(x, 1.0, 10.0))
    return float(np.round(x / step) * step)

def clamp_int(x, lo, hi):
    return int(np.clip(int(round(x)), lo, hi))

def sample_split(rng):
    return str(rng.choice(["PPL", "FullBody"], p=[0.7, 0.3]))

def exp_weights(L: int, tau: float) -> np.ndarray:
    idx = np.arange(L, dtype=float)
    return np.exp(-idx / float(tau))

def logit(p: float) -> float:
    """Inverse sigmoid: logit(p) = ln(p/(1-p))"""
    return math.log(p/(1-p))


### **Verifica Configurazione Skip Model**

Test che parametri skip siano caricati e `logit()` sia definita.


In [None]:
# === VERIFICA CONFIG ===
print("Controllo configurazione skip:")
print(f"  cfg.skip_p0_by_level = {cfg.skip_p0_by_level}")
print(f"  cfg.skip_fatigue_weight = {cfg.skip_fatigue_weight}")

try:
    test_val = logit(0.10)
    print(f"logit(0.10) = {test_val:.3f}")
except NameError:
    print("ERRORE: logit() non è definita!")


Controllo configurazione skip:
  cfg.skip_p0_by_level = {'Beginner': 0.1, 'Intermediate': 0.065, 'Advanced': 0.05}
  cfg.skip_fatigue_weight = 0.25
logit(0.10) = -2.197


---
#4. **Exercise Catalog**

Caricamento catalogo esercizi con fallback.

### **Funzionamento**
1. Tenta caricamento `esercizi_catalogo.csv`
2. Se non trovato: fallback a 12 esercizi base
3. Normalizzazione colonne (italiano → inglese)

### **Schema**
- `exerciseid`, `name`, `targetmusclegroup`
- `mechanics` (Compound/Isolation)
- `difficultylevel` (Beginner/Intermediate/Advanced)
- `equipment`, `bodyregion`, `splitcat` (push/pull/legs/core/other)

**NOTA Con fallback**: solo 12 esercizi ma funzionale.


**CELL 4 — (Python) Load exercise catalog (fallback incluso)**

In [None]:
def load_exercises_catalog(path: str = "esercizi_catalogo.csv") -> pd.DataFrame:
    p = Path(path)
    if p.exists():
        df = pd.read_csv(p)

        rename_map = {
            "idEsercizio": "exerciseid",
            "nome": "name",
            "gruppoMuscolare": "targetmusclegroup",
            "livelloEsercizio": "difficultylevel",
            "Mechanics": "mechanics"
        }
        for k, v in rename_map.items():
            if k in df.columns and v not in df.columns:
                df = df.rename(columns={k: v})
        if "splitcat" not in df.columns:
            df["splitcat"] = "other"
        keep = ["exerciseid","name","targetmusclegroup","mechanics","difficultylevel",
                "equipment","bodyregion","splitcat"]
        for c in keep:
            if c not in df.columns:
                df[c] = None
        df = df[keep].copy()
        df["exerciseid"] = df["exerciseid"].astype(int)
        df["splitcat"] = df["splitcat"].astype(str).str.lower()
        return df.sort_values("exerciseid").reset_index(drop=True)

    # Fallback minimale (estendibile)
    rows = [
        (1,"Bench Press","Chest","Compound","Intermediate","Barbell","Upper Body","push"),
        (2,"Barbell Row","Back","Compound","Intermediate","Barbell","Upper Body","pull"),
        (3,"Squat","Quadriceps","Compound","Advanced","Barbell","Lower Body","legs"),
        (4,"Cable Fly","Chest","Isolation","Beginner","Cable","Upper Body","push"),
        (5,"Lat Pulldown","Back","Compound","Beginner","Machine","Upper Body","pull"),
        (6,"Leg Press","Quadriceps","Compound","Intermediate","Machine","Lower Body","legs"),
        (7,"Plank","Abdominals","Compound","Beginner","Bodyweight","Core","core"),
        (8,"Lateral Raise","Shoulders","Isolation","Beginner","Dumbbell","Upper Body","push"),
        (9,"Romanian Deadlift","Hamstrings","Compound","Advanced","Barbell","Lower Body","legs"),
        (10,"Incline DB Press","Chest","Compound","Intermediate","Dumbbell","Upper Body","push"),
        (11,"Seated Cable Row","Back","Compound","Beginner","Cable","Upper Body","pull"),
        (12,"Leg Curl","Hamstrings","Isolation","Beginner","Machine","Lower Body","legs"),
    ]
    return pd.DataFrame(rows, columns=[
        "exerciseid","name","targetmusclegroup","mechanics","difficultylevel",
        "equipment","bodyregion","splitcat"
    ])

df_ex = load_exercises_catalog()
df_ex.head()


Unnamed: 0,exerciseid,name,targetmusclegroup,mechanics,difficultylevel,equipment,bodyregion,splitcat
0,1,Bench Press,Chest,Compound,Intermediate,Barbell,Upper Body,push
1,2,Barbell Row,Back,Compound,Intermediate,Barbell,Upper Body,pull
2,3,Squat,Quadriceps,Compound,Advanced,Barbell,Lower Body,legs
3,4,Cable Fly,Chest,Isolation,Beginner,Cable,Upper Body,push
4,5,Lat Pulldown,Back,Compound,Beginner,Machine,Upper Body,pull


---
#5. **User Generation**

Generazione utenti con **latenti** e finestre temporali individuali.

## **User Latents (Variabili Nascoste)**

### **Experience Latent**
- Continuo ∈ [0, 1] → Discretizzato:
  - **Beginner** (< 0.40)
  - **Intermediate** (0.40-0.80)
  - **Advanced** (> 0.80)

### **Adaptation Parameters**
- `alpha_adapt`: tasso adattamento (↑ per beginner)
- `k_detraining`: tasso detraining (↑ per beginner)
- `obs_noise`: rumore osservazioni (↑ per beginner)

### **Individual Differences**
- `resilience`: resistenza infortuni (↑ per advanced)
- `fatigue_sens`: sensibilità fatica
- `rpe_report_bias`: bias soggettivo RPE

## **Date Window**
Ogni utente ha `start_date` e `end_date` individuali → simula dataset "real-world".

## **Output**
`df_users`: anagrafica + date window + label target + latenti


**CELL 5 — (Python) Sample user latents + per-user date windows**

In [None]:
LEVELS = ["Beginner", "Intermediate", "Advanced"]

def sample_user_window(cfg: CFG, rng) -> tuple[date, date]:
    end_date = cfg.today + timedelta(days=int(rng.integers(0, cfg.end_date_max_days_ahead + 1)))
    dur = int(rng.integers(cfg.min_duration_days, cfg.max_duration_days + 1))
    start_date = end_date - timedelta(days=dur)
    return start_date, end_date

def sample_user_latents(cfg: CFG, rng):
    # experience_latent in [0,1], 0=novice-ish, 1=advanced-ish
    exp_lat = float(np.clip(rng.beta(2.0, 2.0), 0.0, 1.0))

    # Parametri continui
    # adattamento: più alto per exp_lat bassa
    alpha = float(np.clip(rng.normal(0.05 - 0.03*exp_lat, 0.01), 0.005, 0.08))
    # detraining: più alto per exp_lat bassa
    k_d = float(np.clip(rng.normal(0.020 - 0.012*exp_lat, 0.004), 0.002, 0.03))
    # noise: più alto per exp_lat bassa
    obs_noise = float(np.clip(rng.normal(0.25 - 0.18*exp_lat, 0.05), 0.03, 0.35))

    resilience = float(np.clip(rng.normal(1.0 + 0.6*exp_lat, 0.25), 0.4, 2.2))
    fatigue_sens = float(np.clip(rng.lognormal(mean=-0.2, sigma=0.35), 0.2, 2.0))

    return dict(
        experience_latent=exp_lat,
        alpha_adapt=alpha,
        k_detraining=k_d,
        obs_noise=obs_noise,
        resilience=resilience,
        fatigue_sens=fatigue_sens,
        rpe_report_bias=float(rng.normal(0.0, 0.35)),
    )

def latents_to_experience_label(exp_lat: float) -> str:
    # discretizzazione semplice
    # 3 classi: Beginner / Intermediate / Advanced
    if exp_lat < 0.40:
        return "Beginner"
    if exp_lat < 0.80:
        return "Intermediate"
    return "Advanced"

def generate_users(cfg: CFG, rng) -> pd.DataFrame:
    rows = []
    for uid in range(1, cfg.n_users + 1):
        start_u, end_u = sample_user_window(cfg, rng)
        weekly_freq = clamp_int(rng.normal(cfg.weekly_freq_mu, cfg.weekly_freq_sd),
                                cfg.weekly_freq_min, cfg.weekly_freq_max)
        split = sample_split(rng)

        lat = sample_user_latents(cfg, rng)
        exp_label = latents_to_experience_label(lat["experience_latent"])

        rows.append({
            "userid": uid,
            "weeklyfreqdeclared": weekly_freq,
            "splittype": split,
            "start_date": start_u.isoformat(),
            "end_date": end_u.isoformat(),

            # label target (non nei log)
            "experience_label": exp_label,

            # latenti
            "experience_latent": round(lat["experience_latent"], 4),
            "alpha_adapt": round(lat["alpha_adapt"], 5),
            "k_detraining": round(lat["k_detraining"], 5),
            "obs_noise": round(lat["obs_noise"], 4),
            "resilience": round(lat["resilience"], 4),
            "fatigue_sens": round(lat["fatigue_sens"], 4),
            "rpe_report_bias": round(lat["rpe_report_bias"], 4),
        })
    return pd.DataFrame(rows)

df_users = generate_users(cfg, rng)
df_users.head()


Unnamed: 0,userid,weeklyfreqdeclared,splittype,start_date,end_date,experience_label,experience_latent,alpha_adapt,k_detraining,obs_noise,resilience,fatigue_sens,rpe_report_bias
0,1,4,PPL,2024-09-06,2026-02-02,Beginner,0.2995,0.05222,0.01466,0.1321,1.0623,0.5385,-0.6439
1,2,4,PPL,2024-04-13,2026-02-06,Intermediate,0.47,0.03279,0.01361,0.187,0.9602,0.9363,0.3472
2,3,4,FullBody,2026-02-26,2026-05-30,Beginner,0.2713,0.04981,0.01256,0.2242,1.5184,0.6392,0.2331
3,4,3,FullBody,2025-01-21,2026-03-12,Intermediate,0.4448,0.052,0.01132,0.2132,1.4421,0.8406,0.2141
4,5,4,PPL,2025-02-12,2026-06-04,Beginner,0.2797,0.03763,0.0219,0.2159,1.0132,0.6727,0.4974


---
#6. **Capabilities & Templates**

Generazione capacità iniziali (cmax) e template sessioni.

## **Capabilities (Cmax)**
Per ogni (user, exercise): **carico massimale teorico**
- Base = f(experience_label, experience_latent)
- Moltiplicatore difficoltà esercizio (Beginner: 0.85×, Intermediate: 1.0×, Advanced: 1.12×)
- Jitter individuale

 **Importante**: `cmax` **evolve dinamicamente**:
- ↑ con progressive overload
- ↓ con detraining (pause > 7 giorni)

## **Templates Sessioni**

### PPL Split (70% utenti)
Rotazione: Push → Pull → Legs (3-6 esercizi/sessione)

### FullBody Split (30% utenti)
Rotazione: FullBody-A/B/C (mix bilanciato: 1 legs + 1 push + 1 pull + opzionale core)

## Output
- `caps`: dict `{userid: {exerciseid: cmax_kg}}`
- `templates`: dict `{userid: {session_tag: [exerciseid, ...]}}`


**CELL 6 — (Python) Capabilities (cmax) + templates per split**

In [None]:
BASEMAP = {"Beginner": 50.0, "Intermediate": 80.0, "Advanced": 105.0}

def build_capabilities(df_users: pd.DataFrame, df_ex: pd.DataFrame, rng) -> dict:
    caps = {}
    for u in df_users.itertuples(index=False):
        uid = int(u.userid)
        exp_label = str(u.experience_label)
        exp_lat = float(u.experience_latent)

        # scala base da label (solo per cmax "medio"), ma con jitter continuo su exp_lat
        # Nota: non stiamo usando la label per governare dinamiche; è solo un prior sul massimo teorico.
        base_factor = BASEMAP.get(exp_label, 70.0) * (0.85 + 0.30*exp_lat)

        usermap = {}
        for ex in df_ex.itertuples(index=False):
            # difficoltà esercizio influenza cmax relativo
            diff = str(ex.difficultylevel)
            diff_mul = {"Beginner": 0.85, "Intermediate": 1.0, "Advanced": 1.12}.get(diff, 1.0)
            cmax = base_factor * diff_mul * float(rng.normal(1.0, 0.12))
            cmax = float(np.clip(cmax, 10.0, 200.0))
            usermap[int(ex.exerciseid)] = qload(cmax, cfg.load_step)
        caps[uid] = usermap
    return caps

PPL_ROT = ["Push", "Pull", "Legs"]
FB_ROT = ["FullBody-A", "FullBody-B", "FullBody-C"]

def choose_exercises_for_tag(df_ex: pd.DataFrame, tag: str, rng, n_min=3, n_max=6):
    if tag.startswith("FullBody"):
        pools = {
            "legs": df_ex[df_ex["splitcat"].isin(["legs"])],
            "push": df_ex[df_ex["splitcat"].isin(["push"])],
            "pull": df_ex[df_ex["splitcat"].isin(["pull"])],
            "core": df_ex[df_ex["splitcat"].isin(["core"])],
            "other": df_ex[~df_ex["splitcat"].isin(["legs","push","pull","core"])],
        }
        exids = []
        for k, n in [("legs",1),("push",1),("pull",1)]:
            if len(pools[k]) > 0:
                exids += rng.choice(pools[k]["exerciseid"].values, size=n, replace=False).tolist()
        if len(pools["core"]) and rng.random() < 0.6:
            exids += rng.choice(pools["core"]["exerciseid"].values, size=1, replace=False).tolist()
        while len(exids) < 4:
            pool = pools["other"] if len(pools["other"]) else df_ex
            exids += rng.choice(pool["exerciseid"].values, size=1, replace=False).tolist()
        # dedup mantenendo ordine
        seen = set()
        exids2 = []
        for x in exids:
            if x not in seen:
                seen.add(x)
                exids2.append(int(x))
        return exids2[:6]

    # PPL
    tag_l = tag.lower()
    pool = df_ex[df_ex["splitcat"] == tag_l]
    if len(pool) < 3:
        pool = df_ex.copy()
    n = int(rng.integers(n_min, n_max + 1))
    n = min(n, len(pool))
    return [int(x) for x in rng.choice(pool["exerciseid"].values, size=n, replace=False)]

def build_user_templates(df_users, df_ex, rng):
    templates = {}
    for u in df_users.itertuples(index=False):
        uid = int(u.userid)
        split = str(u.splittype)
        tags = PPL_ROT if split == "PPL" else FB_ROT
        templates[uid] = {tag: choose_exercises_for_tag(df_ex, tag, rng) for tag in tags}
    return templates

caps = build_capabilities(df_users, df_ex, rng)
templates = build_user_templates(df_users, df_ex, rng)

list(templates.items())[0]


(1, {'Push': [10, 8, 1, 4], 'Pull': [2, 11, 5], 'Legs': [12, 9, 3, 6]})

---
#7. **Prescription Helpers**

Funzioni per prescrivere parametri allenamento.

## `prescribe_exercise()`
Genera prescrizione basata su difficoltà esercizio + esperienza utente.

### **Range per Difficoltà Esercizio**

| Parametro | Beginner Ex | Intermediate Ex | Advanced Ex |
|-----------|-------------|-----------------|-------------|
| Sets | 2-4 | 3-5 | 3-6 |
| Reps (mid) | ~12 | ~8 | ~6 |
| Rest (sec) | 60-120 | 90-180 | 120-240 |

**RIR target** dipende da esperienza utente:
- Beginner user → RIR ~2.5 (buffer sicurezza)
- Intermediate → RIR ~2.0
- Advanced → RIR ~1.5 (vicinanza cedimento)

## `intensity_from_reps_rir()`
Euristica intensità (% cmax) da reps + RIR:
```
intensity = 0.86 - 0.018*(reps - 5) - 0.03*rir + noise
```

Esempi:
- 5 reps, RIR 1 → ~83% (forza)
- 12 reps, RIR 2.5 → ~65% (ipertrofia)


**CELL 7 — (Python) Plan prescription (per esercizio) + helpers intensità**

In [None]:
def prescribe_exercise(df_ex_row: dict, experience_label: str, rng):
    # Range reps/sets/rest/rir per esercizio (semplice e stabile)
    lvl = str(df_ex_row["difficultylevel"])

    # default base su difficoltà esercizio (non su esperienza!)
    if lvl == "Beginner":
        reps_mu, reps_sd = 12, 2
        sets_lo, sets_hi = 2, 4
        rest_lo, rest_hi = 60, 120
    elif lvl == "Intermediate":
        reps_mu, reps_sd = 8, 2
        sets_lo, sets_hi = 3, 5
        rest_lo, rest_hi = 90, 180
    else:  # Advanced/altro
        reps_mu, reps_sd = 6, 2
        sets_lo, sets_hi = 3, 6
        rest_lo, rest_hi = 120, 240

    # RIR target può dipendere dall'esperienza (scelta coaching), ma non entra come "flag di dinamica"
    # (serve per definire la prescrizione, che è osservabile nel plan)
    if experience_label in ["Beginner", "Novice"]:
        rir_mu = 2.5
    elif experience_label == "Intermediate":
        rir_mu = 2.0
    else:
        rir_mu = 1.5

    setsplanned = int(rng.integers(sets_lo, sets_hi + 1))
    repsmid = clamp_int(rng.normal(reps_mu, reps_sd), 3, 20)
    width = 2 if lvl != "Beginner" else 3
    repsmin = max(1, repsmid - width)
    repsmax = min(30, repsmid + width)

    restplannedsec = int(rng.integers(rest_lo, rest_hi + 1))
    rirtarget = clamp_int(rng.normal(rir_mu, 0.6), 0, 5)

    return setsplanned, repsmin, repsmax, restplannedsec, rirtarget

def intensity_from_reps_rir(reps_target: float, rir_target: float, rng):
    # euristica: più reps e più RIR => intensità minore
    base = 0.86 - 0.018*(reps_target - 5.0) - 0.03*rir_target
    base += float(rng.normal(0.0, 0.02))
    return float(np.clip(base, 0.35, 0.92))


---
#8. **Core Simulation Engine**

**Cuore del generatore**: simulazione dinamica per singolo utente.

## **Flusso Simulazione**

### 1. **Inizializzazione**
- Fitness, fatigue, skill iniziali
- Parametri Banister individuali (tauF, tauD)
- Capacità iniziali per esercizio (copiate da `caps_u`, evolveranno)

### 2. **Schedule Sessioni**
- Date candidate da `weeklyfreqdeclared`
- Jitter realistico (±1 giorno)

### 3. **Loop Giornaliero**

Per ogni data schedulata:

#### A. Detraining (se gap > 1 giorno)
```python
fitness *= exp(-k_d * gap)
# Se gap > 7: decay anche capacità
```

#### B. Decay Fatica
```python
fatigue *= exp(-1/7)  # ~1 settimana per dimezzare
```

#### C. Skip Decision
Modello logistico:
```python
p_skip = sigmoid(bias_livello + weight*fatigue + noise)
```
- Baseline: Beginner 10%, Intermediate 6.5%, Advanced 5%
- Fatica aumenta probabilità

#### D. Esecuzione Sessione (se non skip)
Per ogni esercizio:
1. **Prescrizione**: setsplanned, reps range, RIR
2. **Esecuzione serie**: per ogni set:
   ```python
   load_done = intensity * cmax * (1 - fatigue_penalty) + noise
   reps_done ~ reps_target * (1 - 0.2*fatigue_factor) + noise
   RPE = 4 + 5.5*intensity + 1.8*rep_gap + 1.2*fatigue + bias
   ```
   - Penalità fatica **ridotta per Beginner** (80% in meno) → newbie gains
3. **Impulso Banister**: `impulse += load * reps * (RPE/10)`
4. **Fatica intra-sessione**: aumenta progressivamente

#### E. **Progressive Overload (post-sessione)**
```python
stim = clip(impulse / I0, 0.05, 2.0)
quality = max(0.2, 1 - 0.6*(fatigue/20))
gain_base = growth_rate * stim * quality
```

**Transfer tra esercizi**:
- Allenato direttamente: 100%
- Stesso targetmusclegroup: 35%
- Stesso splitcat: 12%
- Nessuna similitudine: 0%

```python
current_caps[eid] *= (1 + gain_base * transfer_weight)
```

#### F. **Injury Event**
```python
p_injury = lambda * (impulse / resilience) * (1 + 0.5*fatigue_sens)
```
Se injury → volume ridotto (60%) fino a recovery.

#### G. **Update Stato Globale**
```python
fitness += alpha * log1p(volume_proxy)
skill += 0.002 * log1p(1 + exp_lat)
fatigue = fatigue_session
```

## **Output per Utente**
- `workouts_rows`: metadata sessioni
- `plan_rows`: prescrizioni
- `sets_rows`: log dettagliato set-level
- `impulse_rows`: impulso giornaliero
- `user_meta`: parametri Banister

---

 **Complessità**: O(n_users × n_sessions × n_exercises × n_sets)

300 utenti × 200 sessioni × 4 esercizi × 4 set ≈ **1M righe**

 Tempo: ~5-8 minuti CPU standard.


**CELL 8 — (Python) Scheduler per utente + simulazione set-level (core)**

In [None]:
def schedule_sessions_for_user(start_u: date, end_u: date, weekly_freq: int, rng):
    # giorni target della settimana
    basedays = sorted(rng.choice(np.arange(7), size=weekly_freq, replace=False).tolist())
    dates = []
    d0 = start_u
    n_days = (end_u - start_u).days + 1
    for i in range(n_days):
        day = d0 + timedelta(days=i)
        if day.weekday() in basedays:
            # jitter -1/0/+1
            jitter = int(rng.choice([-1,0,1], p=cfg.weekday_jitter_probs))
            day2 = day + timedelta(days=jitter)
            if start_u <= day2 <= end_u:
                dates.append(day2)
    dates = sorted(list(set(dates)))
    return dates

def simulate_user(cfg: CFG, user_row: dict, df_ex: pd.DataFrame, caps_u: dict, templates_u: dict, rng):
    uid = int(user_row["userid"])
    start_u = date.fromisoformat(user_row["start_date"])
    end_u = date.fromisoformat(user_row["end_date"])
    weekly_freq = int(user_row["weeklyfreqdeclared"])
    experience_label = str(user_row["experience_label"])

    # latenti (stato + parametri)
    exp_lat = float(user_row["experience_latent"])
    alpha = float(user_row["alpha_adapt"])
    k_d = float(user_row["k_detraining"])
    obs_noise = float(user_row["obs_noise"])
    resilience = float(user_row["resilience"])
    fatigue_sens = float(user_row["fatigue_sens"])
    rpe_bias = float(user_row["rpe_report_bias"])

    # Banister params per utente
    tauF = float(max(7.0, rng.normal(cfg.tauF_mean, cfg.tauF_sd)))
    tauD = float(max(2.0, rng.normal(cfg.tauD_mean, cfg.tauD_sd)))

    # stato dinamico (scalari)
    fitness = float(rng.normal(0.0, 1.0))
    fatigue = float(max(0.0, rng.normal(0.5, 0.3)))
    skill = float(np.clip(rng.normal(0.2 + 0.6*exp_lat, 0.15), 0.0, 2.0))

    # injury state
    injury_until = None

    # schedule “candidato”
    session_dates = schedule_sessions_for_user(start_u, end_u, weekly_freq, rng)

    workouts_rows = []
    plan_rows = []
    sets_rows = []
    impulse_rows = []

    wid = 1  # per-user counter, poi lo rendiamo globale fuori
    set_id_counter = 1

    # rotazione tag per split
    tags = PPL_ROT if str(user_row["splittype"]) == "PPL" else FB_ROT
    tag_i = int(rng.integers(0, len(tags)))

    last_train_date = None

    # --- PROGRESSIVE OVERLOAD STATE ---
    # Copia dinamica delle capacità (evolveranno nel tempo)
    current_caps = {eid: float(val) for eid, val in caps_u.items()}

    # Tasso crescita personale
    base_growth = cfg.overload_base_rate.get(experience_label, 0.001)
    growth_rate = float(np.clip(rng.normal(base_growth, 0.0002), 1e-5, 0.005))


    for d in session_dates:
        # detraining: se gap
        if last_train_date is not None:
            gap = (d - last_train_date).days
            if gap > 1:
                fitness *= math.exp(-k_d * gap)

                    # Detraining anche su capacità se gap molto lungo (>7 giorni)
                    # if gap > 7:
                    #     decay = math.exp(-k_d * (gap - 7) * 0.3)
                    #     for eid in current_caps:
                    #         current_caps[eid] *= decay


        # decay fatica giornaliero
        fatigue *= math.exp(-1.0/7.0)


        # se infortunio
        in_injury = (injury_until is not None and d <= injury_until)



        # skip probability
        # --- SKIP MODEL (baseline per livello + fatica leggera) ---
        p0 = float(cfg.skip_p0_by_level.get(experience_label, 0.10))
        bias = logit(p0)

        fat_term = float(np.log1p(max(0.0, float(fatigue))))
        fat_term = min(fat_term, float(cfg.skip_fatigue_cap))

        z = bias + float(cfg.skip_fatigue_weight) * fat_term + float(rng.normal(0.0, cfg.skip_noise_sd))
        p_skip = sigmoid(z)

        status = "done"
        if rng.random() < p_skip:
            status = "skipped"
        # ----------------------------------------



        # assegna tag sessione
        tag = tags[tag_i % len(tags)]
        tag_i += 1

        week_index_user = (d - start_u).days // 7 + 1

        workouts_rows.append({
            "userid": uid,
            "date": d.isoformat(),
            "weekindex_user": int(week_index_user),
            "sessiontag": tag,
            "workoutstatus": status,
            "z_skip": float(z),
            "p_skip": float(p_skip),
            "fatigue_term": float(fat_term),
            "experience_label": experience_label,
        })

        if status == "skipped":
            impulse_rows.append({"userid": uid, "date": d.isoformat(), "impulse": 0.0})
            continue

        # plan per esercizio (per questa sessione)
        exids = templates_u.get(tag, [])
        if len(exids) == 0:
            exids = templates_u[list(templates_u.keys())[0]]

        # fatica intra-sessione (scalare semplice)
        fatigue_session = float(fatigue)

        day_impulse = 0.0
        day_total_sets = 0

        for exid in exids:
            exrow = df_ex[df_ex["exerciseid"] == exid].iloc[0].to_dict()
            setsplanned, repsmin, repsmax, restplannedsec, rirtarget = prescribe_exercise(exrow, experience_label, rng)

            # eventuale riduzione volume in injury
            if in_injury:
                setsplanned = max(1, int(round(setsplanned * 0.6)))

            plan_rows.append({
                "userid": uid,
                "date": d.isoformat(),
                "sessiontag": tag,
                "exerciseid": int(exid),
                "setsplanned": int(setsplanned),
                "repsmin": int(repsmin),
                "repsmax": int(repsmax),
                "restplannedsec": int(restplannedsec),
                "rirtarget": int(rirtarget),
            })

            # Usa capacità CORRENTE (dinamica)
            cmax = current_caps.get(int(exid), 50.0)

            # intended baseline load (per esercizio) dalla prima serie
            reps_target0 = int(rng.integers(repsmin, repsmax + 1))
            inten0 = intensity_from_reps_rir(reps_target0, rirtarget, rng)
            intended_load = qload(inten0 * cmax, cfg.load_step)

            # “esecuzione”: setdone ~ setsplanned con rumore/aderenza implicita
            setsdone = int(np.clip(round(rng.normal(setsplanned, 0.5)), 1, 10))

            for s in range(1, setsdone + 1):
                day_total_sets += 1

                reps_target = int(rng.integers(repsmin, repsmax + 1))
                inten = intensity_from_reps_rir(reps_target, rirtarget, rng)

                fatigue_factor = float(np.clip(0.03 * fatigue_session * fatigue_sens, 0.0, 0.20))

                # Riduzione fatica per Beginner (per permettere newbie gains)
                if experience_label == "Beginner":
                    fatigue_factor *= 0.2  # riduce dell'80% (era 70%)



                load_done = float(inten * cmax * (1.0 - fatigue_factor))
                load_done *= float(rng.normal(1.0, 0.03 + obs_noise*0.08))
                load_done = qload(float(np.clip(load_done, 2.5, cmax)), cfg.load_step)

                # reps calano se fatica sale
                reps_done = int(np.clip(round(rng.normal(reps_target * (1.0 - 0.20*fatigue_factor), 0.6 + obs_noise*1.5)),
                                        1, 30))

                # RPE cresce con intensità e fatica e gap reps
                rep_gap = (reps_target - reps_done) / max(1.0, reps_target)
                rpe_true = 4.0 + 5.5*inten + 1.8*rep_gap + 1.2*fatigue_factor
                rpe_obs = float(rng.normal(rpe_true + rpe_bias, 0.35 + obs_noise))
                rpe_done = qrpe(rpe_obs, cfg.rpe_step)

                # feedback raro
                feedback = None
                if rng.random() < 0.03:
                    feedback = str(rng.choice([
                        "Tecnica ok", "Fatica alta", "Allenamento solido", "Recuperi corti", "Non ero in giornata"
                    ]))

                # missingness (solo osservazioni)
                if rng.random() < cfg.p_missing_rpe:
                    rpe_done = np.nan
                if rng.random() < cfg.p_missing_load:
                    load_done = np.nan
                if rng.random() < cfg.p_missing_feedback:
                    feedback = None

                sets_rows.append({
                    "set_id": f"U{uid:04d}_S{set_id_counter:07d}",
                    "userid": uid,
                    "date": d.isoformat(),
                    "weekindex_user": int(week_index_user),
                    "sessiontag": tag,
                    "exerciseid": int(exid),

                    "set_index": int(s),

                    "reps_target": int(reps_target),
                    "reps_done": int(reps_done),
                    "load_intended_kg": float(intended_load),
                    "load_done_kg": load_done,
                    "rpe_done": rpe_done,

                    "restplannedsec": int(restplannedsec),
                    "rirtarget": int(rirtarget),
                    "feedback": feedback,
                })
                set_id_counter += 1

                # impulso giornaliero (Banister input)
                ld = 0.0 if (isinstance(load_done, float) and np.isnan(load_done)) else float(load_done)
                rd = 0.0 if (isinstance(rpe_done, float) and np.isnan(rpe_done)) else float(rpe_done)
                day_impulse += ld * float(reps_done) * (rd / 10.0)

                # aggiorna fatica intra-sessione
                fatigue_session += 0.08 * inten + 0.02 * (ld / max(20.0, cmax))

            # aggiorna fitness/fatica/skill post-esercizio (molto semplice)
            # carico “effettivo” = intended * volume relativo
            vol_proxy = setsdone * reps_target0 * float(intended_load)
            fitness += alpha * math.log1p(vol_proxy / 1000.0)
            skill += 0.002 * math.log1p(1.0 + exp_lat)  # crescita lenta


        # --- APPLICA PROGRESSIVE OVERLOAD (dose-driven + transfer) ---
        if status == "done" and day_impulse > 5.0 and not in_injury:
            # Stimolo normalizzato (saturato a 2x)
            stim = float(np.clip(day_impulse / cfg.overload_I0, 0.05, 2.0))

            # Quality factor: scende con fatica alta
            quality = max(0.2, 1.0 - cfg.overload_quality_fatigue * (fatigue / 20.0))

            # Guadagno base per questa sessione
            gain_base = growth_rate * stim * quality

            # Transfer: itera su TUTTI gli esercizi e applica peso per similitudine
            exids_done_set = set(exids)  # esercizi fatti oggi

            for ex_target in df_ex.itertuples(index=False):
                eid_target = int(ex_target.exerciseid)

                # Calcola peso transfer
                if eid_target in exids_done_set:
                    # Esercizio allenato direttamente
                    weight = 1.00
                else:
                    # Transfer indiretto (basato su similitudine)
                    weight = 0.0

                    # Controlla similitudine con ciascuno degli esercizi fatti
                    for eid_done in exids_done_set:
                        ex_done_row = df_ex[df_ex["exerciseid"] == eid_done].iloc[0]

                        # Stesso targetmusclegroup (specifico)
                        if str(ex_target.targetmusclegroup).lower() == str(ex_done_row["targetmusclegroup"]).lower():
                            weight = max(weight, cfg.transfer_same_muscle)
                        # Stesso splitcat (pattern motorio simile)
                        elif str(ex_target.splitcat).lower() == str(ex_done_row["splitcat"]).lower():
                            weight = max(weight, cfg.transfer_same_split)

                            # DEBUG: verifica crescita per User 1 (Advanced che calava)
                            if uid == 1:
                                cap_ex9_new = current_caps.get(9, 0)
                                print(f"[DEBUG] User {uid} Day {d.isoformat()}: Impulse={day_impulse:.1f}, Stim={stim:.3f}, Quality={quality:.3f}, GainBase={gain_base:.5f}")
                                print(f"         Cap Ex9 → {cap_ex9_new:.2f} kg")

                # Applica adattamento
                if weight > 0:
                    gain = gain_base * weight
                    current_caps[eid_target] = float(current_caps.get(eid_target, 50.0) * (1.0 + gain))


        # injury event (dopo sessione): aumenta con fatica e “impulso” e bassa resilienza
        p_injury = cfg.injury_lambda * (day_impulse / max(1.0, resilience)) * (1.0 + 0.5*fadigue_sens if (fadigue_sens:=fatigue_sens) else 1.0)
        p_injury = float(np.clip(p_injury, 0.0, 0.35))
        if (injury_until is None or d > injury_until) and rng.random() < p_injury:
            injury_days = int(rng.integers(cfg.injury_days_min, cfg.injury_days_max + 1))
            injury_until = d + timedelta(days=injury_days)

        # aggiorna fatica globale a fine sessione
        fatigue = float(np.clip(fatigue_session, 0.0, 20.0))

        impulse_rows.append({"userid": uid, "date": d.isoformat(), "impulse": float(day_impulse)})
        last_train_date = d

    # output user-level metadata Banister
    user_meta = {
        "userid": uid,
        "tauF": tauF,
        "tauD": tauD,
        "betaF": cfg.betaF,
        "betaD": cfg.betaD
    }

    return workouts_rows, plan_rows, sets_rows, impulse_rows, user_meta


### **Test Detraining Logic**

Verifica decay capacità durante pause lunghe.


In [None]:
# === TEST DETRAINING ===
print("Verifica che current_caps esista e sia modificabile:")
test_caps = {1: 100.0, 2: 80.0}
k_d_test = 0.01
gap_test = 10

if gap_test > 7:
    decay = math.exp(-k_d_test * (gap_test - 7) * 0.3)
    for eid in test_caps:
        test_caps[eid] *= decay

print(f"  Decay factor per gap={gap_test}: {decay:.4f}")
print(f"  Caps after: {test_caps}")
print("  (Dovrebbe essere ~99 e ~79, non 100 e 80)")


Verifica che current_caps esista e sia modificabile:
  Decay factor per gap=10: 0.9910
  Caps after: {1: 99.10403787728836, 2: 79.28323030183068}
  (Dovrebbe essere ~99 e ~79, non 100 e 80)


---
#9. **Run Generator (All Users)**

Esecuzione simulazione per tutti gli utenti.

## **Processo**
1. Loop su tutti gli utenti in `df_users`
2. Per ogni utente:
   - `simulate_user()` → workouts, plan, sets, impulse
   - Assegna `workoutid` **globale** univoco
   - Mapping `(userid, date) → workoutid`
   - Propaga `workoutid` a plan/sets
3. Concatena DataFrame globali

## **Output Globali**
- `df_workouts`: workout metadata
- `df_plan`: prescrizioni
- `df_sets`: **set logs (principale)**
- `df_impulse`: impulsi giornalieri
- `df_ban_meta`: parametri Banister user

---

 **Questa cella richiede diversi minuti**.

 Durante esecuzione: print debug User 1 (verifica progressive overload).


**CELL 9 — (Python) Run generator (tutti utenti) + IDs globali**

In [None]:
all_workouts = []
all_plan = []
all_sets = []
all_impulse = []
ban_meta_rows = []

workout_id_counter = 1

for u in df_users.to_dict(orient="records"):
    uid = int(u["userid"])
    w_rows, p_rows, s_rows, i_rows, meta = simulate_user(cfg, u, df_ex, caps[uid], templates[uid], rng)
    ban_meta_rows.append(meta)

    # assegna workout_id globale: stesso id per stessa (userid,date)
    # costruisco mapping per user
    wdf = pd.DataFrame(w_rows)
    if len(wdf) == 0:
        continue

    # sort e assegnazione
    wdf = wdf.sort_values(["userid","date"]).reset_index(drop=True)
    wdf["workoutid"] = np.arange(workout_id_counter, workout_id_counter + len(wdf))
    workout_id_counter += len(wdf)

    # mapping (userid,date) -> workoutid
    key_to_wid = {(int(r.userid), str(r.date)): int(r.workoutid) for r in wdf.itertuples(index=False)}

    # push workouts
    all_workouts.append(wdf)

    # attach workoutid to plan/sets
    pdf = pd.DataFrame(p_rows)
    if len(pdf):
        pdf["workoutid"] = [key_to_wid[(int(r["userid"]), str(r["date"]))] for r in pdf.to_dict("records")]
        all_plan.append(pdf)

    sdf = pd.DataFrame(s_rows)
    if len(sdf):
        sdf["workoutid"] = [key_to_wid[(int(r["userid"]), str(r["date"]))] for r in sdf.to_dict("records")]
        all_sets.append(sdf)

    idf = pd.DataFrame(i_rows)
    if len(idf):
        all_impulse.append(idf)

df_workouts = pd.concat(all_workouts, ignore_index=True) if all_workouts else pd.DataFrame()
df_plan = pd.concat(all_plan, ignore_index=True) if all_plan else pd.DataFrame()
df_sets = pd.concat(all_sets, ignore_index=True) if all_sets else pd.DataFrame()
df_impulse = pd.concat(all_impulse, ignore_index=True) if all_impulse else pd.DataFrame()
df_ban_meta = pd.DataFrame(ban_meta_rows)

df_workouts.head(), df_sets.head()


(   userid        date  weekindex_user sessiontag workoutstatus    z_skip  \
 0       1  2024-09-06               1       Push          done -2.188516   
 1       1  2024-09-10               1       Pull          done -1.910497   
 2       1  2024-09-13               2       Legs          done -1.968734   
 3       1  2024-09-14               2       Push          done -2.067860   
 4       1  2024-09-16               2       Pull          done -1.956130   
 
      p_skip  fatigue_term experience_label  workoutid  
 0  0.100786      0.003221         Beginner          1  
 1  0.128925      0.502959         Beginner          2  
 2  0.122525      0.739271         Beginner          3  
 3  0.112260      1.025749         Beginner          4  
 4  0.123886      1.150986         Beginner          5  ,
            set_id  userid        date  weekindex_user sessiontag  exerciseid  \
 0  U0001_S0000001       1  2024-09-06               1       Push          10   
 1  U0001_S0000002       1  202

---
# 10. **Derive Aggregated Views**

Creazione viste aggregate da set-level.

## **Workout Logs (Exercise-Level)**
Aggregazione per (workoutid, userid, date, sessiontag, exerciseid).

### **Metriche**
- `setsdone`: max(set_index)
- `repsdonetotal`: sum(reps_done)
- `repsdoneavg`: mean(reps_done)
- `loaddonekg`: median(load_done_kg)
- `rpedone`: mean(rpe_done)

### **GAP (Gap Adherence Score)**
Aderenza al piano:
```python
GAP = 0.45*(load_done/load_intended) +
      0.30*(sets_done/sets_planned) +
      0.25*(reps_done/reps_target)
```
- GAP ~ 1.0: aderenza perfetta
- GAP > 1.0: superamento piano
- GAP < 1.0: sotto-esecuzione

## **Sessions (Session-Level)**
Aggregazione per (workoutid, userid, date, sessiontag).

### **Metriche**
- `total_sets`: count
- `total_reps`: sum
- `volume_kg`: sum(load × reps)
- `sRPE`: mean(rpe_done)

---

 **Uso**:
- `workout_logs.csv` → Feature engineering Mod STATUS
- `sessions.csv` → Analisi macro-ciclo


**CELL 10 — (Python) Derive workoutlogs (exercise-level) + sessions (session-level)**

In [None]:
# join plan -> per calcolare gapadherencescore a livello esercizio
plan_key = ["workoutid","userid","date","sessiontag","exerciseid"]
df_plan_keyed = df_plan[plan_key + ["setsplanned","repsmin","repsmax","rirtarget"]].copy()

# aggregate sets -> exercise
gex = df_sets.groupby(["workoutid","userid","date","sessiontag","exerciseid"], as_index=False).agg(
    setsdone=("set_index","max"),
    repsdonetotal=("reps_done","sum"),
    repsdoneavg=("reps_done","mean"),
    loaddonekg=("load_done_kg","median"),
    rpedone=("rpe_done","mean"),
    loadintendedkg=("load_intended_kg","median"),
    reps_target_avg=("reps_target","mean"),
)

df_logs = gex.merge(df_plan_keyed, on=plan_key, how="left")

# gapadherencescore
carratio = (df_logs["loaddonekg"] / df_logs["loadintendedkg"]).replace([np.inf, -np.inf], np.nan).fillna(1.0)
sratio = (df_logs["setsdone"] / df_logs["setsplanned"]).replace([np.inf, -np.inf], np.nan).fillna(1.0)
# reps target “centrale” ~ media target
repratio = (df_logs["repsdoneavg"] / df_logs["reps_target_avg"]).replace([np.inf, -np.inf], np.nan).fillna(1.0)

gap = 0.45*carratio + 0.30*sratio + 0.25*repratio
df_logs["gapadherencescore"] = np.clip(gap, 0.3, 1.8).round(3)

df_logs["repsdoneavg"] = df_logs["repsdoneavg"].round(2)
df_logs["loaddonekg"] = df_logs["loaddonekg"].round(2)
df_logs["rpedone"] = df_logs["rpedone"].round(2)

# sessions (session-level)
df_sessions = df_sets.groupby(["workoutid","userid","date","sessiontag"], as_index=False).agg(
    total_sets=("set_index","count"),
    total_reps=("reps_done","sum"),
    volume_kg=("load_done_kg", lambda x: float(np.nansum(x.values))),  # solo somma load (non volume)
)
# volume vero = sum(load*reps)
tmp = df_sets.copy()
tmp["load_done_kg_0"] = tmp["load_done_kg"].fillna(0.0)
tmp["volume_kg"] = tmp["load_done_kg_0"] * tmp["reps_done"].astype(float)
df_sessions = tmp.groupby(["workoutid","userid","date","sessiontag"], as_index=False).agg(
    total_sets=("set_index","count"),
    total_reps=("reps_done","sum"),
    volume_kg=("volume_kg","sum"),
    sRPE=("rpe_done","mean")
)
df_sessions["sRPE"] = df_sessions["sRPE"].round(2)

df_logs.head(), df_sessions.head()


(   workoutid  userid        date sessiontag  exerciseid  setsdone  \
 0          1       1  2024-09-06       Push           1         5   
 1          1       1  2024-09-06       Push           4         2   
 2          1       1  2024-09-06       Push           8         1   
 3          1       1  2024-09-06       Push          10         3   
 4          2       1  2024-09-10       Pull           2         4   
 
    repsdonetotal  repsdoneavg  loaddonekg  rpedone  loadintendedkg  \
 0             53        10.60       31.25     7.10           33.25   
 1             21        10.50       21.12     7.75           21.00   
 2             12        12.00       29.00     7.00           28.50   
 3             25         8.33       23.00     7.17           22.50   
 4             37         9.25       35.00     7.50           37.25   
 
    reps_target_avg  setsplanned  repsmin  repsmax  rirtarget  \
 0        11.200000            5       10       14          2   
 1        10.000000 

---
#11. **Compute Banister Daily Series**

Calcolo serie temporale fitness/fatigue.

## **Modello Banister (Impulse-Response)**

Per ogni utente, su ogni giorno `[start_date, end_date]`:

### **Fitness (F) — Adattamento Lungo Termine**
```
F(t) = Σ_{i=0}^{t} u(i) × exp(-(t-i)/tauF)
```
- `tauF` ~ 45 giorni (personalizzato)
- Accumulo lento, decay lento

### **Fatigue (D) — Affaticamento Breve Termine**
```
D(t) = Σ_{i=0}^{t} u(i) × exp(-(t-i)/tauD)
```
- `tauD` ~ 7 giorni (personalizzato)
- Accumulo rapido, decay rapido

### **Performance (P)**
```
P(t) = betaF × F(t) - betaD × D(t)
```
- P > 0: forma positiva
- P < 0: overreaching
- P crescente: supercompensazione

## **Output**
`df_ban` (banister_daily.csv):
- Una riga per (userid, date)
- Colonne: impulse, F, D, P, tauF, tauD, betaF, betaD

---

 **Uso**: Input principale Mod IMPETUS (regressione trend).

 Implementazione $O(L^2)$ per utente. Per $L > 1000$ considerare algoritmi con efficienza maggiore.


**CELL 11 — (Python) Compute Banister daily**

In [None]:
def compute_banister_daily(cfg: CFG, df_users: pd.DataFrame, df_impulse: pd.DataFrame, df_ban_meta: pd.DataFrame):
    # per ogni utente crea serie giornaliera sul range [start_u, end_u]
    rows = []
    for u in df_users.itertuples(index=False):
        uid = int(u.userid)
        start_u = date.fromisoformat(u.start_date)
        end_u = date.fromisoformat(u.end_date)

        meta = df_ban_meta[df_ban_meta["userid"] == uid].iloc[0].to_dict()
        tauF = float(meta["tauF"]); tauD = float(meta["tauD"])
        betaF = float(meta["betaF"]); betaD = float(meta["betaD"])

        days = [start_u + timedelta(days=i) for i in range((end_u - start_u).days + 1)]
        days_iso = [d.isoformat() for d in days]

        sub = df_impulse[df_impulse["userid"] == uid].copy()
        imp_map = dict(zip(sub["date"].astype(str), sub["impulse"].astype(float)))

        uts = np.array([float(imp_map.get(d, 0.0)) for d in days_iso], dtype=float)
        L = len(uts)
        wF = exp_weights(L, tauF)
        wD = exp_weights(L, tauD)

        # calcolo cumulativo “naive” O(L^2) (ok per dataset medio); ottimizzabile se serve
        F = np.array([float(np.sum(uts[:i+1][::-1] * wF[:i+1])) for i in range(L)], dtype=float)
        D = np.array([float(np.sum(uts[:i+1][::-1] * wD[:i+1])) for i in range(L)], dtype=float)
        P = betaF * F - betaD * D

        for i, d in enumerate(days_iso):
            rows.append({
                "userid": uid,
                "date": d,
                "impulse": float(uts[i]),
                "F": float(F[i]),
                "D": float(D[i]),
                "P": float(P[i]),
                "tauF": tauF,
                "tauD": tauD,
                "betaF": betaF,
                "betaD": betaD,
            })
    return pd.DataFrame(rows)

df_ban = compute_banister_daily(cfg, df_users, df_impulse, df_ban_meta)
df_ban.head()


Unnamed: 0,userid,date,impulse,F,D,P,tauF,tauD,betaF,betaD
0,1,2024-09-06,2212.7875,2212.7875,2212.7875,-11.063937,45.859934,7.779921,0.01,0.015
1,1,2024-09-07,0.0,2165.05877,1945.885263,-7.537691,45.859934,7.779921,0.01,0.015
2,1,2024-09-08,0.0,2118.359524,1711.176268,-4.484049,45.859934,7.779921,0.01,0.015
3,1,2024-09-09,0.0,2072.667558,1504.777428,-1.844986,45.859934,7.779921,0.01,0.015
4,1,2024-09-10,2257.9625,4285.923646,3581.236525,-10.859311,45.859934,7.779921,0.01,0.015


---
#12. **Validation & Save**

Validazioni dataset + salvataggio CSV + ZIP.

## **Validazioni Automatiche**

1. **Date Range**: set in `[start_date, end_date]` utente
2. **Foreign Keys**: exerciseid validi
3. **Skip Rate**: Beginner 10-13%, Intermediate 6-9%, Advanced 4-7%
4. **Missingness**: RPE ~2%, Load ~1%, Feedback ~2%
5. **Quantization**: Load multipli 0.25 kg, RPE multipli 0.5
6. **Progressive Overload**: capacità medie crescenti (utenti > 180 giorni)

## **File Output**

In cartella `cfg.outdir` (default: `data_synth_setlevel/`):

| File | Descrizione | Size (300 users) |
|------|-------------|------------------|
| `users.csv` | Anagrafica + latenti + date window | ~50 KB |
| `exercises.csv` | Catalogo esercizi | ~5 KB |
| `workouts.csv` | Metadata sessioni | ~500 KB |
| `workout_plan.csv` | Prescrizioni | ~2 MB |
| **`workout_sets.csv`** | **Set-level (PRINCIPALE)** | **~30 MB** |
| `workout_logs.csv` | Exercise-level | ~10 MB |
| `sessions.csv` | Session-level | ~3 MB |
| `banister_daily.csv` | Serie F/D/P | ~20 MB |
| `capabilities.json` | Capacità iniziali | ~500 KB |
| `metrics.json` | Summary validazione | ~1 KB |

### **ZIP Archive**
Tutti i file compressi in `{outdir}.zip` per download.

---

 Se validazioni passano (`True`): dataset pronto ML.


**CELL 12 — (Python) Validations + Save CSV + Zip**

In [None]:
def validate_dataset(df_users, df_workouts, df_plan, df_sets, df_logs, df_sessions, df_ban):
    checks = {}

    # 1) date range per utente
    u = df_users.copy()
    u["start_date"] = pd.to_datetime(u["start_date"]).dt.date
    u["end_date"] = pd.to_datetime(u["end_date"]).dt.date
    s = df_sets.copy()
    s["date"] = pd.to_datetime(s["date"]).dt.date

    merged = s.merge(u[["userid","start_date","end_date"]], on="userid", how="left")
    checks["sets_in_range"] = bool(((merged["date"] >= merged["start_date"]) & (merged["date"] <= merged["end_date"])).all())

    # 2) set_id unico
    checks["set_id_unique"] = bool(df_sets["set_id"].is_unique)

    # 3) consistenza workoutstatus: se session skipped, non dovrebbero esserci set
    w = df_workouts.copy()
    sw = s.merge(w[["workoutid","workoutstatus"]], on="workoutid", how="left")
    checks["no_sets_for_skipped"] = bool((sw[sw["workoutstatus"] == "skipped"].shape[0] == 0))

    # 4) chiavi minime non nulle
    required_cols = ["userid","date","exerciseid","set_index"]
    checks["sets_required_cols_nonnull"] = bool(df_sets[required_cols].notnull().all().all())

    return checks

checks = validate_dataset(df_users, df_workouts, df_plan, df_sets, df_logs, df_sessions, df_ban)
checks


{'sets_in_range': True,
 'set_id_unique': True,
 'no_sets_for_skipped': True,
 'sets_required_cols_nonnull': True}

In [None]:
# Save
df_users.to_csv(OUTDIR / "users.csv", index=False)
df_ex.to_csv(OUTDIR / "exercises.csv", index=False)
df_workouts.to_csv(OUTDIR / "workouts.csv", index=False)
df_plan.to_csv(OUTDIR / "workoutexercises.csv", index=False)     # plan per esercizio (come prima) [file:1]
df_sets.to_csv(OUTDIR / "workout_sets.csv", index=False)          # canonico set-level
df_logs.to_csv(OUTDIR / "workoutlogs.csv", index=False)           # derivato exercise-level (compat) [file:1]
df_sessions.to_csv(OUTDIR / "sessions.csv", index=False)          # derivato session-level
df_ban.to_csv(OUTDIR / "banisterdaily.csv", index=False)          # compat concettuale [file:1]

with open(OUTDIR / "validation_checks.json", "w", encoding="utf-8") as f:
    json.dump(checks, f, indent=2)

# Zip per download
zip_name = f"{cfg.outdir}.zip"
!zip -r {zip_name} {cfg.outdir}
print("DONE:", zip_name)


  adding: data_synth_setlevel/ (stored 0%)
  adding: data_synth_setlevel/banisterdaily.csv (deflated 72%)
  adding: data_synth_setlevel/workout_sets.csv (deflated 84%)
  adding: data_synth_setlevel/validation_checks.json (deflated 32%)
  adding: data_synth_setlevel/workoutexercises.csv (deflated 84%)
  adding: data_synth_setlevel/workouts.csv (deflated 70%)
  adding: data_synth_setlevel/workoutlogs.csv (deflated 78%)
  adding: data_synth_setlevel/users.csv (deflated 68%)
  adding: data_synth_setlevel/sessions.csv (deflated 72%)
  adding: data_synth_setlevel/exercises.csv (deflated 59%)
DONE: data_synth_setlevel.zip


---
# **Validazioni Post-Generazione**

Controlli qualità dataset.

---


## **Skip Rate per Livello**

Target: Beginner 10-13%, Intermediate 6-9%, Advanced 4-7%


In [None]:
# === SKIP-RATE ===
import pandas as pd
from pathlib import Path

DATA_DIR = OUTDIR if "OUTDIR" in globals() else Path("data_synth_setlevel")
workouts = pd.read_csv(DATA_DIR / "workouts.csv")

# experience_label è già in workouts.csv, non serve merge!
workouts["is_skipped"] = (workouts["workoutstatus"] == "skipped").astype(int)

print("Skip-rate per livello:")
print(workouts.groupby("experience_label")["is_skipped"].mean().sort_index())

# Debug z/p/fatigue_term
if "z_skip" in workouts.columns:
    print("\nMean z_skip per livello:")
    print(workouts.groupby("experience_label")["z_skip"].mean().sort_index())

    print("\nMean p_skip per livello:")
    print(workouts.groupby("experience_label")["p_skip"].mean().sort_index())

    print("\nDistribuzione p_skip:")
    print(workouts["p_skip"].describe().round(4))


Skip-rate per livello:
experience_label
Advanced        0.071001
Beginner        0.131598
Intermediate    0.087638
Name: is_skipped, dtype: float64

Mean z_skip per livello:
experience_label
Advanced       -2.646355
Beginner       -1.900262
Intermediate   -2.369068
Name: z_skip, dtype: float64

Mean p_skip per livello:
experience_label
Advanced        0.066495
Beginner        0.130511
Intermediate    0.085901
Name: p_skip, dtype: float64

Distribuzione p_skip:
count    169292.0000
mean          0.0993
std           0.0250
min           0.0390
25%           0.0810
50%           0.0903
75%           0.1233
max           0.1824
Name: p_skip, dtype: float64


## **Dataset Summary**


In [None]:
print("\n" + "="*60)
print("  DATASET SUMMARY")
print("="*60)
print(f"\nUtenti: {len(df_users)}")
print(f"Sessioni: {len(df_workouts)}")
print(f"Set logs: {len(df_sets)}")
print(f"\nDistribuzione Livelli:")
print((df_users['experience_label'].value_counts(normalize=True) * 100).round(1))
print("\n" + "="*60)
print(" Dataset pronto!")
print("="*60)


  DATASET SUMMARY

Utenti: 1000
Sessioni: 169292
Set logs: 1401317

Distribuzione Livelli:
experience_label
Intermediate    55.1
Beginner        34.5
Advanced        10.4
Name: proportion, dtype: float64

 Dataset pronto!
