# Strat√©gie probabiliste sur les contrats up/down Polymarket

Ce notebook explore comment estimer en continu la probabilit√© qu'une bougie Bitcoin se cl√¥ture au-dessus ou au-dessous de son prix d'ouverture pour trois horizons (m15, h1, daily en timezone ET), puis comment exploiter les d√©s√©quilibres de cotes observ√©s lors de phases de FOMO.

Le pipeline couvre : (1) l'ingestion des donn√©es OHLC minute, (2) l'ing√©nierie de features multi-√©chelles, (3) l'entra√Ænement de mod√®les de probabilit√©s, (4) la simulation d'une cote ¬´ FOMO ¬ª param√©trable et (5) un backtest value simple pour quantifier l'edge potentiel.

**üìã Ordre logique d'ex√©cution des sections :**
0. Setup & Configuration (imports, constantes, fonctions)
1. Chargement et pr√©paration des donn√©es (OHLC + indicateurs)
2. Reconstruction multi-√©chelle (snapshots par timeframe)
3. Mod√®les intrabougie (entra√Ænement, probabilit√©s, visualisations)
4. Probabilit√©s pr√©-ouverture (mod√®les pr√©-open, seuils, visualisations)
5. Simulation des cotes FOMO (g√©n√©ration de sc√©narios)
   5.1 Calibration du mod√®le FOMO (comparaison avec vraies cotes march√©)
6. Backtest ONLINE (trading minute-par-minute, m√©triques)
7. Visualisations et analyse (courbes d'√©quity, distributions, timing)
8. Backtest ONLINE avec cotes march√© r√©elles
9. Lecture mod√®le (feature importances)
10. Synth√®se et prochaines √©tapes


## 0. Setup & Configuration

Configuration de l'environnement, imports, constantes et d√©finition de toutes les fonctions utilitaires utilis√©es dans le notebook.

- But: centraliser toute la configuration et les fonctions de base.
- Entr√©es: aucune.
- Sorties: fonctions et constantes globales disponibles pour les sections suivantes.
- Lecture: section √† ex√©cuter en premier.


In [None]:
from __future__ import annotations

import math
import pathlib
from dataclasses import dataclass
from typing import Dict, Iterable, List, Tuple, Optional

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    brier_score_loss,
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    roc_auc_score,
)


In [None]:
pd.options.display.max_columns = 120
pd.options.display.max_rows = 200

DATA_PATH = pathlib.Path("../data/btc_1m_OHLC.csv").resolve()
MARKET_ODDS_PATH = pathlib.Path("../data/BTC.csv").resolve()
TARGET_TZ = "America/New_York"
RANDOM_SEED = 17

np.random.seed(RANDOM_SEED)


In [None]:
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["axes.titlesize"] = 12
plt.rcParams["axes.labelsize"] = 11


In [None]:
def announce(msg: str) -> None:
    print(f"[INFO] {msg}")


### 0.1 Fonctions de chargement et indicateurs


In [None]:
def load_minute_data(path: pathlib.Path) -> pd.DataFrame:
    """Charge les donn√©es OHLCV minute et impose un index temporel UTC."""
    df = pd.read_csv(path)
    df = df.sort_values("timestamp")
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="s", utc=True)
    df = df.set_index("timestamp")
    df = df.rename(
        columns={
            "open": "open",
            "high": "high",
            "low": "low",
            "close": "close",
            "volume": "volume",
        }
    )
    df.index.name = "timestamp_utc"
    return df


def compute_rsi(close: pd.Series, period: int = 14) -> pd.Series:
    """Calcule un RSI classique sur une s√©rie de cl√¥tures."""
    delta = close.diff()
    gain = delta.clip(lower=0).ewm(alpha=1 / period, adjust=False).mean()
    loss = -delta.clip(upper=0).ewm(alpha=1 / period, adjust=False).mean()
    rs = gain / (loss + 1e-9)
    rsi = 100 - (100 / (1 + rs))
    return rsi


def add_global_indicators(df: pd.DataFrame) -> pd.DataFrame:
    enriched = df.copy()
    enriched["return_1m"] = enriched["close"].pct_change().fillna(0.0)
    enriched["ema_12"] = enriched["close"].ewm(span=12, adjust=False).mean()
    enriched["ema_48"] = enriched["close"].ewm(span=48, adjust=False).mean()
    enriched["ema_288"] = enriched["close"].ewm(span=288, adjust=False).mean()

    enriched["range_1m"] = (enriched["high"] - enriched["low"]).abs()
    enriched["atr_15m"] = enriched["range_1m"].rolling(15).mean().bfill()
    enriched["sigma_15m"] = enriched["return_1m"].rolling(15).std().bfill() * math.sqrt(15)

    enriched["rsi_14"] = compute_rsi(enriched["close"])
    enriched["rolling_vol_30"] = enriched["return_1m"].rolling(30).std().fillna(0.0) * math.sqrt(30)
    enriched["volume_per_minute"] = enriched["volume"].rolling(30).mean().bfill()
    enriched["volume_z"] = ((enriched["volume"] - enriched["volume"].rolling(120).mean())
                            / (enriched["volume"].rolling(120).std() + 1e-9)).fillna(0.0)
    enriched["trend_ema_ratio"] = (enriched["ema_12"] - enriched["ema_48"]) / (enriched["ema_48"] + 1e-9)
    enriched["macro_trend_ratio"] = (enriched["ema_48"] - enriched["ema_288"]) / (enriched["ema_288"] + 1e-9)
    enriched["is_trend_up"] = (enriched["trend_ema_ratio"] > 0).astype(int)
    return enriched


### 0.2 Fonctions de reconstruction multi-√©chelle


In [None]:
def build_timeframe_snapshots(
    df: pd.DataFrame,
    freq: str,
    label: str,
    target_tz: str | None = None,
) -> pd.DataFrame:
    """Agr√®ge les minutes en se basant sur les bornes UTC pour √©viter les ambigu√Øt√©s DST."""
    tz = target_tz
    if tz is None:
        try:
            tz = TARGET_TZ
        except NameError:
            tz = "America/New_York"

    local = df.copy()
    local["bucket_start_utc"] = local.index.floor(freq)
    local["bucket_end_utc"] = local["bucket_start_utc"] + pd.to_timedelta(freq)
    local["bucket_key"] = local["bucket_start_utc"]

    local["timestamp_et"] = local.index.tz_convert(tz)
    local["bucket_start_et"] = local["bucket_start_utc"].tz_convert(tz)
    local["bucket_end_et"] = local["bucket_end_utc"].tz_convert(tz)

    group = local.groupby("bucket_key", group_keys=False)
    local["tf_open"] = group["open"].transform("first")
    local["tf_high_to_now"] = group["high"].cummax()
    local["tf_low_to_now"] = group["low"].cummin()
    local["tf_close_to_now"] = local["close"]
    local["tf_volume_to_now"] = group["volume"].cumsum()
    local["tf_final_close"] = group["close"].transform("last")
    local["tf_final_high"] = group["high"].transform("max")
    local["tf_final_low"] = group["low"].transform("min")

    local["minutes_elapsed"] = group.cumcount() + 1
    local["bucket_size"] = group["close"].transform("size")
    local["minutes_total"] = local["bucket_size"].clip(lower=1)
    local["minutes_remaining"] = (
        local["minutes_total"] - local["minutes_elapsed"]
    ).clip(lower=0)
    local["seconds_remaining"] = local["minutes_remaining"] * 60
    local["time_elapsed_ratio"] = local["minutes_elapsed"] / local["minutes_total"]
    local["time_remaining_ratio"] = (
        local["minutes_remaining"] / local["minutes_total"]
    )

    local["target_up"] = (local["tf_final_close"] >= local["tf_open"]).astype(int)

    local["dist_from_open_pct"] = (
        (local["tf_close_to_now"] - local["tf_open"]) / (local["tf_open"] + 1e-9)
    )
    local["high_gap_pct"] = (
        (local["tf_high_to_now"] - local["tf_close_to_now"]) / (local["tf_open"] + 1e-9)
    )
    local["low_gap_pct"] = (
        (local["tf_close_to_now"] - local["tf_low_to_now"]) / (local["tf_open"] + 1e-9)
    )
    local["running_range_pct"] = (
        (local["tf_high_to_now"] - local["tf_low_to_now"]) / (local["tf_open"] + 1e-9)
    )
    local["minute_body_pct"] = (
        (local["close"] - local["open"]) / (local["tf_open"] + 1e-9)
    )

    # Distance normalis√©e par ATR 15m (vol en unit√©s de prix)
    local["z_dist_atr15"] = (
        (local["tf_close_to_now"] - local["tf_open"]) / (local["atr_15m"] + 1e-6)
    )
    local["z_range_atr15"] = (
        (local["tf_high_to_now"] - local["tf_low_to_now"]) / (local["atr_15m"] + 1e-6)
    )

    local["minute_of_day"] = (
        local["timestamp_et"].dt.hour * 60 + local["timestamp_et"].dt.minute
    )
    local["minute_of_week"] = (
        local["timestamp_et"].dt.dayofweek * 1440 + local["minute_of_day"]
    )
    local["minute_of_day_sin"] = np.sin(2 * np.pi * local["minute_of_day"] / 1440)
    local["minute_of_day_cos"] = np.cos(2 * np.pi * local["minute_of_day"] / 1440)
    local["day_of_week"] = local["timestamp_et"].dt.dayofweek
    local["day_of_week_sin"] = np.sin(2 * np.pi * local["day_of_week"] / 7)
    local["day_of_week_cos"] = np.cos(2 * np.pi * local["day_of_week"] / 7)

    # Streaks intrabougie (cons√©cutifs)
    def _streak_bool(series: pd.Series) -> pd.Series:
        count = 0
        out = []
        for v in series.astype(bool):
            if v:
                count += 1
            else:
                count = 0
            out.append(count)
        return pd.Series(out, index=series.index)

    local["minute_up"] = (local["close"] >= local["open"]).astype(int)
    local["minute_down"] = 1 - local["minute_up"]
    local["tf_up_to_now"] = (local["tf_close_to_now"] >= local["tf_open"]).astype(int)
    local["tf_down_to_now"] = 1 - local["tf_up_to_now"]

    local["streak_up_minute"] = local.groupby("bucket_key")["minute_up"].apply(_streak_bool)
    local["streak_down_minute"] = local.groupby("bucket_key")["minute_down"].apply(_streak_bool)
    local["streak_tf_up"] = local.groupby("bucket_key")["tf_up_to_now"].apply(_streak_bool)
    local["streak_tf_down"] = local.groupby("bucket_key")["tf_down_to_now"].apply(_streak_bool)

    bucket_summary = (
        local.groupby("bucket_key")
        .agg(
            bucket_open=("tf_open", "first"),
            bucket_close=("tf_final_close", "first"),
            bucket_high=("tf_final_high", "first"),
            bucket_low=("tf_final_low", "first"),
            bucket_minutes=("minutes_total", "first"),
            bucket_target=("target_up", "first"),
        )
        .sort_index()
    )
    bucket_summary["bucket_return"] = (
        (bucket_summary["bucket_close"] - bucket_summary["bucket_open"])
        / (bucket_summary["bucket_open"] + 1e-9)
    )
    bucket_summary["bucket_range"] = (
        (bucket_summary["bucket_high"] - bucket_summary["bucket_low"])
        / (bucket_summary["bucket_open"] + 1e-9)
    )
    bucket_summary["prev_bucket_return"] = bucket_summary["bucket_return"].shift(1)
    bucket_summary["prev_bucket_target"] = bucket_summary["bucket_target"].shift(1)
    bucket_summary["prev_bucket_range"] = bucket_summary["bucket_range"].shift(1)

    local = local.join(
        bucket_summary[
            [
                "prev_bucket_return",
                "prev_bucket_target",
                "prev_bucket_range",
            ]
        ],
        on="bucket_key",
    )

    local["prev_bucket_return"].fillna(0.0, inplace=True)
    local["prev_bucket_target"].fillna(0.5, inplace=True)
    local["prev_bucket_range"].fillna(0.0, inplace=True)

    local["timeframe"] = label
    local["contract_id"] = (
        label
        + "_"
        + local["bucket_start_et"].dt.strftime("%Y-%m-%d %H:%M")
    )
    return local



In [None]:
def prepare_timeframe_dataset(
    df: pd.DataFrame,
    mapping: Dict[str, str],
    target_tz: str | None = None,
) -> pd.DataFrame:
    """Assemble les flux enrichis pour chaque horizon demand√©."""
    tz = target_tz
    if tz is None:
        try:
            tz = TARGET_TZ
        except NameError:
            tz = "America/New_York"

    frames = []
    for label, freq in mapping.items():
        frame = build_timeframe_snapshots(df, freq=freq, label=label, target_tz=tz)
        frames.append(frame)
    combined = pd.concat(frames).sort_index()
    combined = combined[combined["minutes_remaining"] > 0]
    combined = combined.dropna(subset=["tf_open", "tf_close_to_now"])
    return combined


In [None]:
def build_preopen_dataset(
    minute_df: pd.DataFrame,
    mapping: Dict[str, str],
    target_tz: str | None = None,
) -> pd.DataFrame:
    """Construit un dataset avant l'ouverture de chaque bougie cible."""
    tz = target_tz or TARGET_TZ

    records = []
    for label, freq in mapping.items():
        full_snapshots = build_timeframe_snapshots(
            minute_df,
            freq=freq,
            label=label,
            target_tz=tz,
        )

        bucket_meta = (
            full_snapshots.groupby("contract_id")
            .agg(
                bucket_start_utc=("bucket_start_utc", "first"),
                bucket_start_et=("bucket_start_et", "first"),
                target_up=("target_up", "first"),
                prev_bucket_return=("prev_bucket_return", "first"),
                prev_bucket_target=("prev_bucket_target", "first"),
                prev_bucket_range=("prev_bucket_range", "first"),
            )
            .reset_index()
            .sort_values("bucket_start_utc")
        )

        for _, row in bucket_meta.iterrows():
            snapshot_time = row["bucket_start_utc"] - pd.Timedelta(minutes=1)
            if snapshot_time not in minute_df.index:
                continue
            base = minute_df.loc[snapshot_time]
            minute_et = snapshot_time.tz_convert(tz)
            minute_of_day = minute_et.hour * 60 + minute_et.minute
            day_of_week = minute_et.dayofweek
            records.append(
                {
                    "timeframe": label,
                    "contract_id": row["contract_id"],
                    "snapshot_utc": snapshot_time,
                    "bucket_start_utc": row["bucket_start_utc"],
                    "target_up": row["target_up"],
                    "prev_bucket_return": row["prev_bucket_return"],
                    "prev_bucket_target": row["prev_bucket_target"],
                    "prev_bucket_range": row["prev_bucket_range"],
                    "ema_12": base["ema_12"],
                    "ema_48": base["ema_48"],
                    "ema_288": base["ema_288"],
                    "trend_ema_ratio": base["trend_ema_ratio"],
                    "macro_trend_ratio": base["macro_trend_ratio"],
                    "rsi_14": base["rsi_14"],
                    "rolling_vol_30": base["rolling_vol_30"],
                    "volume_per_minute": base["volume_per_minute"],
                    "volume_z": base["volume_z"],
                    "minute_of_day": minute_of_day,
                    "minute_of_day_sin": np.sin(2 * np.pi * minute_of_day / 1440),
                    "minute_of_day_cos": np.cos(2 * np.pi * minute_of_day / 1440),
                    "day_of_week": day_of_week,
                    "day_of_week_sin": np.sin(2 * np.pi * day_of_week / 7),
                    "day_of_week_cos": np.cos(2 * np.pi * day_of_week / 7),
                }
            )
    preopen_df = pd.DataFrame.from_records(records)
    preopen_df.sort_values("bucket_start_utc", inplace=True)
    return preopen_df



### 0.3 Fonctions de mod√©lisation


In [None]:
FEATURE_COLUMNS = [
    "dist_from_open_pct",
    "high_gap_pct",
    "low_gap_pct",
    "running_range_pct",
    "minute_body_pct",
    "time_elapsed_ratio",
    "time_remaining_ratio",
    "minutes_elapsed",
    "minutes_remaining",
    "minutes_total",
    "seconds_remaining",
    "minute_of_day",
    "minute_of_week",
    "minute_of_day_sin",
    "minute_of_day_cos",
    "day_of_week",
    "day_of_week_sin",
    "day_of_week_cos",
    "prev_bucket_return",
    "prev_bucket_target",
    "prev_bucket_range",
    "return_1m",
    "ema_12",
    "ema_48",
    "ema_288",
    "trend_ema_ratio",
    "macro_trend_ratio",
    "rsi_14",
    "rolling_vol_30",
    "volume_per_minute",
    "volume_z",
    "is_trend_up",
    "streak_up_minute",
    "streak_down_minute",
    "streak_tf_up",
    "streak_tf_down",
]

TARGET_COLUMN = "target_up"


In [None]:
def sanitize_features(df: pd.DataFrame, features: Iterable[str]) -> pd.DataFrame:
    """Remplit les valeurs manquantes des features en utilisant la m√©diane."""
    cleaned = df.copy()
    for col in features:
        if col not in cleaned:
            continue
        median = cleaned[col].median()
        cleaned[col] = cleaned[col].fillna(median if not np.isnan(median) else 0.0)
    return cleaned


In [None]:
@dataclass
class ModelBundle:
    timeframe: str
    model: HistGradientBoostingClassifier
    calibrator: LogisticRegression
    feature_names: List[str]
    metrics: Dict[str, float]
    feature_importances: np.ndarray | None = None


In [None]:
def train_timeframe_models(
    df: pd.DataFrame,
    feature_cols: List[str],
    target_col: str = TARGET_COLUMN,
) -> Dict[str, ModelBundle]:
    """Entra√Æne un mod√®le par horizon et renvoie les bundles calibr√©s."""
    bundles: Dict[str, ModelBundle] = {}
    for timeframe, frame in df.groupby("timeframe"):
        frame = frame.sort_index()
        frame = sanitize_features(frame, feature_cols)
        frame = frame.dropna(subset=[target_col])

        n = len(frame)
        if n < 1000:
            continue

        train_end = int(n * 0.6)
        calib_end = int(n * 0.8)

        train_slice = frame.iloc[:train_end]
        calib_slice = frame.iloc[train_end:calib_end]
        test_slice = frame.iloc[calib_end:]

        X_train = train_slice[feature_cols]
        y_train = train_slice[target_col]

        X_calib = calib_slice[feature_cols]
        y_calib = calib_slice[target_col]

        X_test = test_slice[feature_cols]
        y_test = test_slice[target_col]

        base_model = HistGradientBoostingClassifier(
            learning_rate=0.05,
            max_iter=400,
            max_depth=6,
            l2_regularization=0.01,
            min_samples_leaf=80,
            random_state=RANDOM_SEED,
            scoring="loss",
            tol=1e-4,
        )
        base_model.fit(X_train, y_train)

        calib_preds = base_model.predict_proba(X_calib)[:, 1]
        calib_preds = calib_preds.reshape(-1, 1)
        calibrator = LogisticRegression(max_iter=200)
        calibrator.fit(calib_preds, y_calib)

        test_raw = base_model.predict_proba(X_test)[:, 1]
        test_calibrated = calibrator.predict_proba(test_raw.reshape(-1, 1))[:, 1]

        metrics = {
            "roc_auc": roc_auc_score(y_test, test_calibrated),
            "brier": brier_score_loss(y_test, test_calibrated),
            "accuracy": accuracy_score(y_test, (test_calibrated >= 0.5).astype(int)),
        }

        if hasattr(base_model, "feature_importances_"):
            feature_importances = base_model.feature_importances_
        else:
            perm = permutation_importance(
                base_model,
                X_test,
                y_test,
                n_repeats=5,
                random_state=RANDOM_SEED,
                n_jobs=-1,
            )
            feature_importances = perm.importances_mean

        bundles[timeframe] = ModelBundle(
            timeframe=timeframe,
            model=base_model,
            calibrator=calibrator,
            feature_names=feature_cols,
            metrics=metrics,
            feature_importances=feature_importances,
        )
    return bundles


In [None]:
def infer_probabilities(
    df: pd.DataFrame,
    bundles: Dict[str, ModelBundle],
) -> pd.DataFrame:
    """Applique les mod√®les calibr√©s √† chaque horizon et renvoie un tableau avec probabilit√©s."""
    results = []
    for timeframe, frame in df.groupby("timeframe"):
        bundle = bundles.get(timeframe)
        if bundle is None:
            continue
        frame_prepared = sanitize_features(frame, bundle.feature_names)
        raw = bundle.model.predict_proba(frame_prepared[bundle.feature_names])[:, 1]
        prob = bundle.calibrator.predict_proba(raw.reshape(-1, 1))[:, 1]
        enriched = frame_prepared.copy()
        enriched["prob_up_raw"] = raw
        enriched["prob_up"] = prob
        results.append(enriched)
    return pd.concat(results).sort_index()


### 0.4 Fonctions de simulation FOMO et backtest


In [None]:
@dataclass
class FomoScenario:
    name: str
    fomo_index: float  # 0 = correction lente, 1 = correction instantan√©e
    aggressiveness: float  # amplitude du biais
    stickiness: float  # inertie de la cote vis-√†-vis du choc
    noise: float = 0.0
    alpha: float = 4.0  # poids sur la distance normalis√©e
    beta: float = 2.0   # poids sur le range normalis√©
    gamma: float = 1.0  # renforcement de fin de p√©riode
    k_atr: float = 1.0  # facteur ATR pour la normalisation


def simulate_fomo_odds(
    df: pd.DataFrame,
    scenarios: Iterable[FomoScenario],
) -> pd.DataFrame:
    """G√©n√®re des cotes simul√©es minute par minute pour chaque sc√©nario de FOMO.
    Utilise une distance normalis√©e par ATR 15m et un renforcement de fin de p√©riode."""
    simulated = df.copy()

    def _simulate_group(group: pd.DataFrame, scenario: FomoScenario) -> np.ndarray:
        odds = []
        prev_odds = None
        eps = 1e-6
        for row in group.itertuples():
            base = getattr(row, "prob_up")
            time_decay = getattr(row, "time_remaining_ratio")
            # distances normalis√©es
            atr = max(getattr(row, "atr_15m", np.nan) * scenario.k_atr, eps)
            z_dist = (getattr(row, "tf_close_to_now") - getattr(row, "tf_open")) / atr
            z_range = (getattr(row, "tf_high_to_now") - getattr(row, "tf_low_to_now")) / atr

            # intensit√© de fin de p√©riode
            end_boost = (1 - time_decay) ** scenario.gamma
            bias = scenario.aggressiveness * np.tanh(scenario.alpha * z_dist + scenario.beta * z_range) * end_boost

            target = float(np.clip(base + bias, 1e-4, 1 - 1e-4))

            if prev_odds is None:
                proposal = target
            else:
                proposal = (
                    scenario.stickiness * prev_odds
                    + (1 - scenario.stickiness) * target
                )
            blended = (
                scenario.fomo_index * base + (1 - scenario.fomo_index) * proposal
            )
            if scenario.noise > 0:
                blended += np.random.normal(0, scenario.noise)
            blended = float(np.clip(blended, 1e-4, 1 - 1e-4))
            odds.append(blended)
            prev_odds = blended
        return np.array(odds)

    for scenario in scenarios:
        column = f"odds_{scenario.name}"
        simulated[column] = np.nan
        for contract_id, group in simulated.groupby("contract_id"):
            series = pd.Series(_simulate_group(group, scenario), index=group.index)
            simulated.loc[group.index, column] = series
    return simulated


In [None]:
def equity_curve(trades: pd.DataFrame, stake_usd: float = 50.0) -> pd.Series:
    """Calcule la courbe d'√©quity cumulative."""
    if trades.empty:
        return pd.Series(dtype=float)
    ordered = trades.sort_values("timestamp").copy()
    ordered["pnl_usd"] = ordered["pnl"] * stake_usd
    curve = ordered["pnl_usd"].cumsum()
    curve.index = pd.RangeIndex(len(curve))
    return curve


def max_drawdown(series: pd.Series) -> float:
    """Calcule le drawdown maximum (valeur n√©gative)."""
    if series.empty:
        return 0.0
    cummax = series.cummax()
    dd = series - cummax
    return float(dd.min())


def max_consecutive_losses(trades: pd.DataFrame) -> int:
    """Calcule le nombre maximum de pertes cons√©cutives."""
    if trades.empty:
        return 0
    ordered = trades.sort_values("timestamp")
    count = 0
    best = 0
    for v in (ordered["pnl"] <= 0).astype(int).tolist():
        if v == 1:
            count += 1
            best = max(best, count)
        else:
            count = 0
    return best


def build_trades_online_stream(
    df: pd.DataFrame,
    odds_column: str,
    prob_column: str = "prob_up",
    target_column: str = TARGET_COLUMN,
    min_edge: float = 0.05,
    min_seconds_remaining: int = 0,
    spread_abs: float = 0.05,
    fee_abs: float = 0.0,
    min_z_abs: Optional[float] = None,
    allow_multiple: bool = False,
    cooldown_minutes: int = 0,
) -> pd.DataFrame:
    """Parcourt chaque contrat dans l'ordre temporel et d√©clenche une entr√©e au premier d√©passement du seuil.
    - Pas de max-edge; d√©cision √† chaud par minute
    - Un seul trade par contrat par d√©faut (allow_multiple=False)
    - Option de filtre par distance normalis√©e (min_z_abs)
    """
    trades: list[dict] = []

    for (timeframe, contract_id), g in df.sort_index().groupby(["timeframe", "contract_id"], sort=False):
        in_cooldown_until = None
        entered = False
        for row in g.itertuples():
            sec_rem = float(getattr(row, "seconds_remaining"))
            if sec_rem < min_seconds_remaining:
                continue
            if min_z_abs is not None and hasattr(row, "z_dist_atr15"):
                if abs(getattr(row, "z_dist_atr15")) < min_z_abs:
                    continue

            ts = row.Index
            if in_cooldown_until is not None and ts < in_cooldown_until:
                continue

            p = float(getattr(row, prob_column))
            mid = float(getattr(row, odds_column))
            edge = p - mid
            if abs(edge) < min_edge:
                continue

            half_spread = spread_abs / 2.0
            if edge > 0:  # acheter YES
                direction = "up"
                paid = np.clip(mid + half_spread, 1e-4, 0.999)
                outcome = int(getattr(row, target_column))
                pnl = outcome - paid - fee_abs
                ev = p - (mid + half_spread) - fee_abs
                model_prob_used = p
            else:  # acheter NO
                direction = "down"
                paid = np.clip((1 - mid) + half_spread, 1e-4, 0.999)
                outcome = 1 - int(getattr(row, target_column))
                pnl = outcome - paid - fee_abs
                ev = (1 - p) - ((1 - mid) + half_spread) - fee_abs
                model_prob_used = 1 - p

            trades.append(
                {
                    "timeframe": timeframe,
                    "contract_id": contract_id,
                    "timestamp": ts,
                    "seconds_remaining": sec_rem,
                    "edge": float(edge),
                    "direction": direction,
                    "price": paid,
                    "model_prob": model_prob_used,
                    "expected_value": ev,
                    "outcome": outcome,
                    "pnl": pnl,
                }
            )

            if not allow_multiple:
                break
            if cooldown_minutes > 0:
                in_cooldown_until = ts + pd.Timedelta(minutes=cooldown_minutes)

    return pd.DataFrame(trades)


def summarize_online_by_timeframe(
    df: pd.DataFrame,
    odds_column: str,
    tolerances: list[float],
    min_seconds_remaining_by_tf: Optional[Dict[str, int]] = None,
    spread_abs: float = 0.05,
    fee_abs: float = 0.0,
    min_z_abs: Optional[float] = None,
    stake_usd: float = 50.0,
) -> pd.DataFrame:
    if min_seconds_remaining_by_tf is None:
        min_seconds_remaining_by_tf = {"m15": 0, "h1": 0, "d1": 0}

    rows: list[dict] = []
    for tf in ["m15", "h1", "d1"]:
        tf_frame = df[df["timeframe"] == tf]
        for tol in tolerances:
            tr = build_trades_online_stream(
                tf_frame,
                odds_column=odds_column,
                min_edge=tol,
                min_seconds_remaining=min_seconds_remaining_by_tf.get(tf, 0),
                spread_abs=spread_abs,
                fee_abs=fee_abs,
                min_z_abs=min_z_abs,
                allow_multiple=False,
            )
            curve = equity_curve(tr, stake_usd=stake_usd)
            mdd = max_drawdown(curve)
            mcl = max_consecutive_losses(tr)
            med_sec = float(tr["seconds_remaining"].median()) if len(tr) else np.nan
            mean_sec = float(tr["seconds_remaining"].mean()) if len(tr) else np.nan
            rows.append(
                {
                    "timeframe": tf,
                    "tolerance": tol,
                    "num_trades": len(tr),
                    "hit_rate": float((tr["pnl"] > 0).mean()) if len(tr) else 0.0,
                    "ev_trade_usd": float(tr["pnl"].mean() * stake_usd) if len(tr) else 0.0,
                    "pnl_total_usd": float((tr["pnl"] * stake_usd).sum()) if len(tr) else 0.0,
                    "max_drawdown_usd": float(mdd),
                    "max_consec_losses": int(mcl),
                    "num_up": int((tr["direction"]=='up').sum()) if len(tr) else 0,
                    "num_down": int((tr["direction"]=='down').sum()) if len(tr) else 0,
                    "up_ratio": float((tr["direction"]=='up').mean()) if len(tr) else 0.0,
                    "median_entry_min": None if np.isnan(med_sec) else round(med_sec / 60.0, 2),
                    "mean_entry_min": None if np.isnan(mean_sec) else round(mean_sec / 60.0, 2),
                }
            )
    return pd.DataFrame(rows).sort_values(["timeframe", "tolerance"])

    


In [None]:
def evaluate_confidence_bands(
    df: pd.DataFrame,
    thresholds: Iterable[float],
    minute_filter: int | None = None,
) -> pd.DataFrame:
    """Mesure la pr√©cision obtenue au-del√† de diff√©rents seuils intrabougie."""
    records = []
    for timeframe, frame in df.groupby("timeframe"):
        subset = frame
        if minute_filter is not None:
            subset = subset[subset["minutes_elapsed"] <= minute_filter]
        for thresh in thresholds:
            selected = subset[subset["prob_up"] >= thresh]
            if selected.empty:
                continue
            hit_rate = selected[TARGET_COLUMN].mean()
            avg_prob = selected["prob_up"].mean()
            avg_edge = (selected["prob_up"] - 0.5).mean()
            records.append(
                {
                    "timeframe": timeframe,
                    "threshold": thresh,
                    "count": len(selected),
                    "hit_rate": hit_rate,
                    "avg_prob": avg_prob,
                    "avg_edge_vs_50pct": avg_edge,
                }
            )
    return pd.DataFrame(records).sort_values(["timeframe", "threshold"])

In [None]:
FOMO_FEATURE_COLUMNS = [
    "prob_up",
    "prob_up_raw",
    "dist_from_open_pct",
    "high_gap_pct",
    "low_gap_pct",
    "running_range_pct",
    "minute_body_pct",
    "time_elapsed_ratio",
    "time_remaining_ratio",
    "minutes_elapsed",
    "minutes_remaining",
    "minutes_total",
    "seconds_remaining",
    "z_dist_atr15",
    "z_range_atr15",
    "return_1m",
    "trend_ema_ratio",
    "macro_trend_ratio",
    "volume_z",
    "streak_up_minute",
    "streak_down_minute",
    "streak_tf_up",
    "streak_tf_down",
    "prev_bucket_return",
    "prev_bucket_range",
    "odds_fast_revert",
    "odds_balanced",
    "odds_slow_sticky",
]


@dataclass
class FomoModelBundle:
    timeframe: str
    target: str
    model: HistGradientBoostingRegressor
    feature_names: List[str]
    metrics: Dict[str, float]


def train_fomo_models(
    df: pd.DataFrame,
    feature_cols: List[str],
    target_col: str,
    target_name: str,
    min_rows: int = 500,
) -> Dict[str, FomoModelBundle]:
    """Apprend un r√©gresseur FOMO par timeframe pour approcher les cotes march√©."""
    bundles: Dict[str, FomoModelBundle] = {}
    for timeframe, frame in df.groupby("timeframe"):
        frame = frame.sort_index()
        frame = frame.dropna(subset=[target_col])
        if len(frame) < min_rows:
            continue

        frame_clean = sanitize_features(frame, feature_cols)
        X = frame_clean[feature_cols]
        y = frame[target_col]

        n = len(frame)
        train_end = int(n * 0.7)
        val_end = int(n * 0.85)

        X_train, y_train = X.iloc[:train_end], y.iloc[:train_end]
        X_val, y_val = X.iloc[train_end:val_end], y.iloc[train_end:val_end]
        X_test, y_test = X.iloc[val_end:], y.iloc[val_end:]

        if len(X_test) < 100:
            continue

        model = HistGradientBoostingRegressor(
            learning_rate=0.05,
            max_depth=5,
            max_iter=400,
            l2_regularization=0.01,
            min_samples_leaf=60,
            random_state=RANDOM_SEED,
        )
        model.fit(pd.concat([X_train, X_val]), pd.concat([y_train, y_val]))

        test_pred = model.predict(X_test)
        rmse = float(np.sqrt(mean_squared_error(y_test, test_pred)))
        mae = float(mean_absolute_error(y_test, test_pred))
        r2 = float(r2_score(y_test, test_pred))
        similarity_3pct = float((np.abs(test_pred - y_test) <= 0.03).mean())
        similarity_5pct = float((np.abs(test_pred - y_test) <= 0.05).mean())

        bundles[timeframe] = FomoModelBundle(
            timeframe=timeframe,
            target=target_name,
            model=model,
            feature_names=list(feature_cols),
            metrics={
                "rmse": rmse,
                "mae": mae,
                "r2": r2,
                "similarity_3pct": similarity_3pct,
                "similarity_5pct": similarity_5pct,
                "n_test": len(X_test),
            },
        )
    return bundles


def apply_fomo_models(
    df: pd.DataFrame,
    bundles: Dict[str, FomoModelBundle],
    feature_cols: List[str],
) -> pd.Series:
    """G√©n√®re les pr√©dictions FOMO pour chaque timeframe pr√©sent dans df."""
    preds = pd.Series(np.nan, index=df.index, dtype=float)
    for timeframe, bundle in bundles.items():
        mask = df["timeframe"] == timeframe
        if not mask.any():
            continue
        frame = sanitize_features(df.loc[mask], feature_cols)
        X = frame[feature_cols]
        preds.loc[mask] = bundle.model.predict(X)
    return preds


def summarize_fomo_performance(
    df: pd.DataFrame,
    prediction_col: str,
    target_col: str,
    similarity_threshold: float = 0.03,
) -> pd.DataFrame:
    """Calcule les m√©triques d'alignement mod√®le ‚Üî march√© par timeframe."""
    records: list[dict] = []
    subset = df.dropna(subset=[prediction_col, target_col])
    for timeframe, frame in subset.groupby("timeframe"):
        diff = frame[prediction_col] - frame[target_col]
        rmse = float(np.sqrt((diff**2).mean()))
        mae = float(diff.abs().mean())
        bias = float(diff.mean())
        similarity = float((diff.abs() <= similarity_threshold).mean())
        records.append(
            {
                "timeframe": timeframe,
                "n": len(frame),
                "rmse": rmse,
                "mae": mae,
                "bias": bias,
                "similarity_threshold": similarity_threshold,
                "similarity_ratio": similarity,
            }
        )
    return pd.DataFrame(records).sort_values("timeframe")


def select_fomo_contract_samples(
    df: pd.DataFrame,
    timeframe: str,
    prediction_col: str,
    target_col: str,
    n: int = 1,
) -> List[str]:
    """S√©lectionne les contrats √† plus forte erreur absolue moyenne pour inspection."""
    subset = df[(df["timeframe"] == timeframe)].dropna(subset=[prediction_col, target_col])
    if subset.empty:
        return []
    errors = (
        subset.groupby("contract_id")
        .apply(lambda g: float(np.abs(g[prediction_col] - g[target_col]).mean()))
        .sort_values(ascending=False)
    )
    return errors.head(n).index.tolist()



In [None]:
def equity_curve_with_capital(
    trades: pd.DataFrame,
    initial_capital: float = 1000.0,
    strategy_type: str = "risk_pct",  # "risk_pct" ou "shares_pct"
    pct_value: float = 0.02,
) -> Tuple[pd.Series, pd.Series]:
    """Calcule la courbe d'√©quity et le capital apr√®s chaque trade."""
    if trades.empty:
        return pd.Series(dtype=float), pd.Series(dtype=float)

    ordered = trades.sort_values("timestamp").copy()
    capital = initial_capital
    equity_values: list[float] = []
    capital_values: list[float] = []

    for _, trade in ordered.iterrows():
        price = trade["price"]
        if strategy_type == "risk_pct":
            stake_usd = capital * pct_value
            shares = stake_usd / max(price, 1e-6)
            pnl_usd = trade["pnl"] * shares
        elif strategy_type == "shares_pct":
            shares = initial_capital * pct_value
            pnl_usd = trade["pnl"] * shares
        else:
            raise ValueError("strategy_type doit √™tre 'risk_pct' ou 'shares_pct'")

        capital += pnl_usd
        equity_values.append(capital)
        capital_values.append(capital)

    equity_curve = pd.Series(equity_values, index=pd.RangeIndex(len(equity_values)))
    capital_series = pd.Series(capital_values, index=pd.RangeIndex(len(capital_values)))
    return equity_curve, capital_series


def summarize_online_by_timeframe_with_capital(
    df: pd.DataFrame,
    odds_column: str,
    tolerances: list[float],
    initial_capital: float = 1000.0,
    strategy_type: str = "risk_pct",
    pct_value: float = 0.02,
    min_seconds_remaining_by_tf: Optional[Dict[str, int]] = None,
    spread_abs: float = 0.05,
    fee_abs: float = 0.0,
    min_z_abs: Optional[float] = None,
) -> Tuple[pd.DataFrame, Dict[str, pd.Series]]:
    """R√©sum√© ONLINE int√©grant la gestion de capital dynamique."""
    if min_seconds_remaining_by_tf is None:
        min_seconds_remaining_by_tf = {"m15": 0, "h1": 0, "d1": 0}

    rows: list[dict] = []
    equity_curves: Dict[str, pd.Series] = {}
    total_contracts = df.groupby("timeframe")["contract_id"].nunique()

    for tf in ["m15", "h1", "d1"]:
        tf_frame = df[df["timeframe"] == tf]
        for tol in tolerances:
            tr = build_trades_online_stream(
                tf_frame,
                odds_column=odds_column,
                prob_column="prob_up",
                min_edge=tol,
                min_seconds_remaining=min_seconds_remaining_by_tf.get(tf, 0),
                spread_abs=spread_abs,
                fee_abs=fee_abs,
                min_z_abs=min_z_abs,
                allow_multiple=False,
            )

            curve, capital_series = equity_curve_with_capital(
                tr,
                initial_capital=initial_capital,
                strategy_type=strategy_type,
                pct_value=pct_value,
            )

            mdd = max_drawdown(curve)
            mcl = max_consecutive_losses(tr)
            med_sec = float(tr["seconds_remaining"].median()) if len(tr) else np.nan
            mean_sec = float(tr["seconds_remaining"].mean()) if len(tr) else np.nan
            num_contracts = total_contracts.get(tf, 0)
            prob_trade = len(tr) / num_contracts if num_contracts else 0.0

            if len(tr) > 0 and not capital_series.empty:
                final_capital = float(capital_series.iloc[-1])
                pnl_total = final_capital - initial_capital
                if strategy_type == "risk_pct":
                    avg_stake = (capital_series.shift(fill_value=initial_capital) * pct_value).mean()
                else:
                    shares_count = initial_capital * pct_value
                    avg_price = tr["price"].mean()
                    avg_stake = shares_count * avg_price
                ev_trade = pnl_total / len(tr)
            else:
                final_capital = initial_capital
                pnl_total = 0.0
                avg_stake = 0.0
                ev_trade = 0.0

            rows.append(
                {
                    "timeframe": tf,
                    "tolerance": tol,
                    "num_trades": len(tr),
                    "hit_rate": float((tr["pnl"] > 0).mean()) if len(tr) else 0.0,
                    "ev_trade_usd": ev_trade,
                    "pnl_total_usd": pnl_total,
                    "final_capital_usd": final_capital,
                    "avg_stake_usd": avg_stake,
                    "max_drawdown_usd": float(mdd),
                    "max_consec_losses": int(mcl),
                    "num_up": int((tr["direction"] == "up").sum()) if len(tr) else 0,
                    "num_down": int((tr["direction"] == "down").sum()) if len(tr) else 0,
                    "up_ratio": float((tr["direction"] == "up").mean()) if len(tr) else 0.0,
                    "prob_trade": prob_trade,
                    "median_entry_min": None if np.isnan(med_sec) else round(med_sec / 60.0, 2),
                    "mean_entry_min": None if np.isnan(mean_sec) else round(mean_sec / 60.0, 2),
                }
            )

            equity_curves[f"{tf}_{tol}"] = curve

    summary = pd.DataFrame(rows).sort_values(["timeframe", "tolerance"])
    return summary, equity_curves



### 0.5 Fonctions de calibration march√© (optionnelles)


In [None]:
def load_market_csv(csv_path: str) -> pd.DataFrame:
    dfm = pd.read_csv(csv_path)
    # timestamps ISO8601 en UTC
    dfm['timestamp'] = pd.to_datetime(dfm['timestamp'], utc=True, errors='coerce')
    dfm = dfm.dropna(subset=['timestamp']).sort_values('timestamp')
    dfm = dfm.set_index('timestamp').rename_axis('timestamp_utc')
    # mid prices par TF
    dfm['m15_mid'] = (dfm['m15_buy'] + dfm['m15_sell'])/2
    dfm['h1_mid']  = (dfm['h1_buy']  + dfm['h1_sell'])/2
    dfm['d1_mid']  = (dfm['daily_buy']+ dfm['daily_sell'])/2
    return dfm

def _et(ts_utc: pd.Timestamp) -> pd.Timestamp:
    return ts_utc.tz_convert(TARGET_TZ)

def _daily_bucket_start_12pm_et(ts_et: pd.Timestamp) -> pd.Timestamp:
    # si < 12:00 ET, le bucket a commenc√© la veille 12:00; sinon aujourd'hui 12:00
    base = ts_et.floor('D') + pd.Timedelta(hours=12)
    return base if ts_et >= base else base - pd.Timedelta(days=1)

def build_market_long(dfm: pd.DataFrame) -> pd.DataFrame:
    # Agr√©gation par minute (UTC) pour aligner avec les snapshots m1
    dfm = dfm.copy()
    dfm['minute_bucket'] = dfm.index.floor('1min')
    dfm_agg = dfm.groupby('minute_bucket').agg({
        'm15_mid': 'mean',
        'h1_mid': 'mean',
        'd1_mid': 'mean',
    }).reset_index()
    dfm_agg = dfm_agg.set_index('minute_bucket').rename_axis('timestamp_utc')
    dfm_agg['timestamp_et'] = dfm_agg.index.tz_convert(TARGET_TZ)
    
    rows = []
    for ts_utc, row in dfm_agg.iterrows():
        ts_et = row['timestamp_et']
        # m15
        b15_start = ts_et.floor('15min'); b15_end = b15_start + pd.Timedelta(minutes=15)
        rows.append({'timeframe':'m15','timestamp_utc':ts_utc,'timestamp_et':ts_et,
                     'bucket_start_et':b15_start,'bucket_end_et':b15_end,
                     'odds_market_mid': row.get('m15_mid', np.nan)})
        # h1
        h1_start = ts_et.floor('1h'); h1_end = h1_start + pd.Timedelta(hours=1)
        rows.append({'timeframe':'h1','timestamp_utc':ts_utc,'timestamp_et':ts_et,
                     'bucket_start_et':h1_start,'bucket_end_et':h1_end,
                     'odds_market_mid': row.get('h1_mid', np.nan)})
        # d1 (12:00‚Üí12:00 ET)
        d1_start = _daily_bucket_start_12pm_et(ts_et); d1_end = d1_start + pd.Timedelta(days=1)
        rows.append({'timeframe':'d1','timestamp_utc':ts_utc,'timestamp_et':ts_et,
                     'bucket_start_et':d1_start,'bucket_end_et':d1_end,
                     'odds_market_mid': row.get('d1_mid', np.nan)})
    long = pd.DataFrame(rows).dropna(subset=['odds_market_mid'])
    long['bucket_start_utc'] = long['bucket_start_et'].dt.tz_convert('UTC')
    long['contract_id'] = long['timeframe'] + '_' + long['bucket_start_et'].dt.strftime('%Y-%m-%d %H:%M')
    long['seconds_remaining'] = (long['bucket_end_et'] - long['timestamp_et']).dt.total_seconds()
    return long.set_index('timestamp_utc').sort_index()

def merge_market_with_sim(simulated_df: pd.DataFrame, market_long: pd.DataFrame) -> pd.DataFrame:
    left = simulated_df.copy().reset_index().rename(columns={'index': 'timestamp_utc'})
    right = market_long.reset_index().rename(columns={'index': 'timestamp_market_utc'})

    merged = pd.merge(
        left,
        right,
        on=['timeframe', 'contract_id'],
        how='left',
        suffixes=('', '_market'),
    )

    merged = merged.rename(columns={'timestamp_utc': 'timestamp_model_utc'})
    merged = merged.sort_values(['timeframe', 'contract_id', 'timestamp_model_utc'])
    merged = merged.set_index('timestamp_model_utc')
    merged.index.name = 'timestamp_utc'
    return merged

def rebuild_ohlc_1m_from_seconds(dfm: pd.DataFrame) -> pd.DataFrame:
    """Reconstruit les OHLC 1 minute √† partir des donn√©es spot_price √† la seconde."""
    dfm = dfm.copy()
    dfm['minute_bucket'] = dfm.index.floor('1min')
    
    ohlc_1m = dfm.groupby('minute_bucket').agg({
        'spot_price': ['first', 'max', 'min', 'last', 'count']
    })
    ohlc_1m.columns = ['open', 'high', 'low', 'close', 'count']
    ohlc_1m = ohlc_1m.reset_index()
    ohlc_1m = ohlc_1m.set_index('minute_bucket').rename_axis('timestamp_utc')
    
    # Volume = nombre de ticks par minute (approximation)
    ohlc_1m['volume'] = ohlc_1m['count']
    ohlc_1m = ohlc_1m.drop('count', axis=1)
    
    return ohlc_1m


def build_market_long_with_stats(dfm: pd.DataFrame) -> pd.DataFrame:
    """Reconstruit des lignes minute avec statistiques par bougie (m15/h1/d1).

    - Chaque minute contient la cote march√© moyenne (`odds_market_mid`).
    - Les statistiques min/max/mean/close sont calcul√©es au niveau de la bougie
      (15min, 1h, daily 12pm‚Üí12pm) puis r√©pliqu√©es sur chaque minute appartenant √†
      cette bougie.
    """
    dfm = dfm.copy().sort_index()
    dfm['timestamp_et'] = dfm.index.tz_convert(TARGET_TZ)
    dfm['minute_utc'] = dfm.index.floor('1min')

    # Moyennes par minute pour chaque timeframe
    minute_means = (
        dfm.groupby('minute_utc')[['m15_mid', 'h1_mid', 'd1_mid']].mean().rename(
            columns={
                'm15_mid': 'm15_mid_minute_mean',
                'h1_mid': 'h1_mid_minute_mean',
                'd1_mid': 'd1_mid_minute_mean',
            }
        )
    )
    minute_means.index = minute_means.index.tz_convert('UTC')
    minute_means['timestamp_et'] = minute_means.index.tz_convert(TARGET_TZ)

    # Pr√©parer DataFrame final
    all_rows: list[dict] = []

    def process_timeframe(label: str, minute_col: str, freq: str) -> None:
        if label == 'd1':
            bucket_start_et = minute_means['timestamp_et'].apply(_daily_bucket_start_12pm_et)
            bucket_end_et = bucket_start_et + pd.Timedelta(days=1)
            bucket_key = dfm['timestamp_et'].apply(_daily_bucket_start_12pm_et)
            stats = (
                dfm.groupby(bucket_key)[minute_col.replace('_minute_mean', '')]
                .agg(['min', 'max', 'mean', 'last'])
                .rename(columns={
                    'min': 'odds_market_min',
                    'max': 'odds_market_max',
                    'mean': 'odds_market_mean',
                    'last': 'odds_market_close',
                })
            )
            stats.index.name = 'bucket_start_et'
        else:
            bucket_start_et = minute_means['timestamp_et'].dt.floor(freq)
            bucket_end_et = bucket_start_et + (
                pd.Timedelta(minutes=15) if label == 'm15' else pd.Timedelta(hours=1)
            )
            stats = (
                dfm.groupby(dfm['timestamp_et'].dt.floor(freq))[minute_col.replace('_minute_mean', '')]
                .agg(['min', 'max', 'mean', 'last'])
                .rename(columns={
                    'min': 'odds_market_min',
                    'max': 'odds_market_max',
                    'mean': 'odds_market_mean',
                    'last': 'odds_market_close',
                })
            )
            stats.index.name = 'bucket_start_et'
        
        tf_df = pd.DataFrame({
            'timestamp_utc': minute_means.index,
            'timestamp_et': minute_means['timestamp_et'],
            'bucket_start_et': bucket_start_et,
            'bucket_end_et': bucket_end_et,
            'odds_market_mid': minute_means[minute_col],
        })
        tf_df = tf_df.join(stats, on='bucket_start_et')
        tf_df['bucket_start_utc'] = tf_df['bucket_start_et'].dt.tz_convert('UTC')
        tf_df['contract_id'] = (
            label + '_' + tf_df['bucket_start_et'].dt.strftime('%Y-%m-%d %H:%M')
        )
        tf_df['seconds_remaining'] = (
            (tf_df['bucket_end_et'] - tf_df['timestamp_et']).dt.total_seconds()
        )
        tf_df['timeframe'] = label
        all_rows.extend(tf_df.to_dict('records'))

    process_timeframe('m15', 'm15_mid_minute_mean', '15min')
    process_timeframe('h1', 'h1_mid_minute_mean', '1h')
    process_timeframe('d1', 'd1_mid_minute_mean', '1D')

    market_long = pd.DataFrame(all_rows)
    market_long = market_long.dropna(subset=['odds_market_mid'])
    market_long = market_long.sort_values(['timeframe', 'timestamp_utc'])
    market_long = market_long.set_index('timestamp_utc')
    return market_long


def scenario_scores_vs_market(df: pd.DataFrame, scenarios: list[str]) -> pd.DataFrame:
    records = []
    sub = df.dropna(subset=['odds_market_mid'])
    if len(sub) == 0:
        return pd.DataFrame(columns=['timeframe', 'scenario', 'rmse', 'mae', 'n'])
    for tf, g in sub.groupby('timeframe'):
        for sc in scenarios:
            col = f'odds_{sc}'
            if col not in g.columns:
                continue
            gg = g.dropna(subset=[col])
            if len(gg)==0: continue
            err = gg[col] - gg['odds_market_mid']
            rmse = float(np.sqrt((err**2).mean()))
            mae  = float(err.abs().mean())
            records.append({'timeframe':tf,'scenario':sc,'rmse':rmse,'mae':mae,'n':len(gg)})
    if len(records) == 0:
        return pd.DataFrame(columns=['timeframe', 'scenario', 'rmse', 'mae', 'n'])
    return pd.DataFrame(records).sort_values(['timeframe','rmse'])


In [None]:
def build_market_long_with_minmax(dfm: pd.DataFrame) -> pd.DataFrame:
    """Retourne les statistiques mid/min/max/mean/close par bougie march√©."""
    base = build_market_long_with_stats(dfm)
    columns_to_keep = [
        "timeframe",
        "timestamp_et",
        "bucket_start_et",
        "bucket_end_et",
        "bucket_start_utc",
        "contract_id",
        "odds_market_mid",
        "odds_market_min",
        "odds_market_max",
        "odds_market_mean",
        "odds_market_close",
        "seconds_remaining",
    ]
    available = [col for col in columns_to_keep if col in base.columns]
    return base[available]


def build_trades_with_minmax_opportunities(
    df: pd.DataFrame,
    prob_column: str = "prob_up",
    target_column: str = TARGET_COLUMN,
    min_edge: float = 0.05,
    min_seconds_remaining: int = 0,
    spread_abs: float = 0.05,
    fee_abs: float = 0.0,
    use_min_for_up: bool = True,
    use_max_for_down: bool = True,
) -> pd.DataFrame:
    """Simule des entr√©es en exploitant les min/max intra-minute des cotes march√©."""
    trades: list[dict] = []

    for (timeframe, contract_id), g in df.sort_index().groupby(["timeframe", "contract_id"], sort=False):
        for row in g.itertuples():
            sec_rem = float(getattr(row, "seconds_remaining"))
            if sec_rem < min_seconds_remaining:
                continue

            p = float(getattr(row, prob_column))
            if p > 0.5:
                mid = getattr(row, "odds_market_mid", np.nan)
                if use_min_for_up and hasattr(row, "odds_market_min"):
                    mid = getattr(row, "odds_market_min")
                if np.isnan(mid):
                    continue
                direction = "up"
                edge = p - mid
                if abs(edge) < min_edge:
                    continue
                paid = np.clip(mid + spread_abs / 2.0, 1e-4, 0.999)
                outcome = int(getattr(row, target_column))
                pnl = outcome - paid - fee_abs
                ev = p - (mid + spread_abs / 2.0) - fee_abs
                model_prob_used = p
            else:
                mid = getattr(row, "odds_market_mid", np.nan)
                if use_max_for_down and hasattr(row, "odds_market_max"):
                    mid = 1.0 - getattr(row, "odds_market_max")
                else:
                    mid = 1.0 - mid
                if np.isnan(mid):
                    continue
                direction = "down"
                edge = (1 - p) - mid
                if abs(edge) < min_edge:
                    continue
                paid = np.clip(mid + spread_abs / 2.0, 1e-4, 0.999)
                outcome = 1 - int(getattr(row, target_column))
                pnl = outcome - paid - fee_abs
                ev = (1 - p) - (mid + spread_abs / 2.0) - fee_abs
                model_prob_used = 1 - p

            trades.append(
                {
                    "timeframe": timeframe,
                    "contract_id": contract_id,
                    "timestamp": row.Index,
                    "seconds_remaining": sec_rem,
                    "edge": float(edge),
                    "direction": direction,
                    "price": paid,
                    "model_prob": model_prob_used,
                    "expected_value": ev,
                    "outcome": outcome,
                    "pnl": pnl,
                }
            )
            break

    return pd.DataFrame(trades)


def prepare_live_market_stream(market_raw: pd.DataFrame, predictions: pd.DataFrame) -> pd.DataFrame:
    """Assemble un flux seconde par seconde avec probabilit√©s minute."""
    seconds_df = market_raw.copy().sort_index()
    seconds_df.index = seconds_df.index.tz_convert("UTC")
    seconds_df["timestamp_et"] = seconds_df.index.tz_convert(TARGET_TZ)
    seconds_df["minute_utc"] = seconds_df.index.floor("1min")

    pred_cols = ["timeframe", "prob_up", "target_up", "contract_id"]
    pred_meta = (
        predictions[pred_cols]
        .reset_index()
        .rename(columns={"index": "minute_utc"})
    )
    pred_meta["minute_utc"] = pred_meta["minute_utc"].dt.tz_convert("UTC")

    stream_rows: list[pd.DataFrame] = []

    def _bucket_info(frame: pd.Series, tf: str) -> tuple[pd.Series, pd.Series]:
        if tf == "m15":
            bucket_start_et = frame.dt.floor("15min")
            bucket_end_et = bucket_start_et + pd.Timedelta(minutes=15)
        elif tf == "h1":
            bucket_start_et = frame.dt.floor("1h")
            bucket_end_et = bucket_start_et + pd.Timedelta(hours=1)
        else:
            bucket_start_et = frame.apply(_daily_bucket_start_12pm_et)
            bucket_end_et = bucket_start_et + pd.Timedelta(days=1)
        return bucket_start_et, bucket_end_et

    for timeframe, col_name in [("m15", "m15_mid"), ("h1", "h1_mid"), ("d1", "d1_mid")]:
        if col_name not in seconds_df.columns:
            continue
        tf_seconds = (
            seconds_df[["timestamp_et", "minute_utc", col_name]]
            .rename(columns={col_name: "odds_market_mid"})
            .dropna(subset=["odds_market_mid"])
            .copy()
        )
        if tf_seconds.empty:
            continue

        bucket_start_et, bucket_end_et = _bucket_info(tf_seconds["timestamp_et"], timeframe)
        tf_seconds["bucket_start_et"] = bucket_start_et
        tf_seconds["bucket_end_et"] = bucket_end_et
        tf_seconds["contract_id"] = timeframe + "_" + tf_seconds["bucket_start_et"].dt.strftime("%Y-%m-%d %H:%M")
        tf_seconds["seconds_remaining"] = (tf_seconds["bucket_end_et"] - tf_seconds["timestamp_et"]).dt.total_seconds()
        tf_seconds["timeframe"] = timeframe

        tf_pred = pred_meta[pred_meta["timeframe"] == timeframe]
        merge_cols = ["minute_utc", "contract_id"]
        tf_seconds = tf_seconds.merge(
            tf_pred[merge_cols + ["prob_up", "target_up"]],
            on=merge_cols,
            how="inner",
        )
        if tf_seconds.empty:
            continue

        tf_seconds.index = tf_seconds.index.tz_convert("UTC")
        stream_rows.append(tf_seconds)

    if not stream_rows:
        return pd.DataFrame()

    stream = pd.concat(stream_rows).sort_index()
    cols_order = [
        "timeframe",
        "contract_id",
        "seconds_remaining",
        "prob_up",
        "target_up",
        "odds_market_mid",
    ]
    cols_extra = [c for c in ["timestamp_et", "bucket_start_et", "bucket_end_et"] if c in stream.columns]
    stream = stream[cols_order + cols_extra]
    stream.rename_axis("timestamp_utc", inplace=True)
    return stream



In [None]:
def train_preopen_models(
    df: pd.DataFrame,
    feature_cols: List[str],
    target_col: str = TARGET_COLUMN,
) -> Dict[str, ModelBundle]:
    """Entra√Æne un mod√®le uniquement √† partir des informations pr√©-ouverture."""
    bundles: Dict[str, ModelBundle] = {}
    for timeframe, frame in df.groupby("timeframe"):
        frame = frame.dropna(subset=feature_cols + [target_col])
        if len(frame) < 500:
            continue

        n = len(frame)
        train_end = int(n * 0.7)
        calib_end = int(n * 0.85)

        train_slice = frame.iloc[:train_end]
        calib_slice = frame.iloc[train_end:calib_end]
        test_slice = frame.iloc[calib_end:]

        X_train = train_slice[feature_cols]
        y_train = train_slice[target_col]

        X_calib = calib_slice[feature_cols]
        y_calib = calib_slice[target_col]

        X_test = test_slice[feature_cols]
        y_test = test_slice[target_col]

        base_model = HistGradientBoostingClassifier(
            learning_rate=0.05,
            max_iter=350,
            max_depth=5,
            l2_regularization=0.015,
            min_samples_leaf=60,
            random_state=RANDOM_SEED,
        )
        base_model.fit(X_train, y_train)

        calib_preds = base_model.predict_proba(X_calib)[:, 1]
        calibrator = LogisticRegression(max_iter=200)
        calibrator.fit(calib_preds.reshape(-1, 1), y_calib)

        test_raw = base_model.predict_proba(X_test)[:, 1]
        test_calibrated = calibrator.predict_proba(test_raw.reshape(-1, 1))[:, 1]

        metrics = {
            "roc_auc": roc_auc_score(y_test, test_calibrated),
            "brier": brier_score_loss(y_test, test_calibrated),
            "accuracy": accuracy_score(y_test, (test_calibrated >= 0.5).astype(int)),
        }

        if hasattr(base_model, "feature_importances_"):
            feature_importances = base_model.feature_importances_
        else:
            perm = permutation_importance(
                base_model,
                X_test,
                y_test,
                n_repeats=5,
                random_state=RANDOM_SEED,
                n_jobs=-1,
            )
            feature_importances = perm.importances_mean

        bundles[timeframe] = ModelBundle(
            timeframe=timeframe,
            model=base_model,
            calibrator=calibrator,
            feature_names=feature_cols,
            metrics=metrics,
            feature_importances=feature_importances,
        )
    return bundles


In [None]:
def evaluate_preopen_thresholds(
    df: pd.DataFrame,
    thresholds: Iterable[float],
) -> pd.DataFrame:
    """Calcule la pr√©cision avant ouverture selon diff√©rents seuils."""
    records = []
    for timeframe, frame in df.groupby("timeframe"):
        for thresh in thresholds:
            selected = frame[frame["prob_up"] >= thresh]
            if selected.empty:
                continue
            hit_rate = selected[TARGET_COLUMN].mean()
            records.append(
                {
                    "timeframe": timeframe,
                    "threshold": thresh,
                    "count": len(selected),
                    "hit_rate": hit_rate,
                    "avg_prob": selected["prob_up"].mean(),
                }
            )
    return pd.DataFrame(records).sort_values(["timeframe", "threshold"])


In [None]:
def top_feature_importances(bundle: ModelBundle, top_n: int = 10) -> pd.DataFrame:
    """Retourne les principales features import√©es d'un mod√®le."""
    if bundle.feature_importances is None:
        raise ValueError("Importance des features indisponible pour ce mod√®le")
    data = pd.DataFrame(
        {
            "feature": bundle.feature_names,
            "importance": bundle.feature_importances,
        }
    )
    data = data.sort_values("importance", ascending=False).head(top_n)
    return data


## 1. Chargement et pr√©paration des donn√©es

On charge l'historique BTC m1 (UTC), puis on ajoute les indicateurs globaux n√©cessaires aux mod√®les : rendements, EMA multi-√©chelles, RSI, volatilit√© et signaux de volume.

- But: obtenir un DataFrame minute enrichi avec indicateurs techniques.
- Entr√©es: CSV OHLCV minute.
- Sorties: `minute_df` avec colonnes enrichies (EMA, RSI, ATR, volume, etc.).
- Lecture: v√©rifier que les indicateurs sont calcul√©s correctement.


In [None]:
minute_df = load_minute_data(DATA_PATH)
minute_df = add_global_indicators(minute_df)
print(f"[INFO] Donn√©es charg√©es: {len(minute_df)} lignes de {minute_df.index[0]} √† {minute_df.index[-1]}")
minute_df.head()


## 2. Reconstruction multi-√©chelle

On regroupe chaque horizon (m15, h1, daily) en suivant la timezone ET, en calculant pour chaque minute : ouverture de la bougie cible, range parcouru, temps restant et m√©moire de la bougie pr√©c√©dente.

- But: cr√©er des snapshots minute-par-minute pour chaque timeframe avec features intrabougie.
- Entr√©es: `minute_df` enrichi.
- Sorties: `snapshots_df` avec colonnes par timeframe (contract_id, minutes_remaining, dist_from_open_pct, etc.).
- Lecture: v√©rifier que chaque timeframe a bien ses buckets et features.


In [None]:
TIMEFRAME_MAP = {"m15": "15min", "h1": "1h", "d1": "1d"}

snapshots_df = prepare_timeframe_dataset(minute_df, TIMEFRAME_MAP)
print(f"[INFO] Snapshots cr√©√©s: {len(snapshots_df)} lignes")
print(f"[INFO] R√©partition par timeframe:")
print(snapshots_df.groupby("timeframe").size())
snapshots_df.sample(5, random_state=RANDOM_SEED)


## 3. Mod√®les intrabougie

Entra√Ænement des classifieurs `HistGradientBoosting` sur chaque horizon avec calibration logistique. On observe ensuite les m√©triques globales et la distribution des probabilit√©s.


### 3.1 Entra√Ænement et calibration

On entra√Æne un mod√®le `HistGradientBoosting` par horizon (m15, h1, d1) sur les features intrabougie, puis on calibre les probabilit√©s via une r√©gression logistique. On examine ensuite les m√©triques principales pour v√©rifier que le mod√®le d√©passe bien le benchmark 50/50.


In [None]:
bundles = train_timeframe_models(snapshots_df, FEATURE_COLUMNS)
metrics_overview = {tf: bundle.metrics for tf, bundle in bundles.items()}
metrics_overview


In [None]:
pred_df = infer_probabilities(snapshots_df, bundles)
pred_df.tail()


### 3.2 Visualisations des probabilit√©s (ex. m15)

Distribution des probabilit√©s pr√©vues et courbe de calibration empirique pour v√©rifier la coh√©rence des scores.


In [None]:
viz_df = pred_df[pred_df["timeframe"] == "m15"].copy()
if len(viz_df) > 20000:
    viz_df = viz_df.sample(20000, random_state=RANDOM_SEED)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(viz_df["prob_up"], bins=30, ax=axes[0], color="#1f77b4")
axes[0].set_title("Distribution des probabilit√©s (m15)")
axes[0].set_xlabel("Probabilit√© mod√®le d'une cl√¥ture up")
axes[0].set_ylabel("Nombre de minutes")

bins = np.linspace(0, 1, 11)
labels = pd.IntervalIndex.from_breaks(bins, closed="right")
viz_df["bucket"] = pd.cut(viz_df["prob_up"], bins=bins, labels=False, include_lowest=True)
calibration = (
    viz_df.groupby("bucket").agg(
        mean_prob=("prob_up", "mean"),
        hit_rate=(TARGET_COLUMN, "mean"),
        count=(TARGET_COLUMN, "size"),
    )
)
axes[1].plot([0, 1], [0, 1], ls="--", color="gray", label="Parfaitement calibr√©")
axes[1].plot(calibration["mean_prob"], calibration["hit_rate"], marker="o", label="Empirique")
axes[1].set_title("Calibration empirique (m15)")
axes[1].set_xlabel("Probabilit√© moyenne par bin")
axes[1].set_ylabel("Fr√©quence r√©alis√©e")
axes[1].legend()

plt.tight_layout()
plt.show()

calibration.head()


### 3.2.5 Analyse du biais de tendance haussi√®re

V√©rification si le mod√®le n'est pas biais√© par une p√©riode de r√©f√©rence en tendance haussi√®re sur Bitcoin. On examine :
- La distribution des targets (proportion up vs down)
- La distribution des probabilit√©s pr√©dites
- La performance du mod√®le sur diff√©rentes p√©riodes
- Si le mod√®le pr√©dit syst√©matiquement "up" plus souvent que "down"


In [None]:
# Analyse du biais potentiel de tendance haussi√®re

print("=" * 80)
print("ANALYSE DU BIAIS DE TENDANCE HAUSSI√àRE")
print("=" * 80)

# 1. Distribution des targets (r√©els) par timeframe
print("\n1. Distribution des TARGETS r√©els (proportion de bougies qui cl√¥turent UP):")
target_dist = (
    snapshots_df.groupby("timeframe")[TARGET_COLUMN]
    .agg(["mean", "count", "sum"])
    .rename(columns={"mean": "proportion_up", "count": "total", "sum": "nb_up"})
)
target_dist["proportion_down"] = 1 - target_dist["proportion_up"]
target_dist["nb_down"] = target_dist["total"] - target_dist["nb_up"]
display(target_dist)

# 2. Distribution des probabilit√©s pr√©dites par timeframe
print("\n2. Distribution des PROBABILIT√âS PR√âDITES par le mod√®le:")
prob_stats = pred_df.groupby("timeframe")["prob_up"].agg([
    "mean", "median", "std", 
    lambda x: (x > 0.5).mean(),  # proportion de pr√©dictions > 0.5
    lambda x: (x > 0.6).mean(),  # proportion de pr√©dictions > 0.6
    lambda x: (x < 0.4).mean(),  # proportion de pr√©dictions < 0.4
]).rename(columns={
    "<lambda_0>": "prop_pred_>_0.5",
    "<lambda_1>": "prop_pred_>_0.6", 
    "<lambda_2>": "prop_pred_<_0.4"
})
display(prob_stats.round(4))

# 3. Comparaison target r√©el vs probabilit√© moyenne pr√©dite
print("\n3. Comparaison TARGET R√âEL vs PROBABILIT√â MOYENNE PR√âDITE:")
comparison = pd.DataFrame({
    "target_real_prop_up": target_dist["proportion_up"],
    "prob_pred_mean": prob_stats["mean"],
    "diff": prob_stats["mean"] - target_dist["proportion_up"],
})
comparison["bias_pct"] = (comparison["diff"] / comparison["target_real_prop_up"] * 100).round(2)
display(comparison.round(4))

# 4. Performance par p√©riode (d√©coupage temporel)
print("\n4. Performance du mod√®le sur diff√©rentes p√©riodes (d√©coupage en 4 quartiles):")
pred_df_with_date = pred_df.copy()
pred_df_with_date["date"] = pred_df_with_date.index.date
pred_df_with_date["period_quartile"] = pd.qcut(
    pred_df_with_date.index.astype(int), q=4, labels=["Q1", "Q2", "Q3", "Q4"]
)

period_perf = []
for tf in ["m15", "h1", "d1"]:
    tf_data = pred_df_with_date[pred_df_with_date["timeframe"] == tf]
    for period in ["Q1", "Q2", "Q3", "Q4"]:
        period_data = tf_data[tf_data["period_quartile"] == period]
        if len(period_data) == 0:
            continue
        target_prop = period_data[TARGET_COLUMN].mean()
        prob_mean = period_data["prob_up"].mean()
        accuracy = ((period_data["prob_up"] >= 0.5) == period_data[TARGET_COLUMN]).mean()
        period_perf.append({
            "timeframe": tf,
            "period": period,
            "target_prop_up": target_prop,
            "prob_mean": prob_mean,
            "accuracy": accuracy,
            "n": len(period_data),
        })

period_df = pd.DataFrame(period_perf)
if len(period_df) > 0:
    display(period_df.pivot_table(
        index="timeframe", 
        columns="period", 
        values=["target_prop_up", "prob_mean", "accuracy"],
        aggfunc="first"
    ).round(4))

# 5. Test de calibration conditionnelle : le mod√®le est-il biais√© vers "up"?
print("\n5. Test de biais conditionnel:")
print("   Si le mod√®le √©tait biais√© par la tendance, on s'attendrait √†:")
print("   - Une probabilit√© moyenne pr√©dite > proportion r√©elle de 'up'")
print("   - Une meilleure performance sur les p√©riodes haussi√®res")
print("   - Des pr√©dictions syst√©matiquement > 0.5")

bias_summary = pd.DataFrame({
    "timeframe": ["m15", "h1", "d1"],
    "target_real_up": [target_dist.loc[tf, "proportion_up"] for tf in ["m15", "h1", "d1"]],
    "prob_pred_mean": [prob_stats.loc[tf, "mean"] for tf in ["m15", "h1", "d1"]],
    "diff_abs": [abs(prob_stats.loc[tf, "mean"] - target_dist.loc[tf, "proportion_up"]) 
                 for tf in ["m15", "h1", "d1"]],
})

bias_summary["bias_direction"] = bias_summary.apply(
    lambda row: "surestime UP" if row["prob_pred_mean"] > row["target_real_up"] 
    else "sous-estime UP", axis=1
)
bias_summary["bias_severity"] = bias_summary["diff_abs"].apply(
    lambda x: "faible" if x < 0.02 else "mod√©r√©" if x < 0.05 else "fort"
)

display(bias_summary.round(4))

print("\n" + "=" * 80)
print("CONCLUSION:")
print("=" * 80)
for tf in ["m15", "h1", "d1"]:
    target_p = target_dist.loc[tf, "proportion_up"]
    prob_p = prob_stats.loc[tf, "mean"]
    diff = prob_p - target_p
    print(f"\n{tf.upper()}:")
    print(f"  - Proportion r√©elle de bougies UP: {target_p:.1%}")
    print(f"  - Probabilit√© moyenne pr√©dite: {prob_p:.1%}")
    print(f"  - √âcart: {diff:+.1%} ({'surestime' if diff > 0 else 'sous-estime'} UP)")
    if abs(diff) < 0.02:
        print(f"  ‚Üí Biais n√©gligeable (< 2%)")
    elif abs(diff) < 0.05:
        print(f"  ‚Üí Biais mod√©r√©, √† surveiller")
    else:
        print(f"  ‚Üí BIAIS FORT D√âTECT√â - Le mod√®le peut √™tre influenc√© par la tendance")


### 3.3 Analyse des probabilit√©s

On quantifie l'edge en filtrant les signaux dont la probabilit√© d√©passe diff√©rents seuils, √† la fois au cours de la bougie et d√®s la premi√®re minute.


In [None]:
summary = (
    snapshots_df.groupby("timeframe")
    .agg(
        rows=("contract_id", "size"),
        contracts=("contract_id", pd.Series.nunique),
        median_minutes_total=("minutes_total", "median"),
    )
)
display(summary)


## 4. Probabilit√©s pr√©-ouverture

On calcule un snapshot juste avant l'ouverture de chaque pari (m15, h1, daily) pour estimer la direction probable avant m√™me la premi√®re minute de trading.

**Fonctionnement du mod√®le pr√©dictif :**
- **Pr√©-ouverture** : Probabilit√© calcul√©e √† T-1 min avant l'ouverture, bas√©e sur les features macro (EMA, RSI, tendance, bougie pr√©c√©dente).
- **Intrabougie** : Probabilit√© recalcul√©e √† chaque minute pendant la bougie, enrichie par les features intrabougie (distance depuis open, temps restant, etc.).
- **Entr√©e conditionnelle** : On peut filtrer par `min_seconds_remaining` pour √©viter d'entrer trop t√¥t quand les probas sont insuffisantes. On peut aussi filtrer par probabilit√© minimale.
- **Validation** : On compare la pr√©diction avec la vraie cl√¥ture (target_up) pour calculer le winrate.

- But: anticiper la direction d√®s T‚àí1 min.
- Entr√©es: `preopen_df`, `PREOPEN_FEATURES`.
- Sorties: `preopen_bundles`, `preopen_pred`, tableaux d'√©valuation par seuil.
- Lecture: comparer hit_rate par seuils et par TF; utile quand les cotes se forment t√¥t.


In [None]:
thresholds = [0.55, 0.6, 0.65, 0.7]
all_minutes_stats = evaluate_confidence_bands(pred_df, thresholds)
first_minute_stats = evaluate_confidence_bands(pred_df, thresholds, minute_filter=1)

display(all_minutes_stats.head(12))
display(first_minute_stats.head(12))


In [None]:
scenarios = [
    FomoScenario(name="fast_revert", fomo_index=0.8, aggressiveness=0.12, stickiness=0.25, noise=0.01),
    FomoScenario(name="balanced", fomo_index=0.5, aggressiveness=0.18, stickiness=0.55, noise=0.015),
    FomoScenario(name="slow_sticky", fomo_index=0.2, aggressiveness=0.25, stickiness=0.8, noise=0.02),
]

simulated_df = simulate_fomo_odds(pred_df, scenarios)
simulated_df[["prob_up"] + [f"odds_{s.name}" for s in scenarios]].head()


In [None]:
preopen_df = build_preopen_dataset(minute_df, TIMEFRAME_MAP)
preopen_df.head()


In [None]:
preopen_bundles = train_preopen_models(preopen_df, PREOPEN_FEATURES)
{k: v.metrics for k, v in preopen_bundles.items()}


In [None]:
preopen_pred = infer_probabilities(preopen_df, preopen_bundles)
preopen_pred.head()


In [None]:
preopen_thresholds = evaluate_preopen_thresholds(preopen_pred, thresholds)
preopen_thresholds.head(9)


### 4.1 Visualisations des probabilit√©s pr√©-ouverture

Graphiques du winrate par timeframe et par seuil de probabilit√© pour les pr√©dictions pr√©-ouverture.


In [None]:
# Winrate par timeframe et seuil
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Graphique 1: Winrate par seuil et timeframe
for tf in preopen_thresholds['timeframe'].unique():
    tf_data = preopen_thresholds[preopen_thresholds['timeframe'] == tf]
    axes[0].plot(tf_data['threshold'], tf_data['hit_rate'], marker='o', label=f'{tf}', linewidth=2)
axes[0].axhline(0.5, color='gray', linestyle='--', alpha=0.5, label='50% (baseline)')
axes[0].axhline(0.6, color='green', linestyle='--', alpha=0.3, label='60% (objectif minimum)')
axes[0].set_xlabel('Seuil de probabilit√©')
axes[0].set_ylabel('Winrate (hit rate)')
axes[0].set_title('Winrate pr√©-ouverture par timeframe et seuil')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Graphique 2: Nombre de trades par seuil et timeframe
for tf in preopen_thresholds['timeframe'].unique():
    tf_data = preopen_thresholds[preopen_thresholds['timeframe'] == tf]
    axes[1].plot(tf_data['threshold'], tf_data['count'], marker='s', label=f'{tf}', linewidth=2)
axes[1].set_xlabel('Seuil de probabilit√©')
axes[1].set_ylabel('Nombre de trades')
axes[1].set_title('Nombre de trades pr√©-ouverture par timeframe')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

# Tableau r√©capitulatif par timeframe (meilleur seuil pour chaque TF)
best_by_tf = preopen_thresholds.loc[preopen_thresholds.groupby('timeframe')['hit_rate'].idxmax()]
print("Meilleur seuil par timeframe (max winrate):")
display(best_by_tf[['timeframe', 'threshold', 'hit_rate', 'count', 'avg_prob']])


## 5. Simulation des cotes FOMO

On g√©n√®re plusieurs sc√©narios param√©trables (correction rapide, √©quilibr√©e, collante) pour simuler les cotes qu'un march√© d√©s√©quilibr√© pourrait afficher. Ces cotes simul√©es seront ensuite compar√©es aux vraies cotes march√© pour calibrer le mod√®le.

- But: cr√©er des cotes synth√©tiques (odds_*) pour tester l'√©cart mod√®le‚Üîmarch√©.
- Entr√©es: `pred_df` (probas intrabougie), param√®tres `FomoScenario`.
- Sorties: `simulated_df` avec colonnes `odds_<scenario>`.
- Lecture: plus (z_dist_atr15, z_range_atr15) et moins de temps restant ‚áí cote plus extr√™me; sinon proche de `prob_up`.


In [None]:
scenarios = [
    FomoScenario(name="fast_revert", fomo_index=0.8, aggressiveness=0.12, stickiness=0.25, noise=0.01),
    FomoScenario(name="balanced", fomo_index=0.5, aggressiveness=0.18, stickiness=0.55, noise=0.015),
    FomoScenario(name="slow_sticky", fomo_index=0.2, aggressiveness=0.25, stickiness=0.8, noise=0.02),
]

simulated_df = simulate_fomo_odds(pred_df, scenarios)
print(f"[INFO] Cotes FOMO simul√©es pour {len(scenarios)} sc√©narios")
print(f"[INFO] Colonnes cr√©√©es: {[f'odds_{s.name}' for s in scenarios]}")
simulated_df[["prob_up"] + [f"odds_{s.name}" for s in scenarios]].head()


### 5.1 Calibration du mod√®le FOMO

On charge les donn√©es BTC.csv (√† la seconde), reconstruit les OHLC 1 minute et les bougies de paris (m15, h1, daily) avec statistiques des cotations (min/max/mean/close). On compare ensuite les cotes simul√©es aux vraies cotes march√© pour calibrer le mod√®le.

- But: valider et calibrer le mod√®le FOMO en comparant les cotes simul√©es aux vraies cotes march√©.
- Entr√©es: `simulated_df`, `MARKET_ODDS_PATH` (BTC.csv avec donn√©es √† la seconde: timestamp, spot_price, m15_buy, m15_sell, h1_buy, h1_sell, daily_buy, daily_sell).
- Sorties: `ohlc_1m_rebuilt` (OHLC 1 minute reconstruits), `market_long` (bougies avec stats: odds_market_min/max/mean/close), `scores`, `best_scenario`, `merged`.
- Lecture: les bougies daily sont de 12pm ET √† 12pm ET. Les statistiques des cotations (min/max/mean/close) sont utiles pour les simulations de trading.


In [None]:
# Les fonctions load_market_csv, build_market_long, merge_market_with_sim et scenario_scores_vs_market
# sont d√©j√† d√©finies dans la section 0.5 (Fonctions de calibration march√©).
# On v√©rifie si le fichier existe et on charge les donn√©es r√©elles.

fomo_models_by_target: Dict[str, Dict[str, FomoModelBundle]] = {}
fomo_performance: Dict[str, pd.DataFrame] = {}
fomo_target_columns = {"mean": "odds_market_mean", "close": "odds_market_close"}
fomo_similarity_threshold = 0.03

if MARKET_ODDS_PATH.exists():
    print(f"[INFO] Fichier march√© trouv√©: {MARKET_ODDS_PATH}")
    market_raw = load_market_csv(str(MARKET_ODDS_PATH))
    print(f"[INFO] Donn√©es march√© charg√©es: {len(market_raw)} lignes (√† la seconde) de {market_raw.index[0]} √† {market_raw.index[-1]}")
    print(f"[INFO] Colonnes disponibles: {list(market_raw.columns)}")
    
    # Reconstruire les OHLC 1 minute √† partir de spot_price
    ohlc_1m_rebuilt = rebuild_ohlc_1m_from_seconds(market_raw)
    print(f"[INFO] OHLC 1 minute reconstruits: {len(ohlc_1m_rebuilt)} bougies")
    print(f"[INFO] P√©riode OHLC: {ohlc_1m_rebuilt.index[0]} √† {ohlc_1m_rebuilt.index[-1]}")
    
    # Reconstruire les bougies de paris avec statistiques des cotations
    market_long = build_market_long_with_stats(market_raw)
    print(f"[INFO] Bougies de paris avec stats cr√©√©es: {len(market_long)} lignes")
    print(f"[INFO] Colonnes disponibles: {list(market_long.columns)}")
    if len(market_long) > 0:
        print(f"[INFO] Exemple de statistiques m15:")
        m15_sample = market_long[market_long['timeframe'] == 'm15'].head(3)
        display(m15_sample[['timeframe', 'contract_id', 'odds_market_min', 'odds_market_max', 'odds_market_mean', 'odds_market_close']])
    
    merged = merge_market_with_sim(simulated_df, market_long)
    print(f"[INFO] Merge simul√©/march√©: {len(merged)} lignes")
    print(f"[INFO] Lignes avec cotes march√©: {merged['odds_market_mid'].notna().sum()}")
    
    # Diagnostics si le merge n'a pas fonctionn√©
    if merged['odds_market_mid'].notna().sum() == 0:
        print("\n[WARNING] Aucune correspondance trouv√©e entre donn√©es simul√©es et march√©.")
        print(f"[DEBUG] P√©riode simulated_df: {simulated_df.index.min()} √† {simulated_df.index.max()}")
        print(f"[DEBUG] P√©riode market_long: {market_long.index.min()} √† {market_long.index.max()}")
        print(f"[DEBUG] Exemples contract_id simulated_df: {simulated_df['contract_id'].unique()[:5]}")
        print(f"[DEBUG] Exemples contract_id market_long: {market_long['contract_id'].unique()[:5]}")
        print(f"[DEBUG] Timeframes simulated_df: {simulated_df['timeframe'].unique()}")
        print(f"[DEBUG] Timeframes market_long: {market_long['timeframe'].unique()}")
    
    # Comparaison des sc√©narios simul√©s vs march√© r√©el
    scenario_names = [s.name for s in scenarios]
    scores = scenario_scores_vs_market(merged, scenario_names)
    print("\n[INFO] Scores RMSE/MAE par sc√©nario et timeframe:")
    display(scores)
    
    # Meilleur sc√©nario par timeframe (si des scores existent)
    if len(scores) > 0:
        best_scenario = scores.loc[scores.groupby('timeframe')['rmse'].idxmin()]
        print("\n[INFO] Meilleur sc√©nario par timeframe (RMSE minimal):")
        display(best_scenario[['timeframe', 'scenario', 'rmse', 'mae', 'n']])
    else:
        print("\n[WARNING] Aucun score disponible (pas de correspondances entre simul√© et march√©).")
    
    # Entra√Ænement de mod√®les FOMO supervis√©s (cibles moyenne & cl√¥ture)
    for target_name, market_col in fomo_target_columns.items():
        if market_col not in merged.columns:
            continue
        available = merged.dropna(subset=[market_col])
        if len(available) < 1000:
            continue
        bundles = train_fomo_models(available, FOMO_FEATURE_COLUMNS, market_col, target_name)
        if not bundles:
            continue
        fomo_models_by_target[target_name] = bundles

        merged[f"fomo_pred_{target_name}"] = apply_fomo_models(merged, bundles, FOMO_FEATURE_COLUMNS)
        simulated_df[f"fomo_pred_{target_name}"] = apply_fomo_models(simulated_df, bundles, FOMO_FEATURE_COLUMNS)

        metrics_rows = []
        for tf, bundle in bundles.items():
            row = bundle.metrics.copy()
            row["timeframe"] = tf
            metrics_rows.append(row)
        if metrics_rows:
            df_metrics = pd.DataFrame(metrics_rows).sort_values("timeframe")
            print(f"\n[INFO] M√©triques apprentissage FOMO (cible {target_name})")
            display(df_metrics[["timeframe", "rmse", "mae", "r2", "similarity_3pct", "similarity_5pct", "n_test"]])

        perf = summarize_fomo_performance(
            merged,
            prediction_col=f"fomo_pred_{target_name}",
            target_col=market_col,
            similarity_threshold=fomo_similarity_threshold,
        )
        if not perf.empty:
            print(f"[INFO] Alignement mod√®le ‚Üî march√© (cible {target_name}, seuil {fomo_similarity_threshold:.2%})")
            display(perf)
            fomo_performance[target_name] = perf

    # Option: utiliser les cotes march√© r√©elles dans le backtest (si disponibles)
    if merged['odds_market_mid'].notna().sum() > 0:
        print("\n[INFO] R√©sum√© avec cotes march√© r√©elles:")
        summary_online_market = summarize_online_by_timeframe(
            merged, odds_column='odds_market_mid', tolerances=[0.05, 0.10, 0.20, 0.30],
            min_seconds_remaining_by_tf={'m15': 0, 'h1': 0, 'd1': 0},
            spread_abs=0.05, fee_abs=0.0, min_z_abs=None, stake_usd=50.0
        )
        display(summary_online_market)
    else:
        print("\n[WARNING] Impossible de g√©n√©rer le r√©sum√© avec cotes march√© (pas de correspondances).")
    
    # On utilise les cotes march√© r√©elles pour la suite seulement si on en a trouv√©
    if merged['odds_market_mid'].notna().sum() > 0:
        use_market_odds = True
        best_odds_column = 'odds_market_mid'
    else:
        print("\n[INFO] Utilisation du sc√©nario simul√© par d√©faut (slow_sticky) car pas de correspondances march√©.")
        use_market_odds = False
        # Si un mod√®le FOMO (moyenne) est disponible, on l'utilise comme meilleure approximation
        if 'fomo_pred_mean' in simulated_df.columns:
            best_odds_column = 'fomo_pred_mean'
        else:
            best_odds_column = 'odds_slow_sticky'
else:
    print(f"[INFO] Fichier march√© non trouv√©: {MARKET_ODDS_PATH}")
    print("[INFO] Utilisation des sc√©narios simul√©s uniquement.")
    use_market_odds = False
    # On choisit le sc√©nario par d√©faut (slow_sticky)
    best_odds_column = 'odds_slow_sticky'
    merged = None
    scores = None


### 5.2 Visualisation et pr√©cision du mod√®le FOMO

Analyse d√©taill√©e de l'alignement entre les cotes pr√©dites par le mod√®le FOMO supervis√© et les cotes march√© r√©elles (moyenne minute et cl√¥ture).


In [None]:
if 'merged' not in globals() or merged is None or merged.empty:
    print("[INFO] Aucun merge simul√©/march√© disponible. Ex√©cuter la section 5.1 avant cette visualisation.")
elif not fomo_models_by_target:
    print("[INFO] Aucun mod√®le FOMO supervis√© entra√Æn√©. V√©rifiez la disponibilit√© des colonnes march√© et relancez la cellule 5.1.")
else:
    summary_tables: list[pd.DataFrame] = []
    global_rows: list[dict] = []

    for target_name, market_col in fomo_target_columns.items():
        pred_col = f"fomo_pred_{target_name}"
        if pred_col not in merged.columns or market_col not in merged.columns:
            continue
        perf = summarize_fomo_performance(
            merged,
            prediction_col=pred_col,
            target_col=market_col,
            similarity_threshold=fomo_similarity_threshold,
        )
        if not perf.empty:
            perf = perf.copy()
            perf["target"] = target_name
            perf["similarity_%"] = (perf["similarity_ratio"] * 100).round(2)
            summary_tables.append(perf)

        subset = merged.dropna(subset=[pred_col, market_col])
        if subset.empty:
            continue
        diff = subset[pred_col] - subset[market_col]
        corr = np.corrcoef(subset[market_col], subset[pred_col])[0, 1] if len(subset) > 1 else np.nan
        global_rows.append(
            {
                "target": target_name,
                "n_obs": len(subset),
                "rmse": float(np.sqrt((diff**2).mean())),
                "mae": float(diff.abs().mean()),
                "bias": float(diff.mean()),
                "similarity_3pct": float((diff.abs() <= 0.03).mean()),
                "similarity_5pct": float((diff.abs() <= 0.05).mean()),
                "corr": float(corr),
            }
        )

    if summary_tables:
        summary_df = pd.concat(summary_tables, ignore_index=True)
        display(summary_df[["target", "timeframe", "n", "rmse", "mae", "bias", "similarity_%"]])
    else:
        print("[INFO] Aucune statistique par timeframe disponible pour les colonnes cibl√©es.")

    if global_rows:
        global_df = pd.DataFrame(global_rows)
        global_df[["similarity_3pct", "similarity_5pct"]] = (global_df[["similarity_3pct", "similarity_5pct"]] * 100).round(2)
        display(global_df)
    else:
        print("[INFO] Aucun calcul global possible (donn√©es insuffisantes).")

    mean_col = fomo_target_columns.get("mean")
    mean_pred_col = "fomo_pred_mean"
    if mean_col in merged.columns and mean_pred_col in merged.columns:
        sample_ids = select_fomo_contract_samples(
            merged,
            timeframe="h1",
            prediction_col=mean_pred_col,
            target_col=mean_col,
            n=1,
        )
        if sample_ids:
            sample_id = sample_ids[0]
            sample = merged[(merged["timeframe"] == "h1") & (merged["contract_id"] == sample_id)].sort_index()
            plt.figure(figsize=(12, 5))
            plt.plot(sample.index, sample[mean_col], label="March√© (mean)")
            plt.plot(sample.index, sample[mean_pred_col], label="Mod√®le FOMO (mean)")
            if "odds_balanced" in sample.columns:
                plt.plot(sample.index, sample["odds_balanced"], linestyle="--", alpha=0.6, label="Sc√©nario balanced")
            plt.title(f"Contrat {sample_id} (H1) ‚Äî comparaison cotes moyennes")
            plt.xlabel("Timestamp UTC")
            plt.ylabel("Probabilit√© de cl√¥ture up")
            plt.legend()
            plt.grid(alpha=0.3)
            plt.tight_layout()
            plt.show()
        else:
            print("[INFO] Impossible de s√©lectionner un contrat H1 pour la visualisation (donn√©es insuffisantes).")
    else:
        print("[INFO] Colonnes n√©cessaires absentes pour la visualisation H1 (mean).")

    close_col = fomo_target_columns.get("close")
    close_pred_col = "fomo_pred_close"
    if close_col in merged.columns and close_pred_col in merged.columns:
        sample_ids_close = select_fomo_contract_samples(
            merged,
            timeframe="m15",
            prediction_col=close_pred_col,
            target_col=close_col,
            n=1,
        )
        if sample_ids_close:
            contract_close = sample_ids_close[0]
            sample_close = merged[(merged["timeframe"] == "m15") & (merged["contract_id"] == contract_close)].sort_index()
            plt.figure(figsize=(12, 5))
            plt.plot(sample_close.index, sample_close[close_col], label="March√© (close)")
            plt.plot(sample_close.index, sample_close[close_pred_col], label="Mod√®le FOMO (close)")
            plt.title(f"Contrat {contract_close} (M15) ‚Äî comparaison cotes de cl√¥ture")
            plt.xlabel("Timestamp UTC")
            plt.ylabel("Probabilit√© de cl√¥ture up")
            plt.legend()
            plt.grid(alpha=0.3)
            plt.tight_layout()
            plt.show()
        else:
            print("[INFO] Impossible de s√©lectionner un contrat M15 pour la visualisation (close).")
    else:
        print("[INFO] Colonnes n√©cessaires absentes pour la visualisation M15 (close).")


## 6. Backtest ONLINE minute-par-minute

D√©cision √† chaud √† chaque minute: on entre √† la premi√®re minute o√π |proba_mod√®le ‚àí cote| ‚â• tol√©rance, sans regard vers le futur, avec spread/frais inclus. Les r√©sultats sont ventil√©s par timeframe (m15/h1/d1).

- But: mesurer l'impact r√©el de la tol√©rance (nb trades, EV/trade) en conditions online.
- Entr√©es: `simulated_df`, `tolerances`, `spread_abs`, `fee_abs`.
- Sorties: `summary_online` (hit_rate, EV/trade, PnL_total, MDD, pertes cons√©cutives, timing).
- Lecture: tol√©rance ‚Üë ‚áí nb trades ‚Üì, EV/trade ‚Üë; surveiller MDD et pertes cons√©cutives.


In [None]:
tolerances_online = [0.05, 0.10, 0.20, 0.30]
minutes_filters = {"m15": 0, "h1": 0, "d1": 0}

if use_market_odds and merged is not None:
    backtest_source = merged.dropna(subset=[best_odds_column])
    context_label = "cotes march√©"
else:
    backtest_source = simulated_df
    context_label = "cotes simul√©es"

print(f"[INFO] Backtest ONLINE sur {context_label} avec tol√©rances {tolerances_online}")
summary_online = summarize_online_by_timeframe(
    backtest_source,
    odds_column=best_odds_column,
    tolerances=tolerances_online,
    min_seconds_remaining_by_tf=minutes_filters,
    spread_abs=0.05,
    fee_abs=0.0,
    min_z_abs=None,
    stake_usd=50.0,
)
display(summary_online)


## 7. Visualisations et analyse

On exploite les r√©sultats du backtest minute-par-minute pour diagnostiquer les opportunit√©s : comparaison min/mid/max des cotes march√©, suivi d'√©quity et analyses de biais.

- But: explorer graphiquement les performances (√©quity, distributions, timing) et identifier les gains potentiels li√©s aux extr√™mes de cotations.
- Entr√©es: `merged_minmax`, `summary_2pct`, `summary_4pct`, `equity_curves_*`.
- Sorties: comparatifs statistiques, graphiques d'√©quity, mesures de biais.
- Lecture: v√©rifier la robustesse (drawdown, pertes cons√©cutives) et l'impact de l'utilisation min/max sur l'edge.


In [None]:
if MARKET_ODDS_PATH.exists():
    market_raw = load_market_csv(str(MARKET_ODDS_PATH))
    market_long_minmax = build_market_long_with_minmax(market_raw)
    merged_minmax = merge_market_with_sim(simulated_df, market_long_minmax)

    # Comparaison : moyenne vs min/max
    trades_mean = build_trades_online_stream(
        merged_minmax, odds_column='odds_market_mid', min_edge=0.10,
        min_seconds_remaining=0, spread_abs=0.05, fee_abs=0.0
    )
    trades_minmax = build_trades_with_minmax_opportunities(
        merged_minmax, min_edge=0.10,
        min_seconds_remaining=0, spread_abs=0.05, fee_abs=0.0,
        use_min_for_up=True, use_max_for_down=True
    )

    print(f"Trades avec moyenne: {len(trades_mean)}")
    print(f"Trades avec min/max: {len(trades_minmax)}")
    print(f"Opportunit√©s suppl√©mentaires: {len(trades_minmax) - len(trades_mean)}")

    # Comparaison PnL
    if len(trades_mean) > 0:
        print(f"PnL moyen (moyenne): ${trades_mean['pnl'].mean() * 50:.2f}")
    if len(trades_minmax) > 0:
        print(f"PnL moyen (min/max): ${trades_minmax['pnl'].mean() * 50:.2f}")


In [None]:
# R√©sum√© ONLINE par timeframe avec gestion de capital dynamique
odds_col = "odds_slow_sticky"   # ou "odds_balanced" / "odds_fast_revert"
tolerances = [0.05, 0.10, 0.20, 0.30]
initial_capital = 1000.0

# Strat√©gie 1: 2% de risque par pari (montant fixe en USD)
print("=" * 80)
print("STRAT√âGIE 1: 2% de risque par pari (montant fixe en USD)")
print("=" * 80)
print("Exemple: cote 0.2, capital 1000$ ‚Üí stake = 20$, shares = 20/0.2 = 100 shares")
summary_2pct, equity_curves_2pct = summarize_online_by_timeframe_with_capital(
    simulated_df,
    odds_column=odds_col,
    tolerances=tolerances,
    initial_capital=initial_capital,
    strategy_type="risk_pct",  # Risque X% en USD
    pct_value=0.02,  # 2% de risque
    min_seconds_remaining_by_tf={"m15": 0, "h1": 0, "d1": 0},
    spread_abs=0.05,
    fee_abs=0.0,
    min_z_abs=None,
)
display(summary_2pct)

# Strat√©gie 2: 4% du capital initial en shares (nombre fixe de shares, co√ªt variable)
print("\n" + "=" * 80)
print("STRAT√âGIE 2: 4% du capital initial en shares (nombre fixe, co√ªt variable selon cote)")
print("=" * 80)
print("Exemple: cote 0.2, capital 1000$ ‚Üí shares = 40 (fixe), co√ªt = 40 * 0.2 = 8$")
summary_4pct, equity_curves_4pct = summarize_online_by_timeframe_with_capital(
    simulated_df,
    odds_column=odds_col,
    tolerances=tolerances,
    initial_capital=initial_capital,
    strategy_type="shares_pct",  # Ach√®te X% du capital en shares
    pct_value=0.04,  # 4% du capital en shares
    min_seconds_remaining_by_tf={"m15": 0, "h1": 0, "d1": 0},
    spread_abs=0.05,
    fee_abs=0.0,
    min_z_abs=None,
)
display(summary_4pct)

# R√©partition UP/DOWN par timeframe (agr√©g√© toutes tol√©rances) - Strat√©gie 2%
print("\n" + "=" * 80)
print("R√âPARTITION UP/DOWN (Strat√©gie 2%)")
print("=" * 80)
mix_2pct = (summary_2pct
       .groupby('timeframe')[['num_up','num_down','num_trades']]
       .sum()
       .assign(up_ratio=lambda x: (x['num_up'] / x['num_trades']).fillna(0.0)))
display(mix_2pct)

# Courbes d'√©quity par gestion de risque
print("\n" + "=" * 80)
print("COURBES D'√âQUITY PAR GESTION DE RISQUE")
print("=" * 80)
sel_tol = 0.10  # Tol√©rance s√©lectionn√©e pour la visualisation

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, tf in enumerate(["m15", "h1", "d1"]):
    ax = axes[idx]
    
    # Courbe 2%
    key_2pct = f"{tf}_{sel_tol}"
    if key_2pct in equity_curves_2pct and len(equity_curves_2pct[key_2pct]) > 0:
        curve_2pct = equity_curves_2pct[key_2pct]
        ax.plot(curve_2pct.index, curve_2pct.values, label=f"2% risque (USD fixe)", linewidth=2, color="#1f77b4")
    
    # Courbe 4%
    key_4pct = f"{tf}_{sel_tol}"
    if key_4pct in equity_curves_4pct and len(equity_curves_4pct[key_4pct]) > 0:
        curve_4pct = equity_curves_4pct[key_4pct]
        ax.plot(curve_4pct.index, curve_4pct.values, label=f"4% shares (fixe, co√ªt variable)", linewidth=2, color="#ff7f0e")
    
    # Ligne de r√©f√©rence (capital initial)
    ax.axhline(initial_capital, color="gray", linestyle="--", alpha=0.5, label="Capital initial")
    
    ax.set_title(f"√âquity {tf.upper()} (tol={sel_tol})")
    ax.set_xlabel("Trades (#)")
    ax.set_ylabel("Capital (USD)")
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Utiliser summary_2pct pour les analyses suivantes
summary_online = summary_2pct
mix = mix_2pct


In [None]:
# V√©rification du biais de tendance : comparaison trades vs r√©alit√©
print("\n" + "=" * 80)
print("V√âRIFICATION OVERFITTING / BIAIS DE TENDANCE")
print("=" * 80)
print("\nComparaison entre:")
print("  - Proportion de TRADES 'up' (strat√©gie)")
print("  - Proportion R√âELLE de bougies qui cl√¥turent 'up' (dataset)")

# Distribution r√©elle des targets par timeframe
target_real_dist = (
    snapshots_df.groupby("timeframe")[TARGET_COLUMN]
    .agg(["mean", "count"])
    .rename(columns={"mean": "target_real_prop_up", "count": "total_bougies"})
)

# Comparaison
bias_check = mix[["up_ratio"]].copy()
bias_check["target_real_prop_up"] = target_real_dist["target_real_prop_up"]
bias_check["diff"] = bias_check["up_ratio"] - bias_check["target_real_prop_up"]
bias_check["diff_pct"] = (bias_check["diff"] / bias_check["target_real_prop_up"] * 100).round(2)
bias_check["bias_severity"] = bias_check["diff"].abs().apply(
    lambda x: "‚úÖ N√©gligeable" if x < 0.05 else "‚ö†Ô∏è Mod√©r√©" if x < 0.10 else "‚ùå FORT"
)

print("\nR√©sultats par timeframe:")
display(bias_check.round(4))

print("\nInterpr√©tation:")
for tf in bias_check.index:
    trades_up_pct = bias_check.loc[tf, "up_ratio"] * 100
    real_up_pct = bias_check.loc[tf, "target_real_prop_up"] * 100
    diff = bias_check.loc[tf, "diff"] * 100
    severity = bias_check.loc[tf, "bias_severity"]
    
    print(f"\n{tf.upper()}:")
    print(f"  - Trades 'up': {trades_up_pct:.1f}%")
    print(f"  - R√©alit√© 'up': {real_up_pct:.1f}%")
    print(f"  - √âcart: {diff:+.1f}%")
    print(f"  - {severity}")
    
    if abs(diff) > 10:
        print(f"  ‚ö†Ô∏è ATTENTION: La strat√©gie trade {'beaucoup plus' if diff > 0 else 'beaucoup moins'} 'up' que la r√©alit√©.")
        print(f"     Cela peut indiquer un overfitting sur la tendance haussi√®re du dataset.")
    elif abs(diff) > 5:
        print(f"  ‚ö†Ô∏è √Ä surveiller: l√©ger d√©s√©quilibre d√©tect√©.")
    else:
        print(f"  ‚úÖ Distribution √©quilibr√©e, pas de biais d√©tect√©.")

print("\n" + "=" * 80)


In [None]:
# Visualisations ONLINE
# 1) √âquity par timeframe pour une tol√©rance
sel_tol = 0.10
plt.figure(figsize=(12,6))
for tf in ["m15","h1","d1"]:
    tf_frame = simulated_df[simulated_df["timeframe"] == tf]
    tr = build_trades_online_stream(
        tf_frame, odds_column=odds_col, min_edge=sel_tol,
        min_seconds_remaining=0, spread_abs=0.05, fee_abs=0.0,
        min_z_abs=None, allow_multiple=False
    )
    curve = equity_curve(tr, stake_usd=50.0)
    if len(curve) > 0:
        plt.plot(curve.index, curve.values, label=f"{tf}, tol={sel_tol}")
plt.title(f"√âquity ONLINE (stake 50$, {odds_col})")
plt.xlabel("Trades (#)")
plt.ylabel("PNL cumul√© (USD)")
plt.legend(); plt.tight_layout(); plt.show()

# 2) Histogramme PnL par timeframe (m15 en exemple)
tf = "m15"
tr_m15 = build_trades_online_stream(
    simulated_df[simulated_df["timeframe"]==tf], odds_column=odds_col, min_edge=sel_tol,
    min_seconds_remaining=0, spread_abs=0.05, fee_abs=0.0,
    min_z_abs=None, allow_multiple=False
)
if not tr_m15.empty:
    plt.figure(figsize=(8,4))
    sns.histplot(tr_m15["pnl"]*50.0, bins=30, color="#ff7f0e")
    plt.axvline((tr_m15["pnl"]*50.0).mean(), color="black", ls="--", label="PNL moyen")
    plt.title(f"Distribution PnL USD (ONLINE) ‚Äî {tf}, tol={sel_tol}")
    plt.xlabel("PNL par trade (USD)"); plt.ylabel("Nombre de trades")
    plt.legend(); plt.tight_layout(); plt.show()

# 3) Timing d'entr√©e (minutes avant la cl√¥ture)
if not tr_m15.empty:
    plt.figure(figsize=(8,4))
    mins = tr_m15["seconds_remaining"]/60.0
    sns.histplot(mins, bins=30, color="#1f77b4")
    plt.title(f"Timing d'entr√©e ‚Äî {tf}, tol={sel_tol}")
    plt.xlabel("Minutes avant la cl√¥ture"); plt.ylabel("Nombre d'entr√©es")
    plt.tight_layout(); plt.show()


## 8. Backtest ONLINE avec cotes march√© r√©elles

On rejoue la strat minute-par-minute en utilisant directement les cotes Polymarket (mid prices).
Pour chaque timeframe, on simule les entr√©es lorsque le spread mod√®le-march√© d√©passe la tol√©rance.

- But: √©valuer la performance realis√©e en branchant les quotes march√©.
- Entr√©es: `merged` (probabilit√©s mod√®le + cotes march√©), tol√©rances.
- Sorties: tableaux de synth√®se (2% risque / 4% shares) et √©quity comparatives.
- Lecture: v√©rifier la coh√©rence des entr√©es (prob_trade), la robustesse (drawdown) et l'impact sizing.


In [None]:
# 8.5 Backtest ONLINE avec cotes march√© r√©elles
initial_capital_live = 1000.0
tolerances_live = [0.05, 0.10, 0.20, 0.30]

if 'pred_df' not in globals():
    print('[WARN] pred_df absent. Ex√©cutez les sections 3 pour g√©n√©rer les probabilit√©s.')
elif not MARKET_ODDS_PATH.exists():
    print(f'[WARN] Fichier march√© introuvable: {MARKET_ODDS_PATH}')
else:
    if 'market_raw' not in globals():
        market_raw = load_market_csv(str(MARKET_ODDS_PATH))
    live_stream = prepare_live_market_stream(market_raw, pred_df)
    if live_stream.empty:
        print('[WARN] Impossible de construire le flux live (aucune minute en commun). V√©rifiez les dates des donn√©es mod√®le vs march√©.')
    else:
        nb_rows = len(live_stream)
        nb_contracts = live_stream['contract_id'].nunique()
        print(f'[INFO] Flux march√© disponible: {nb_rows} points (secondes), {nb_contracts} contrats.')
        print(f'[INFO] Fen√™tre: {live_stream.index.min()} ‚Üí {live_stream.index.max()}')

        # Strat√©gie 1: 2% de risque par trade (mise USD fixe)
        live_summary_2pct, live_equity_2pct = summarize_online_by_timeframe_with_capital(
            live_stream,
            odds_column='odds_market_mid',
            tolerances=tolerances_live,
            initial_capital=initial_capital_live,
            strategy_type='risk_pct',
            pct_value=0.02,
            min_seconds_remaining_by_tf={'m15': 0, 'h1': 0, 'd1': 0},
            spread_abs=0.05,
            fee_abs=0.0,
            min_z_abs=None,
        )
        print('[INFO] R√©sultats (2% de risque par trade)')
        display(live_summary_2pct)

        # Strat√©gie 2: 4% du capital initial en shares (nombre fixe de parts)
        live_summary_4pct, live_equity_4pct = summarize_online_by_timeframe_with_capital(
            live_stream,
            odds_column='odds_market_mid',
            tolerances=tolerances_live,
            initial_capital=initial_capital_live,
            strategy_type='shares_pct',
            pct_value=0.04,
            min_seconds_remaining_by_tf={'m15': 0, 'h1': 0, 'd1': 0},
            spread_abs=0.05,
            fee_abs=0.0,
            min_z_abs=None,
        )
        print('[INFO] R√©sultats (4% du capital en shares)')
        display(live_summary_4pct)

        # R√©partition UP/DOWN (strat√©gie 2%)
        mix_live = (
            live_summary_2pct
            .groupby('timeframe')[['num_up', 'num_down', 'num_trades']]
            .sum()
            .assign(up_ratio=lambda x: (x['num_up'] / x['num_trades']).fillna(0.0))
        )
        print('[INFO] R√©partition UP/DOWN ‚Äì Strat√©gie 2% de risque')
        display(mix_live)

        # Visualisation des √©quities pour une tol√©rance donn√©e
        sel_tol_live = 0.10
        plt.figure(figsize=(12, 6))
        for tf in ['m15', 'h1', 'd1']:
            key = f'{tf}_{sel_tol_live}'
            if key in live_equity_2pct and len(live_equity_2pct[key]) > 0:
                plt.plot(live_equity_2pct[key].values, label=f'{tf} ‚Äì 2% risque')
            if key in live_equity_4pct and len(live_equity_4pct[key]) > 0:
                plt.plot(live_equity_4pct[key].values, linestyle='--', label=f'{tf} ‚Äì 4% shares')
        plt.axhline(initial_capital_live, color='gray', linestyle='--', alpha=0.4, label='Capital initial')
        plt.title(f'√âquit√© r√©elle (tol√©rance = {sel_tol_live})')
        plt.xlabel('Trades (#)')
        plt.ylabel('Capital (USD)')
        plt.legend()
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.show()

        # Archive des r√©sum√©s pour exploitation ult√©rieure
        live_summary_2pct.to_csv('data/live_backtest_summary_2pct.csv', index=False)
        live_summary_4pct.to_csv('data/live_backtest_summary_4pct.csv', index=False)
        print('[INFO] R√©sum√©s sauvegard√©s dans data/live_backtest_summary_*.csv')


### 8.5 Visualisation des trades (strat√©gie 4¬†% du capital en shares)

Cette section synth√©tise les trades issus du backtest r√©el (section¬†8) en se concentrant sur la gestion de taille fixe (4¬†% du capital initial exprim√© en nombre de shares).


In [None]:
if "live_stream" not in globals() or live_stream.empty:
    print("[INFO] Aucun flux march√© pr√™t ‚Äî ex√©cuter la section 8 avant cette visualisation.")
elif "live_summary_4pct" not in globals():
    print("[INFO] R√©sultats 4 % shares indisponibles ‚Äî relancer la cellule principale de la section 8.")
else:
    share_count = initial_capital_live * 0.04
    stats_records: list[dict] = []
    trades_frames: list[pd.DataFrame] = []

    for tf in ["m15", "h1", "d1"]:
        tf_stream = live_stream[live_stream["timeframe"] == tf]
        if tf_stream.empty:
            continue
        for tol in tolerances_live:
            trades = build_trades_online_stream(
                tf_stream,
                odds_column="odds_market_mid",
                prob_column="prob_up",
                target_column=TARGET_COLUMN,
                min_edge=tol,
                min_seconds_remaining=0,
                spread_abs=0.05,
                fee_abs=0.0,
                min_z_abs=None,
                allow_multiple=False,
            )
            if trades.empty:
                continue

            trades = trades.copy()
            trades["timeframe"] = tf
            trades["tolerance"] = tol
            trades["pnl_usd"] = trades["pnl"] * share_count
            trades["shares"] = share_count
            trades_frames.append(trades)

            pnl_usd = trades["pnl_usd"]
            pnl_mean = float(pnl_usd.mean())
            pnl_std = float(pnl_usd.std(ddof=1))
            sharpe = np.nan
            if pnl_std > 1e-9:
                sharpe = (pnl_mean / pnl_std) * np.sqrt(len(pnl_usd))

            stats_records.append(
                {
                    "timeframe": tf,
                    "tolerance": tol,
                    "num_trades": len(trades),
                    "winrate": float((pnl_usd > 0).mean()),
                    "num_up": int((trades["direction"] == "up").sum()),
                    "num_down": int((trades["direction"] == "down").sum()),
                    "avg_pnl_usd": pnl_mean,
                    "pnl_total_usd": float(pnl_usd.sum()),
                    "sharpe_trade_sqrtN": sharpe,
                    "median_edge": float(trades["edge"].median()),
                    "median_seconds_remaining": float(trades["seconds_remaining"].median()),
                }
            )

    if not stats_records:
        print("[INFO] Aucun trade n'a √©t√© g√©n√©r√© pour les param√®tres 4¬†% shares.")
    else:
        stats_df = pd.DataFrame(stats_records).sort_values(["tolerance", "timeframe"])
        display(stats_df)

        all_trades_df = pd.concat(trades_frames, ignore_index=True).sort_values("timestamp")
        pnl_usd = all_trades_df["pnl_usd"]
        global_stats = {
            "total_trades": len(all_trades_df),
            "winrate": float((pnl_usd > 0).mean()),
            "num_up": int((all_trades_df["direction"] == "up").sum()),
            "num_down": int((all_trades_df["direction"] == "down").sum()),
            "avg_pnl_usd": float(pnl_usd.mean()),
            "pnl_total_usd": float(pnl_usd.sum()),
            "median_seconds_remaining": float(all_trades_df["seconds_remaining"].median()),
            "median_edge": float(all_trades_df["edge"].median()),
        }
        pnl_std = float(pnl_usd.std(ddof=1))
        if pnl_std > 1e-9:
            global_stats["sharpe_trade_sqrtN"] = (global_stats["avg_pnl_usd"] / pnl_std) * np.sqrt(len(pnl_usd))
        else:
            global_stats["sharpe_trade_sqrtN"] = np.nan

        print("\n[INFO] Statistiques agr√©g√©es (toutes tol√©rances / timeframes)")
        display(pd.DataFrame([global_stats]))

        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        sns.barplot(data=stats_df, x="tolerance", y="pnl_total_usd", hue="timeframe", ax=axes[0])
        axes[0].set_title("PnL total par timeframe et tol√©rance")
        axes[0].set_xlabel("Tol√©rance")
        axes[0].set_ylabel("PnL total (USD)")
        axes[0].legend(title="Timeframe")

        sns.barplot(data=stats_df, x="tolerance", y="winrate", hue="timeframe", ax=axes[1])
        axes[1].set_title("Winrate par timeframe et tol√©rance")
        axes[1].set_xlabel("Tol√©rance")
        axes[1].set_ylabel("Winrate")
        axes[1].legend(title="Timeframe")

        plt.tight_layout()
        plt.show()



## 9. Lecture mod√®le

On inspecte les features dominantes des mod√®les intrabougie et pr√©-ouverture pour comprendre quels signaux le mod√®le exploite le plus.

- But: comprendre quels signaux portent l'edge.
- Entr√©es: `bundles`, `preopen_bundles`.
- Sorties: tableaux d'importances (permutation si besoin).
- Lecture: valider l'apport de `z_dist_atr15`, ratios EMA, temps restant, streaks.


In [None]:
# Utiliser top_feature_importances d√©fini en section 0 pour l'analyse des mod√®les.

for name, bundle in bundles.items():
    print(f"Importance features {name}")
    display(top_feature_importances(bundle))


In [None]:
for name, bundle in preopen_bundles.items():
    print(f"Importance features pr√©-open {name}")
    display(top_feature_importances(bundle))


## 10. Synth√®se et prochaines √©tapes

- Les mod√®les intrabougie produisent des probabilit√©s calibr√©es avec des AUC > 0.6 sur l'ensemble test, et d√©passent 70¬†% de hit rate lorsque la probabilit√© mod√®le franchit 0.65 en fin de bougie m15/h1. Les signaux d√®s la premi√®re minute conservent un edge > 60¬†% pour m15 et h1.
- Le mod√®le pr√©-ouverture exploite surtout les retours et la structure de tendance (ratios EMA, RSI, volatilit√©). Plusieurs fen√™tres atteignent 60‚Äì62¬†% de r√©ussite sur m15 quand la proba d√©passe 0.6 avant l'ouverture.
- La simulation de cotes "FOMO" offre trois sc√©narios param√©trables (correction rapide, √©quilibr√©e, collante) ; la strat√©gie value simple garde une PnL moyenne positive (>¬†5¬†% d'EV par trade) dans les configurations √† correction lente.
- Tout est pr√™t pour brancher un flux Polymarket r√©el : il suffira d'alimenter `odds_xxx` avec les cotes live, d'ajuster l'indice de FOMO et de monitorer la calibration en temps r√©el.
- Prochaines actions : (1) d√©ployer le scrapper temps r√©el, (2) comparer les cotes observ√©es aux sc√©narios simul√©s pour estimer dynamiquement l'indice de FOMO, (3) raffiner la gestion du risque (taille de mise adaptative, limites de liquidit√©) et (4) automatiser l'√©valuation continue via backtests glissants.
