# BTC Hourly Direction Study – Two Labeling Schemes
**Dataset**: `gemini_btc_data_final_version.csv`  
**File path**: `C:\Users\ADMIN\Desktop\Coding_projects\stock_market_prediction\Stock-Market-Prediction\data\processed\gemini_btc_data_final_version.csv`

We will run **two entirely separate experiments** so one crash cannot corrupt the other:

| Run | Label | Why we use it |
|-----|-------|---------------|
| **A** | `direction_quant70` – top/bottom 30 % 1-h returns (drop middle 40 %) | Removes noisy bars; focuses on meaningful moves. |
| **B** | `direction_raw` – every bar (up/down) ; plus **high-confidence filter** (`prob ≥ 0.7`) | Uses full data; we act only when the model is very sure. |

For each run we compare **Momentum-only features** vs. **All engineered features** and report Accuracy, Precision, Recall, F1, ROC-AUC.  
Held-out permutation importances tell which columns truly matter.


### Feature glossary

| Feature                | Formula / window                 | Intuition |
|------------------------|-----------------------------------|-----------|
| **vol_6h**             | Std. dev. of 1-h returns, 6-hour window | Captures short-term volatility spikes that often precede breakouts. |
| **vol_24h**            | Std. dev. of 1-h returns, 24 h window  | Detects high-vol vs. calm regimes over one day. |
| **atr_14h**            | Mean of `(high-low)` over last 14 bars | A simpler Average True Range; bigger bars → more movement potential. |
| **pos_24h**            | `(close − 24h low) / (24h high − 24h low)` | Where price sits inside its 24-hour range (0 = bottom, 1 = top). |
| **boll_b**             | Bollinger %B on a 24-h SMA ±2 σ        | >1 means over-bought; <0 means over-sold relative to band. |
| **vol_mean_24h**       | 24-h moving average of **Volume BTC**  | Baseline liquidity level. |
| **vol_ratio**          | `Volume / vol_mean_24h`                | >1 = current bar has abnormally high volume (demand shock). |
| **vol_pct_change**     | % change in volume vs. 6 bars ago      | Sudden volume jumps ahead of price moves. |
| **obv**                | Cumulative Σ (sign(return) × Volume)   | On-Balance Volume; tracks whether volume flows with price. |
| **body**               | `close − open`                         | Size / direction of candle body (positive = green bar). |
| **upper_shadow**       | `high − max(open, close)`              | Buying rejection wick; long tails can signal reversals. |
| **lower_shadow**       | `min(open, close) − low`               | Selling rejection wick. |
| **hour_sin / hour_cos**| Sin/Cos of UTC hour (0-23)             | Encode intraday cycle (Asia, EU, US sessions). |
| **dow**                | Day-of-week (0 = Mon … 6 = Sun)        | Weekend vs. weekday behaviour. |


In [15]:
# ─────────────────────────────────────────────────────────────
# Cell 1 ▸ Imports & load
# ─────────────────────────────────────────────────────────────
import pandas as pd, numpy as np
from pathlib  import Path
from sklearn.ensemble   import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score)
from sklearn.inspection import permutation_importance

# --- edit your file path here ---
CSV = Path(r"C:\Users\ADMIN\Desktop\Coding_projects\stock_market_prediction\Stock-Market-Prediction\data\processed\gemini_btc_data_final_version.csv")

df = (pd.read_csv(CSV, parse_dates=["date"])
        .sort_values("date")
        .reset_index(drop=True))

print(f"Rows loaded : {len(df):,}")
print("Date span   :", df['date'].min(), "→", df['date'].max())


Rows loaded : 82,519
Date span   : 2015-10-08 14:00:00 → 2025-03-27 23:00:00


### Basic cleaning
1. **Drop** Gemini’s ultra-sparse infancy period (before 2016-04-01).  
2. **Forward-fill core OHLCV gaps ≤ 3 hours** so short feed hiccups don’t create NaNs.


In [16]:
# ─────────────────────────────────────────────────────────────
# Cell 2 ▸ Basic cleaning
# ─────────────────────────────────────────────────────────────
df = df[df['date'] >= '2016-04-01'].copy()

core = ['open', 'high', 'low', 'close', 'Volume BTC']
df[core] = df[core].fillna(method='ffill', limit=3)

print("Rows after cut & ffill:", len(df))


Rows after cut & ffill: 78576


  df[core] = df[core].fillna(method='ffill', limit=3)


## Feature engineering (safe – guards against ÷0 / ±inf)

We create six families:

1. **Momentum** (`roc_*`)  
2. **Volatility** (`vol_*`, `atr_*`)  
3. **Range/position** (`pos_24h`, `boll_b`)  
4. **Volume/liquidity** (`vol_ratio`, `vol_pct_change`, `obv`)  
5. **Candlestick anatomy** (`body`, `upper_shadow`, `lower_shadow`)  
6. **Time seasonality** (`hour_sin`, `hour_cos`, `dow`)


In [27]:
# ─────────────────────────────────────────────────────────────
# Cell 3 ▸ Feature engineering (safe)
# ─────────────────────────────────────────────────────────────
EPS = 1e-9

# ── 1) Momentum
df['roc_4h']  = df['close'].pct_change(4)
df['roc_24h'] = df['close'].pct_change(24)
df['roc_7d']  = df['close'].pct_change(24*7)
df['roc_30d'] = df['close'].pct_change(24*30)

# ── 2) Volatility
ret = df['close'].pct_change()
df['vol_6h']  = ret.rolling(6).std()
df['vol_24h'] = ret.rolling(24).std()
df['atr_14h'] = (df['high'] - df['low']).rolling(14).mean()

# ── 3) Range / position
hi24 = df['high'].rolling(24).max()
lo24 = df['low'] .rolling(24).min()
rng   = (hi24 - lo24).replace(0, EPS)
df['pos_24h'] = (df['close'] - lo24) / rng
mid24 = df['close'].rolling(24).mean()
std24 = df['close'].rolling(24).std().replace(0, EPS)
df['boll_b']  = (df['close'] - (mid24 - 2*std24)) / (4*std24)

# ── 4) Volume / liquidity
df['vol_mean_24h']   = df['Volume BTC'].rolling(24).mean().replace(0, EPS)
df['vol_ratio']      = df['Volume BTC'] / df['vol_mean_24h']
df['vol_pct_change'] = df['Volume BTC'].pct_change(6)
sign = np.sign(df['close'].diff()).fillna(0)
df['obv'] = (sign * df['Volume BTC']).cumsum()

# ── 5) Candle shape
df['body']          = df['close'] - df['open']
df['upper_shadow']  = df['high'] - df[['close','open']].max(axis=1)
df['lower_shadow']  = df[['close','open']].min(axis=1) - df['low']

# ── 6) Time seasonality
df['hour'] = df['date'].dt.hour
df['hour_sin'] = np.sin(2*np.pi*df['hour']/24)
df['hour_cos'] = np.cos(2*np.pi*df['hour']/24)
df['dow']      = df['date'].dt.dayofweek


### Feature family lists


In [28]:
# ─────────────────────────────────────────────────────────────
# Cell 4 ▸ Feature family lists
# ─────────────────────────────────────────────────────────────
F_MOM  = ['roc_4h','roc_24h','roc_7d','roc_30d']
F_VOL  = ['vol_6h','vol_24h','atr_14h']
F_RNG  = ['pos_24h','boll_b']
F_VLM  = ['vol_ratio','vol_pct_change','obv']
F_CDL  = ['body','upper_shadow','lower_shadow']
F_TIME = ['hour_sin','hour_cos','dow']
ALL_FEATS = F_MOM + F_VOL + F_RNG + F_VLM + F_CDL + F_TIME


In [29]:
df[ALL_FEATS] = df[ALL_FEATS].shift(1)

# RUN A – Quantile 70/30 label (decisive moves) 


In [30]:
# Cell 5A ▸ Create quantile-based label & mask rows
# --- final safety sweep: convert ±inf → NaN and drop those rows -------------
df.replace([np.inf, -np.inf], np.nan, inplace=True)

ret1h = df['close'].pct_change()
up_q, dn_q = ret1h.quantile([0.70, 0.30])
df['direction_quant70'] = np.select(
    [ret1h >= up_q, ret1h <= dn_q], [1, 0], default=np.nan
)

mask_A = df[ALL_FEATS].notna().all(axis=1) & df['direction_quant70'].notna()
dfa = df.loc[mask_A].reset_index(drop=True)

print("RUN A rows :", len(dfa))
print("Class balance:\n", dfa['direction_quant70'].value_counts(normalize=True).round(3))


RUN A rows : 46859
Class balance:
 direction_quant70
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64


In [31]:
# Cell 6A ▸ Evaluation helper for one label
def eval_run(dataframe, feat_list, label_col, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    metrics, imps = [], []
    for fold,(tr,ts) in enumerate(tscv.split(dataframe)):
        Xtr, ytr = dataframe.iloc[tr][feat_list], dataframe.iloc[tr][label_col]
        Xts, yts = dataframe.iloc[ts][feat_list], dataframe.iloc[ts][label_col]

        clf = RandomForestClassifier(n_estimators=400, n_jobs=-1,
                                     random_state=fold).fit(Xtr, ytr)
        y_pred = clf.predict(Xts)
        proba  = clf.predict_proba(Xts)[:,1]

        metrics.append({
            "accuracy" : accuracy_score(yts, y_pred),
            "precision": precision_score(yts, y_pred, zero_division=0),
            "recall"   : recall_score(yts, y_pred, zero_division=0),
            "f1"       : f1_score(yts, y_pred, zero_division=0),
            "roc_auc"  : roc_auc_score(yts, proba),
        })

        pi = permutation_importance(clf, Xts, yts, scoring="accuracy",
                                    n_repeats=10, random_state=fold, n_jobs=-1)
        imps.append(pd.Series(pi.importances_mean, index=feat_list))

    return pd.DataFrame(metrics).mean().round(3), pd.concat(imps, axis=1).mean(axis=1)


In [32]:
# Cell 7A ▸ Run Momentum vs. ALL for RUN A
expA = {}
for name, feats in {"MOM":F_MOM, "ALL":ALL_FEATS}.items():
    res, imp = eval_run(dfa, feats, "direction_quant70")
    expA[name] = res
    if name == "ALL": impA = imp          # save importances for ALL

pd.DataFrame(expA).T




Unnamed: 0,accuracy,precision,recall,f1,roc_auc
MOM,0.513,0.505,0.55,0.526,0.521
ALL,0.535,0.526,0.612,0.561,0.554


### Top-10 generalising features (Run A, ALL)


In [33]:
impA.sort_values(ascending=False).head(10)


body              0.011666
roc_4h            0.008872
pos_24h           0.007658
boll_b            0.004766
upper_shadow      0.004323
lower_shadow      0.002799
vol_pct_change    0.002287
hour_cos          0.001867
roc_30d           0.001749
vol_ratio         0.001501
dtype: float64

# RUN B – Raw label (every bar) + confidence filter 


In [34]:
# Cell 5B ▸ Build raw label and mask
df['direction_raw'] = (df['close'].shift(-1) > df['close']).astype(int)
mask_B = df[ALL_FEATS].notna().all(axis=1)   # raw label has no NaN
dfb    = df.loc[mask_B].reset_index(drop=True)

print("RUN B rows :", len(dfb))
print("Class balance:\n", dfb['direction_raw'].value_counts(normalize=True).round(3))


RUN B rows : 77570
Class balance:
 direction_raw
1    0.506
0    0.494
Name: proportion, dtype: float64


In [37]:
# Cell 6B ▸ Evaluation with optional prob threshold (robust to 1-class slices)
def eval_runB(dataframe, feat_list, prob_thr=None, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    rows = []           # store metric dicts
    cover = []          # coverage for prob-threshold variant

    for fold, (tr, ts) in enumerate(tscv.split(dataframe)):
        Xtr, ytr = dataframe.iloc[tr][feat_list], dataframe.iloc[tr]['direction_raw']
        Xts, yts = dataframe.iloc[ts][feat_list], dataframe.iloc[ts]['direction_raw']

        clf = RandomForestClassifier(
            n_estimators=400, n_jobs=-1, random_state=fold
        ).fit(Xtr, ytr)

        proba = clf.predict_proba(Xts)[:, 1]
        y_pred = (proba >= 0.50).astype(int)          # default 0.5 cut

        # — optional high-confidence filter —
        if prob_thr is not None:
            mask = proba >= prob_thr
            if mask.sum() == 0:          # nothing meets threshold → skip fold
                continue
            yts, y_pred, proba = yts[mask], y_pred[mask], proba[mask]
            cover.append(mask.mean())

        # --- compute metrics safely ---------------------------------
        unique = np.unique(yts)
        if len(unique) == 1:
            roc = np.nan                 # can’t compute ROC-AUC with 1 class
        else:
            roc = roc_auc_score(yts, proba)

        rows.append({
            "accuracy" : accuracy_score(yts, y_pred),
            "precision": precision_score(yts, y_pred, zero_division=0),
            "recall"   : recall_score(yts, y_pred, zero_division=0),
            "f1"       : f1_score(yts, y_pred, zero_division=0),
            "roc_auc"  : roc,
        })

    out = pd.DataFrame(rows).mean(numeric_only=True).round(3)
    if prob_thr is not None:
        out["coverage"] = np.mean(cover) if cover else 0.0
    return out


In [38]:
# Cell 7B ▸ Run 3 variants for RUN B
expB = {
    "RawFull | MOM"  : eval_runB(dfb, F_MOM),
    "RawFull | ALL"  : eval_runB(dfb, ALL_FEATS),
    "RawConf>.7 | ALL": eval_runB(dfb, ALL_FEATS, prob_thr=0.7)
}
pd.DataFrame(expB).T


Unnamed: 0,accuracy,coverage,f1,precision,recall,roc_auc
RawFull | MOM,0.503,,0.527,0.511,0.546,0.504
RawFull | ALL,0.517,,0.559,0.521,0.607,0.521
RawConf>.7 | ALL,0.55,0.000201,0.613,0.55,0.8,0.583


## Interpretation guide

* **Run A table** (Quant70) – choose the row (MOM vs. ALL) with higher ROC-AUC/F1 to see if extra features help on decisive moves.  
* **Run B table** – compare RawFull vs. RawConf > 0.7.  
  * If RawConf > 0.7 reaches similar F1 but with, say, 20 % *coverage*, you have a practical high-confidence trading signal.  
* **Top-10 importances (Run A)** – columns genuinely driving out-of-sample accuracy.

Feel free to tweak:
* Quantile thresholds (e.g., 0.75 / 0.25)  
* Probability threshold (0.6–0.8)  
* Swap RandomForest for XGBoost (drop-in replacement).

This notebook is self-contained; each run is isolated so a crash in one loop can’t invalidate the other. Good luck!
