
This notebook answers:

1. Do new feature families (volatility, range-position, volume, candles, time-seasonality)
   improve hourly direction prediction vs. momentum-only?
2. Which specific columns contribute the most (held-out permutation importance)?


### Feature glossary

| Feature                | Formula / window                 | Intuition |
|------------------------|-----------------------------------|-----------|
| **vol_6h**             | Std. dev. of 1-h returns, 6-hour window | Captures short-term volatility spikes that often precede breakouts. |
| **vol_24h**            | Std. dev. of 1-h returns, 24 h window  | Detects high-vol vs. calm regimes over one day. |
| **atr_14h**            | Mean of `(high-low)` over last 14 bars | A simpler Average True Range; bigger bars → more movement potential. |
| **pos_24h**            | `(close − 24h low) / (24h high − 24h low)` | Where price sits inside its 24-hour range (0 = bottom, 1 = top). |
| **boll_b**             | Bollinger %B on a 24-h SMA ±2 σ        | >1 means over-bought; <0 means over-sold relative to band. |
| **vol_mean_24h**       | 24-h moving average of **Volume BTC**  | Baseline liquidity level. |
| **vol_ratio**          | `Volume / vol_mean_24h`                | >1 = current bar has abnormally high volume (demand shock). |
| **vol_pct_change**     | % change in volume vs. 6 bars ago      | Sudden volume jumps ahead of price moves. |
| **obv**                | Cumulative Σ (sign(return) × Volume)   | On-Balance Volume; tracks whether volume flows with price. |
| **body**               | `close − open`                         | Size / direction of candle body (positive = green bar). |
| **upper_shadow**       | `high − max(open, close)`              | Buying rejection wick; long tails can signal reversals. |
| **lower_shadow**       | `min(open, close) − low`               | Selling rejection wick. |
| **hour_sin / hour_cos**| Sin/Cos of UTC hour (0-23)             | Encode intraday cycle (Asia, EU, US sessions). |
| **dow**                | Day-of-week (0 = Mon … 6 = Sun)        | Weekend vs. weekday behaviour. |


In [2]:
# Cell 1 ▸ Imports & data load
import pandas as pd, numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix
)
from sklearn.inspection import permutation_importance

CSV = r"C:\Users\ADMIN\Desktop\Coding_projects\stock_market_prediction\Stock-Market-Prediction\data\processed\gemini_btc_data_final_version.csv"
df  = (
    pd.read_csv(CSV, parse_dates=["date"])
      .sort_values("date")
      .reset_index(drop=True)
)

print(f"{len(df):,} rows  {df['date'].min()} → {df['date'].max()}")


82,519 rows  2015-10-08 14:00:00 → 2025-03-27 23:00:00


In [10]:
# Cell 2 ▸ Safe feature-engineering (guard against div-by-zero, inf, NaN)
EPS = 1e-9                     # tiny constant to avoid /0

# --- Momentum -----------------------------------------------------
df['roc_4h']  = df['close'].pct_change(4)
df['roc_24h'] = df['close'].pct_change(24)
df['roc_7d']  = df['close'].pct_change(24*7)
df['roc_30d'] = df['close'].pct_change(24*30)

# --- Volatility ---------------------------------------------------
ret_1h = df['close'].pct_change()
df['vol_6h']  = ret_1h.rolling(6).std()
df['vol_24h'] = ret_1h.rolling(24).std()
df['atr_14h'] = (df['high'] - df['low']).rolling(14).mean()

# --- Range position ----------------------------------------------
rng_hi = df['high'].rolling(24).max()
rng_lo = df['low'] .rolling(24).min()
rng    = (rng_hi - rng_lo).replace(0, EPS)     # avoid 0-range

df['pos_24h'] = (df['close'] - rng_lo) / rng

mid_24h = df['close'].rolling(24).mean()
std_24h = df['close'].rolling(24).std().replace(0, EPS)
df['boll_b'] = (df['close'] - (mid_24h - 2*std_24h)) / (4*std_24h)

# --- Volume / liquidity ------------------------------------------
df['vol_mean_24h']   = df['Volume BTC'].rolling(24).mean().replace(0, EPS)
df['vol_ratio']      = df['Volume BTC'] / df['vol_mean_24h']
df['vol_pct_change'] = df['Volume BTC'].pct_change(6)

direction = np.sign(df['close'].diff()).fillna(0)
df['obv'] = (direction * df['Volume BTC']).cumsum()

# --- Candlestick shape -------------------------------------------
df['body']          = df['close'] - df['open']
df['upper_shadow']  = df['high'] - df[['close','open']].max(axis=1)
df['lower_shadow']  = df[['close','open']].min(axis=1) - df['low']

# --- Time-of-day / weekday ---------------------------------------
df['hour'] = df['date'].dt.hour
df['hour_sin'] = np.sin(2*np.pi*df['hour']/24)
df['hour_cos'] = np.cos(2*np.pi*df['hour']/24)
df['dow']  = df['date'].dt.dayofweek

# --- Final clean-up: drop or impute all inf / NaN -----------------
df = (
    df.replace([np.inf, -np.inf], np.nan)   # turn inf → NaN
      .dropna()                             # drop remaining NaNs
      .reset_index(drop=True)
)

print("Feature engineering safe-completed.  Rows left:", len(df))


Feature engineering safe-completed.  Rows left: 47903


In [12]:
# Cell 2b ▸ sanity-check (numeric columns only)
num = df.select_dtypes(include=[np.number]).values

print("Total rows after cleaning :", len(df))
print("Any remaining NaN?        :", df.isna().any().any())
print("Any ±inf after cleaning?  :", np.isinf(num).any())


Total rows after cleaning : 47903
Any remaining NaN?        : False
Any ±inf after cleaning?  : False


In [6]:
# Cell 3 ▸ Build quantile-based direction label (clearer than 1 % cutoff)
ret_1h = df['close'].pct_change()
up_thr, dn_thr = ret_1h.quantile([0.7, 0.3])
df['direction'] = np.select(
    [ret_1h >= up_thr, ret_1h <= dn_thr],
    [1, 0],
    default=np.nan
)
df = df.dropna(subset=['direction']).reset_index(drop=True)
print("Class balance:\n", df['direction'].value_counts(normalize=True))


Class balance:
 direction
1.0    0.5
0.0    0.5
Name: proportion, dtype: float64


In [7]:
# Cell 4 ▸ Feature-family dictionaries
F_MOM   = ['roc_4h','roc_24h','roc_7d','roc_30d']
F_VOL   = ['vol_6h','vol_24h','atr_14h']
F_RNG   = ['pos_24h','boll_b']
F_VLM   = ['vol_ratio','vol_pct_change','obv']
F_CDL   = ['body','upper_shadow','lower_shadow']
F_TIME  = ['hour_sin','hour_cos','dow']

SETUPS = {
    "Momentum-only"        : F_MOM,
    "Momentum+Volatility"  : F_MOM + F_VOL,
    "Momentum+Range"       : F_MOM + F_RNG,
    "Momentum+Volume"      : F_MOM + F_VLM,
    "Momentum+Candles"     : F_MOM + F_CDL,
    "Momentum+Time"        : F_MOM + F_TIME,
    "ALL_FEATURES"         : F_MOM + F_VOL + F_RNG + F_VLM + F_CDL + F_TIME,
}


In [8]:
# Cell 5 ▸ Evaluation helper (TimeSeriesSplit + held-out permutation importance)
def evaluate(feature_list, label='direction', n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    metrics, importances = [], []
    
    for fold,(train_idx, test_idx) in enumerate(tscv.split(df)):
        X_train, y_train = df.loc[train_idx, feature_list], df.loc[train_idx, label]
        X_test , y_test  = df.loc[test_idx , feature_list], df.loc[test_idx , label]

        clf = RandomForestClassifier(
            n_estimators=400, max_depth=None, n_jobs=-1, random_state=fold
        ).fit(X_train, y_train)

        y_pred = clf.predict(X_test)
        y_prob = clf.predict_proba(X_test)[:,1]
        
        metrics.append({
            "accuracy" : accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred, zero_division=0),
            "recall"   : recall_score(y_test, y_pred, zero_division=0),
            "f1"       : f1_score(y_test, y_pred, zero_division=0),
            "roc_auc"  : roc_auc_score(y_test, y_prob)
        })
        
        imp = permutation_importance(
            clf, X_test, y_test,
            scoring="accuracy", n_repeats=20, random_state=fold, n_jobs=-1
        )
        importances.append(pd.Series(imp.importances_mean, index=feature_list))
    
    metr_df = pd.DataFrame(metrics).mean().round(3)
    imp_df  = pd.concat(importances, axis=1).mean(axis=1).sort_values(ascending=False)
    return metr_df, imp_df


In [9]:
# Cell 6 ▸ Run all setups and store results
results, imps = {}, {}
for name, feats in SETUPS.items():
    print(f"▶ Evaluating: {name}  ({len(feats)} features)")
    res, imp = evaluate(feats)
    results[name] = res
    imps[name]    = imp


▶ Evaluating: Momentum-only  (4 features)
▶ Evaluating: Momentum+Volatility  (7 features)
▶ Evaluating: Momentum+Range  (6 features)
▶ Evaluating: Momentum+Volume  (7 features)


ValueError: Input X contains infinity or a value too large for dtype('float32').

In [None]:
# Cell 7 ▸ Show metrics table (higher is better for all)
metrics_df = pd.DataFrame(results).T.sort_values("roc_auc", ascending=False)
metrics_df


In [None]:
# Cell 8 ▸ Display top-10 held-out importances for BEST setup
best_name = metrics_df.index[0]
print(f"Top features for: {best_name}\n")
imps[best_name].head(10)
