In [14]:
import pandas as pd
import numpy as np
import lightgbm as lgb
df = pd.read_parquet("C:/Users/james/OneDrive/Documents/GitHub/solana-qrf-interval-forecasting/data/features_full.parquet").dropna(subset=["return_72h"])


## Stage 1 – define predictors & basic types
anything that's not timestamp, token, or the 72-h target is a candidate

In [15]:
drop_cols = ["timestamp", "token", "return_72h"]
predictors_initial = df.columns.difference(drop_cols).tolist()

# Columns that are really *categorical* even if stored as int/float
EXPLICIT_CAT = [
    # seasonality
    "day_of_week",
    # discretised regimes
    "vol_regime",          # 0–4
    "trend_regime",        # -1, 0, +1
    "momentum_bucket",     # 0–9
    # extreme-tail indicators (binary)
    "extreme_flag1", "tail_positive", "tail_negative", "tail_asym"
]

for col in EXPLICIT_CAT:
    if col in df.columns:
        df[col] = df[col].astype("category")

# Everything else stays numeric
#   hour_sin / hour_cos are **already scaled floats** – leave numeric.
#   holder_missing, new_token_accounts_missing etc. are binary ints – fine as numeric.

# 2.  Build initial predictor lists
drop_cols = ["timestamp", "token", "return_72h"]
cat_feats = [c for c in df.columns
             if df[c].dtype.name == "category" and c not in drop_cols]
num_feats = [c for c in df.columns
             if c not in drop_cols + cat_feats]

print(f"Found {len(num_feats)} numeric and {len(cat_feats)} categorical predictors.")

Found 40 numeric and 8 categorical predictors.


In [17]:
# 3.  Variance & sparsity filter  (numeric variance; both sparsity)
#     • keep numeric if var > 1e-8
#     • keep any feature (num or cat) if <80 % NaN

var_keep = df[num_feats].var() > 1e-8          # Series indexed by num_feats
sparse_keep = df[num_feats + cat_feats].isna().mean() < 0.80

predictors_stage1 = (
      var_keep.index[var_keep & sparse_keep[var_keep.index]].tolist()   # numeric pass
    + [c for c in cat_feats if sparse_keep[c]]                          # cat pass
)

print(f"Stage 1 keeps {len(predictors_stage1)} predictors "
      f"({len(num_feats)} numeric ➞ {var_keep.sum()} kept, "
      f"{len(cat_feats)} categorical ➞ {len([c for c in cat_feats if sparse_keep[c]])} kept)")

Stage 1 keeps 41 predictors (40 numeric ➞ 33 kept, 8 categorical ➞ 8 kept)


Stage 1 — Type Cleaning & Basic Filters
Objective Ensure that every candidate predictor is (i) assigned the correct data type for downstream models and (ii) carries non-trivial information.

Explicit categorical encoding
day_of_week, vol_regime, trend_regime, momentum_bucket, plus binary extreme flags were force-cast to pandas category.
This allows LightGBM to apply native categorical split logic and keeps open the option of one-hot expansion for linear models.

Near-zero variance filter 
Numeric predictors with variance ≤ 1 × 10⁻⁸ were removed, as they provide no discriminatory power and can destabilise optimisation in linear quantile regression.

Sparsity filter 
Any feature (numeric or categorical) with ≥ 80 % missing values was discarded.
This threshold balances the gain from rare-event signals (e.g. extreme_flag) against the risk of noise amplification in high-NA columns.

Result ≈ N₁ predictors ( n₁ numeric | c₁ categorical ) passed Stage 1.

## Stage 2 – multicollinearity filter (|ρ| > 0.98 → drop one feature)

In [19]:
from itertools import combinations

# ① split Stage-1 list back into numeric vs. categorical
num_keep = [c for c in predictors_stage1 if c in num_feats]
cat_keep = [c for c in predictors_stage1 if c in cat_feats]

# ② compute absolute Pearson correlation on numeric part
corr = df[num_keep].corr().abs()

# ③ scan the upper triangle; mark the *second* feature for dropping
to_drop = set()
for (col_i, col_j) in combinations(corr.columns, 2):
    if corr.loc[col_i, col_j] > 0.98:
        # keep the first occurrence, drop the second
        to_drop.add(col_j)

num_after = [c for c in num_keep if c not in to_drop]
predictors_stage2 = num_after + cat_keep

print(f"Dropped {len(to_drop)} highly-collinear numerics "
      f"(>0.98) ➜ {len(predictors_stage2)} predictors remain.\n"
      f"Numeric kept: {len(num_after)}  |  Categorical kept: {len(cat_keep)}")

# Optional: inspect what was dropped
display(sorted(to_drop))

Dropped 1 highly-collinear numerics (>0.98) ➜ 40 predictors remain.
Numeric kept: 32  |  Categorical kept: 8


['williams_r']

Stage 2 — Multicollinearity Filter
Objective: remove redundant numeric predictors to stabilise models and clarify feature importance.

Procedure

Compute the absolute Pearson correlation matrix for the numeric Stage-1 set.

Scan the upper triangle; when |ρ| > 0 .98, keep the first feature 
 
The 0.98 cut-off prunes near-duplicates without sacrificing distinct information.

Categorical variables are exempt—Pearson’s ρ does not capture their dependence.

Outcome N₂ = n₂ + c₁ predictors survive (n₂ numeric | c₁ categorical) and feed Stage 3.

## Stage 3 — Cheap LightGBM Quantile Model (τ = 0.50)

**Objective** Obtain a fast, model-based ranking of predictor importance  
before engaging in computationally expensive tuning.

* **Model**  LightGBM with `objective="quantile"` and `alpha = 0.5`
  (i.e., median pinball loss).  
* **Configuration**  400 trees, shrinkage 0.05, moderate regularisation
  (`num_leaves = 64`, 80 % row/feature bagging).  
* **Categorical handling**  Native LightGBM categorical splits, using the
  list derived in Stage 1 (`cat_keep`).  
* **Output**  Gain-based importance for every predictor; features
  contributing < 0.3 % total gain will be eligible for pruning in
  Stage 4.

Running this quick model costs < 30 s on a laptop yet yields a robust
importance ordering that reliably mirrors more expensive
hyper-parameter-tuned runs.


In [20]:
# --- 1. prepare matrices -------------------------------------------------
X = df[predictors_stage2]          # predictors from Stage 2
y = df["return_72h"]

lgb_data = lgb.Dataset(
    X,
    label=y,
    categorical_feature=cat_keep,  # defined in Stage 1
    free_raw_data=False
)

In [21]:
# --- 2. model params -----------------------------------------------------
params = dict(
    objective        = "quantile",
    alpha            = 0.5,          # median
    learning_rate    = 0.05,
    num_leaves       = 64,
    feature_fraction = 0.80,
    bagging_fraction = 0.80,
    seed             = 42,
    verbose          = -1,
)

gbm = lgb.train(
    params,
    lgb_data,
    num_boost_round = 400
)

In [24]:
# --- 3. gain importance --------------------------------------------------
gain = pd.Series(
    gbm.feature_importance(importance_type="gain"),
    index = predictors_stage2
).sort_values(ascending=False)

gain_pct = 100 * gain / gain.sum()
display(gain_pct.head(20).to_frame("gain_%").style.format({"gain_%":"{:.2f}"}))

# candidate list for Stage 4 pruning
threshold = 0.3                   # % of total gain
predictors_stage3 = gain_pct[gain_pct >= threshold].index.tolist()

print(f"\nStage 3 complete → {len(predictors_stage3)} predictors "
      f"(cover {gain_pct[gain_pct >= threshold].sum():.1f}% of total gain) "
      "advance to Stage 4.")

Unnamed: 0,gain_%
proc,32.06
ret_ETH,4.13
ret_SOL,4.08
ret_BTC,3.7
stoch_k,3.45
cci,3.39
logret_12h,3.31
logret_36h,3.16
bollinger_bw,3.06
bollinger_b,3.0



Stage 3 complete → 29 predictors (cover 99.3% of total gain) advance to Stage 4.


### Stage 4 — Gain-Based Feature Pruning

**Objective** Remove predictors that contribute a negligible share of LightGBM gain so subsequent
hyper-parameter search is faster and feature importance clearer.

* **Criterion**  
  A predictor is kept if its **gain share ≥ 0.3 %** of total model gain
  (median-quantile LightGBM from Stage 3).

* **Result**  
  *29 predictors* survive the filter, representing **99.3 % of total gain**.
  The discarded set contains mainly rare-event flags (`extreme_flag1`,
  `tail_*`) and low-signal regime dummies (`vol_regime`, `trend_regime`)
  that LightGBM could not exploit at τ = 0.5.

* **Rationale**  
  • 0.3 % is conservative: features below this level each explain less
    than 1⁄300 of model gain.  
  • Sparse tail flags can still be revisited for τ = 0.10 / 0.90 if needed,
    but including them now would inflate tree depth without measurable
    benefit at the median.

The pruned predictor list is frozen as **feature-set v1** for all
down-stream modelling and statistical tests.


In [25]:
THRESH = 0.3    # percent gain threshold (feel free to tweak)

predictors_final = gain_pct[gain_pct >= THRESH].index.tolist()
print(f"Kept {len(predictors_final)} predictors "
      f"(covers {gain_pct[gain_pct >= THRESH].sum():.1f}% of gain)")

# ----- build final dataframe to save -------------------------------------
cols_to_save = ["timestamp", "token", "return_72h"] + predictors_final
df_v1        = df[cols_to_save].copy()

df_v1.to_parquet("features_v1.parquet", index=False)
print("📁  Saved feature-set v1 → features_v1.parquet")

Kept 29 predictors (covers 99.3% of gain)
📁  Saved feature-set v1 → features_v1.parquet


## Stage 5 — Domain “must-keep” Add-Backs

Sparse tail-event indicators carry little gain for the median
quantile, but economic theory suggests they matter for the tails
(τ ≪ 0.50 or τ ≫ 0.50).  
Therefore, we add back  
`extreme_flag`, `tail_pos`, `tail_neg`, `tail_asym`,  
`extreme_count_72h`  
after Stage 4 pruning.  These flags cost almost no depth in tree models
and can widen the 10 % / 90 % (and other tail) intervals when recent
shocks cluster.


## Stage 6 — Freeze Feature Sets

We freeze two predictor sets for reproducibility:

| File | # Predictors | Intended quantiles |
|------|--------------|--------------------|
| `features_v1.parquet` | 29 | core set for τ = 0.50 (median) baseline models |
| `features_v1_tail.parquet` | 34 | core set **plus** tail flags for τ ∈ {0.05, 0.10, 0.25, 0.75, 0.90, 0.95} models |

Locking the sets guarantees fair, identical inputs in all subsequent
rolling-CV experiments and statistical tests.


In [27]:
# 1. Add tail flags to the pruned predictor list
tail_cols = ["extreme_flag1","tail_asym", "extreme_count_72h"]
tail_cols  = [c for c in tail_cols if c in df.columns]

predictors_final_tail = predictors_final + tail_cols

print(f"Feature-set sizes  →  v1: {len(predictors_final)}  |  v1_tail: {len(predictors_final_tail)}")

# 2. Save Parquet files
base_cols = ["timestamp", "token", "return_72h"]

df[base_cols + predictors_final]         .to_parquet("features_v1.parquet",       index=False)
df[base_cols + predictors_final_tail]    .to_parquet("features_v1_tail.parquet", index=False)

print("✅  Saved features_v1.parquet  and  features_v1_tail.parquet")


Feature-set sizes  →  v1: 29  |  v1_tail: 32
✅  Saved features_v1.parquet  and  features_v1_tail.parquet
