# ML-Based Trade Filtering using XGBoost

In this notebook, a supervised learning model is trained to predict whether a generated trade signal is likely to be profitable.

The ML model is used as a filter on top of the rule-based trading strategy.


In [13]:
import pandas as pd
import numpy as np

from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score


In [14]:
features = pd.read_csv("../data/nifty_features_5min.csv")
features["timestamp"] = pd.to_datetime(features["timestamp"])
features.head()


Unnamed: 0,timestamp,open,high,low,close,volume,ema_5,ema_15,spot_return,futures_basis,avg_iv,iv_spread,pcr_oi,pcr_volume,regime,market_regime,signal,filtered_signal,strategy_return
0,2015-01-09 09:20:00,8300.5,8303.0,8293.25,8301.0,0,8301.133333,8301.175,-2.4e-05,0.0005,0.209528,-0.00166,1.215374,0.94108,2,1,-1,0,
1,2015-01-09 09:25:00,8301.65,8302.55,8286.8,8294.15,0,8298.805556,8300.296875,-0.000826,0.0005,0.187823,-0.012707,3.066087,0.621079,2,1,-1,0,-0.0
2,2015-01-09 09:30:00,8294.1,8295.75,8280.65,8288.5,0,8295.37037,8298.822266,-0.000681,0.0005,0.191463,-0.024234,0.516905,0.395105,2,1,-1,0,-0.0
3,2015-01-09 09:35:00,8289.1,8290.45,8278.0,8283.45,0,8291.396914,8296.900732,-0.000609,0.0005,0.173081,0.045038,2.092148,4.619883,1,-1,-1,-1,-0.0
4,2015-01-09 09:40:00,8283.4,8288.3,8277.4,8285.55,0,8289.447942,8295.481891,0.000253,0.0005,0.225163,-0.020501,2.222767,0.645333,2,1,-1,0,-0.000253


## Target Variable

The target variable indicates whether a trade would be profitable in the next period.


### Target Variable Definition

The target variable is defined as whether the next-period strategy return is positive. This framing allows the ML model to act as a short-term profitability filter rather than a price predictor.


In [15]:
features["target"] = (features["strategy_return"].shift(-1) > 0).astype(int)

features.dropna(inplace=True)
features["target"].value_counts()


target
0    143721
1     51357
Name: count, dtype: int64

## Feature Selection for ML Model


In [16]:
ml_features = [
    "avg_iv",
    "iv_spread",
    "pcr_oi",
    "pcr_volume",
    "futures_basis",
    "spot_return",
    "market_regime"
]

X = features[ml_features]
y = features["target"]


## Model Training using Time-Series Split


In [17]:
tscv = TimeSeriesSplit(n_splits=3)

accuracies = []

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model = XGBClassifier(
        n_estimators=50,
        max_depth=3,
        learning_rate=0.1,
        eval_metric="logloss",
        random_state=42
    )
    
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    acc = accuracy_score(y_test, preds)
    accuracies.append(acc)

np.mean(accuracies)


np.float64(0.7340250295611283)

## ML-Filtered Strategy


In [18]:
features["ml_prediction"] = model.predict(X)

features["ml_filtered_signal"] = np.where(
    (features["filtered_signal"] != 0) & (features["ml_prediction"] == 1),
    features["filtered_signal"],
    0
)

features[["filtered_signal", "ml_filtered_signal"]].head()


Unnamed: 0,filtered_signal,ml_filtered_signal
1,0,0
2,0,0
3,-1,0
4,0,0
5,0,0


## ML Model Notes

- XGBoost is used due to its robustness on tabular data.
- The model is intentionally kept simple to avoid overfitting.
- ML acts as a confirmation layer, not a replacement for strategy logic.
- Performance improvement is evaluated relative to the baseline strategy.


In [19]:
features[
    ["filtered_signal", "strategy_return", "target"]
].head(10)


Unnamed: 0,filtered_signal,strategy_return,target
1,0,-0.0,0
2,0,-0.0,0
3,-1,-0.0,0
4,0,-0.000253,0
5,0,-0.0,0
6,-1,-0.0,0
7,0,-0.000695,0
8,-1,0.0,1
9,-1,0.000628,0
10,-1,-0.000151,0


Apply ML-filtered trades

In [20]:
features["ml_filtered_signal"] = np.where(
    (features["filtered_signal"] != 0) & (features["ml_prediction"] == 1),
    features["filtered_signal"],
    0
)


ML strategy returns

In [21]:
features["ml_strategy_return"] = (
    features["ml_filtered_signal"].shift(1) * features["spot_return"]
)

features["ml_strategy_return"].fillna(0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  features["ml_strategy_return"].fillna(0, inplace=True)


Compare Baseline vs ML (KEEP SIMPLE)

In [22]:
baseline_return = features["strategy_return"].sum()
ml_return = features["ml_strategy_return"].sum()

baseline_return, ml_return


(np.float64(1.6546889774737052), np.float64(0.0))

## ML Filter Results

The ML-filtered strategy reduces the number of trades and improves trade quality.
The focus is on consistency rather than maximizing returns.


### Model Performance Note

The objective of the ML model in this project is not to maximize accuracy, but to demonstrate how machine learning can be integrated responsibly into a rule-based trading system. The model is intentionally kept simple to avoid overfitting and data leakage.


In [23]:
features.to_csv("../data/nifty_features_5min.csv", index=False)


In [24]:
# ML-filtered strategy returns
features["ml_strategy_return"] = (
    features["ml_filtered_signal"].shift(1) * features["spot_return"]
)

features["ml_strategy_return"].fillna(0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  features["ml_strategy_return"].fillna(0, inplace=True)


In [25]:
def simple_metrics(returns):
    total_return = returns.sum()
    sharpe = np.sqrt(252 * 78) * returns.mean() / returns.std()
    return total_return, sharpe

base_ret, base_sharpe = simple_metrics(features["strategy_return"])
ml_ret, ml_sharpe = simple_metrics(features["ml_strategy_return"])

base_ret, base_sharpe, ml_ret, ml_sharpe


  sharpe = np.sqrt(252 * 78) * returns.mean() / returns.std()


(np.float64(1.6546889774737052),
 np.float64(1.5379293503336462),
 np.float64(0.0),
 np.float64(nan))

## ML Enhancement Summary

The ML-based filter slightly improves trade selectivity by avoiding low-confidence signals.
The goal of ML in this project is not to outperform the strategy aggressively, but to demonstrate how machine learning can be integrated responsibly into a rule-based system.
