# Predicting 1-Minute Stock Direction Using Machine Learning

This notebook builds a machine learning model to predict whether
the next one-minute candle will close **higher or lower**
for HDFC Bank stock.

We will:

1. Load minute-level price data
2. Engineer technical features
3. Train Logistic Regression
4. Convert probabilities into trading signals
5. Backtest the strategy with transaction costs
6. Interpret model coefficients

> Goal: Understand whether ML captures directional signal,
not guarantee profitability.


In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Step 1 — Load Data

We load the 1-minute OHLCV data for HDFC Bank.
Each row represents one minute of trading activity.

Important columns:
- open
- high
- low
- close
- volume


In [5]:
df = pd.read_csv("HDFCBANK_minute.csv")
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')

df = df.sort_values('date').reset_index(drop=True)
df.head()


Unnamed: 0,date,open,high,low,close,volume
0,2015-02-02 09:15:00,532.9,532.9,530.2,530.9,12719
1,2015-02-02 09:16:00,531.55,531.9,530.75,530.75,9437
2,2015-02-02 09:17:00,530.75,531.7,530.75,531.45,3500
3,2015-02-02 09:18:00,531.45,531.45,530.4,530.5,5203
4,2015-02-02 09:19:00,530.5,530.75,529.8,529.8,3386


## Step 2 — Feature Engineering

We create features commonly used in quantitative trading:

- returns & log-returns
- candle range
- close-open distance
- volume change
- moving averages
- EMA
- volatility
- RSI (14)
- MACD

These help the model estimate directional bias.


In [6]:
df['return']=df['close'].pct_change() # percentage change
# log_return = ln(close_now / close_previous)
df['log_return']=np.log(df['close']/df['close'].shift(1)) #shift 1 means previous rows 

# add range 
df['range']=df['high']-df['low']

#close open 
# if close-open is +ve then bullish green candle
# or if -ve then bearish red candle
df['co']=df['close']-df['open']

#volume change same as return claculation but with volume
df['volume_change']=df['volume'].pct_change()


# adding indicators in our excel 
# 1st is SMA simple moving average
df['sma_20']=df['close'].rolling(window=20).mean()
df['sma_50']=df['close'].rolling(window=50).mean()
# difference etween fast and slow line of ema 
df['sma_20_50_diff']=df['sma_20']-df['sma_50']

# 2nd is EMA indicator
df['ema_12']=df['close'].ewm(span=12,adjust=False).mean()
df['ema_26']=df['close'].ewm(span=26,adjust=False).mean()

df['ema_12_26_diff']=df['ema_12']-df['ema_26']

#3rd indicator is rolling volatility
# Rolling volatility (standard deviation of log returns)
df['vol_20'] = df['log_return'].rolling(window=20).std()
df['vol_50'] = df['log_return'].rolling(window=50).std()

#4th indicator

window = 14

delta = df['close'].diff()

# Gains (positive changes) and losses (negative changes)
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)

# Average gain and loss
avg_gain = gain.rolling(window=window).mean()
avg_loss = loss.rolling(window=window).mean()

# Relative Strength (avoid division by zero using +1e-9)
rs = avg_gain / (avg_loss + 1e-9)

# RSI formula
df['rsi_14'] = 100 - (100 / (1 + rs))

# MACD line
df['macd'] = df['ema_12'] - df['ema_26']

# Signal line (9-period EMA of MACD)
df['macd_signal'] = df['macd'].ewm(span=9, adjust=False).mean()

# Histogram
df['macd_hist'] = df['macd'] - df['macd_signal']


# print(df.columns)





## Step 3 — Define Target

Target = 1  
if next candle close > current close  
else 0

This makes it a **binary classification** problem.


In [7]:
df['target'] = (df['close'].shift(-1) > df['close']).astype(int)

df.replace([np.inf, -np.inf], np.nan, inplace=True)
df = df.dropna().reset_index(drop=True)


## Step 4 — Train / Test Split and Scaling

We use:
- 80% history → training
- 20% → testing (future data)

Features are standardized
so no feature dominates due to scale.


In [8]:
feature_cols = [
    'return','log_return','range','co','volume_change',
    'sma_20','sma_50','sma_20_50_diff',
    'ema_12','ema_26','ema_12_26_diff',
    'vol_20','vol_50',
    'rsi_14',
    'macd','macd_signal','macd_hist'
]

X = df[feature_cols]
y = df['target']

train_size = int(len(df)*0.8)

X_train = X.iloc[:train_size]
X_test  = X.iloc[train_size:]
y_train = y.iloc[:train_size]
y_test  = y.iloc[train_size:]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)


## Step 5 — Train Logistic Regression

Logistic Regression predicts the probability
that the next candle closes higher.


In [9]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

y_pred = log_reg.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.5327012653395989

Confusion Matrix:
 [[79588 21834]
 [68757 23682]]


## Step 6 — Probability Threshold

Instead of predicting up/down directly,
we use the predicted probability.

We only trade when confidence > threshold (e.g., 0.6).
This reduces false trades.


In [10]:
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]
thresholds=[0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8]
for threshold in thresholds:
    print(f" for threshold = {threshold} we have our model details as :")
    y_pred_custom = (y_prob > threshold).astype(int)

    print("Number of trades:", y_pred_custom.sum())
    print("Trade frequency:", y_pred_custom.mean())

    from sklearn.metrics import precision_score, recall_score
    print("UP Precision:", precision_score(y_test, y_pred_custom))
    print("UP Recall:", recall_score(y_test, y_pred_custom))

    hit_rate = (
        (y_pred_custom == 1) & (y_test == 1)
    ).sum() / max((y_pred_custom == 1).sum(), 1)
    print("Hit Rate on Trades:", hit_rate)
    print("\n\n\n")


 for threshold = 0.4 we have our model details as :
Number of trades: 193505
Trade frequency: 0.9981636327059078
UP Precision: 0.47687139867186895
UP Recall: 0.998247492941291
Hit Rate on Trades: 0.47687139867186895




 for threshold = 0.45 we have our model details as :
Number of trades: 160290
Trade frequency: 0.8268295325000903
UP Precision: 0.4833551687566286
UP Recall: 0.8381419097999762
Hit Rate on Trades: 0.4833551687566286




 for threshold = 0.5 we have our model details as :
Number of trades: 45516
Trade frequency: 0.23478678021881658
UP Precision: 0.5203005536514632
UP Recall: 0.2561905689157174
Hit Rate on Trades: 0.5203005536514632




 for threshold = 0.55 we have our model details as :
Number of trades: 8413
Trade frequency: 0.04339707316066666
UP Precision: 0.5348864852014739
UP Recall: 0.04868075163080518
Hit Rate on Trades: 0.5348864852014739




 for threshold = 0.6 we have our model details as :
Number of trades: 2180
Trade frequency: 0.01124517050876659
UP Precis

## Step 7 — Backtesting With Transaction Costs

Trading rule:
- If probability > 0.6 → Buy
- Hold 1 candle
- Exit next close

Transaction cost assumed = 0.02% per trade.


In [11]:
# for backtesting we create a test df 
df_test=df.iloc[train_size:].copy()
df_test=df_test.reset_index(drop=True)

df_test['prob_up']=y_prob
threshold=0.6
df_test['signal']=(df_test['prob_up']>threshold).astype(int)

df_test['future_return'] = (
    df_test['close'].shift(-1) / df_test['close'] - 1
)
df_test=df_test.dropna()
#0.01 for buy and same for sell but 0.01 means 1percent so 0.0002
transaction_cost=0.0002
df_test['strategy_return'] = df_test['signal'] * (df_test['future_return']-transaction_cost)
print("this is cumulative records starting")
print((1+df_test['strategy_return']).cumprod().head(5))
print("this is cumulative records ending")
print((1+df_test['strategy_return']).cumprod().tail(5))
print("this is total percentage returns")
print((1+df_test['strategy_return']).prod()-1)

win_rate = (df_test[df_test['signal'] == 1]['strategy_return'] > 0).mean()
print("Win rate after costs:", win_rate)

avg_trade_return=df_test[df_test['signal']==1]['strategy_return'].mean()
print("Average return per trade:", avg_trade_return)



this is cumulative records starting
0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: strategy_return, dtype: float64
this is cumulative records ending
193855    0.83084
193856    0.83084
193857    0.83084
193858    0.83084
193859    0.83084
Name: strategy_return, dtype: float64
this is total percentage returns
-0.1691596098612147
Win rate after costs: 0.46559633027522934
Average return per trade: -8.396096641707033e-05


## Step 8 — Feature Coefficients

Logistic Regression assigns a weight to each feature.
Positive → bullish influence
Negative → bearish influence


In [12]:
# TO LEARN FEATURE COEFFICIENTS 

'''
Create a pandas Series where:
# the values are the model’s learned coefficients
# the labels are the feature names
# Then sort them to see which features influence predictions the most.
# '''
coefs=pd.Series(
    log_reg.coef_[0],
    index=feature_cols
).sort_values()

print(coefs)
#  this will give weights of each feature 


sma_20_50_diff   -0.088146
rsi_14           -0.080217
co               -0.060022
return           -0.028704
log_return       -0.026142
macd_hist        -0.020238
volume_change     0.000168
vol_20            0.006796
vol_50            0.006909
sma_20            0.010185
ema_26            0.010559
ema_12            0.010643
sma_50            0.010683
macd              0.032261
ema_12_26_diff    0.032261
macd_signal       0.040649
range             0.060959
dtype: float64


## Step 9 — Conclusions

- Market directional predictability exists but is weak
- Transaction costs destroy most edges
- Logistic Regression learns short-term mean reversion behavior
- Random Forest may capture nonlinear structure

Next steps:
- Random Forest Classifier
- Hyperparameter tuning
- Feature selection
- Risk management
