# Multi-Stock Direction Prediction using XGBoost

This project explores next-day stock return direction prediction using a merged multi-asset dataset.  
The objective is to evaluate whether machine learning can detect generalized directional patterns across different equities.


## 1. Problem Statement

Financial markets are noisy and exhibit weak short-term predictability.

The objective of this project is to predict the next-day return direction (Up/Down) using historical market data from multiple equities.

Target Definition:
- 1 → Next-week return > 0
- 0 → Next-week return ≤ 0

This serves as a baseline classification experiment for financial time-series modeling.


In [None]:
!pip install yfinance xgboost



In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 2. Data Collection

Historical daily OHLCV data was collected using the Yahoo Finance API (via `yfinance`).

Stocks Included:
- TCS.NS
- INFY.NS
- HDFCBANK.NS

The datasets were merged into a single combined dataset to evaluate whether a unified model can learn cross-asset directional patterns.

Data Frequency: Weekly  
Time Horizon: [2018-01-01] to [2023-12-31]


In [None]:
stocks = ["TCS.NS", "INFY.NS", "HDFCBANK.NS"]
data_list = []

In [None]:
df_test = yf.download("TCS.NS", start="2018-01-01", end="2023-12-31")
df_test.head()

  df_test = yf.download("TCS.NS", start="2018-01-01", end="2023-12-31")
[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume
Ticker,TCS.NS,TCS.NS,TCS.NS,TCS.NS,TCS.NS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2018-01-01,1083.771606,1103.926395,1079.42927,1098.805763,1351760
2018-01-02,1077.872681,1093.603324,1073.366524,1089.670638,1920290
2018-01-03,1080.924072,1093.357003,1077.872197,1078.199937,1257120
2018-01-04,1088.482422,1090.489669,1081.477358,1085.573863,913082
2018-01-05,1101.631958,1105.851378,1085.573681,1085.573681,1153706


In [None]:
print(df_test.columns)

MultiIndex([( 'Close', 'TCS.NS'),
            (  'High', 'TCS.NS'),
            (   'Low', 'TCS.NS'),
            (  'Open', 'TCS.NS'),
            ('Volume', 'TCS.NS')],
           names=['Price', 'Ticker'])


In [None]:
for stock in stocks:
    df = yf.download(stock, start="2018-01-01", end="2023-12-31")
    df = df[["Open", "High", "Low", "Close", "Volume"]]
    df.columns = df.columns.get_level_values(0)
    # Convert daily to weekly
    df = df.resample("W").agg({"Open": "first","High": "max","Low": "min","Close": "last","Volume": "sum"})
    df["Stock"] = stock
    data_list.append(df)

  df = yf.download(stock, start="2018-01-01", end="2023-12-31")
[*********************100%***********************]  1 of 1 completed
  df = yf.download(stock, start="2018-01-01", end="2023-12-31")
[*********************100%***********************]  1 of 1 completed
  df = yf.download(stock, start="2018-01-01", end="2023-12-31")
[*********************100%***********************]  1 of 1 completed


In [None]:
data = pd.concat(data_list)
data.head()

Price,Open,High,Low,Close,Volume,Stock
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-01-07,1098.805763,1105.851378,1073.366524,1101.631958,6595958,TCS.NS
2018-01-14,1106.056395,1155.992623,1096.450388,1137.333618,12638524,TCS.NS
2018-01-21,1137.660976,1221.578224,1120.414748,1212.279175,14035250,TCS.NS
2018-01-28,1215.851271,1338.237552,1199.837019,1281.571167,17062102,TCS.NS
2018-02-04,1285.821058,1324.255298,1268.328603,1294.957642,12616152,TCS.NS


In [None]:
data.shape

(939, 6)

In [None]:
# Weekly return
data["Return"] = data.groupby("Stock")["Close"].pct_change()
# Next week return (our target)
data["Return_next"] = data.groupby("Stock")["Return"].shift(-1)
data.head(10)

Price,Open,High,Low,Close,Volume,Stock,Return,Return_next
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-01-07,1098.805763,1105.851378,1073.366524,1101.631958,6595958,TCS.NS,,0.032408
2018-01-14,1106.056395,1155.992623,1096.450388,1137.333618,12638524,TCS.NS,0.032408,0.065896
2018-01-21,1137.660976,1221.578224,1120.414748,1212.279175,14035250,TCS.NS,0.065896,0.057158
2018-01-28,1215.851271,1338.237552,1199.837019,1281.571167,17062102,TCS.NS,0.057158,0.010445
2018-02-04,1285.821058,1324.255298,1268.328603,1294.957642,12616152,TCS.NS,0.010445,-0.057695
2018-02-11,1277.033948,1309.062452,1189.427685,1220.244873,12452834,TCS.NS,-0.057695,-0.013006
2018-02-18,1222.954828,1241.453358,1187.723781,1204.374756,9803086,TCS.NS,-0.013006,0.048669
2018-02-25,1209.281409,1268.410666,1188.832419,1262.990479,13372030,TCS.NS,0.048669,-0.012387
2018-03-04,1262.662148,1268.800924,1240.077839,1247.345825,7506670,TCS.NS,-0.012387,-0.001185
2018-03-11,1251.986063,1284.014651,1222.503834,1245.868286,10506966,TCS.NS,-0.001185,-0.068686


## 3. Multi-Asset Data Integration

All three stocks were combined into a single dataset by stacking observations across time.

This approach allows:
- Increased training data size
- Testing model generalization across assets
- Learning broader market behavior patterns

Each observation represents a stock-day instance with engineered features.


In [None]:
data.groupby("Stock").tail(3)

Price,Open,High,Low,Close,Volume,Stock,Return,Return_next
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2023-12-17,3374.087907,3628.386104,3318.302201,3595.836426,19011963,TCS.NS,0.064604,-0.009583
2023-12-24,3593.135594,3659.166271,3486.266235,3561.376953,11042750,TCS.NS,-0.009583,-0.008002
2023-12-31,3557.511794,3574.416099,3506.801632,3532.878662,5837092,TCS.NS,-0.008002,
2023-12-17,1402.812374,1495.595767,1349.289036,1486.039673,49400017,INFY.NS,0.058512,-0.00982
2023-12-24,1476.20109,1499.785331,1432.233826,1471.446655,29341429,INFY.NS,-0.00982,-0.012797
2023-12-31,1445.179225,1478.131086,1433.881407,1452.616943,21195159,INFY.NS,-0.012797,
2023-12-17,803.0558,811.791766,786.04612,806.219238,263125506,HDFCBANK.NS,0.002026,0.008632
2023-12-24,808.238971,822.377238,800.135714,813.178894,161423810,HDFCBANK.NS,0.008632,0.022982
2023-12-31,814.346917,837.780824,812.059518,831.867615,114142830,HDFCBANK.NS,0.022982,


## 4. Feature Engineering

To extract predictive signals from historical price data, the following features were engineered:

- Daily returns
- Lagged returns (t-1, t-2, etc.)
- Moving averages (MA10)
- Rolling volatility
- Volume-based indicators

Feature engineering plays a critical role in financial modeling due to the weak signal-to-noise ratio in raw prices.


In [None]:
data["Volatility_10"] = (data.groupby("Stock")["Return"].rolling(10).std().reset_index(level=0, drop=True))
data.head(15)

Price,Open,High,Low,Close,Volume,Stock,Return,Return_next,Volatility_10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-01-07,1098.805763,1105.851378,1073.366524,1101.631958,6595958,TCS.NS,,0.032408,
2018-01-14,1106.056395,1155.992623,1096.450388,1137.333618,12638524,TCS.NS,0.032408,0.065896,
2018-01-21,1137.660976,1221.578224,1120.414748,1212.279175,14035250,TCS.NS,0.065896,0.057158,
2018-01-28,1215.851271,1338.237552,1199.837019,1281.571167,17062102,TCS.NS,0.057158,0.010445,
2018-02-04,1285.821058,1324.255298,1268.328603,1294.957642,12616152,TCS.NS,0.010445,-0.057695,
2018-02-11,1277.033948,1309.062452,1189.427685,1220.244873,12452834,TCS.NS,-0.057695,-0.013006,
2018-02-18,1222.954828,1241.453358,1187.723781,1204.374756,9803086,TCS.NS,-0.013006,0.048669,
2018-02-25,1209.281409,1268.410666,1188.832419,1262.990479,13372030,TCS.NS,0.048669,-0.012387,
2018-03-04,1262.662148,1268.800924,1240.077839,1247.345825,7506670,TCS.NS,-0.012387,-0.001185,
2018-03-11,1251.986063,1284.014651,1222.503834,1245.868286,10506966,TCS.NS,-0.001185,-0.068686,


In [None]:
data = data.dropna()
data.shape

(906, 9)

In [None]:
features = ["Return", "Volatility_10", "Volume"]
X = data[features]
y = data["Return_next"]
X.head()

Price,Return,Volatility_10,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-03-18,-0.068686,0.02464,103373550
2018-03-25,-0.002672,0.024704,17402818
2018-04-01,0.011,0.020581,12324438
2018-04-08,0.035502,0.021379,9380540
2018-04-15,0.068807,0.021022,19838292


## 5. Train-Test Strategy

A chronological split was used (80% training, 20% testing) to preserve temporal ordering and avoid look-ahead bias.

Time-series data must respect sequential structure to ensure realistic evaluation.


In [None]:
# Sort by date to be safe
data = data.sort_index()

# Define split date
split_date = "2022-01-01"

train = data[data.index < split_date]
test = data[data.index >= split_date]

X_train = train[features]
y_train = train["Return_next"]

X_test = test[features]
y_test = test["Return_next"]

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)

Train size: (594, 3)
Test size: (312, 3)


## 6. Model: XGBoost Classifier

XGBoost was selected due to:

- Strong performance on structured tabular data
- Ability to capture nonlinear feature interactions
- Built-in regularization to mitigate overfitting

The model was trained to classify next-day return direction.


In [None]:
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=200,max_depth=4,learning_rate=0.05,random_state=42)
model.fit(X_train, y_train)
print("Model trained successfully.")

Model trained successfully.


In [None]:
from sklearn.metrics import mean_squared_error
# Predictions
y_pred = model.predict(X_test)
# RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)
# Correlation
correlation = np.corrcoef(y_test, y_pred)[0,1]
print("Correlation:", correlation)

RMSE: 0.03381204830631196
Correlation: 0.026959981524933972


In [None]:
# Convert to direction
direction_actual = (y_test > 0).astype(int)
direction_pred = (y_pred > 0).astype(int)
accuracy = (direction_actual == direction_pred).mean()
print("Directional Accuracy:", accuracy)

Directional Accuracy: 0.5


## 7. Results

The baseline model achieved approximately 50% directional accuracy across the merged multi-stock dataset.

Given the stochastic and regime-dependent nature of financial markets, this result reflects the difficulty of short-term return prediction.

The model serves as a benchmark for further experimentation.


## 8. Limitations

- Financial markets exhibit non-stationarity and structural breaks.
- Short-term directional signals are weak and highly noisy.
- A single static train-test split may not capture regime shifts.


## 9. Future Work

Planned extensions include:

- Implementing Rainbow DQN for reinforcement learning-based trading policy development
- Applying walk-forward validation
- Incorporating deep learning models (LSTM) for sequential pattern recognition
- Evaluating performance using financial metrics such as Sharpe Ratio and drawdown analysis
