**Required Library Import**

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
import pickle
from sklearn.multioutput import MultiOutputRegressor
import warnings 
warnings.filterwarnings("ignore")

In [2]:
data=pd.read_csv("/kaggle/input/tata-steel-stock-data/TATASTEEL.csv")

In [3]:
data.columns

Index(['Date', 'Symbol', 'Series', 'Prev Close', 'Open', 'High', 'Low', 'Last',
       'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume',
       '%Deliverble'],
      dtype='object')

# Problem Statement 

**Tata Steel stock belongs to a highly volatile sector of the stock market where daily prices are influenced by multiple factors such as market dependency, buying and selling pressure, and news events. These factors cause frequent fluctuations in the stock price.
In this project, we aim to train a predictive model using several years of historical data of Tata Steel to forecast the next-day Open, High, Low, and Close (OHLC) prices. The objective is to capture historical price patterns and market behavior to support data-driven analysis and decision-making.**

# Objective Of The Project 

**The primary objective of this project is to analyze historical stock price data of Tata Steel and develop a machine learning model capable of predicting the next-day Open, High, Low, and Close (OHLC) prices with reasonable accuracy.
The project also aims to understand price trends, volatility patterns, and time-series dependencies present in the stock data.**

# Dataset Discription 

**The dataset used in this project is collected from NSE (National Stock Exchange) historical stock data for Tata Steel. It contains approximately 20 years of daily trading data, which provides a long-term view of price movements and market behavior.
The dataset includes the following features:
Date – Trading date
Symbol – Stock symbol (TATASTEEL)
Series – Equity series (e.g., EQ)
Prev Close – Previous day’s closing price
Open – Opening price of the day
High – Highest price of the day
Low – Lowest price of the day
Last – Last traded price
Close – Closing price of the day
VWAP – Volume Weighted Average Price
Volume – Total traded quantity
Turnover – Total traded value
Trades – Number of executed trades
Deliverable Volume – Quantity of shares delivered
%Deliverable – Percentage of deliverable quantity
This dataset is well-suited for time-series analysis and machine learning modeling to predict next-day OHLC prices, as it captures both price-based and volume-based market dynamics.**

# E D A (EXPLORATORY DATA ANALYSIS)

In [4]:
data.head()

Unnamed: 0,Date,Symbol,Series,Prev Close,Open,High,Low,Last,Close,VWAP,Volume,Turnover,Trades,Deliverable Volume,%Deliverble
0,2000-01-03,TISCO,EQ,142.35,148.0,153.2,146.1,152.5,152.45,150.92,2003185,30231640000000.0,,,
1,2000-01-04,TISCO,EQ,152.45,150.1,153.0,143.05,151.95,150.8,151.03,1555136,23487850000000.0,,,
2,2000-01-05,TISCO,EQ,150.8,144.6,162.9,144.6,158.0,156.55,156.85,3840284,60233640000000.0,,,
3,2000-01-06,TISCO,EQ,156.55,158.95,169.1,158.95,169.0,168.25,167.61,2560449,42915300000000.0,,,
4,2000-01-07,TISCO,EQ,168.25,173.4,179.0,166.3,170.55,171.95,173.89,3641691,63324590000000.0,,,


In [5]:
data.tail()

Unnamed: 0,Date,Symbol,Series,Prev Close,Open,High,Low,Last,Close,VWAP,Volume,Turnover,Trades,Deliverable Volume,%Deliverble
5301,2021-04-26,TATASTEEL,EQ,925.6,935.0,956.0,930.05,942.5,940.75,942.98,21234858,2002407000000000.0,274958.0,4584617.0,0.2159
5302,2021-04-27,TATASTEEL,EQ,940.75,948.3,983.0,944.3,982.0,977.75,965.43,24904515,2404346000000000.0,331493.0,3575969.0,0.1436
5303,2021-04-28,TATASTEEL,EQ,977.75,985.0,986.0,962.0,971.0,971.4,972.08,20447968,1987700000000000.0,255599.0,3550908.0,0.1737
5304,2021-04-29,TATASTEEL,EQ,971.4,983.0,1036.95,983.0,1035.0,1031.35,1015.76,44718647,4542359000000000.0,554647.0,5539528.0,0.1239
5305,2021-04-30,TATASTEEL,EQ,1031.35,1024.0,1052.6,1011.1,1025.6,1034.0,1031.95,28129738,2902854000000000.0,385840.0,3536863.0,0.1257


In [6]:
print("Shape of dataset-->\n", data.shape)
print("Size of dataset-->\n", data.size)

Shape of dataset-->
 (5306, 15)
Size of dataset-->
 79590


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5306 entries, 0 to 5305
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                5306 non-null   object 
 1   Symbol              5306 non-null   object 
 2   Series              5306 non-null   object 
 3   Prev Close          5306 non-null   float64
 4   Open                5306 non-null   float64
 5   High                5306 non-null   float64
 6   Low                 5306 non-null   float64
 7   Last                5306 non-null   float64
 8   Close               5306 non-null   float64
 9   VWAP                5306 non-null   float64
 10  Volume              5306 non-null   int64  
 11  Turnover            5306 non-null   float64
 12  Trades              2456 non-null   float64
 13  Deliverable Volume  4792 non-null   float64
 14  %Deliverble         4792 non-null   float64
dtypes: float64(11), int64(1), object(3)
memory usage: 621.9

**After observing the dataset, it is clear that the dataset contains fifteen columns and is a pandas.core.frame.DataFrame. The data types are a mix of integers, floats, and objects. Additionally, the columns trades, deliverable, and deliverable_volume contain some missing (null) values.**

In [8]:
data.describe()

Unnamed: 0,Prev Close,Open,High,Low,Last,Close,VWAP,Volume,Turnover,Trades,Deliverable Volume,%Deliverble
count,5306.0,5306.0,5306.0,5306.0,5306.0,5306.0,5306.0,5306.0,5306.0,2456.0,4792.0,4792.0
mean,403.385658,404.253581,411.21046,396.509197,403.467414,403.553703,404.062991,6165253.0,266487600000000.0,93969.26873,1550750.0,0.260951
std,187.146366,187.559958,190.791329,183.858461,187.26519,187.312178,187.436529,5329084.0,301286100000000.0,58218.860189,1215813.0,0.107903
min,67.25,66.0,69.7,66.0,67.3,67.25,67.97,23291.0,215916500000.0,2796.0,24158.0,0.0451
25%,275.775,275.6,284.4125,270.0,275.8125,275.9375,276.935,2801380.0,111871900000000.0,57557.25,769850.0,0.1805
50%,402.85,403.0,409.375,396.65,402.7,402.9,403.43,4800300.0,194930300000000.0,79400.0,1250946.0,0.253
75%,523.9875,525.0,534.725,516.4875,523.95,524.075,525.23,7833888.0,337964000000000.0,110710.25,2018066.0,0.3277
max,1031.35,1024.0,1052.6,1011.1,1035.0,1034.0,1031.95,64284600.0,4881124000000000.0,626502.0,26434720.0,0.9701


**The df.describe() output provides statistical summary of numerical columns. It shows that Tata Steel stock has a wide price range with high variability (min ~67, max ~1031, std ~187). Columns like Trades and Deliverable Volume contain missing values. Volume and Turnover are on a large scale, and %Deliverable averages around 26%. Percentiles indicate that most closing prices lie between 275 and 525**

**Missing Value Analysis**

In [9]:
print("sum missing value in the dataset are:\n")
for i in data.columns:
    missing=data[i].isnull().sum()
    if missing>0:
        print(f"percentage of missing value in {i} is\n",(missing*100)/5306)

sum missing value in the dataset are:

percentage of missing value in Trades is
 53.71277798718432
percentage of missing value in Deliverable Volume is
 9.68714662646061
percentage of missing value in %Deliverble is
 9.68714662646061


**By analyzing the missing values, we observed that the Trades column has approximately 53% missing rows, making it unsuitable for feature selection and difficult to handle. The other two columns, Deliverable Volume and %Deliverable, have around 9% missing values, so we need to handle these missing values appropriately**

In [10]:
#data["Deliverable Volume"]=np.fillna(data["Deliverable Volume"],)

Duplicate checking

In [11]:
data.duplicated().sum()

0

**No duplicate rows founded in dataset**

In [12]:
pd.to_datetime(data["Date"])
data.sort_values("Date")

Unnamed: 0,Date,Symbol,Series,Prev Close,Open,High,Low,Last,Close,VWAP,Volume,Turnover,Trades,Deliverable Volume,%Deliverble
0,2000-01-03,TISCO,EQ,142.35,148.00,153.20,146.10,152.50,152.45,150.92,2003185,3.023164e+13,,,
1,2000-01-04,TISCO,EQ,152.45,150.10,153.00,143.05,151.95,150.80,151.03,1555136,2.348785e+13,,,
2,2000-01-05,TISCO,EQ,150.80,144.60,162.90,144.60,158.00,156.55,156.85,3840284,6.023364e+13,,,
3,2000-01-06,TISCO,EQ,156.55,158.95,169.10,158.95,169.00,168.25,167.61,2560449,4.291530e+13,,,
4,2000-01-07,TISCO,EQ,168.25,173.40,179.00,166.30,170.55,171.95,173.89,3641691,6.332459e+13,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5301,2021-04-26,TATASTEEL,EQ,925.60,935.00,956.00,930.05,942.50,940.75,942.98,21234858,2.002407e+15,274958.0,4584617.0,0.2159
5302,2021-04-27,TATASTEEL,EQ,940.75,948.30,983.00,944.30,982.00,977.75,965.43,24904515,2.404346e+15,331493.0,3575969.0,0.1436
5303,2021-04-28,TATASTEEL,EQ,977.75,985.00,986.00,962.00,971.00,971.40,972.08,20447968,1.987700e+15,255599.0,3550908.0,0.1737
5304,2021-04-29,TATASTEEL,EQ,971.40,983.00,1036.95,983.00,1035.00,1031.35,1015.76,44718647,4.542359e+15,554647.0,5539528.0,0.1239


**pipeline for nan filling**

In [13]:
class Datapipeline:
    def __init__(self,data):
        self.data=data.copy()
        self.nan_strategy={}

    def set_nan_strategy(self,column_strategy):
        self.nan_strategy.update(column_strategy)
    def nan_fill(self):
        for col, strategy in self.nan_strategy.items():
            if col in self.data.columns:
                if strategy=="mean":
                    value=self.data[col].mean()
                elif strategy=="mode":
                    value=self.data[col].mode()[0]
                elif strategy== "median":
                    value=self.data[col].median()
                else:
                    raise ValueError("strategy must be mean ,mode, median ")
                       
                self.data[col]=self.data[col].fillna(value)
                print(f"[NAN_FILL] in {col} using {strategy} : {value:.4f}")
            else:
                print("column not present in dataset")
        return self.data   
    def get_dataset(self):
        return self.data
        
    def drop_column(self,column):
        self.data=self.data.drop(columns=column, axis=1,errors="coerce")
        print(f"column dropped from dataset {column}")
        return self.data
    def Split_Date_column(self):
        self.data["Date"]=pd.to_datetime(self.data["Date"])
        self.data["Year"]=self.data["Date"].dt.year
        self.data["Month"]=self.data["Date"].dt.month
        self.data["day"]=self.data["Date"].dt.day
        return self.data
        
                
            

In [14]:
pipeline=Datapipeline(data)
pipeline.set_nan_strategy({
    "Deliverable Volume":"mean",
    "%Deliverble":"mode",
    "Trades":"median"
    })
data=pipeline.nan_fill()


[NAN_FILL] in Deliverable Volume using mean : 1550749.8082
[NAN_FILL] in %Deliverble using mode : 0.2010
[NAN_FILL] in Trades using median : 79400.0000


**separation of categoricaland numerical columns**

**categorical column encoding**

In [15]:
pipeline= Datapipeline(data)
data=pipeline.Split_Date_column()

**List of category column Symbol and Series column is useleas for single stock features selection so we have to drop Date ,Symbol, Series column from dataset**

In [16]:
pipeline=Datapipeline(data)
data=pipeline.drop_column(["Date","Symbol","Series"])

column dropped from dataset ['Date', 'Symbol', 'Series']


In [17]:
numerical_col=data.select_dtypes(include=["int64","float64","int32"])
categorical_col=data.select_dtypes(include=["O"])
print("numerical column is ",numerical_col.columns.to_list())
print("\n categorical column is ", categorical_col.columns.to_list())

numerical column is  ['Prev Close', 'Open', 'High', 'Low', 'Last', 'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume', '%Deliverble', 'Year', 'Month', 'day']

 categorical column is  []


**targate column**

In [18]:
data["open"]=data["Open"].shift(-1)
data["close"]=data["Close"].shift(-1)
data["low"]=data["Low"].shift(-1)
data["high"]=data["High"].shift(-1)
data=data.dropna()
y=data[["open","low","high","close"]]
x=data.drop(["open","high","low","close","Trades"],axis=1)

In [19]:
for i in data.columns:
    q1=data[i].quantile(0.25)
    q2=data[i].quantile(0.75)
    iqr=q2-q1
    lower=q1-1.5*iqr
    upper=q2+1.5*iqr
    outlier=((data[i]<lower)|(data[i]>upper)).sum()
    print(f"sum of outlier in {i} is ", outlier)

sum of outlier in Prev Close is  30
sum of outlier in Open is  33
sum of outlier in High is  38
sum of outlier in Low is  25
sum of outlier in Last is  32
sum of outlier in Close is  31
sum of outlier in VWAP is  32
sum of outlier in Volume is  299
sum of outlier in Turnover is  270
sum of outlier in Trades is  2455
sum of outlier in Deliverable Volume is  277
sum of outlier in %Deliverble is  77
sum of outlier in Year is  0
sum of outlier in Month is  0
sum of outlier in day is  0
sum of outlier in open is  34
sum of outlier in close is  33
sum of outlier in low is  25
sum of outlier in high is  38


In [20]:
data.skew()

Prev Close            0.189430
Open                  0.192806
High                  0.210376
Low                   0.177251
Last                  0.195363
Close                 0.193886
VWAP                  0.194239
Volume                2.726401
Turnover              4.844322
Trades                4.461069
Deliverable Volume    4.103746
%Deliverble           0.765465
Year                  0.013374
Month                 0.013875
day                   0.010313
open                  0.196960
close                 0.198352
low                   0.181648
high                  0.214783
dtype: float64

**Observation of outlier and skewness**
High outlier count alone is not the reason to drop a feature, but when a feature is highly skewed, heavily outlier-dominated, and redundant with other features, it increases model variance without adding predictive power.”

**Correlation analysis**

In [21]:
corr_matrix = data.corr()
for target in y.columns:
    print(f"\nCorrelation with target: {target}")
    print(corr_matrix[target].sort_values(ascending=False))


Correlation with target: open
open                  1.000000
Close                 0.999356
Last                  0.999350
high                  0.999059
VWAP                  0.999019
low                   0.998755
High                  0.998413
Low                   0.998379
close                 0.998093
Open                  0.997434
Prev Close            0.997363
Year                  0.470805
Turnover              0.428153
Volume                0.173820
Trades                0.168284
%Deliverble           0.140260
Deliverable Volume    0.090122
day                  -0.004666
Month                -0.009159
Name: open, dtype: float64

Correlation with target: low
low                   1.000000
close                 0.999056
open                  0.998755
Last                  0.998448
Close                 0.998427
high                  0.998301
VWAP                  0.997976
Low                   0.997716
High                  0.996975
Open                  0.996159
Prev Close   

In [22]:
x.columns

Index(['Prev Close', 'Open', 'High', 'Low', 'Last', 'Close', 'VWAP', 'Volume',
       'Turnover', 'Deliverable Volume', '%Deliverble', 'Year', 'Month',
       'day'],
      dtype='object')

In [23]:
X=x.drop(columns=["day","Month","%Deliverble","Deliverable Volume"],axis=1)

In [24]:
class RegressionPipeline:
    def __init__(self, X, y, test_size=0.2, random_state=42):
        self.X = X
        self.y = y
        self.test_size = test_size
        self.random_state = random_state

        self.models = {
            "LinearRegression": MultiOutputRegressor(LinearRegression()),
            "Ridge": MultiOutputRegressor(Ridge()),
            "Lasso": MultiOutputRegressor(Lasso()),
            "RandomForest": MultiOutputRegressor(
                RandomForestRegressor(n_estimators=100, random_state=42)
            ),
            "GradientBoosting": MultiOutputRegressor(
                GradientBoostingRegressor(random_state=42)
            )
        }

        self.results = []

    def train_test_split(self):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.X, self.y,
            test_size=self.test_size,
            random_state=self.random_state,
            shuffle=False   # ⚠️ time series ke liye important
        )

    def evaluate(self):
        self.train_test_split()

        for name, model in self.models.items():
            model.fit(self.X_train, self.y_train)
            y_pred = model.predict(self.X_test)

            r2 = r2_score(self.y_test, y_pred)
            mse = mean_squared_error(self.y_test, y_pred)
            rmse = np.sqrt(mse)
            mape = mean_absolute_percentage_error(self.y_test, y_pred)

            self.results.append({
                "Model": name,
                "R2_Score": r2,
                "MSE": mse,
                "RMSE": rmse,
                "MAPE": mape
            })

        return pd.DataFrame(self.results).sort_values(by="R2_Score", ascending=False)

In [25]:
y.isna().sum()

open     0
low      0
high     0
close    0
dtype: int64

In [26]:
pipeline = RegressionPipeline(X, y)
results = pipeline.evaluate()

results

Unnamed: 0,Model,R2_Score,MSE,RMSE,MAPE
0,LinearRegression,0.99488,91.592122,9.570377,0.012848
1,Ridge,0.994879,91.605243,9.571063,0.012849
2,Lasso,0.994818,92.633717,9.624641,0.013064
3,RandomForest,0.993119,122.849717,11.083759,0.014988
4,GradientBoosting,0.992829,128.030485,11.315056,0.014887
