- **pandas library**: Pandas is a data manipulation library for Python, providing powerful data structures and tools for working with structured data, such as tables and CSV files.

- **scikit-learn (sklearn)**: Scikit-learn is a machine learning library in Python that offers a wide range of tools for data analysis and modeling, including classification, regression, and clustering algorithms.

- **pickle**: Pickle is a Python module for serializing and deserializing Python objects, enabling the storage and retrieval of complex data structures, including machine learning models, in a compact binary format.


# 1. Data Collection

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pickle

In [2]:
data = pd.read_csv("HDFCBANK.csv")
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2022-09-30,1378.800049,1431.449951,1365.0,1421.349976,1405.234863,7890878
1,2022-10-03,1409.949951,1417.849976,1401.099976,1413.199951,1397.177246,5770556
2,2022-10-04,1429.5,1458.0,1426.150024,1453.0,1436.526123,5769263
3,2022-10-06,1459.949951,1462.599976,1434.199951,1437.0,1420.70752,6274599
4,2022-10-07,1430.25,1434.949951,1420.349976,1430.800049,1414.577759,6648997


# 2. Feature Engineering

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       248 non-null    object 
 1   Open       248 non-null    float64
 2   High       248 non-null    float64
 3   Low        248 non-null    float64
 4   Close      248 non-null    float64
 5   Adj Close  248 non-null    float64
 6   Volume     248 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.7+ KB


In [4]:
data['Date'] = pd.to_datetime(data['Date'])

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       248 non-null    datetime64[ns]
 1   Open       248 non-null    float64       
 2   High       248 non-null    float64       
 3   Low        248 non-null    float64       
 4   Close      248 non-null    float64       
 5   Adj Close  248 non-null    float64       
 6   Volume     248 non-null    int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 13.7 KB


In [6]:
data[['Date', 'Open', 'Close']]

Unnamed: 0,Date,Open,Close
0,2022-09-30,1378.800049,1421.349976
1,2022-10-03,1409.949951,1413.199951
2,2022-10-04,1429.500000,1453.000000
3,2022-10-06,1459.949951,1437.000000
4,2022-10-07,1430.250000,1430.800049
...,...,...,...
243,2023-09-25,1525.000000,1531.000000
244,2023-09-26,1525.000000,1537.650024
245,2023-09-27,1523.000000,1526.849976
246,2023-09-28,1534.000000,1523.699951


# Problem Statement: 

**Predict today's Closing Price based on yesterday's Closing Price**

In [7]:
# Calculate the feature 'Yesterday_Close' by shifting 'Close' by one day
data['Yesterday_Close'] = data['Close'].shift(1)

# Drop the first row since it will have NaN values in 'Yesterday_Close'
data = data.dropna()

data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Yesterday_Close
1,2022-10-03,1409.949951,1417.849976,1401.099976,1413.199951,1397.177246,5770556,1421.349976
2,2022-10-04,1429.5,1458.0,1426.150024,1453.0,1436.526123,5769263,1413.199951
3,2022-10-06,1459.949951,1462.599976,1434.199951,1437.0,1420.70752,6274599,1453.0
4,2022-10-07,1430.25,1434.949951,1420.349976,1430.800049,1414.577759,6648997,1437.0
5,2022-10-10,1408.0,1426.0,1398.199951,1415.0,1398.956909,6554651,1430.800049


# 3. Splitting data into Training and Testing sets 

In [8]:
# Split the data into features (X) and target (y)
X = data[['Yesterday_Close', 'Open', 'High', 'Low', 'Adj Close']]
y = data['Close']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Model Training

In [9]:
# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# 5. Data Evaluation

In [10]:
# Evaluate the model on the testing data
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error: ", mse)
print("R-Squared: ", r2)

Mean Squared Error:  22.410547891531348
R-Squared:  0.9952223658844194


# 6. Model Deployment

In [11]:
with open('lag.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)
    # Save the trained model to a file using pickle

print("Model saved as 'lag.pkl'")

Model saved as 'lag.pkl'
