# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member  -**Kavya Reddy Chinnam


# **Project Summary -**

This project focuses on analyzing and predicting the stock closing prices of Yes Bank using historical stock data and various regression models. The goal is to build accurate predictive models that can estimate future closing prices based on historical open, high, and low prices.

Data Loading and Preprocessing

The dataset consists of daily stock prices of Yes Bank, including the columns Date, Open, High, Low, and Close prices. The data is loaded from a CSV file (data_YesBank_StockPrices.csv) into a pandas DataFrame for analysis.

The Date column is converted to datetime format to allow time-series sorting and analysis. The dataset is sorted by Date to maintain chronological order, which is essential for time-dependent stock data. Basic exploratory data analysis is performed to check the dataset’s structure, missing values, and duplicates. Any rows with missing values are removed to ensure clean data for model training.

Exploratory Data Analysis

Several visualization techniques help understand the data:

A correlation matrix is computed and visualized using a heatmap to identify relationships between features. The strong correlations observed between the Close price and other price features (Open, High, Low) justify the use of these as predictors.

Scatter plots visualize the closing price over time, showing trends and variations.

Box plots provide an overview of the distribution and potential outliers in the stock price features.

Hypothesis testing confirms the presence of strong linear relationships between Close prices and other features, validating the use of regression models.

Model Development

The dataset is split into training and testing sets (80-20 split) to evaluate model performance on unseen data.

Three machine learning models are developed and compared:

Linear Regression: A simple model assuming a linear relationship between predictors (Open, High, Low) and the target (Close). It serves as a baseline to compare more complex models.

Decision Tree Regressor: A non-linear model that splits the feature space into regions based on decision rules. GridSearchCV is used to tune hyperparameters such as maximum tree depth and minimum samples per split for better generalization and reduced overfitting.

Random Forest Regressor: An ensemble model combining multiple decision trees to improve accuracy and reduce variance. Hyperparameters like the number of trees, maximum depth, and minimum samples split are optimized using GridSearchCV with cross-validation.

Model Evaluation

Models are evaluated using two key metrics:

Mean Squared Error (MSE): Measures the average squared difference between predicted and actual closing prices. Lower values indicate better model accuracy.

R-squared (R²) Score: Represents the proportion of variance in the dependent variable explained by the model. Values closer to 1 indicate better fit.

The linear regression model provides a baseline R² score, while the tuned decision tree and random forest models are expected to yield improved results due to their ability to model non-linear relationships.

Results

The linear regression model typically performs best among the three, demonstrating superior predictive power with higher R² scores and lower MSE values on the test data.

The best-performing model is selected based on the highest R² score.

Model Deployment

The selected best model is serialized and saved using Python’s pickle module, allowing for later reuse without retraining. This saved model can be loaded and used for real-time predictions or integration into a larger application, such as a stock forecasting tool or a financial decision support system.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


####The objective of the project is to develop an effective machine learning model that can predict the closing price of Yes Bank’s stock using historical stock data, including open, high, and low prices. The challenge lies in capturing the underlying patterns and relationships within the financial time-series data to provide reliable and accurate price predictions.

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
import pickle

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset shape (rows, columns):", df.shape)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values:\n", df.isnull().sum())

### What did you know about your dataset?

My dataset contains 5 cloumns (Date,open,high,low,close)
#####Date needs to be converted to datetime format to sort and visualize trends.
#####open,high,low are higly correlated to close.
#####close variable is used as target variable.
#####This dataset contains 185 records.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

### Variables Description

#####*Date needs to be converted to datetime format to sort and visualize trends.
#####*The Close column is typically used as the target variable for stock price prediction.
#####*Features like Open, High, Low are highly correlated with Close.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df['Open'].unique()
df['Close'].unique()
df['High'].unique()
df['Low'].unique()
df['Date'].unique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = df.dropna()

### What all manipulations have you done and insights you found?

I droped empty values from dataset

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

To examine the correlation between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

####Reveals how strongly variables are related (values from -1 to 1).
####Useful for reducing multicollinearity in models.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
print("scatter plot")
plt.scatter(df['Date'],df['Close'])
plt.xlabel("Date")
plt.ylabel("Close Price")
plt.show()

##### 1. Why did you pick the specific chart?

To visualize the trend of the stock closing prices over time.

##### 2. What is/are the insight(s) found from the chart?

####Shows how the stock price changes with time.
####Helps identify upward or downward trends, spikes, or crashes.
####Useful for detecting seasonality or volatility.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
print("box plot")
plt.figure(figsize=(8, 5))
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title("Box Plot of Stock Price Features")
plt.ylabel("Price")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze the distribution and outliers of stock price features.

##### 2. What is/are the insight(s) found from the chart?

####Quickly compares Open, High, Low, and Close prices.
####Box plots show:

Median

Interquartile Range (IQR)

Outliers (unusual values)
####Useful for detecting price variability or anomalies.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### Null Hypothesis: There is a strong linear relationship between open,close,high and low prices.
####Alternative Hypothesis : There is no strong linear relationship between open,close,high,low prices.

#### Hypothesis test.

In [None]:
#test
if abs(corr_matrix['Close'][0]) > 0.8 and abs(corr_matrix['Close'][1]) > 0.8 and abs(corr_matrix['Close'][2]) > 0.8:
    print(" Hypothesis 1 supported: Strong linear relationships found.")
else:
    print(" Hypothesis 1 not supported: Correlations are weak.")

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.dropna()

#### What all missing value imputation techniques have you used and why did you use those techniques?

I used dropna to drop missing values

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

### 1. Data Transformation

In [None]:
# data transformation
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df = df.sort_values('Date')


####splitting of data

In [None]:
x=df[['Open','High','Low']]
y=df['Close']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
print("Linear Regression model")
model1=LinearRegression()
model1.fit(x_train,y_train)
y_pred1=model1.predict(x_test)
mse1=mean_squared_error(y_test,y_pred1)
r2_1=r2_score(y_test,y_pred1)



####Display

In [None]:
print("dates to test:",x_test)

print('actual_value:',y_test)

print('predicted value:',y_pred1)

print('mean squared error:',mse1)

print('r2 score:',r2_1)


####Why this model

Linear Regression is a simple algorithm that assumes a linear relationship between input features (Open, High, Low) and the output (Close).

It tries to find the best-fit line that minimizes the difference between predicted values and actual values.

#### ML model 2

In [None]:
#decision tree model2
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# Decision Tree model with GridSearchCV
print("Tuned Decision Tree Model")
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

grid_dt = GridSearchCV(DecisionTreeRegressor(), param_grid_dt, cv=5, scoring='r2')
grid_dt.fit(x_train, y_train)

best_dt = grid_dt.best_estimator_
y_pred2 = best_dt.predict(x_test)
mse2 = mean_squared_error(y_test, y_pred2)
r2_2 = r2_score(y_test, y_pred2)

####Display

In [None]:
print("Best parameters (Decision Tree):", grid_dt.best_params_)
print("dates to test:",x_test)
print("actual values:",y_test)
print("predicted values:",y_pred2)
print("mean squared error:",mse2)
print("r2 score:",r2_2)

####Why this model

Decision Tree Regressor splits the data into branches based on feature values and predicts output based on decision rules.

It builds a tree structure with nodes representing features and branches representing possible decisions based on those features.

####ML model 3

In [None]:
from sklearn.ensemble import RandomForestRegressor
print("random forest model")
model3=RandomForestRegressor()
model3.fit(x_train,y_train)
y_pred3=model3.predict(x_test)
mse3=mean_squared_error(y_test,y_pred3)
r2_3=r2_score(y_test,y_pred3)


####Display

In [None]:
print("Best parameters (Random Forest):", grid_rf.best_params_)
print("dates to test:",x_test)
print("actual values:",y_test)
print("predicted values:",y_pred2)
print("mean squared error:",mse2)
print("r2 score:",r2_2)

####Why this model

Random Forest Regressor is an ensemble method that combines multiple decision trees to improve accuracy and control overfitting.

It creates multiple trees and aggregates their predictions, typically by averaging the results in regression tasks.

####Hyperparameter tuning

GridSearchCV was used to optimize the performance of the Decision Tree and Random Forest models by finding the best combination of hyperparameters.

####Reason for choosing that method

Gridsearchcv is used because :
#####Improve accuracy.
#####To avoid over fitting and underfitting.
#####Built in cross -validation.

####which is best model?

In [None]:
best_model = max((r2_1, "Linear Regression"), (r2_2, "Decision Tree"), (r2_3, "Random Forest"))
print("The best model is:", best_model)

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
if best_model[1] == "Linear Regression":
    final_model = model1
    filename = 'linear_regression_model.pkl'
elif best_model[1] == "Tuned Decision Tree":
    final_model = best_dt
    filename = 'decision_tree_model.pkl'
elif best_model[1] == "Tuned Random Forest":
    final_model = best_rf
    filename = 'random_forest_model.pkl'
else:
    raise ValueError("Unknown model type")

# Save using pickle
with open(filename, 'wb') as file:
    pickle.dump(final_model, file)

print(f" Best model saved as '{filename}'")

### 2. Again Load the saved model file


In [None]:
# Load the File
with open(filename, 'rb') as file:
    loaded_model = pickle.load(file)

# **Conclusion**

This project explored how machine learning can help predict Yes Bank’s stock closing prices using past price data. After cleaning and analyzing the data, i tested different models and found that the Linear regression model gave the most accurate predictions.