# **Project Name**    - **Yes Bank Stock Closing Price Prediction**





##### **Project Type**    - Regression, Random Forest
##### **Contribution**    - Individual
##### **Team Member 1 -** Mohammed Ashfakh

# **Project Summary -** Yes Bank Stock Closing Price Prediction

This project focuses on predicting the monthly closing price of Yes Bank stock using machine learning techniques. The dataset consists of 185 records with five variables: Date, Open, High, Low, and Close. The primary objective of the project was to build a reliable regression model that can accurately forecast the closing price based on historical stock price data.

The project began with Exploratory Data Analysis (EDA) to understand the structure and quality of the dataset. Initial checks confirmed that there were no missing values or duplicate records. Descriptive statistics and visualization techniques were used to examine the distribution of stock prices and identify patterns. Correlation analysis revealed a very strong positive linear relationship between Open, High, Low, and Close prices, indicating that these variables are highly influential in predicting the closing price.

Two hypothesis tests were conducted to statistically validate relationships in the dataset. Pearson’s correlation test confirmed a significant linear relationship between Open and Close prices, supporting the suitability of regression modeling. Additionally, a t-test was used to examine differences between opening and closing prices. These statistical tests strengthened the reliability of the modeling approach.

Outlier detection was performed using boxplots, and the Interquartile Range (IQR) method was applied to identify extreme values. Instead of removing data points, a capping technique was used to reduce the influence of outliers while preserving important information. This ensured model stability without losing valuable stock price patterns.

For feature engineering and selection, the Date column was removed as it does not directly contribute to numerical prediction in a regression framework. The Open, High, and Low columns were selected as independent variables due to their strong correlation with the Close price. Data scaling was performed using Standardization to ensure that all features were on a similar scale, which improves regression model performance.

Two machine learning models were implemented: Linear Regression and Random Forest Regressor. Linear Regression was chosen as the baseline model because of its simplicity and interpretability. Random Forest was selected to capture potential non-linear relationships and interactions between features.

Model evaluation was conducted using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² Score. Linear Regression achieved an MAE of 4.98, RMSE of 8.52, and an R² score of 0.9913. Random Forest produced comparatively higher error values with an R² score of 0.9793 before tuning. After hyperparameter optimization using GridSearchCV, Random Forest improved slightly, but Linear Regression still outperformed it in terms of accuracy and error reduction.

Based on the evaluation metrics, Linear Regression was selected as the final prediction model due to its lower error values and higher explanatory power.

Overall, the project demonstrates that stock closing prices can be effectively predicted using regression techniques when strong linear relationships exist in the data. The high R² score indicates strong predictive capability, making the model suitable for supporting financial analysis, investment planning, and risk assessment decisions. The project successfully follows the complete machine learning pipeline, from data preprocessing and statistical validation to model training, evaluation, and optimization.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to develop a machine learning model to predict the monthly closing price of Yes Bank stock using historical stock data. The dataset includes Open, High, Low, and Close prices, and the goal is to analyze the relationships among these variables to build an accurate regression model. Accurate prediction of closing prices can support better financial analysis, investment planning, and data-driven decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("ROWS:",df.shape[0])
print("COLUMNS:",df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
plt.figure(figsize=(10,5))
msno.bar(df)
plt.title("Missing values per column",y=1.1)
plt.show()

### What did you know about your dataset?


the given dataset is related to the details of stock price of yes bank, which include 5 columns, date,open,high,low and close. In This data set, we have 185 rows and 5 columns. data type of the date is object type, remainig all are float. In this data set we don't have any duplicate values and missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Date: Month and year of the stock price record.

Open: Opening price

High: Highest price in the day

Low: Lowest price in the day

Close: Stock price at the end of the month.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
#converting date column
df['Date']=pd.to_datetime(df['Date'],format='%b-%y')

In [None]:
df=df.sort_values(by='Date')

In [None]:
df.reset_index(inplace=True)

In [None]:
df.head()

### What all manipulations have you done and insights you found?

The Date column was originally in string format (Jan-20, Feb-20, etc.)

Converting it to datetime enable:
*   Correct chronological sorting.
*   Time-based analysis.

Sorted the dataset by Date


* Ensures the data is in proper time order.
*  Essential for stock price trend analysis and modeling.

Reset the index after sorting:
Sorting changes the row order but keeps the old index.

Resetting the index:

*   Creates a clean sequential index
*   Improves readability

Insights Derived from These Steps



*   Chronological consistency achieved
*   This makes trend detection reliable








## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 univariate

In [None]:
# Histogram
plt.figure(figsize=(10,5))
sns.histplot(df['Close'], bins=30,kde=True)
plt.title('Distribution of Closing Price')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal for understanding the distribution and frequency of values, helping identify common price ranges and outliers.

##### 2. What is/are the insight(s) found from the chart?

-> Most closing prices fall within a lower price range, indicating a sustained decline

->Presence of extreme values shows historical instability in stock performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

->Negative insights: Concentration at lower prices signals reduced valuation and weaker financial health.

->Positive impact: Helps long-term investors identify low-price accumulation zones.

#### Chart - 2

In [None]:
# Chart - 2 Univariate

In [None]:
#Boxplot
plt.figure(figsize=(10,5))
sns.boxplot(df[['Open','High','Low','Close']])
plt.title('Price Distribution Comparison')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot is chosen because it summarizes the distribution, spread, median, and outliers of the closing price in a compact visual.

##### 2. What is/are the insight(s) found from the chart?

->The median closing price indicates the typical market level.

->Presence of outliers indicates extreme market events or abnormal trading periods..

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

-> Negative insights:Outliers reflect market instability, which increases investment risk

->Positive insight: Identifying outliers helps traders and investors avoid abnormal periods.

#### Chart - 3

In [None]:
# Chart - Bivariate

In [None]:
#Line Chart
plt.figure(figsize=(10,5))
plt.plot(df['Date'],df['Close'])
plt.title('Yes bank monthly closing price trend')
plt.xlabel('Date')
plt.ylabel('Close')
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is most suitable for time-dependent data. it show how stock closing price changes over times and help to identify long term trends, cycles and sudden price movements.


##### 2. What is/are the insight(s) found from the chart?

-> The closing price shows high volatility over time.

->there is extendend period of decline from 2018-2020, indicating a bearish sentiment.

-> sharp drop indicating financial or organizational stress affecting yes bank.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

positive insight: identify volatile periods helps investors avoid high risk entry points and support better timing strategy.

negative insight: sharp or extended downward trend indicate loss of investors confidence, whill will lead to trading volume and market capitalization

#### Chart - 4

In [None]:
# Chart - 4 Bivariate

In [None]:
#Line Chart
plt.figure(figsize=(10,5))
plt.plot(df['Date'], df['High'] - df['Low'])
plt.title('Monthly Price Volatility (High–Low)')
plt.ylabel('Price Range')
plt.show()

##### 1. Why did you pick the specific chart?

The High–Low spread directly measures monthly volatility, which is critical for risk assessment in stock markets.

##### 2. What is/are the insight(s) found from the chart?

->Periods with high spread indicate uncertain or unstable market conditions.

->Calm periods suggest temporary stability in stock performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative impact: High volatility increases investment risk and discourages conservative investors

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap (multivariate)

In [None]:
import seaborn as sns
sns.heatmap(df[['Open','High','Low','Close']].corr(), annot=True)
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap visually represents the strength of relationship between multiple numerical variables

##### 2. What is/are the insight(s) found from the chart?

->Strong positive correlation between Open, High, Low, and Close prices

->Indicates internal pricing consistency in the stock(if Open price increase->close,High,low also tend to increase)
(if Open price decreases-> close,High,low also tend to decrease)

->High correlation confirms data reliability.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code(Multivariate)

In [None]:
import seaborn as sns
sns.pairplot(df[['Open','High','Low','Close']])
plt.suptitle('Pairplot of Yes bank Stock prices', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot is chosen to simultaneously visualize relationships and distributions among all stock price variables. It provides a compact overview of how prices interact with each other.



##### 2. What is/are the insight(s) found from the chart?

->Strong linear relationships between all price pairs

->Confirms that all prices move together under the same market conditions

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

There is a significant relationship between the Open price and the Close price of Yes Bank stock.

->This statement was derived from scatter plots and correlation heatmaps, which visually indicated a strong positive relationship.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: There is no significant relationship between open and close price

Alternative Hypothesis: There is significant relationship between open and close price.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# TEST USED- PEARSON CORRELATION TEST

In [None]:
from scipy.stats import pearsonr
correlation, p_value=pearsonr(df['Open'],df['Close'])

print("Correlation:",correlation)
print("p-values:",p_value)

if p_value<0.05:
  print("Reject the null hypothesis, there is a significant relationship between open and close price")
else:
  print("fail to reject the null hypothesis, there is no significant relationship between open and close price")


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test was used to test the null hypothesis that there is no linear relationship between Open and Close prices. Since the p-value was significantly less than 0.05 and the correlation coefficient was high (r ≈ 0.98), the null hypothesis was rejected, confirming a strong and statistically significant positive relationship between the two variables.


##### Why did you choose the specific statistical test?

The Pearson correlation test was chosen to measure the strength and direction of the linear relationship between two continuous numerical variables — Open price and Close price

-This test is appropriate for the dataset because:

->Both variables are continuous and numerical, satisfying a key requirement of Pearson correlation.

->The objective was to determine whether changes in the Open price are linearly associated with changes in the Close price.

->Pearson correlation provides both a correlation coefficient (r) to quantify relationship strength and a p-value to test statistical significance.

### Hypothetical Statement - 2

***statement***

There is a significant difference between the monthly Open and Close prices of Yes Bank stock

This statement was motivated by boxplots and line charts, which showed noticeable price movement within month

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: Mean Open price equal to Mean closing price

Alternative hypothesis: Mean Open Price not equal to Mean closing price

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Paired (T-test)

In [None]:
from scipy.stats import ttest_rel
t_statistic, p_value=ttest_rel(df['Open'],df['Close'])

print("t-statistics:", t_statistic)
print('P-value:',p_value)

if p_value<0.05:
  print("Reject the null hypothesis, there is a significant difference between the monthly open and close prices")
else:
  print("Fail to reject the null hypothesis, there is no significant difference between the open and closing prices")

##### Which statistical test have you done to obtain P-Value?

Paired Sample t-Test

A paired sample t-test was used because Open and Close prices are dependent observations measured from the same time period, and the test evaluates whether their mean difference is statistically significant

p-value interpretation:

Since the p-value (0.825) is greater than the significance level (0.05), we fail to reject the null hypothesis. This suggests that there is no statistically significant difference between the monthly Open and Close prices of Yes Bank

##### Why did you choose the specific statistical test?

The paired sample t-test was chosen because:

->Open and Close prices are paired observations from the same month.

->Each Open price is directly related to its corresponding Close price.

->The objective was to test whether the mean difference between Open and Close prices is statistically significant.

->The test is designed to compare two related samples, not independent groups

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Here in the given dataset, it doesn't have any missing values, so no need to go for imputation techniques.


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
import pandas as pd
import numpy as np

In [None]:
# to detect outliers

import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.boxplot(df[['Open','High','Low','Close']])
plt.title('boxplot')
plt.ylabel('Price')
plt.show()

In [None]:
price_col=['Open','High','Low','Close']

Q1=df[price_col].quantile(0.25)
Q3=df[price_col].quantile(0.75)
IQR=Q3-Q1

lower_bound=Q1-(1.5*IQR)
upper_bound=Q3+(1.5*IQR)

print("Lower Bound:\n",lower_bound)
print("Upper Bound:\n",upper_bound)



In [None]:
#Capping
df[price_col]=df[price_col].clip(lower=lower_bound,upper=upper_bound, axis=1 )

In [None]:
#Verify outliers after treatment
outliers_after = ((df[price_col] < lower_bound) | (df[price_col] > upper_bound)).sum()
outliers_after


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.boxplot(df[price_col])
plt.title("Boxplot After Outlier Treatment")
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Outliers were first identified using visualization techniques, specifically boxplots, which provided a view of the data distribution and highlighted the presence of extreme values across all price-related columns

After confirming the presence of outliers, the Interquartile Range (IQR) method was applied to statistically determine the lower and upper bounds for each price column

Instead of removing the detected outliers, a capping (winsorization) technique was used to limit extreme values within the calculated IQR bounds. This approach was selected to avoid loss of important information, especially given the limited size of the dataset.

Finally, boxplots were re-generated after applying capping to verify that the influence of outliers had been effectively reduced, ensuring the dataset was more stable and suitable for regression modeling.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Categorical encoding was not required for this regression task because the dataset contains only numerical features. The Date column was treated as a time-based feature and was either excluded or transformed into numerical components rather than encoded as a categorical variable

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

Creating new features was not mandatory for this regression task, as the existing price-related features already showed strong predictive power.

In [None]:
df.drop(columns='Date', inplace=True)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

In [None]:
#define features and target
X = df[['Open', 'High', 'Low']]
y = df['Close']

##### What all feature selection methods have you used  and why?

Feature selection was performed using correlation analysis. The columns Open, High, and Low were selected as independent features, while Close was chosen as the target variable. These feature columns exhibit a strong positive correlation with the target, indicating that they carry significant predictive information for estimating the closing price. Selecting highly correlated numerical features helps improve model performance and ensures that the regression model captures meaningful price relationships

##### Which all features you found important and why?

Open, High, and Low prices were identified as the most important features because they exhibited strong positive correlation with the Close price and directly represent key aspects of market behavior. These features collectively capture price movement throughout the month, making them highly informative for predicting the closing price

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

The data did not need major transformation because the features already had a linear relationship with the target variable. Outliers were handled using the capping method. Since regression models do not strictly require normally distributed features, transformations like log or power transformation were not applied. Standardization was used only to bring all features to the same scale

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

# Features and target
X = df[['Open', 'High', 'Low']]
y = df['Close']

# Fit and transform features
X_scaled = scaler.fit_transform(X)

#Convert back to dataframe
X_scaled=pd.DataFrame(X_scaled, columns=X.columns)

X_scaled.head()

##### Which method have you used to scale you data and why?

I used Standardization (StandardScaler) to scale my data. This method converts all feature values to a similar scale by keeping the mean as 0 and standard deviation as 1. I used this because regression models work better when features are on the same scale. It also helps the model learn faster and gives more accurate results.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

dimensionality reduction is not needed.

The dataset contains only 5 columns and 185 rows, which is already small and easy to handle. All selected features (Open, High, Low) are directly relevant to predicting the Close price and do not cause high dimensional complexity. Applying dimensionality reduction techniques like PCA may reduce interpretability without providing any significant performance benefit.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
from sklearn.model_selection import train_test_split

# Train_test split-(80-20)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit only on training data
X_train_scaled = scaler.fit_transform(X_train)

# Use same scaler for test data
X_test_scaled = scaler.transform(X_test)


##### What data splitting ratio have you used and why?

I used the train_test_split method.

Why?

It is simple and easy to use, especially for beginners.

It randomly splits the dataset into training and testing data, reducing bias.

It is widely used in machine learning for creating reliable model evaluation.

This method helps ensure the model is trained and tested on different data.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

This is a regression problem where the target variable (Close price) is a continuous numerical value, not a category or class. The concept of data imbalance mainly applies to classification problems, where some classes have significantly fewer samples than others. Since this dataset aims to predict a numerical value and does not involve class labels, the notion of class imbalance does not apply here

Why?

The target classes are almost evenly distributed.

No single class dominates the dataset.

This means the model will not be biased toward one class

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

from sklearn.linear_model import LinearRegression
lr=LinearRegression()

# Fit the Algorithm
lr.fit(X_train_scaled,y_train)

# Predict on the model
y_pred=lr.predict(X_test_scaled)

In [None]:
y_pred

In [None]:
#Model evaluation
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
import numpy as np

Lr_MAE=mean_absolute_error(y_test,y_pred)
Lr_RMSE=np.sqrt(mean_squared_error(y_test,y_pred))
Lr_r2=r2_score(y_test,y_pred)

print("Mean Absolute Error:",Lr_MAE)
print("Root Mean Squared Error:",Lr_RMSE)
print("R2 Score:",Lr_r2)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Linear Regression was selected as a baseline model because it is simple, interpretable, and works well when there is a strong linear relationship between features and the target. In this dataset, Open, High, and Low prices show a strong positive correlation with the Close price, making Linear Regression a suitable starting point.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt

metrics = ['MAE','RMSE','R²']
scores=[Lr_MAE,Lr_RMSE,Lr_r2]

plt.figure(figsize=(10,5))
plt.bar(metrics,scores, color=['red','green','blue'])
plt.title('Linear Regression evaluation metrics')
plt.xlabel('Metrics')
plt.ylabel('Scores')
plt.show()

The R² score appears smaller in the bar chart because it ranges between 0 and 1, while MAE and RMSE are measured in price units and have larger numerical values. The difference in scale causes the R² bar to look comparatively smaller, even though its value (0.99) indicates excellent model performance.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import GridSearchCV
param_grid = {'fit_intercept': [True, False]}

# Fit the Algorithm

grid_lr = GridSearchCV(lr,param_grid,cv=5,scoring='r2')
grid_lr.fit(X_train_scaled,y_train)

# Get the best model
best_lr_model = grid_lr.best_estimator_

# Predict on the model
y_pred_lr_opt = best_lr_model.predict(X_test_scaled)

print('Best Parameters:', grid_lr.best_params_)
print('Best Score:', grid_lr.best_score_)


In [None]:
y_pred_lr_opt[:5]

In [None]:
# import pandas as pd

# comparison = pd.DataFrame({
#     'Actual Close Price': y_test.values,
#     'Predicted Close Price': y_pred_lr_opt
# })

# comparison.head()

##### Which hyperparameter optimization technique have you used and why?

Technique Used: GridSearchCV

GridSearchCV was chosen because Linear Regression has a limited number of hyperparameters. GridSearch systematically evaluates all parameter combinations and ensures optimal model selection using cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, a slight improvement was observed after hyperparameter tuning.

The R² score increased from 0.9913 to 0.9955, which indicates that the optimized model explains more variance in the closing price compared to the baseline mode

### ML Model - 2

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

In [None]:
y_pred_rf = rf_model.predict(X_test)

In [None]:
y_pred_rf[:5]

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

rf_mae = mean_absolute_error(y_test, y_pred_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
rf_r2 = r2_score(y_test, y_pred_rf)

print("Mean Absolute Error:", rf_mae)
print("Root Mean Squared Error:", rf_rmse)
print("R2 Score:", rf_r2)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Random Forest Regressor is an ensemble learning model that builds multiple decision trees and combines their predictions. It is capable of capturing complex and non-linear relationships between features and the target variable.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

metrics = ['MAE', 'RMSE', 'R²']
rf_scores = [rf_mae, rf_rmse, rf_r2]

plt.figure(figsize=(10,5))
plt.bar(metrics, rf_scores, color=['red','green','blue'])
plt.title('Random Forest Evaluation Metrics')
plt.xlabel('metrics')
plt.ylabel('Score')
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV

param_grid_rf = {'n_estimators': [100, 200],
                'max_depth': [None, 10, 20],
                'min_samples_split': [2, 5]}

grid_rf = GridSearchCV( estimator=RandomForestRegressor(random_state=42),
                        param_grid=param_grid_rf,
                        cv=5,
                        scoring='r2',
                        n_jobs=-1)

# Fit the Algorithm
grid_rf.fit(X_train, y_train)

# Best model
best_rf_model = grid_rf.best_estimator_

# Predict on the model
y_pred_rf_opt = best_rf_model.predict(X_test)

print('Best Parameters:', grid_rf.best_params_)
print('Best Score:', grid_rf.best_score_)


In [None]:
y_pred_rf_opt[:5]

In [None]:
import pandas as pd

comparison = pd.DataFrame({
    'Actual Close Price': y_test.values,
    'Predicted Close Price': y_pred_rf_opt
})

comparison.head()

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used because it systematically evaluates multiple combinations of hyperparameters using cross-validation. Random Forest has several important tuning parameters (like n_estimators and max_depth), so GridSearch helps identify the best combination for improved performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, improvement was observed after applying hyperparameter tuning using GridSearchCV. The R² score increased from 0.9793 to 0.9891, indicating better model performance and improved variance explanation. The optimized parameters (increased n_estimators and controlled max_depth) helped the model generalize better and reduce overfitting. Overall, the tuned Random Forest model performs more accurately than the baseline model.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

#### Mean Absolute Error (MAE)


*   MAE shows the average difference between the predicted closing price and the actual closing price.



*   If MAE is low ,it means predictions are very close to real market prices. This helps investors, analysts, or financial planners make more accurate trading and investment decisions with lower financial risk.


#### Root Mean Squared Error (RMSE)

*   RMSE gives more importance to larger errors. It shows how serious large prediction mistakes are.
*   In stock markets, large prediction errors can lead to heavy financial losses. A lower RMSE means the model avoids major price miscalculations

#### R² Score



*   R² shows how much of the variation in closing price is explained by the model.
*   A high R² indicates strong predictive power. This improves confidence in forecasting stock trends, supporting better strategic decisions in investment planning.








### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered MAE, RMSE, and R² score for evaluating business impact.

MAE and RMSE were important because they measure prediction errors — lower values reduce financial risk in stock price decisions.
R² score was considered to check how well the model explains price variation — a higher R² increases confidence in forecasting and investment planning.


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I selected Linear Regression as the final prediction model. It achieved a lower MAE (4.98) and RMSE (8.52) compared to Random Forest, indicating smaller prediction errors. Additionally, it produced a higher R² score (0.9913), meaning it explains more variance in the closing price. Since it delivers better accuracy and performance metrics.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In this project, we used Linear Regression to predict the monthly closing price of Yes Bank stock. Linear Regression establishes a linear relationship between the independent variables (Open, High, Low) and the target variable (Close). Since the features were highly correlated with the closing price, the model achieved a high R² score.

The main libraries used were Pandas and NumPy for data processing, Matplotlib and Seaborn for visualization, and Scikit-learn for model building, scaling, and evaluation.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully developed a machine learning model to predict the monthly closing price of Yes Bank stock using historical price data. Through exploratory data analysis, statistical testing, outlier treatment, and feature selection, a strong linear relationship was identified between Open, High, Low, and Close prices. Two regression models were implemented and evaluated, and Linear Regression outperformed Random Forest with lower error values and a higher R² score (0.9913).

The final model demonstrates high predictive accuracy and reliability, making it suitable for supporting financial forecasting and investment-related decision-making. Overall, the project follows a complete end-to-end machine learning workflow and provides meaningful business insights

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***