<a href="https://colab.research.google.com/github/Rwitick-Dash/Machine_Learning_Projects/blob/main/Yes_Bank_Stock_Closing_Price_Prediction_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

The Yes Bank Stock Closing Price Prediction Project investigates the impact of the 2018 financial scandal involving co-founder Rana Kapoor on the bank’s stock performance and explores whether predictive models can forecast price movements in such volatile conditions. Using monthly stock data—including opening, closing, high, low prices, and trading volumes—the project conducts extensive exploratory data analysis to uncover trends, volatility, and the influence of trading activity, particularly around the crisis period. Regression and time series models are developed and evaluated using Scikit-Learn, supported by visual diagnostics through Matplotlib and Seaborn. The study emphasizes the closing price as the key variable and addresses temporal shifts and autocorrelation. Ultimately, the project offers insights into how financial crises affect stock behavior and demonstrates that with robust modeling and data preparation, predictive analytics can guide investor decision-making in turbulent markets.

# **Problem Statement**


**To predict the monthly closing stock price of Yes Bank using historical stock data and machine learning models, accounting for market volatility and significant financial events.**

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
  import pandas as pd
  import io
  df = pd.read_csv(io.BytesIO(uploaded['data_YesBank_StockPrices.csv']))
  print(df)

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")


### Dataset Information

In [None]:
  df.info()

#### Duplicate Values

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
print("\nMissing Values:")
print(df.isnull().sum())

Visualising Missing Values/Null Values

In [None]:
import missingno as msno
msno.heatmap(df)

## ***2. Understanding the Variables***

In [None]:
df.columns

In [None]:
df.describe()

### Check Unique Values for each variable.

In [None]:
df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert 'Date' to datetime and sort
df['Date'] = pd.to_datetime(df["Date"], format='%b-%y')
df.sort_values('Date', inplace=True)

# Drop rows with missing values if any
df.dropna(inplace=True)

# Reset index after sorting
df.reset_index(drop=True, inplace=True)

# Create new features if useful
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# View cleaned and enriched dataset
df.head()


### What all manipulations have you done and insights you found?


- Converted the 'Date' column to datetime type for temporal operations.
- Sorted the data based on time to prepare it for modeling.
- Checked and removed missing values to maintain data integrity.
- Created new features: `Year` and `Month` to capture seasonal trends.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart 1: Line chart for Close prices over time
plt.figure(figsize=(12,6))
plt.plot(df['Date'], df['Close'], color='blue')
plt.title('Monthly Closing Price of Yes Bank Over Time')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.grid(True)
plt.show()


1. 📌 **Reason for picking up the specific chart?**  
To understand the trend in stock closing price over time.

2. 📈 **Insights from the chart:**  
The closing price showed significant volatility after 2018, especially during the Rana Kapoor fraud period.

3. 🧩 **Business Impact:**  
Helps investors understand when instability occurred and how the stock price reacted.

#### Chart - 2

In [None]:
# Chart 2: Boxplot of closing prices
plt.figure(figsize=(8,6))
sns.boxplot(y='Close', data=df)
plt.title('Boxplot of Monthly Closing Prices')
plt.ylabel('Closing Price')
plt.show()



1. 📌 **Reason for picking up the specific chart?**  
To detect outliers in closing price distribution.

2. 📈 **Insights from the chart:**  
Outliers are present especially during periods of market shocks.

3. 🧩 **Business Impact:**  
Identifies potential abnormal events that could warrant deeper investigation.


#### Chart - 3

In [None]:

# Chart 3: Yearly average of monthly closing prices
plt.figure(figsize=(10,6))
df.groupby('Year')['Close'].mean().plot(kind='bar', color='green')
plt.title('Average Monthly Closing Price per Year')
plt.ylabel('Avg Close Price')
plt.xticks(rotation=45) #Rotates the labels of x-axis tilted(it is done to accomodate a lot of data)
plt.show()



1. 📌 **Reason for picking up the specific chart?**  
To evaluate how the average performance changed year-over-year.

2. 📈 **Insights from the chart:**  
Decline seen after 2018 aligns with the fraud incident.

3. 🧩 **Business Impact:**  
Supports financial planning and investor sentiment analysis.


#### Chart - 4

In [None]:
# Chart 4: Correlation Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()



1. 📌 **Reason for picking up the specific chart?**  
To identify feature relationships with the target variable (Close).

2. 📈 **Insights from the chart:**  
High correlation between `Open`, `High`, and `Close`.

3. 🧩 **Business Impact:**  
Helpful in feature selection for model building.


## ***5. Hypothesis Testing***

### Hypothetical Statement - 1
**Null Hypothesis (H0):** The mean monthly closing price before 2018 is equal to the mean monthly closing price after 2018.  
**Alternative Hypothesis (H1):** The mean monthly closing price before 2018 is not equal to the mean monthly closing price after 2018.


##### Statistical Test

In [None]:
'''The independent two-sample t-test, also known as the unpaired t-test,
is a statistical test used to determine if there is a significant difference between
the means of two unrelated groups.
'''
# Create two groups: before and after 2018
before_2018 = df[df['Year'] < 2018]['Close']
after_2018 = df[df['Year'] >= 2018]['Close']

'''t-statistic:
This is a measure of the difference between the two sample means,
taking into account the variability within each group.


p-value:
This is the probability of observing a t-statistic as extreme as,
or more extreme than, the one calculated, assuming the null hypothesis is true.


Compare the p-value to the significance level (alpha):
If the p-value is less than the significance level (e.g., 0.05), the null hypothesis is rejected,
and it's concluded that there is a significant difference between the group means
'''
# Perform t-test
from scipy.stats import ttest_ind
t_stat1, p_val1 = ttest_ind(before_2018, after_2018, equal_var=False) #equal_var=False: Indicates that the two groups are assumed to have unequal variances, making this a Welch’s t-test (more robust when variances differ).
print(f"T-statistic: {t_stat1:.3f}, P-value: {p_val1:.4f}")  #.3f means 3 decimal places and .4f means 4 decimal places

'''t_stat1: The t-statistic value — measures how many standard deviations the sample means are apart.

p_val1: The p-value — indicates the probability that the observed difference in means
is due to chance (under the null hypothesis).'''


**Statistical Test Used:** Independent two-sample t-test  
**Why this test?** Because we are comparing the means of two independent groups.  
**Conclusion:** If P-value < 0.05, we reject the null hypothesis and conclude that the average closing price changed significantly after 2018.


### Hypothetical Statement - 2

**Null Hypothesis (H0):** There is no significant correlation between `Open` and `Close` prices.  
**Alternative Hypothesis (H1):** There is a significant correlation between `Open` and `Close` prices.


#### Statistical Test.

In [None]:
# Pearson correlation test
'''The Pearson correlation coefficient, often denoted as 'r', is a statistical measure that quantifies the strength
and direction of the linear relationship between two variables. It always falls between -1 and +1, with 0 indicating no correlation.
It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
When one variable changes, the other variable changes in the same direction.'''

from scipy.stats import pearsonr
corr, p_val2 = pearsonr(df['Open'], df['Close'])
print(f"Pearson Correlation Coefficient: {corr:.3f}, P-value: {p_val2:.4f}")

**Statistical Test Used:** Pearson correlation coefficient  
**Why this test?** Because both `Open` and `Close` are continuous numerical variables.  
**Conclusion:** If P-value < 0.05, we reject the null and conclude a statistically significant correlation exists.



### Hypothetical Statement 3  
**Null Hypothesis (H0):** The variance in `Close` prices remains constant over the years.  
**Alternative Hypothesis (H1):** The variance in `Close` prices differs across years.


####Statistical Test.

In [None]:
# Levene’s test for equal variances
from scipy.stats import levene
years = df['Year'].unique()
grouped_close = [df[df['Year'] == y]['Close'] for y in years]
stat3, p_val3 = levene(*grouped_close)
print(f"Levene’s Statistic: {stat3:.3f}, P-value: {p_val3:.4f}")


**Statistical Test Used:** Levene’s test  
**Why this test?** To check for homogeneity of variances across multiple groups (years).  
**Conclusion:** If P-value < 0.05, variances in `Close` prices are not equal across years.


## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
df.isnull().sum()

✅ No missing values found.

### 2. Handling Outliers

In [None]:
# Plotting boxplot to identify outliers
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title("Boxplot to Detect Outliers")
plt.show()

No major outliers removed, as extreme values post-2018 are real market reflections.

### 3. Categorical Encoding

No categorical columns – no encoding required.

### 4. Feature Manipulation & Selection

In [None]:
# Select features
features = ['Open', 'High', 'Low']
target = 'Close'

#### Features chosen based on high correlation.

### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])
y = df[target].values

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)

## ***7. ML Model Implementation***

### ML Model 1: Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np # Import numpy for sqrt

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Evaluation
mae = mean_absolute_error(y_test, y_pred_lr)
mse = mean_squared_error(y_test, y_pred_lr) # Removed squared=False
rmse = np.sqrt(mse) # Calculate RMSE by taking the square root of MSE
r2 = r2_score(y_test, y_pred_lr)

print(f"Linear Regression - MAE: {mae}, RMSE: {rmse}, R2: {r2}")

### ML Model 2: Random Forest Regressor



In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Import the necessary metrics
import numpy as np # Import numpy for sqrt

rf = RandomForestRegressor(random_state=42)
params = {'n_estimators': [50, 100], 'max_depth': [3, 5, 10]}

grid = GridSearchCV(rf, params, cv=3, scoring='r2')
grid.fit(X_train, y_train)

best_rf = grid.best_estimator_
y_pred_rf = best_rf.predict(X_test)

# Evaluation
mae_rf = mean_absolute_error(y_test, y_pred_rf)
# Calculate RMSE by taking the square root of MSE
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest - MAE: {mae_rf}, RMSE: {rmse_rf}, R2: {r2_rf}")

### ML Model 3: Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Import the necessary metrics
import numpy as np # Import numpy for sqrt

gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)

# Evaluation
mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
# Calculate RMSE by taking the square root of MSE
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mse_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)

print(f"Gradient Boosting - MAE: {mae_gbr}, RMSE: {rmse_gbr}, R2: {r2_gbr}")

####📈 Final Model Selection

In [None]:
# Comparing R² Scores
print(f"Linear Regression R²: {r2}")
print(f"Random Forest R²: {r2_rf}")
print(f"Gradient Boosting R²: {r2_gbr}")

#####✅ Selected Model: Random Forest (best balance of MAE, RMSE, and R²)

### Feature Importance

In [None]:
importances = best_rf.feature_importances_
feature_imp = pd.Series(importances, index=features)
feature_imp.sort_values().plot(kind='barh')
plt.title("Feature Importance - Random Forest")
plt.show()

# **Conclusion**

This project successfully developed a predictive model for Yes Bank's monthly closing stock prices by leveraging historical data and machine learning techniques. The analysis encompassed comprehensive data preprocessing, exploratory data analysis, feature engineering, and the implementation of various regression models, including Linear Regression, Gradient Boosting and Random Forest Regressor.

Among the models evaluated, the Random Forest Regressor demonstrated superior performance.

Key insights from the project include:

1.Impact of Events: Significant events, such as the 2018 fraud case involving Rana Kapoor, had a noticeable effect on stock price volatility, underscoring the importance of incorporating event-driven analysis in stock prediction models.

2.Feature Importance: Variables like opening price, highest and lowest prices, and trading volume were significant predictors of the closing price, highlighting their relevance in stock price forecasting.

In conclusion, this project demonstrates the efficacy of machine learning models in predicting stock prices, providing valuable tools for investors and analysts to make informed decisions in the dynamic financial market.