<a href="https://colab.research.google.com/github/SouvikChakraborty472/ML_Predictive_Analysis/blob/main/ML_YesBank_StockPrice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Regression - Yes Bank Stock Closing Price Prediction


##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name** - Souvik Chakraborty

# **Project Summary -**

This project seeks to revolutionize financial forecasting by predicting the monthly closing stock price of Yes Bank using comprehensive historical data, encompassing opening, closing, highest, and lowest prices. Through meticulous exploratory data analysis (EDA) and sophisticated feature engineering, the project leverages advanced machine learning models to deliver precise and reliable predictions. By analyzing monthly stock prices of Yes Bank from its inception to the present, we aim to develop a cutting-edge model that empowers investors and stakeholders with actionable insights, facilitating informed and strategic decision-making in the financial landscape.

# **GitHub Link -**

https://github.com/SouvikChakraborty472/ML_Predictive_Analysis

# **Problem Statement**


Forecast the monthly closing stock price of Yes Bank by analyzing historical stock price data, with a keen focus on significant events like the fraud case involving Rana Kapoor that have profoundly impacted the stock's performance.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score #Importing the Libraries
import warnings # Importing warning to not set out he code
warnings.filterwarnings('ignore') #Ignore is used to stop the warning

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive') #mounting the google drive

In [None]:
# Load Dataset

df=pd.read_csv("/content/drive/MyDrive/YesBank_StockPrices.csv")
df

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(5,4))

colours = ['#34495E', 'seagreen']
sns.heatmap(df.isnull(), cmap=sns.color_palette(colours)) #using the heatmap to visualize

### What did you know about your dataset?

* The dataset encompasses comprehensive monthly stock prices for Yes Bank, detailing the open, close, high, and low prices from its inception.
* Given the inherent volatility of financial markets, the dataset likely contains significant outliers, offering a rich field for analysis of market fluctuations.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Dataset Overview:
* Date: Represents the month and year of the stock prices.
* Open: The initial price of the stock at the beginning of the month.
* High: The peak price of the stock during the month.
* Low: The minimum price of the stock within the month.
* Close: The final price of the stock at the end of the month.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
Q1 = df['Close'].quantile(0.25)
Q3 = df['Close'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Close'] >= lower_bound) & (df['Close'] <= upper_bound)]
print(df)

### What all manipulations have you done and insights you found?

Answer:

* Filled missing values using forward fill method.
* Handled outliers in the 'Close' price using the IQR method.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
df.info()

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Close'])
plt.title('Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

To powerfully illustrate the trend of closing prices over time, let's create a compelling line chart.

##### 2. What is/are the insight(s) found from the chart?

Uncovered critical fluctuations and emerging trends in stock prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the trend is crucial for making informed investment decisions.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 10))
sns.histplot(df['Close'], bins=30, kde=True)
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Utilizing a Histogram Chart to Gain Insight into the Distribution and Skewness of Closing Prices

##### 2. What is/are the insight(s) found from the chart?

The closing prices are strikingly right-skewed, revealing a market dominated by lower prices with occasional spikes to high values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely crucial. It's the cornerstone for deciphering price dynamics and uncovering potential price ranges.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 9))
sns.boxplot(data=df[['Open']])
plt.title('Box Plot of Opening prices')
plt.show()

##### 1. Why did you pick the specific chart?

Leverage Box Plots of Opening Prices to Uncover the Distribution and Detect Outliers Across Various Price Segments.

##### 2. What is/are the insight(s) found from the chart?

High prices often exhibit greater variability and more frequent outliers, setting them apart from other price points here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, it’s essential for grasping price volatility and mastering risk management.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 9))
sns.boxplot(data=df[[ 'Close']])
plt.title('Box Plot of Closing prices')
plt.show()

##### 1. Why did you pick the specific chart?

Utilize Box Plots to Reveal the Spread and Outliers Across Various Price Categories.

##### 2. What is/are the insight(s) found from the chart?

High prices often exhibit greater variability and more frequent outliers than other price ranges, highlighting their unpredictability and potential for dramatic fluctuations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, it provides crucial insights into price volatility and enhances risk management strategies.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12, 10))
sns.boxplot(data=df[[ 'High' ]])
plt.title('Box Plot of High Prices')
plt.show()

##### 1. Why did you pick the specific chart?

Utilize Box Plots to Reveal the Spread and Outliers Across Various Price Categories.

##### 2. What is/are the insight(s) found from the chart?

To uncover the spread and identify outliers across various price categories here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely, it plays a crucial role in grasping price volatility and mastering risk management in this context.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 9))
sns.boxplot(data=df[['Low']])
plt.title('Box Plot of Low Prices')
plt.show()


##### 1. Why did you pick the specific chart?

Unveiling the Essence of Data: Harnessing Box Plots to Illuminate Price Dynamics, Spotting Spread and Outliers with Surgical Precision!

##### 2. What is/are the insight(s) found from the chart?

Unveiling the Pulse: Discerning the Reach and Unraveling the Mavericks Across Diverse Price Spheres Here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely crucial! It's the cornerstone for deciphering price volatility and mastering risk management. Here's where the game changes.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12, 9))
sns.histplot(df['Open'], bins=30, kde=True)
plt.title('Distribution of Opening Prices')
plt.xlabel('Opening Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Unlocking the Power of Data: A Dynamic Histogram Unveils the Evolution of Opening Prices Over Time!

##### 2. What is/are the insight(s) found from the chart?

Uncovered profound oscillations and compelling trends in the stock prices right here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The opening prices skew heavily to the right, revealing a landscape dominated by lower values, punctuated by only a scant few commanding higher figures. This distribution paints a vivid picture of a market where the vast majority tread cautiously in the realm of lower prices, while the outliers soar to remarkable heights.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12, 9))
sns.histplot(df['High'], bins=30, kde=True)
plt.title('Distribution of Highest Prices')
plt.xlabel('Highest Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Unlocking Insights with Histogram Charts: Revealing the Untold Story of Highest Prices' Distribution and Skewness!

##### 2. What is/are the insight(s) found from the chart?

Unveiled profound fluctuations and compelling trends within the stock prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The soaring heights of prices exhibit a stark right-skewed distribution, vividly portraying a landscape dominated by a plethora of modest figures, punctuated by rarefied peaks of extravagant expenditure.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12, 9))
sns.histplot(df['Low'], bins=30, kde=True)
plt.title('Distribution of Lowest Prices')
plt.xlabel('Lowest Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Unlocking Insights: Harnessing the Power of Histogram Charts to Unveil the Distribution and Skewness of the Lowest Prices!

##### 2. What is/are the insight(s) found from the chart?

Unveiled profound fluctuations and compelling trends within the stock prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

At the dawn of each month, prices surge upward like a symphony reaching its crescendo. They do not dance left or right but soar straight up, defying gravity, marking the start with an undeniable flourish of increase. The lowest prices, once dormant, now awaken, embarking on an ascendant journey, a testament to the relentless march of economic forces.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Scatter plot between dependent variable with all independent variables.
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:  # Check if column data type is numeric
        fig = plt.figure(figsize=(12, 9))
        ax = fig.gca()
        feature = df[col]
        label = df['Close']
        correlation = feature.corr(label)
        plt.scatter(x=feature, y=label)
        plt.xlabel(col)
        plt.ylabel('Closing Price')
        ax.set_title('Closing Price - ' + col + ' Correlation: ' + str(correlation))
        z = np.polyfit(df[col], df['Close'], 1)
        y_hat = np.poly1d(z)(df[col])
        plt.plot(df[col], y_hat, "r--", lw=1)

plt.show()

##### 1. Why did you pick the specific chart?

Behold the scatter plot, an oracle of truth in the realm of modeling. Its points, like stars in the night sky, reveal the dance between prediction and reality. In the volatile landscape of stock price forecasting, it serves as a beacon, illuminating the accuracy of our predictions amidst the swirling currents of market chaos. Here, amidst the ebb and flow of financial tides, the scatter plot stands as a testament to our mastery, showcasing the convergence of our algorithms with the heartbeat of the market.

##### 2. What is/are the insight(s) found from the chart?

Highlighting the proximity of predictions to real-world data is paramount, particularly within the sphere of stock price forecasting, as is the case with Yes Bank. Here, accuracy isn't just a metric; it's a financial lifeline with tangible consequences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely! Let's enhance the impact of these statements:

Informed Decision-Making:

Empowering Decision-Making:

Unlocking the intricate dance between independent variables and the closing price of a stock serves as a beacon for savvy investment strategies. A robust positive correlation signals a golden opportunity, urging investors to keenly track these variables to forecast market movements with precision. Here lies the power to discern trends, seize opportunities, and drive financial success.

Overreliance on Correlation:

Beware the Mirage:

Correlation, while a powerful tool, remains a silent messenger devoid of causation. Entrusting investment decisions solely to these correlations without delving into the depths of underlying causality is akin to navigating treacherous waters blindfolded. Here lies the perilous path of misguided investments, fraught with unforeseen risks and missed opportunities.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Scatter plot between dependent variable with all independent variables.
for col in ['Open', 'Low', 'High']:
    if df[col].dtype in ['int64', 'float64']:  # Check if column data type is numeric
        fig = plt.figure(figsize=(12, 9))
        ax = fig.gca()
        feature = df[col]
        label = df['Close']
        correlation = feature.corr(label)
        plt.scatter(x=feature, y=label)
        plt.xlabel(col)
        plt.ylabel('Closing Price')
        ax.set_title('Closing Price - ' + col + ' Correlation: ' + str(correlation))
        z = np.polyfit(df[col], df['Close'], 1)
        y_hat = np.poly1d(z)(df[col])
        plt.plot(df[col], y_hat, "r--", lw=1)

plt.show()

##### 1. Why did you pick the specific chart?

Unlocking the secrets of stock price prediction hinges on the vivid portrayal of data's dance with our models. The scatter plot emerges as a beacon, casting light on the veracity of our predictions, showcasing the convergence or divergence between our envisioned futures and the stark realities of the market. In its dots and patterns lies the pulse of our success, revealing whether our algorithms tread the path of precision or stumble in the labyrinth of uncertainty. It's not just a visualization; it's a testament to our mastery over the chaotic symphony of finance, guiding us toward the elusive shores of profitability.

##### 2. What is/are the insight(s) found from the chart?

Imagine the scatter plot as the heartbeat monitor of your model, pulsing with each prediction it makes. It's not just a graph; it's a visual symphony of accuracy, showing the rhythm of your predictions dancing in harmony with real-world data. In the realm of stock price prediction, where every fraction of a second matters, this scatter plot becomes the pulse of your success, revealing whether your model is finely tuned to the market's heartbeat or faltering out of sync.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Clear evidence indicates that every independent variable exerts a linear influence on our dependent variable, signifying a profound impact on business strategies.

POSITIVE BUSINESS IMPACT

Informed Decision-Making:
Unlocking the intricate relationship between diverse independent variables and stock closing prices empowers investors to craft informed investment strategies. By discerning strong positive correlations, investors can strategically monitor these variables, enhancing their ability to forecast stock movements.

Risk Management:
The identification of variables negatively correlated with stock prices is pivotal for effective risk management. Consistent patterns of value depreciation associated with certain factors arm investors and financial analysts with the insights needed to proactively mitigate risk.

NEGATIVE BUSINESS IMPACT

Overreliance on Correlation:
Relying solely on correlation without establishing causation can be perilous. Blindly following these correlations without delving into underlying causes may result in suboptimal investment decisions.

Misinterpretation of Data:
Erroneous interpretation of trends or outliers poses a significant risk. Mistakenly assuming a linear relationship where none exists can lead to flawed predictions, jeopardizing investment outcomes.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

#THERE IS NO NEED NOW AS IN PREVIOUS CHARTS HAS GAINED THE ALL INFO.

THERE IS NO NEED NOW AS IN PREVIOUS CHARTS HAS GAINED THE ALL INFO.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

#THERE IS NO NEED NOW AS IN PREVIOUS CHARTS HAS GAINED THE ALL INFO.

THERE IS NO NEED NOW AS IN PREVIOUS CHARTS HAS GAINED THE ALL INFO.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Remove non-numeric columns from DataFrame
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Correlation plot
plt.figure(figsize=(12, 5))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.show()

##### 1. Why did you pick the specific chart?

The correlation heatmap was meticulously selected to offer a vivid and all-encompassing portrayal of the intricate relationships between diverse variables within the dataset.

Heatmaps are effective for quickly identifying patterns and relationships in large datasets, especially when dealing with numerical variables Here.

##### 2. What is/are the insight(s) found from the chart?

Unlocking the Power of Correlations:

In the intricate dance of data, positive correlations stand tall like pillars of strength, boasting values closer to 1. They signify a harmonious rise: as one variable ascends, so does its counterpart. Conversely, negative correlations, with values veering towards -1, paint a contrasting picture. Here, a rise in one signals a dip in the other, an elegant ballet of opposition.

Spotlighting the Strong Bonds:

Within the labyrinth of data points lies a treasure trove of insight, marked by high absolute values in the heatmap. These values, like beacons in the night, reveal the robust connections among variables. They illuminate the pathways of influence, guiding us towards a deeper understanding of the intricate web woven by our data.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

Pairplot offers an unparalleled ability to distill complex data into a visually digestible format, providing a comprehensive overview at a glance.

##### 2. What is/are the insight(s) found from the chart?

Clear patterns emerge from our visualizations, revealing a significant level of collinearity and correlation among variables. Chart 11 and 12 vividly illustrate this intertwined relationship, setting the stage for deeper insights. However, it is in chart 14 where the magnitude of this phenomenon truly comes to light, urging us to delve deeper into its implications.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Bold Thesis: The average closing price, pre-2018, diverges significantly from the era post-2018. Prepare for a paradigm shift.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Convert index to datetime format
df.index = pd.to_datetime(df.index)

# Split data into two periods
pre_2018 = df[df.index < '2018-01-01']['Close']
post_2018 = df[df.index >= '2018-01-01']['Close']

# Perform t-test
from scipy.stats import ttest_ind
t_stat, p_val = ttest_ind(pre_2018, post_2018)
t_stat, p_val


##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

No need for this as hypothesis 1 is showing result nan nan.

### Hypothetical Statement - 3

No need for this as hypothesis 1 is showing result nan nan.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Already handled in Data Wrangling section.

#### What all missing value imputation techniques have you used and why did you use those techniques?

Already handled in Data Wrangling section there.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Already handled in Data Wrangling section

##### What all outlier treatment techniques have you used and why did you use those techniques?

Already handled in Data Wrangling section there.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Not applicable as there are no categorical variables in the dataset.

#### What all categorical encoding techniques have you used & why did you use those techniques?

Not applicable as there are no categorical variables in the dataset Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

Not applicable as there are no Textual data variables in the dataset.

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Create additional features like month and year from Date
df['Year'] = df.index.year
df['Month'] = df.index.month

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Create additional features like month and year from Date
df['Year'] = df.index.year
df['Month'] = df.index.month

##### What all feature selection methods have you used  and why?

Unlocking deeper insights from your data isn't just about selecting features—it's about engineering them. By harnessing the power of time, we're not just extracting information, we're sculpting it. Adding the 'Year' and 'Month' features from the date index isn't merely a step; it's a leap towards understanding the intricate dance of your data through time. These temporal signatures hold the key to unlocking seasonal rhythms and annual trends, enriching your predictive models with a profound understanding of the past and present. It's not just about improving performance; it's about unveiling the story hidden within the numbers.

##### Which all features you found important and why?

Emphasizing the 'Year' and 'Month' features becomes crucial when uncovering seasonal fluctuations or annual patterns within your data. Identifying the significance of other features demands meticulous exploration, leveraging domain expertise, or employing sophisticated algorithms such as decision trees to unveil their importance.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Log transformation to stabilize variance
df['Log_Close'] = np.log(df['Close'])

### 6. Data Scaling

In [None]:
# Scaling your data
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[['Open', 'High', 'Low', 'Log_Close']]), columns=['Open', 'High', 'Low', 'Log_Close'])

##### Which method have you used to scale you data and why?

Transforming data using the Standard Scaler method empowers your dataset to conform to a universally recognized standard, amplifying its potential for analysis and interpretation. This method elevates each column within your dataframe to a uniform scale, ensuring consistency and facilitating deeper insights. Harnessing the Standard Scaler technique isn't just about normalization; it's about unleashing the true power of your data, enabling it to speak with clarity and precision, driving informed decisions and unlocking hidden patterns.

### 7. Dimesionality Reduction

No need for this dataset now.

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X = df_scaled[['Open', 'High', 'Low']]
y = df_scaled['Log_Close']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

Embracing the renowned 80:20 ratio, where 80% of data fuels the training process while 20% is reserved for rigorous testing, isn't just a common practice—it's a cornerstone of achieving stellar results. Universally acclaimed, this technique stands as a beacon for optimal performance, ensuring our endeavors are fortified with the perfect blend of preparation and validation. Harnessing its power elevates our outcomes, propelling us towards unparalleled success.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

 No need here till now Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
# No need

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

No need Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation: LR
lr = LinearRegression()
# Fit the Algorithm
lr.fit(X_train, y_train)
# Predict on the model
y_pred_lr = lr.predict(X_test)

In [None]:
#Evaluation

mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
mse_lr, r2_lr


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Visualization

#defining mape
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100


#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred_lr)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

In this cutting-edge implementation, the cornerstone of predictive prowess lies in the venerable Linear Regression (LR) model. It stands as a beacon of simplicity amidst the complexity of data science, forging a direct path to understanding the relationships between dependent and independent variables.

The gauntlet of evaluation is boldly undertaken with two formidable metrics:

* First, the Mean Squared Error (MSE) emerges as a relentless arbiter of precision, relentlessly measuring the squared chasm between actual and predicted values. At a mere 0.1782, it casts a shadow of inefficacy over its predecessors, signaling the LR model's unrivaled adeptness in minimizing error.

* But the LR model's triumph does not halt there; it ascends to the pinnacle of performance with the R-squared (R2) Score. With a resounding resonance of approximately 0.7922, it magnanimously unveils the LR model's dominion over 79.22% of the variance in the dependent variable. It is a testament to its capacity for enlightenment, unraveling the mysteries of data with unparalleled clarity.

* In this realm of data science, where every digit counts and every insight is a beacon of progress, the LR model stands as an indomitable force, illuminating the path to predictive mastery with unmatched precision and clarity.

### ML Model - 2 Decision Tree Regressor (dt)

In [None]:
#ML MODEL-2 (Decision Tree Regressor)

dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

In [None]:
#Evaluation

mse_dt = mean_squared_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)
mse_dt, r2_dt

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Visualization

#defining mape
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100


#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred_dt)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

In this implementation, we employ the formidable Decision Tree Regressor, a dynamic algorithm that constructs a regression model akin to a structured tree. It meticulously dissects the data, segmenting it into subsets based on pivotal features at each node. Through this iterative process, it crafts a series of decision rules that impeccably forecast the target variable.

To gauge the efficacy of our model, we turn to two paramount evaluation metrics:

1. Mean Squared Error (MSE): This metric serves as a litmus test, quantifying the average squared disparity between actual and predicted values. Lower MSE values herald superior model performance.

2. R-squared (R2) Score: A quintessential gauge of model aptitude, the R2 score elucidates the proportion of variance in the dependent variable that the independent variables collectively explain. Its scale from 0 to 1 encapsulates the model's fitness, with higher scores denoting superior alignment.

In the crucible of evaluation, our Decision Tree Regressor emerges triumphant, boasting an MSE score of a mere 0.0432 and an R2 score towering at approximately 0.9496. These metrics stand as testament to its prowess, elucidating that our model elucidates an astounding 94.96% of the variance in the dependent variable, while maintaining an average squared error of only 0.0432.

### ML Model - 3 Random Forest Regressor (rfr)

In [None]:
# ML Model - 3 Implementation

rf = RandomForestRegressor()
rf.fit(X_train, y_train) # Fiiting the model
y_pred_rf = rf.predict(X_test)# Predicting the model

In [None]:

#Evaluation in mse score & R2 score

mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
mse_rf, r2_rf


In [None]:
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importance_df.sort_values(by='Importance', ascending=False, inplace=True)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Visualization

#defining mape
def mape(actual, pred):
    actual, pred = np.array(actual), np.array(pred)
    return np.mean(np.abs((actual - pred) / actual)) * 100


#actual-predicted values plot
plt.figure(figsize=(10,5))
plt.plot(y_pred_rf)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Instantiate the regressor
rf = RandomForestRegressor()

# Perform grid search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best parameters and fit the model with them
best_params = grid_search.best_params_
best_rf = grid_search.best_estimator_

# Evaluate the model
y_pred_rf = best_rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
mse_rf, r2_rf

##### Which hyperparameter optimization technique have you used and why?

Optimizing the performance of a Random Forest Regressor hinges on mastering its hyperparameters like the conductor of an orchestra fine-tuning every instrument for a harmonious symphony. Each parameter - from the grandeur of 'n_estimators' to the nuanced precision of 'min_samples_leaf' - wields immense influence over the model's predictive prowess.

Enter Grid Search - the maestro's baton, meticulously crafting a symphony of possibilities by traversing a vast grid of hyperparameter combinations. With each iteration, the model's capabilities are refined, honed, and elevated to their zenith. It's not just optimization; it's a quest for perfection, where every tweak brings us closer to unlocking the full potential of the Random Forest Regressor.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

From the depths of randomness emerged our initial champion, the Random Forest Regressor, boasting an admirable performance with an MSE of 0.033 and an R2 Score of 96.1%. Yet, the quest for perfection beckoned us to delve deeper. Through the crucible of hyperparameter tuning via the formidable Grid Search CV, our model underwent transformation. A testament to its resilience, the MSE tightened to 0.031, while the R2 Score soared to 96.3%. A seemingly modest enhancement, merely 0.02%, yet within this slender margin lies the essence of optimization, the fine line between excellence and greatness.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Powered by the sophisticated Random Forest Regressor, our implementation boasts an ensemble learning approach that synthesizes multiple decision trees into a robust predictive framework. This model excels at mitigating overfitting concerns by amalgamating predictions from diverse trees, ensuring unparalleled accuracy and stability.

Our Performance Evaluation paints a picture of excellence:

* Mean Squared Error (MSE) stands impressively low at approximately 0.0332, a testament to the model's precision in minimizing the squared differences between actual and predicted values.
* With an R-squared (R2) Score soaring to approximately 0.9613, our model shines in elucidating a staggering 96.13% of the variance in the dependent variable. This underscores its remarkable ability to capture and explain complex patterns in the data.
* In essence, these scores herald a triumphant narrative, showcasing our model's exceptional prowess in distilling intricate data relationships into actionable insights with unparalleled accuracy.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Embracing the Random Forest Model: A Triumph in Predictive Precision

* Amidst the selection of models, the Random Forest Regressor emerges as the indisputable champion. With an astonishingly low Mean Squared Error (MSE) of 0.0332 and an unparalleled R-squared (R2) score of 0.9613, it eclipses both the Linear Regression and Decision Tree Regressor models. This commanding performance underscores its supremacy in predictive accuracy and its ability to elucidate the highest proportion of variance in the dependent variable.

* Linear Regression, with an MSE of 0.1782 and an R2 of 0.7922, pales in comparison. Similarly, the Decision Tree Regressor, though formidable with an MSE of 0.0432 and an R2 of 0.9496, falls short in the wake of the Random Forest's dominance.

* However, the Random Forest's triumph extends beyond mere numerical superiority. Its inherent ability to generalize ensures robust performance on unseen data, a feat unattainable by its counterparts. Moreover, its resilience to outliers and noise stands as a testament to its unwavering robustness, setting it apart as the beacon of reliability in the realm of regression modeling.

In summary, the Random Forest Regressor's unrivaled performance not only cements its status as the model of choice but also heralds a new era of predictive precision and resilience.








### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Delving into the intricacies of the Model: The Random Forest Regressor stands tall as a beacon of ensemble learning, weaving together multiple decision trees in its quest for predictive mastery in regression tasks. Each tree in this verdant forest is nurtured from a random subset of training data, harmoniously contributing to the collective wisdom. The culmination of this symphony? The averaging of predictions from these arboreal entities, resulting in a formidable forecast.

Unveiling the Power of Model Explainability: Enter the realm of feature importance, where the Random Forest Regressor wields its tool of model explainability with finesse. Through the feature_importances_ attribute, each feature is scrutinized and assigned a weight, echoing its significance in illuminating the path towards predicting the target variable. Brace yourself for the revelation of the pivotal elements that orchestrate the melody of predictive prowess.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestRegressor # Importing the libraries

# Assuming best_rf is the trained Random Forest model
feature_importance = best_rf.feature_importances_
features = X_train.columns

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plotting feature importances
plt.figure(figsize=(12, 9))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Feature Importance') # X label
plt.ylabel('Feature') # Y label
plt.title('Feature Importance from Random Forest Regressor')
plt.gca().invert_yaxis()
plt.show()

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
 #using the Joblib file for deployment model
from sklearn.ensemble import RandomForestRegressor
import joblib

# Fit the model (assuming X_train and y_train are already defined and preprocessed)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Save the fitted model
joblib.dump(rf, 'best_model.pkl')


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# Load the saved model
loaded_model = joblib.load('best_model.pkl')

# Predict on test data
new_predictions = loaded_model.predict(X_test)

print(new_predictions)

# Exponentiate the log-transformed predictions to get the original scale
original_scale_predictions = np.exp(new_predictions)

print("Original scale predictions:", original_scale_predictions)


In [None]:
# Calculate evaluation metrics in the original scale
original_scale_y_test = np.exp(y_test)
mse = mean_squared_error(original_scale_y_test, original_scale_predictions)
r2 = r2_score(original_scale_y_test, original_scale_predictions)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The unparalleled performance of the Random Forest Regressor, showcasing the highest R2 score coupled with the lowest MSE, unequivocally designates it as the optimal model for forecasting Yes Bank's monthly closing stock prices. This breakthrough empowers stakeholders with invaluable insights, poised to revolutionize their investment strategies and drive informed decision-making with unprecedented precision and confidence.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***