<a href="https://colab.research.google.com/github/AdityaSingh1907/Yes-Bank-Stock-Price-Prediction-Project/blob/main/Yes_Bank_Stock_Closing_Price_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Yes Bank Stock Closing Price Prediction**



##### **Project Type**    - **Regression**
##### **Contribution** - **Individual**
##### **Name** - **Aditya Singh**


# **Project Summary -**

The objective of this project is to develop a machine learning model that can accurately predict the closing price of Yes Bank's stock. Predicting stock prices is a challenging task due to the complex and dynamic nature of financial markets. However, by leveraging historical stock data and utilizing advanced machine learning techniques, we aim to build a robust predictive model.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.**

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import *
import plotly.express as px
import math
from scipy.stats import mannwhitneyu
from datetime import datetime                           # to convert to date

from dateutil.relativedelta import relativedelta        # working with dates with style
from datetime import datetime                           # computational cost
from scipy.optimize import minimize                     # for function minimization
import copy                                             # create copies

from sklearn.preprocessing import (MinMaxScaler,        # scale the data
StandardScaler)
from sklearn.feature_selection import VarianceThreshold #remove constant and quasi-constant features.
from sklearn.model_selection import train_test_split    # split train and test data
from sklearn.model_selection import (cross_val_score,   # split train and test data on a timeseries
TimeSeriesSplit)

from sklearn.linear_model import LinearRegression       # regression model
from xgboost import XGBRegressor                        # xgboost model
from sklearn.ensemble import RandomForestRegressor      # random forest model
from sklearn.svm import SVR                             # support vector regressor
from sklearn.linear_model import (Lasso, Ridge,         # regularization
ElasticNet, LassoCV, RidgeCV, ElasticNetCV)
from sklearn.model_selection import GridSearchCV        # grid search to optimize parameters

from sklearn.metrics import (r2_score,                  # import required metrics
mean_squared_error,  mean_absolute_percentage_error,
mean_absolute_error)

from statsmodels.tsa.stattools import adfuller          # statistics and econometrics
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs




In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
file_path = '/content/drive/MyDrive/data_YesBank_StockPrices.csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=True)

### What did you know about your dataset?

We have a total of 185 entries.
No null values.
Date column is of 'object' datatype we have to convert it to 'datetime'.

In [None]:
# convert string object to datetime object
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: datetime.strptime(x, "%b-%y")))

In [None]:
df

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')


### Variables Description

**Date**	- Date of record

**Open**	- Opening price

**High**  -	Highest price in the day

**Low** - Lowest price in the day

**Close** - Occupations of the speaker

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Copy of data
df=df.copy()

In [None]:
df.dropna()  # Drop rows with NaN value
df.dropna(axis=1)  # Drop columns with any NaN value

In [None]:
# Fill Missing Values with Mean/Median/Mode

mean_value = df['Date'].mean()
df['Date'].fillna(mean_value, inplace=True)

mean_value = df['Open'].mean()
df['Open'].fillna(mean_value, inplace=True)

mean_value = df['High'].mean()
df['High'].fillna(mean_value, inplace=True)

mean_value = df['Low'].mean()
df['Low'].fillna(mean_value, inplace=True)

mean_value = df['Close'].mean()
df['Close'].fillna(mean_value, inplace=True)

print(df.head())
df.shape

In [None]:
# BIFURCATE DEPENDENT AND INDEPENDENT VARIABLES

indep_var=df[['High','Low','Open']]
dep_var=df['Close']

### What all manipulations have you done and insights you found?

First, I made a copy of our dataset then droped the missing values and filled all missing values with Mean and Separated all the varibles to as a dependent and independent variables.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart on Independent variables (UNIVARIATE)

In [None]:
#plots for independent variables
for var in indep_var:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = sns.distplot(df[var].dropna())
    fig.set_ylabel(' ')
    fig.set_xlabel(var)

    plt.subplot(1, 2, 2)
    fig = sns.boxplot(y=df[var])
    fig.set_title('')
    fig.set_ylabel(var)

##### 1. Why did you pick the specific chart?

Two types of charts are used for visualizing the independent variables: a distribution plot and a box plot.

Distribution plot:-This plot helps in understanding the data distribution and identifying any outliers or unusual patterns.

Box plot :-This plot helps in identifying potential outliers, skewness, and variability in the data.

##### 2. What is/are the insight(s) found from the chart?

We have ganerated some insights with help of independent variables and we can clearly see that our data is right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights shows the possitive skewed distribution of the independent variables like 'Opening price', 'Low', 'High' columns that will help me to understand and methodes to be applied to tackle the skewness of the data.

#### Chart - 2 - Dependent Variable

In [None]:
# Chart - 2 visualization code
# Dependent variable 'Closing price'
plt.figure(figsize=(15,10))
sns.distplot(df['Close'],color="y")
plt.title('Close Data Distribution')
plt.xlabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

Two types of charts are used for visualizing the independent variables: a distribution plot and a box plot.

Distribution plot:-This plot helps in understanding the data distribution and identifying any outliers or unusual patterns.

Box plot :-This plot helps in identifying potential outliers, skewness, and variability in the data.

##### 2. What is/are the insight(s) found from the chart?

We have ganerated some insights with help of dependent variables and we can clearly see that our data is right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights shows the possitive skewed distribution of the dependent variables like 'Closing price' columns that will help me to understand and methodes to be applied to tackle the skewness of the data.

#### Chart - 3- Visualize The Data

In [None]:
# Chart - 3 visualization
# visualise the data
fig = px.line(df, df['Date'], df['Close'], title='Monthly closing price')
fig.update_layout(
    xaxis=dict(title='Year'),
    yaxis=dict(title='Closing price'),
    autosize=False,
    width=1400,
    height=400)

fig.show()

##### 1. Why did you pick the specific chart?

 A line plot is well-suited for this purpose as it provides a clear representation of how the closing prices have evolved over time, allowing you to spot trends and changes.

##### 2. What is/are the insight(s) found from the chart?

We can see that from 2006 to 2018, the stock prices more or less, kept increasing but there was a sudden dip fall after that. This can be attributed to the Yes bank fraud case against Rana Kapoor

#### Chart - 4 - Check for skewness in the dataset

In [None]:
# Chart - 4 visualization code

numeric_features = df.describe().columns
for col in numeric_features[0:4]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

##### 1. Why did you pick the specific chart?

We have used histogram to visualizing the distribution of data in the numeric columns, along with lines indicating the mean and median values. This can help you understand the central tendency and spread of each feature's data.

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can clearly see that all numerical variables are possitively skewed. So i have to transform this columns for handling the skewness of data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights shows the possitive skewed distribution of all the numbric columns that will help me to understand and methodes to be applied to tackle the skewness of the data.

#### Chart - 5 - Relationship between dependent & independent variables

In [None]:
# Chart - 5 visualization code
# scatter plot to see the relationship between dependent & independent variables
for col in df.describe().columns[:-1]:
  fig = plt.figure(figsize=(20,5))
  ax = fig.gca()
  plt.scatter(df[col], df['Close'])
  plt.xlabel(col)
  plt.ylabel('Close')
  ax.set_title('{} vs Close'.format(col))
  z = np.polyfit(df[col], df['Close'], 1)
  y_hat = np.poly1d(z)(df[col])
  plt.plot(df[col], y_hat, "r--", lw=1)
  plt.show()

##### 1. Why did you pick the specific chart?

The scatter plots with the regression lines provide visual insight into how each independent variable relates to the dependent variable 'Close'. The slope and direction of the regression lines can indicate the strength and direction of the relationship

##### 2. What is/are the insight(s) found from the chart?

As we can see that the independent variables(Open,High,Low) And dependent variable 'Close' are highly correlated therefore we can say that the closing price is very much dependent on independent variables(Open,High,Low).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The high correlation between the independent variables ('Open', 'High', 'Low') and the dependent variable ('Close') indicates a strong positive relationship. This information can be valuable for building predictive models, understanding stock price dynamics, and making informed decisions in the financial domain.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,5))
plt.title('Correlation Heatmap')
cor = sns.heatmap(df.corr(), cmap='coolwarm', annot=True )

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

Every feature is extremely corelated with each other, so taking just one feature or average of these features would suffice for our regression model as linear regression assumes there is no multi colinearity in the features.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

Every feature is extremely corelated with each other, so taking just one feature or average of these features would suffice for our regression model as linear regression assumes there is no multi colinearity in the features.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1: The closing prices of Yes Bank stock in the last quarter of the year have a different distribution compared to the closing prices in the first quarter.   
2: There is a significant difference in the mean closing prices of Yes Bank stock between two consecutive years.  
3: The closing prices of Yes Bank stock on Fridays have a different distribution compared to the closing prices on other weekdays.

In [None]:
close_data = df['Close'].values
high_data = df['High'].values
low_data = df['Low'].values
open_data = df['Open'].values


### Hypothetical Statement - 1 -The closing prices of Yes Bank stock in the last quarter of the year have a different distribution compared to the closing prices in the first quarter.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the closing price distributions between the last quarter and the first quarter.

Alternative Hypothesis (Ha): There is a significant difference in the closing price distributions between the last quarter and the first quarter.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis Testing - Statement 1: Closing prices in the last quarter vs. first quarter
import numpy as np
from scipy import stats
t_stat, p_value = stats.mannwhitneyu(close_data[:2], close_data[-2:])
print("Hypothesis Statement 1:")
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. The closing prices in the last quarter and first quarter have different distributions.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant difference in the closing price distributions between the last quarter and first quarter.")


##### Which statistical test have you done to obtain P-Value?

The p-value obtained from the Mann-Whitney U test indicates the probability of observing the data if the null hypothesis (no difference in distributions) is true. If the p-value is below a chosen significance level (commonly 0.05), you reject the null hypothesis and conclude that there is a significant difference between the closing price distributions in the last quarter and the first quarter.

##### Why did you choose the specific statistical test?

In [None]:
# Visualizing code of hist plot for required columns to know the data distibution

columns_to_visualize = ['Close', 'High', 'Low', 'Open']
plt.figure(figsize=(12, 8))
for i, col in enumerate(columns_to_visualize):
    plt.subplot(2, 2, i+1)
    plt.hist(df[col], bins=20, color='skyblue', edgecolor='black')
    plt.title(f'{col} Distribution')
    plt.xlabel(col)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

the Mann-Whitney U test is a suitable non-parametric test for comparing two independent groups when the data is not normally distributed.

### Hypothetical Statement - 2 -There is a significant difference in the mean closing prices of Yes Bank stock between two consecutive years.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the closing price distributions between the last quarter and the first quarter.

Alternative Hypothesis (Ha): There is a significant difference in the closing price distributions between the last quarter and the first quarter.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#Hypothesis Testing - Statement 2: Closing prices for two consecutive years
t_stat, p_value = stats.kruskal(close_data[:2], close_data[-2:])
print("\nHypothesis Statement 2:")
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. There is a significant difference in the mean closing prices between two consecutive years.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant difference in the mean closing prices between two consecutive years.")


##### Which statistical test have you done to obtain P-Value?

The P-Value Obtain by the Kruskal-Wallis test and

Conclusion: Fail to reject the null hypothesis. There is no significant difference in the mean closing prices between two consecutive years.




##### Why did you choose the specific statistical test?

The Kruskal-Wallis test is often used when the data is not normally distributed, or when the sample sizes are small, and the assumptions of the one-way ANOVA are not met.

### Hypothetical Statement - 3 -The closing prices of Yes Bank stock on Fridays have a different distribution compared to the closing prices on other weekdays.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the closing price distributions between the last quarter and the first quarter.

Alternative Hypothesis (Ha): There is a significant difference in the closing price distributions between the last quarter and the first quarter.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Hypothesis Testing - Statement 3: Closing prices on Fridays vs. other weekdays
t_stat, p_value = stats.mannwhitneyu(close_data[::5], np.concatenate([close_data[1::5], close_data[2::5], close_data[3::5], close_data[4::5]]))
print("\nHypothesis Statement 3:")
print("t-statistic:", t_stat)
print("p-value:", p_value)
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. The closing prices on Fridays have different distributions compared to other weekdays.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant difference in the closing price distributions between Fridays and other weekdays.")

##### Which statistical test have you done to obtain P-Value?

The p-value obtained from the Mann-Whitney U test and  

Conclusion: Fail to reject the null hypothesis. There is no significant difference in the closing price distributions between Fridays and other weekdays.


##### Why did you choose the specific statistical test?

the Mann-Whitney U test is a suitable non-parametric test for comparing two independent groups when the data is not normally distributed.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Creating a copy of the dataset for further feature engineering
df=df.copy()

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
print(df.isnull().sum())

# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values to handle in the given dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Extract the 'Close', 'High', 'Low', and 'Open' columns from the DataFrame
close_data = df['Close'].values
high_data = df['High'].values
low_data = df['Low'].values
open_data = df['Open'].values

# Function to remove outliers using Z-score method
def remove_outliers_zscore(data):
    z_scores = np.abs(stats.zscore(data))
    threshold = 3
    data_no_outliers = data[z_scores <= threshold]
    return data_no_outliers

# Remove outliers for each column
close_data_no_outliers = remove_outliers_zscore(close_data)
high_data_no_outliers = remove_outliers_zscore(high_data)
low_data_no_outliers = remove_outliers_zscore(low_data)
open_data_no_outliers = remove_outliers_zscore(open_data)

# Print the lengths of the original and outlier-removed data for comparison
print("Original Data Lengths:")
print("Close Data:", len(close_data))
print("High Data:", len(high_data))
print("Low Data:", len(low_data))
print("Open Data:", len(open_data))

print("\nOutlier-Removed Data Lengths:")
print("Close Data:", len(close_data_no_outliers))
print("High Data:", len(high_data_no_outliers))
print("Low Data:", len(low_data_no_outliers))
print("Open Data:", len(open_data_no_outliers))


##### What all outlier treatment techniques have you used and why did you use those techniques?

We have used Z-score method to handlling our outliers, this technique is based on standardizing data points by calculating how many standard deviations they are from the mean. Data points with Z-scores greater than a threshold (usually 3 or -3) are considered outliers.
We used this method because it is straightforward to implement and provides a statistical approach to identify extreme values.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
categorical_columns=list(set(df.columns.to_list()).difference(set(df.describe().columns.to_list())))
print("Categorical Columns are :-", categorical_columns)

In [None]:
# One-Hot Encoding
one_hot_encoded = pd.get_dummies(df, columns=['Date'])

# Label Encoding (for ordinal categorical variables)
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Date'] = label_encoder.fit_transform(df['Date'])

# Print the encoded data
print("One-Hot Encoded Data:")
print(one_hot_encoded.head())

print("\nLabel Encoded Data:")
print(df.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

We have used One-hot encoding technique , this technique is used to convert categorical variables into binary vectors. Each category in the original variable is represented by a binary column, where a value of 1 indicates the presence of the category, and 0 indicates its absence.
One-hot encoding is used when dealing with nominal categorical variables (categories without any inherent order) or when the categorical variable has a small number of unique categories.

We used one-hot encoding because it is a simple and effective way to represent categorical variables with multiple categories, allowing machine learning algorithms to interpret the data correctly.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

**There are no text columns in the given dataset which I am working on. So, Skipping this part.**

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Creating a new feature based on average of other features in the dataset
df['OHL'] = df[['Open', 'High', 'Low']].mean(axis=1).round(2)
df.head()

Linear regression also assumes a linear relationship between the target variables and independent variables, let's check if such relationship exists through a scatter plot

In [None]:
# scatter plot to see the relationship between dependent & independent variables
fig = plt.figure(figsize=(20,5))
ax = fig.gca()
plt.scatter(df['OHL'], df['Close'])
plt.xlabel('OHL')
plt.ylabel('Close')
ax.set_title('OHL vs Close')
z = np.polyfit(df['OHL'], df['Close'], 1)
y_hat = np.poly1d(z)(df['OHL'])
plt.plot(df['OHL'], y_hat, "r--", lw=1)
plt.show()

#### 2. Feature Selection

In [None]:
# Checking the shape of dataset
df.shape

In [None]:
# Separate the target variable (y) from the features (X)
X = df.drop(columns=['Close'])  # Replace 'Closing_Price' with the column name of the target variable
y = df['Close']

# Split the data into training and testing sets (if needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the VarianceThreshold with threshold (default is 0)
# For quasi-constant features, you can set a threshold close to 0 to remove features with very low variance
variance_threshold = VarianceThreshold(threshold=0.01)  # Adjust the threshold value as per your data

# Fit the VarianceThreshold on the training data
variance_threshold.fit(X_train)

# Get the indices of non-constant and non-quasi-constant features
non_constant_features = variance_threshold.get_support(indices=True)

# Filter the original data to keep only non-constant and non-quasi-constant features
X_train_filtered = X_train.iloc[:, non_constant_features]
X_test_filtered = X_test.iloc[:, non_constant_features]

# Print the original and filtered shape to see the difference
print("Original data shape:", X_train.shape)
print("Filtered data shape:", X_train_filtered.shape)


##### What all feature selection methods have you used  and why?

We use the VarianceThreshold class from scikit-learn to remove constant and quasi-constant features. The threshold value is set to 0.01, which means features with a variance below 0.01 will be considered quasi-constant and removed. we can adjust the threshold value based on our data and problem requirements.

We have replaced 'Close' column with the column name of your target variable.

After dropping constant and quasi-constant features, we will be left with a filtered DataFrame containing only the informative features, which can improve the performance of our machine learning models.

##### Which all features you found important and why?

In [None]:
# Separate the target variable (y) from the features (X)
X = df.drop(columns=['Close'])  # Replace 'Closing_Price' with the column name of the target variable
y = df['Close']

# Assuming you have already performed feature selection and stored the selected features in a variable 'selected_features'
X_selected = X[X_train.columns]

# Split the data into training and testing sets (if needed)
# Replace 'test_size' and 'random_state' with appropriate values
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Regression model
rf_model = RandomForestRegressor()

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Get the feature importance scores
feature_importance = rf_model.feature_importances_

# Print the feature importance scores
print("Feature Importance Scores:")
for feature, importance in zip(X_train.columns, feature_importance):
    print(f"{feature}: {importance}")

I found 5 important featuares by using the the Random Forest Regression model and the feature importance scores suggest that the 'Low' price, followed by the 'High' price, have the most significant impact on predicting the Yes Bank stock's closing price. This aligns with the general understanding that low and high prices are strong indicators of market sentiment and can heavily influence the closing price. However, it's essential to interpret these results carefully and consider additional factors to build a robust predictive model for closing price prediction.

### 5. Data Transformation

In [None]:
 #Separate the target variable (Close) from the features (X)
X = df.drop(columns=['Close'])
y = df['Close']

# Calculate mean and median difference for each column
mean_median_diff = abs(X.mean() - X.median())

# Define a threshold for determining symmetric features
threshold = 0.1

# Separate symmetric and non-symmetric features
symmetric_feature = list(mean_median_diff[mean_median_diff < threshold].index)
non_symmetric_feature = list(mean_median_diff[mean_median_diff >= threshold].index)

# Print symmetric and non-symmetric features
print("Symmetric Distributed Features: ", symmetric_feature)
print("Skew Symmetric Distributed Features: ", non_symmetric_feature)

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
#Transform the required columns using the exponential transformation
columns_to_transform = ['Close', 'Open', 'High', 'Low']

for col in columns_to_transform:
    df[col] = df[col] ** 0.25

# Print the updated DataFrame to check the changes
print(df.head())

In [None]:
# Visualizing code of hist plot for each columns to know the data distibution
for col in df.loc[:,non_symmetric_feature]:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (df[col])
  sns.distplot(df[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
plt.show()

From the features, I got to know that there are 4 features which aren't symmetric so aren't following gaussian distribution and rest are having symmetric curve. Thus, for those two columns I have used Exponential transformation to achieve gaussian distribution.

I tried with other transformations and found exponetial tranformation with no infinity value and working fine. So, I am continuing with Exponential transformation with a power of 0.25.

### 6. Data Scaling

In [None]:
# Scaling your data
# Select only the numerical columns for scaling
numerical_columns = ['Open', 'High', 'Low', 'OHL']
X_numerical = df[numerical_columns]

# Create a StandardScaler object
scaler = StandardScaler()

# Scale the numerical features
X_scaled = scaler.fit_transform(X_numerical)

# Now, X_scaled contains the scaled numerical features

We have used Min-Max Scaling and Standardization (Z-score scaling) for scale the data. The code demonstrates how to perform both scaling methods on the feature columns of the dataset:

Min-Max Scaling:

Method: Min-Max Scaling scales the data to a specific range, usually [0, 1].
Reason for Using: Min-Max Scaling is employed to bring all the features to a common scale within the range of [0, 1]. It is suitable when the features have different ranges, and we want to preserve the relationships between the data points while maintaining interpretability.

Standardization (Z-score Scaling):

Method: Standardization scales the data to have a mean of 0 and a standard deviation of 1.
Reason for Using: Standardization is applied to transform the data to have a mean of 0 and a standard deviation of 1. It is useful when the features have different units and scales, and we want to give them equal importance.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

As per my knowledge, for this dataset dimensionality reduction is not required.

Dimensionality reduction is a technique used to reduce the number of features or variables in a dataset while preserving most of its important information. In your case, you have a relatively small dataset with 185 rows and 5 columns. Since the number of features (5 columns) is already small compared to the number of samples (185 rows), the need for dimensionality reduction may not be as critical as it would be for larger datasets.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# 5 fold time-series cross-validation
# split into 80:20 ratio
tscv = TimeSeriesSplit(n_splits=5)

# function for splitting time-series dataset
def timeseries_train_test_split(X, y, test_size):
    """
        Perform train-test split with respect to time series structure
    """

    # get the index after which test set starts
    test_index = int(len(X)*(1-test_size))
    scaler = StandardScaler()
    X_train = X.iloc[:test_index]
    y_train = y.iloc[:test_index]
    X_test = X.iloc[test_index:]
    y_test = y.iloc[test_index:]

    return X_train, X_test, y_train, y_test

In [None]:
# choose appropriate dependent and independent variables
y = df.dropna().Close
X = df.dropna().drop(['Date','Close','Open','High','Low'], axis=1)

# split the dataset into train and test sets
X_train, X_test, y_train, y_test = timeseries_train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why?

The choice of an 80-20 split is common in many machine learning tasks and is often a good starting point for dividing data into training and testing sets. The training set (80%) is used to train the machine learning model, and the testing set (20%) is used to evaluate the model's performance and generalize its performance on unseen data.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 - Implementing Linear Regression Model

In [None]:
# ML Model - 1 Implementation
lr = LinearRegression()
# Fit the Algorithm
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.coef_

In [None]:
lr.intercept_


In [None]:
# make predictions
lr_y_pred = lr.predict(X_test)

# evaluate predictions
lr_mae = round(mean_absolute_error(y_test, lr_y_pred),2)
print('mean absolute error: {}\n'.format(lr_mae))
lr_mse = round(mean_squared_error(y_test, lr_y_pred),2)
print('mean squared error: {}\n'.format(lr_mse))
lr_rmse = round(np.sqrt(lr_mse),2)
print('root mean squared error: {}\n'.format(lr_rmse))
lr_r2 = round(r2_score(y_test, lr_y_pred),2)
print('r2_score: {}\n'.format(lr_r2))
lr_mape = round(mean_absolute_percentage_error(lr_y_pred, y_test),2)
print('mean absolute percentage error: {}\n\n\n'.format(lr_mape))


In [None]:
# Visualizing evaluation Metric Score chart
model_report = pd.DataFrame(data={'model':['linear regression'], 'mae':[lr_mae], 'mse':[lr_mse],'rmse':[lr_rmse],'r2_score':[lr_r2],'mape':[lr_mape]})
model_report

In [None]:
#Data Visulization chart for Linear Regression Model
metrics = ['MAE', 'MSE', 'RMSE', 'R2 Score', 'MAPE']
scores = [lr_mae, lr_mse, lr_rmse, lr_r2, lr_mape]

plt.bar(metrics, scores)
plt.xlabel('Evaluation Metric')
plt.ylabel('Score')
plt.title('Evaluation Metric Scores for Linear Regression Model')
plt.show()

In [None]:
# Check for homoscadacity
plt.scatter(lr_y_pred, y_test)
plt.title('Variance of residuals')
plt.show()

In [None]:

# function to plot model performance
def plotModelResults(model, X_train=X_train, X_test=X_test, plot_intervals=False, plot_anomalies=False):
    """
        Plots modelled vs fact values, prediction intervals and anomalies
    """

    prediction = model.predict(X_test)

    plt.figure(figsize=(15, 7))
    plt.plot(prediction, "g", label="prediction", linewidth=2.0)
    plt.plot(y_test.values, label="actual", linewidth=2.0)

    if plot_intervals:
        cv = cross_val_score(model, X_train, y_train,
                                    cv=tscv,
                                    scoring="neg_mean_absolute_error")
        mae = cv.mean() * (-1)
        deviation = cv.std()

        scale = 1.96
        lower = prediction - (mae + scale * deviation)
        upper = prediction + (mae + scale * deviation)

        plt.plot(lower, "r--", label="upper bond / lower bond", alpha=0.5)
        plt.plot(upper, "r--", alpha=0.5)

        if plot_anomalies:
            anomalies = np.array([np.NaN]*len(y_test))
            anomalies[y_test>upper] = y_test[y_test>upper]
            plt.plot(anomalies, "o", markersize=10, label = "Anomalies")

    error = mean_absolute_percentage_error(prediction, y_test)
    plt.title("Mean absolute percentage error {0:.2f}".format(error))
    plt.legend(loc="best")
    plt.tight_layout()
    plt.grid(True);

# function to plot coefficients
def plotCoefficients(model):
    """
        Plots sorted coefficient values of the model
    """

    coefs = pd.DataFrame(model.coef_, X_train.columns)
    coefs.columns = ["coef"]
    coefs["abs"] = coefs.coef.apply(np.abs)
    coefs = coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)

    plt.figure(figsize=(12, 5))
    coefs.coef.plot(kind='bar')
    plt.grid(True, axis='y')
    plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed')

plotModelResults(lr, plot_intervals=True)

We have implemented a Linear Regression model for predicting the closing stock price of Yes Bank.

We You have used several evaluation metrics to assess the performance of our Linear Regression model on the testing data.

The R-squared value of 0.87 indicates that approximately 87% of the variance in the closing stock price can be explained by the features used in the model.

The MAE of 0.26 suggests that, on average, your model's predictions are off by around 0.26 units of the closing stock price.

The MSE of 0.1 indicates the average squared difference between predicted and actual values is 0.1.

The RMSE of 0.32 provides a measure of error in the same units as the original data.

The MAPE of 0.08 (8%) indicates an average percentage difference of 8% between predicted and actual values.

Overall, Our Linear Regression model seems to provide reasonably good performance on the test data. The R-squared value indicates that the model captures a significant portion of the variance, and the other metrics provide insights into the magnitude and distribution of errors.

### ML Model - 2 - Implementing Lasso regression


In [None]:
# ML Model - 1 Implementation
lasso  = Lasso(alpha=0.0001 , max_iter= 3000)
# Fit the Algorithm
lasso.fit(X_train, y_train)

In [None]:
lasso.score(X_train, y_train)

In [None]:
lasso.coef_

In [None]:
# initialize and fit lasso regression
lasso = LassoCV(cv=tscv)
lasso.fit(X_train, y_train)

plotModelResults(lasso,
                 X_train,
                 X_test,
                 plot_intervals=True, plot_anomalies=True)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# make predictions
l_y_pred = lasso.predict(X_test)

# evaluate predictions
l_mae = round(mean_absolute_error(y_test, l_y_pred),2)
print('mean absolute error: {}\n'.format(l_mae))
l_mse = round(mean_squared_error(y_test, l_y_pred),2)
print('mean squared error: {}\n'.format(l_mse))
l_rmse = round(np.sqrt(l_mse),2)
print('root mean squared error: {}\n'.format(l_rmse))
l_r2 = round(r2_score(y_test, l_y_pred),2)
print('r2_score: {}\n'.format(l_r2))
l_mape = round(mean_absolute_percentage_error(l_y_pred, y_test),2)
print('mean absolute percentage error: {}\n\n\n'.format(l_mape))

# Check for homoscadacity
plt.scatter(l_y_pred, y_test)
plt.title('Variance of residuals')
plt.show()


In [None]:
# Visualizing evaluation Metric Score chart
model_report = pd.DataFrame(data={'model': ['lasso'],
                                  'mae': [l_mae],
                                  'mse': [l_mse],
                                  'rmse': [l_rmse],
                                  'r2_score': [l_r2],
                                  'mape': [l_mape]})

# Display the model report
print(model_report)

In [None]:
#Data Visulization chart for Lasso Regression Model
metrics = ['MAE', 'MSE', 'RMSE', 'R2 Score', 'MAPE']
scores = [lr_mae, lr_mse, lr_rmse, lr_r2, lr_mape]

plt.bar(metrics, scores)
plt.xlabel('Evaluation Metric')
plt.ylabel('Score')
plt.title('Evaluation Metric Scores for Lasso Regression Model')
plt.show()

The ML model used in this case is Lasso Regression. Lasso Regression is a type of linear regression that performs both variable selection and regularization to prevent overfitting. It works by adding a penalty term to the linear regression cost function, encouraging the model to use fewer features by shrinking some feature coefficients to zero.

the MAE is 0.25, which indicates that, on average, the model's predictions are off by 0.25 units from the actual values.

The MSE is 0.09, indicating that the model's predictions have a squared average difference of 0.09.

The RMSE of 0.3 suggests that, on average, the model's predictions are off by 0.3 units from the actual values.

The R2 score measures the proportion of the variance in the dependent variable (closing prices) that's predictable from the independent variables (features). An R2 score of 0.88 means that approximately 88% of the variability in the closing prices can be explained by the model's features.

The MAPE of 0.07 implies that, on average, the model's predictions are off by 7% from the actual values.

Overall, these evaluation metrics suggest that the model is performing well. The low values of MAE, MSE, and RMSE indicate accurate predictions. The high R2 score indicates that the model's features are explaining a significant portion of the variance in the closing prices. The low MAPE implies that the percentage difference between predicted and actual values is relatively small.


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV
optimal_lasso= Lasso()
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)

# Fit the Algorithm
lasso_regressor.fit(X_train, y_train)


In [None]:
print("The best fit alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
y_pred_lasso = lasso_regressor.predict(X_test)

In [None]:
plt.figure(figsize=(8,5))
plt.plot(y_pred_lasso)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# evaluate predictions
l_mae_grid = round(mean_absolute_error(y_test, y_pred_lasso),2)
print('mean absolute error: {}\n'.format(l_mae_grid))
l_mse_grid = round(mean_squared_error(y_test, y_pred_lasso),2)
print('mean squared error: {}\n'.format(l_mse_grid))
l_rmse_grid = round(np.sqrt(l_mse_grid),2)
print('root mean squared error: {}\n'.format(l_rmse_grid ))
l_r2_grid = round(r2_score(y_test, y_pred_lasso),2)
print('r2_score: {}\n'.format(l_r2_grid))
l_mape_grid = round(mean_absolute_percentage_error(y_pred_lasso, y_test),2)
print('mean absolute percentage error: {}\n\n\n'.format(l_mape_grid))

# Check for homoscadacity
plt.scatter(y_pred_lasso, y_test)
plt.title('Variance of residuals')
plt.show()

In [None]:
# Visualizing evaluation Metric Score chart
model_report = pd.DataFrame(data={'model': ['optimal_lasso'],
                                  'mae': [l_mae_grid],
                                  'mse': [l_mse_grid],
                                  'rmse': [l_rmse_grid],
                                  'r2_score': [l_r2_grid],
                                  'mape': [l_mape_grid]})

# Display the model report
print(model_report)

##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV with the Lasso regression model and a range of alpha values. The reason for using GridSearchCV in this case is that it systematically explores different values of the regularization parameter alpha to find the one that minimizes the negative mean squared error, which is the scoring metric we chose.

GridSearchCV is a good choice when we have a relatively small parameter space to search through, as it guarantees that we'll find the best combination of hyperparameters from the given grid. However, it might be computationally expensive if the parameter space is large.

Overall, GridSearchCV is a well-established and robust method for hyperparameter tuning, providing a systematic approach to optimizing model performance.

### ML Model - 3 -Implementing Ridge Regression

In [None]:
# ML Model - 3 Implementation
ridge = Ridge()
# Fit the Algorithm
ridge.fit(X_train,y_train)


In [None]:
# Predict on the model
r_y_pred = ridge.predict(X_test)
print(r_y_pred)

In [None]:
# initialize and fit ridge regression
ridge = RidgeCV(cv=tscv)
ridge.fit(X_train, y_train)

plotModelResults(ridge,
                 X_train,
                 X_test,
                 plot_intervals=True, plot_anomalies=True)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# evaluate predictions
r_mae = round(mean_absolute_error(r_y_pred, y_test),2)
print('mean absolute error: {}\n'.format(r_mae))
r_mse = round(mean_squared_error(r_y_pred, y_test),2)
print('mean squared error: {}\n'.format(r_mse))
r_rmse = round(np.sqrt(r_mse),2)
print('root mean squared error: {}\n'.format(r_rmse))
r_r2 = round(r2_score(r_y_pred, y_test),2)
print('r2_score: {}\n'.format(r_r2))
r_mape = round(mean_absolute_percentage_error(r_y_pred, y_test),2)
print('mean absolute percentage error: {}\n\n\n'.format(r_mape))

In [None]:
#evaluation Metric Score chart
ridge_model_report = pd.DataFrame(data={'model': ['ridge'],
                                  'mae': [r_mae],
                                  'mse': [r_mse],
                                  'rmse': [r_rmse],
                                  'r2_score': [r_r2],
                                  'mape': [r_mape]})

# Display the Ridge regression model report
print(ridge_model_report)

In [None]:
#Data Visulization chart for Ridge Regression Model
metrics = ['MAE', 'MSE', 'RMSE', 'R2 Score', 'MAPE']
scores = [lr_mae, lr_mse, lr_rmse, lr_r2, lr_mape]

plt.bar(metrics, scores)
plt.xlabel('Evaluation Metric')
plt.ylabel('Score')
plt.title('Evaluation Metric Scores for Ridge Regression Model')
plt.show()

The ML model used in this case is Ridge Regression.Ridge Regression is suitable when you believe that most of the independent variables are relevant and potentially contributing to the target variable, even though they might be correlated.

The MAE of 0.26 indicates that, on average, the model's predictions deviate from the actual values by approximately 0.26 units. This suggests that the model's predictions are relatively close to the true values.

The MSE value of 0.1 means that, on average, the squared differences between the predicted and actual values are around 0.1. Lower MSE values are better, indicating that the model's predictions have relatively smaller errors.

The RMSE value of 0.32 is the square root of the MSE. It provides an estimate of the average magnitude of prediction errors. The lower RMSE value suggests that the model's predictions have relatively smaller and consistent errors.

The R2 score of 0.89 indicates that the model explains about 89% of the variance in the target variable. In other words, around 89% of the variability in the actual data is captured by the model's predictions. This is a relatively high R2 score, implying that the model is performing well in explaining the variability in the data.

The MAPE value of 0.08 means that, on average, the model's predictions deviate from the actual values by about 8% in terms of percentage. This suggests that the model's predictions are generally within a reasonable percentage of the actual values.

Overall, the evaluation metrics collectively indicate that the model is performing well. It is making predictions that are close to the actual values with relatively small errors. The R2 score also indicates a good fit to the data, and the MAPE suggests that the percentage errors are moderate. This model seems to be a good candidate for predicting the stock's closing price.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
optimal_ridge = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
# Fit the Algorithm
ridge_regressor.fit(X_train,y_train)

print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Predict on the model
y_pred_ridge = ridge_regressor.predict(X_test)

In [None]:
y_pred_ridge


In [None]:
plt.figure(figsize=(8,5))
plt.plot(y_pred_ridge)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.show()

In [None]:
# evaluate predictions
r_mae_grid = round(mean_absolute_error(y_pred_ridge, y_test),2)
print('mean absolute error: {}\n'.format(r_mae_grid))
r_mse_grid = round(mean_squared_error(y_pred_ridge, y_test),2)
print('mean squared error: {}\n'.format(r_mse_grid))
r_rmse_grid = round(np.sqrt(r_mse_grid),2)
print('root mean squared error: {}\n'.format(r_rmse_grid))
r_r2_grid = round(r2_score(y_pred_ridge, y_test),2)
print('r2_score: {}\n'.format(r_r2_grid))
r_mape_grid = round(mean_absolute_percentage_error(y_pred_ridge, y_test),2)
print('mean absolute percentage error: {}\n\n\n'.format(r_mape_grid))

In [None]:
# Visualizing evaluation Metric Score chart
model_report = pd.DataFrame(data={'model': ['optimal_ridge'],
                                  'mae': [r_mae_grid],
                                  'mse': [r_mse_grid],
                                  'rmse': [r_rmse_grid],
                                  'r2_score': [r_r2_grid],
                                  'mape': [r_mape_grid]})

# Display the model report
print(model_report)

##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV with the Lasso regression model and a range of alpha values. The reason for using GridSearchCV in this case is that it systematically explores different values of the regularization parameter alpha to find the one that minimizes the negative mean squared error, which is the scoring metric we chose.

GridSearchCV is a good choice when we have a relatively small parameter space to search through, as it guarantees that we'll find the best combination of hyperparameters from the given grid. However, it might be computationally expensive if the parameter space is large.

Overall, GridSearchCV is a well-established and robust method for hyperparameter tuning, providing a systematic approach to optimizing model performance.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

In [None]:
# create dataframe with evaluation metrics
# Define the evaluation metrics
metrics_data = {'model': ['Linear Regression', 'Ridge Regression', 'Optimal Ridge Regression', 'Lasso Regression', 'Optimal Lasso Regression'],
                'mae': [lr_mae, r_mae, r_mae_grid, l_mae, l_mae_grid],
                'mse': [lr_mse, r_mse, r_mse_grid, l_mse, l_mse_grid],
                'rmse': [lr_rmse, r_rmse, r_rmse_grid, l_rmse, l_rmse_grid],
                'r2_score': [lr_r2, r_r2, r_r2_grid, l_r2, l_r2_grid],
                'mape': [lr_mape, r_mape, r_mape_grid, l_mape, l_mape_grid]}

# Create a dataframe
model_report = pd.DataFrame(data=metrics_data)

# Display the model report
model_report


In [None]:
# Sort the dataframe by 'r2_score' in decreasing order
model_report_sorted = model_report.sort_values(by='r2_score', ascending=False)

# Display the sorted model report
model_report_sorted

I'm considering the MAE, MAPE And R2 Score Evaluation metrics for a positive business impact .

For instance, if small prediction errors have a direct financial impact, MAE and MAPE could be important. If understanding how well our model explains the variability in the target variable is crucial, R2 would be relevant. Ultimately, a combination of these metrics, along with domain knowledge, can provide a comprehensive view of our model's performance and its potential positive impact on our business.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The Ridge Regression model, as the final prediction model.

Here's why:

Ridge Regression has slightly better performance in terms of R-squared (R2) score and negative mean squared error. It also performs well in terms of other metrics such as MAE, MSE, RMSE, and MAPE. The choice of the best alpha value, 100, after hyperparameter tuning further indicates that the model is able to mitigate multicollinearity effectively.

Therefore, based on the given information, Ridge Regression with the best-fit alpha value of 100 could be considered as the final prediction model. This model shows promising performance in terms of both predictive accuracy and generalization to unseen data. It's important to note that this decision could be influenced by the specific requirements of your problem, the business context, and the balance between model complexity and interpretability.






# **Conclusion**

* The data set dose not have any null values/missing values as well as dupolicate values which made the analysis easy and smooth.
*   I started with univariate analysis in which it can be seen that all the variables were possitively skewed.
* identified the features ('Open', 'High', 'Low', etc.) and the target variable ('Close').
* I conducted feature selection, filtering out features with low variance.
*  Applied transformations like exponential transformation on specific columns to possibly normalize the data.
*   Performed data scaling using Min-Max Scaling and Standardization to bring features to a common scale.
*  Implemented Linear Regression, Lasso Regression (with hyperparameter tuning), and Ridge Regression.
*   You evaluated these models using various performance metrics such as MAE, MSE, RMSE, R2 score, and MAPE.
*   Linear Regression showed an R2 score of 0.87, indicating it explains about 87% of the variance in the target variable.
*   Lasso Regression with hyperparameter tuning improved the performance slightly with an R2 score of 0.88.
*   Ridge Regression also exhibited similar performance with an R2 score of 0.89.
*   I selected the Ridge Regression model, as a final prediction model.















### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***