<a href="https://colab.research.google.com/github/Faraz-Khan02/Yes-Bank-Stock-Closing-Price-Prediction/blob/main/Yes_Bank_Stock_Closing_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Yes Bank Stock Closing Price Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Name -** Faraz Faisal Khan


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.**

# ***Let's Begin !***

### Importing Libraries

In [None]:
# Importing Libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading Dataset
df = pd.read_csv("/content/drive/MyDrive/Capstone Project-2/data_YesBank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

Our dataset contains 185 rows and 5 columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

Our dataset contains zero null values


Here we've the Yes Bank Stock price dataset which has the monthly stock prices. It contains the following features in it, they are:

**Date**: It denotes the month & year with respect to the price of the stock.

**Open**: It denotes the opening price at which a stock started trading that month.

**High**: It denotes the highest price at which a stock traded during a period.

**Low**: It denotes the highest price at which a stock traded during a period.

**Close**: It denotes the final trading price for that month.


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate = df.duplicated()
print(duplicate.value_counts())

There is no duplicate value in our dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


There is zero null value in our dataset.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_name", size=12, weight="bold")
plt.title("Missing values",fontweight="bold",size=15)
plt.show()


It shows zero missing value.

## ***Data Cleaning***

In [None]:
# Copying data to preserve orignal dataset
new_df = df.copy()

Here Date is not in regular format so we need to corvert it in 'YYYY-MM-DD' format.

In [None]:
new_df['Date'] = new_df['Date'].apply(lambda x: datetime.strptime(x, "%b-%y"))

In [None]:
new_df.head()

Here we changed our Date format. Now we are ready to do EDA on our dataset.

In [None]:
# Segregating the dataset into dependent & independent variable.
X = new_df.drop(['Close','Date'],axis=1)       
Y = new_df['Close']                          

Here X is our Independent variable and Y is our Dependent variable.

## ***Exploratory Data Analysis***

### **Univariate Analysis**

In Univariate Analysis we will analize our Dependent variable and Independent variable separately using different Displot and bar plot.

In [None]:

# Visualisation of closing price with respect to dates.
plt.figure(figsize=(15,5))
plt.xlabel('Year', fontsize=15)
plt.ylabel('Closing Prices', fontsize=15)
plt.plot(new_df['Date'], new_df['Close'],linewidth=2,color='red')
plt.title('Closing Stock Prices along different Year', fontsize=20)
plt.grid()
plt.show()


 After 2018 onwards the closing stock prices have witnessed a great downfall due to Rana Kapoor fraud case.

## *Distplot for Dependent Variable - Close*

In [None]:
# Dependent variable-'Close'
plt.figure(figsize=(10,5))
sns.distplot(df['Close'],color="g")
plt.xlabel('Closing Price',fontsize=20)
plt.ylabel('Density',fontsize=20)
plt.show()

Here we can clearly see that distribution is right skewed so we have to do transformatiom to make it Normal using log transformation method.
Log Transfomation is used when our data is highly skewed.



In [None]:
#Collecting all our numeric column in a new variable 
numeric_features=new_df.describe().columns
numeric_features
     

## *Mean and Median of Dependent Variable - Close*

In [None]:
# Plotting Mean and Median of our Dependent variable
for col in numeric_features[-1:]:
    fig = plt.figure(figsize=(12,6))
    ax = fig.gca()
    feature = new_df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='red', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='green', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

Here red dashed line shows Mean of our target variable and green dashed line shows Median of our target variable.

Here there is lot of difference between our Mean and Median so we have to do Lof tranformation to reduce such a variance between our central tendency.

## *Applying Log Transformation on Dependent Variable - Close*

In [None]:
# Applying log transformation on target variable 
plt.figure(figsize=(10,5))
sns.distplot(np.log(df['Close']),color="g")
plt.xlabel('Closing Price',fontsize=20)
plt.ylabel('Density',fontsize=20)
plt.show()

Now we can say that our data is transformed and it is normally distributed.

In [None]:

# Applying log transformation on target variable
for col in numeric_features[-1:]:
    fig = plt.figure(figsize=(12,6))
    ax = fig.gca()
    feature = np.log(new_df[col])
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='red', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='green', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

After doing Log transformation the difference between Mean and Median is reduced to a great extent.

## *Distplot on Independent variable - (Open, High, Low)*

In [None]:

# Distribution of independent variable
plt.figure(figsize=(10,5))
sns.distplot(new_df['Open'], color='purple')

plt.figure(figsize=(10,5))
sns.distplot(new_df['High'], color='brown')

plt.figure(figsize=(10,5))
sns.distplot(new_df['Low'], color='magenta')

plt.show()

Here just like our Dependent variable , all Independent variables are right skewed so, to convert it in normal distribution we need to do Log Transformation.

## *Mean & Median of Independent Variables - (Open, High, Low)*

In [None]:
# Ploting bar blot for each independent variables - (Open, High, Low)
for col in numeric_features[:-1]:
    fig = plt.figure(figsize=(12, 6))
    ax = fig.gca()
    feature = new_df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='red', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='green', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

Here Red dashed line denotes Mean and Green dashed line denotes Median.

From above graph we can see that there is much difference in between mean and median for all independent variable. It means independent variables are not normally distributed.

## *Applying Log Transformation on Independent Variables - (Open, High, Low)*

In [None]:

# Applying log transformation independent variables.
plt.figure(figsize=(10,5))
sns.distplot(np.log(new_df['Open']), color='purple')

plt.figure(figsize=(10,5))
sns.distplot(np.log(df['High']), color='brown')

plt.figure(figsize=(10,5))
sns.distplot(np.log(df['Low']), color='magenta')

plt.show()

Here we can see after applying Log Transformation out Independent variables are Normally Distributed.

In [None]:
# Applying log transformation on Independent Variables
for col in numeric_features[:-1]:
    fig = plt.figure(figsize=(12, 6))
    ax = fig.gca()
    feature = np.log(new_df[col])
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='red', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='green', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

Here we can see after applying Log Transformation the difference between Mean & Median is almost finished.

## **Bivariate Analysis**

In Bivariate Analysis we will see relation between Dependent and Independent Variables using scatter plot and heat map.

## *Scatter plot for finding relation between Dependent & Independent variable*

In [None]:
#Using scatter plot
for col in numeric_features[:-1]:
  fig = plt.figure(figsize = (10,5))
  ax = fig.gca()
  features = new_df[col]
  label = new_df['Close']
  correlation = features.corr(label)
  plt.scatter(x = features,y = label)
  plt.xlabel(col)
  plt.ylabel('Close')
  plt.title('Close Vs  ' + col + '_ correlation:' + str(correlation))
  z = np.polyfit(df[col],df['Close'],1)
  y_hat = np.poly1d(z)(df[col])
  plt.plot(df[col] , y_hat, "r--",lw = 2)
plt.show()

From above Scatter plot we can say all independent variables high correlation with our dependent variable.

So, we cannot drop any column for modelling all the independent variables are important.

## *Heatmap for checking correlation with Independent variables*

In [None]:

#Correlation heatmap
plt.figure(figsize=(12,7))
sns.heatmap(new_df.corr(),cmap='PiYG',annot=True)

Here dependent variable shows high correlation with all independent variables.

## *Checking Multicollinearity*

In [None]:
#Calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(X):

  # calculating VIF
  vif =pd.DataFrame()
  vif["variables"] = X.columns
  vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X. shape[1])]

  return(vif)

calculate_vif(new_df[[i for i in new_df.describe().columns if i not in ['Date', 'Close']]])

Here VIF of every independent variable is high so we cannot drop any of the columns.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***