<a href="https://colab.research.google.com/github/Adityasingh3008/YES-BANK-STOCK-PRICE-PREDICTION/blob/main/Individual_Notebook_YES_BANK_STOCK_PRICE_PREDICTION_(Capstone_Project_2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
# **Yes Bank is a well-known bank in the Indian financial domain. Since 2018, it has been in the news because of the fraud case involving Rana Kapoor. Owing to this fact, it was interesting to see how that impacted the stock prices of the company and whether Time series models or any other predictive models can do justice to such situations. This dataset has monthly stock prices of the bank since its inception and includes closing, starting, highest, and lowest stock prices of every month. The main objective is to predict the stock’s closing price of the month.** 

# **Lets Understand first What is Stock?**
**Stocks are a type of security that gives stockholders a share of ownership in a company. Stocks also are called "Equities". Units of stock are called "Shares". Stocks are bought and sold predominantly on stock exchanges, though there can be private sales as well, and are the foundation of many individual investors portfolios.**

In [23]:
# Importing all required libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import math
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn import metrics
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [24]:
#  Mounting google drive to load our dataset
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
# Link the dataset path and read csv file
dataset = pd.read_csv('/content/drive/MyDrive/data_YesBank_StockPrices.csv')

In [26]:
# Loaded csv file
dataset

Unnamed: 0,Date,Open,High,Low,Close
0,Jul-05,13.00,14.00,11.25,12.46
1,Aug-05,12.58,14.88,12.55,13.42
2,Sep-05,13.48,14.87,12.27,13.30
3,Oct-05,13.20,14.47,12.40,12.99
4,Nov-05,13.35,13.88,12.88,13.41
...,...,...,...,...,...
180,Jul-20,25.60,28.30,11.10,11.95
181,Aug-20,12.00,17.16,11.85,14.37
182,Sep-20,14.30,15.34,12.75,13.15
183,Oct-20,13.30,14.01,12.11,12.42


In [27]:
# Fetch first five rows by using head() method of dataframe
dataset.head()

Unnamed: 0,Date,Open,High,Low,Close
0,Jul-05,13.0,14.0,11.25,12.46
1,Aug-05,12.58,14.88,12.55,13.42
2,Sep-05,13.48,14.87,12.27,13.3
3,Oct-05,13.2,14.47,12.4,12.99
4,Nov-05,13.35,13.88,12.88,13.41


In [28]:
# Fetch last five rows by using tail() method of dataframe
dataset.tail()

Unnamed: 0,Date,Open,High,Low,Close
180,Jul-20,25.6,28.3,11.1,11.95
181,Aug-20,12.0,17.16,11.85,14.37
182,Sep-20,14.3,15.34,12.75,13.15
183,Oct-20,13.3,14.01,12.11,12.42
184,Nov-20,12.41,14.9,12.21,14.67


In [29]:
# Getting information about the datatypes and null values stored at each column by using "info()" method
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    185 non-null    object 
 1   Open    185 non-null    float64
 2   High    185 non-null    float64
 3   Low     185 non-null    float64
 4   Close   185 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.4+ KB


In [30]:
# Fetch first five rows of feature "Date" by using head() method of dataframe
dataset['Date'].head()

0    Jul-05
1    Aug-05
2    Sep-05
3    Oct-05
4    Nov-05
Name: Date, dtype: object

* **From the above dataframe we can see that the 'Date' feature is of object data type, so we need to convert it in the date time format and also the format of the date is in MMMM-YY format and we also need to convert it to a proper date time format as YYYY-MM-DD.**

In [31]:
# importing datetime and converting 'Date' into datetime - YYYY-MM-DD
from datetime import datetime
dataset['Date'] = pd.to_datetime(dataset['Date'].apply(lambda x: datetime.strptime(x, '%b-%y')))   

In [10]:
# Fetch first five rows by using head() method of dataframe after converting "Date" column into proper Date format
dataset.head()

Unnamed: 0,Date,Open,High,Low,Close
0,2005-07-01,13.0,14.0,11.25,12.46
1,2005-08-01,12.58,14.88,12.55,13.42
2,2005-09-01,13.48,14.87,12.27,13.3
3,2005-10-01,13.2,14.47,12.4,12.99
4,2005-11-01,13.35,13.88,12.88,13.41


* **Now the 'Date' feature is converted into a proper datetime format.**

# **Checking Null Values**

In [32]:
# Checking Null Values In Our Dataset
dataset.isnull().sum()

Date     0
Open     0
High     0
Low      0
Close    0
dtype: int64

* **We can see that there are no null values in our dataset.**

In [33]:
# Creating a copy of a dataframe
df = dataset.copy()

In [34]:
# Set the DataFrame index using existing columns.
df.set_index('Date',inplace=True)

In [35]:
# Fetch first five rows by using head() method of dataframe after set_index method()
df.head()

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-07-01,13.0,14.0,11.25,12.46
2005-08-01,12.58,14.88,12.55,13.42
2005-09-01,13.48,14.87,12.27,13.3
2005-10-01,13.2,14.47,12.4,12.99
2005-11-01,13.35,13.88,12.88,13.41


* **We created a copy of dataframe so that if we apply some conditions or made changes in our dataset it won't affect the original dataset.**

# **Checking Duplicate Values**

In [36]:
#Taking a look at duplicate values
len(df[df.duplicated()])

0

* **As there were no null values there are no duplicate values as well.**

In [37]:
# Analyse the data by looking into various columns i.e. max. , min. , mean etc.
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Open,185.0,105.541405,98.87985,10.0,33.8,62.98,153.0,369.95
High,185.0,116.104324,106.333497,11.24,36.14,72.55,169.19,404.0
Low,185.0,94.947838,91.219415,5.55,28.51,58.0,138.35,345.5
Close,185.0,105.204703,98.583153,9.98,33.45,62.54,153.3,367.9


* **We have used describe() method for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.**
* **We have also used a transpose method to convert rows into columns and vice versa.**

In [39]:
# Check the various attributes of data like shape(rows and columns), null values,unique values.print ("Rows : " , dataset.shape[0])
print ("Columns  : " ,dataset.shape[1])
print ("\nFeatures : \n " ,dataset.columns.tolist())
print ("\nMissing values:", dataset.isnull().sum().values.sum())
print ("\nUnique values :  \n", dataset.nunique())

Columns  :  5

Features : 
  ['Date', 'Open', 'High', 'Low', 'Close']

Missing values: 0

Unique values :  
 Date     185
Open     183
High     184
Low      183
Close    185
dtype: int64


**Conclusion Drawn**:-

* **From the various attributes of data like shape(rows and columns) , null values and unique values, we get to knw that there are 5 columns and there are no missing value in our data and each columns have their respective unique values.**

# **Features/Columns in respective dataset**
 * **Date:-** Date denotes the date of investment( in our data date contains month and year for a particular price)
 * **Open:-** Open means the price at which a stock started trading.
 * **High:-** The high is the highest price at which a stock traded during a period.
 * **Low:-** The low is the minimum price at which a stock traded during a period.
 * **Close:-** The closing price refers to a stock's trading price closed at the end of a trading day.


# **As now we have converted our 'Date' object feature into proper datetime format(There are no categorical features in our dataset) we will take a look at numerical features in our dataset now**

In [40]:
# Fetch all the values of numeric features in our dataset
# Fetch first five rows by using head() method of dataframe
dataset_num = dataset.select_dtypes(exclude=['bool','object'])
dataset_num.head()

Unnamed: 0,Date,Open,High,Low,Close
0,2005-07-01,13.0,14.0,11.25,12.46
1,2005-08-01,12.58,14.88,12.55,13.42
2,2005-09-01,13.48,14.87,12.27,13.3
3,2005-10-01,13.2,14.47,12.4,12.99
4,2005-11-01,13.35,13.88,12.88,13.41


In [41]:
# Create a new variable and store the column in that variable
numeric_features = df.columns
numeric_features

Index(['Open', 'High', 'Low', 'Close'], dtype='object')