<a href="https://colab.research.google.com/github/Nautiyalmukesh2001/Yes-Bank-Stock-Price-Prediction/blob/main/Yes_Bank_Stock_Price%7CPredict%7CM6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name : Yes Bank Stock Closing Price Prediction**







**Project Type** - Time Series Analysis

**Contribution** - Individual

**Team Member 1** - Mukesh Nautiyal

# **Project Summary :**

**Project Summary: Yes Bank Stock Closing Price Prediction**  

**Objective**  
This project aimed to predict the closing stock price of Yes Bank using time series analysis and machine learning techniques. The goal was to provide actionable insights for investors and financial analysts, enabling informed decision-making and effective risk management.  

**Dataset**  
The analysis was conducted using historical Yes Bank stock prices, including daily open, high, low, and close values.  

**Methodology**  

1. **Data Preprocessing & Wrangling** – Cleaned and prepared the dataset to ensure accuracy and consistency.  
2. **Exploratory Data Analysis (EDA)** – Used visualizations to identify historical trends, volatility, and key patterns.  
3. **Feature Engineering** – Created lagged features and relevant variables to enhance model performance.  
4. **Model Implementation** – Developed and evaluated various models, including:  
   - **Time Series Models:** ARIMA, SARIMA (to capture trends and seasonality).  
   - **Machine Learning Models:** Linear Regression, Lasso, and Ridge (to leverage predictive patterns).  
5. **Model Evaluation** – Assessed model performance using Mean Squared Error (MSE) and R-squared metrics.  

**Key Findings**  

- **ARIMA & SARIMA Models** effectively captured trends and seasonality, producing reasonable forecasts.  
- **Machine Learning Models (Lasso & Ridge)** emphasized the significance of feature selection and regularization for better predictions.  
- **Data Limitations**: The predictions were constrained by the dataset, which lacked external economic factors, market sentiment, and global indicators that significantly impact stock prices.  

**Business Impact & Solutions**  

1. **Risk Management** – Helps investors assess potential price fluctuations, enabling proactive risk mitigation.  
2. **Investment Strategies** – Supports data-driven investment decisions by forecasting future price movements.  
3. **Market Analysis** – Provides deeper insights into stock behavior, which can be correlated with external financial and economic conditions.  

**Conclusion**  
This project demonstrated the power of data-driven approaches in stock price prediction. While stock markets remain inherently unpredictable, the developed models offer valuable insights for investment and risk management. Future work can enhance accuracy by integrating external factors such as economic indicators, market sentiment analysis, and advanced deep learning techniques.  


#**GitHub Link -**

# **Problem Statement**

**BUSINESS PROBLEM OVERVIEW**

Stock market fluctuations present a significant challenge for investors and financial institutions. Predicting stock prices accurately can help in making informed investment decisions, mitigating risks, and optimizing portfolio strategies.

This project focuses on predicting the closing stock price of Yes Bank using machine learning techniques. By analyzing historical stock price data, including factors such as opening price, trading volume, and market trends, this model aims to provide insights into future price movements. The outcome of this prediction can assist traders, investors, and financial analysts in making data-driven decisions, thereby improving risk assessment and investment strategies.




#**1. Dataset Exploration**

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import time
import plotly.express as ps
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import os
from datetime import datetime


from statsmodels.tsa.seasonal import seasonal_decompose

from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge



## Dataset Loading

In [None]:
# Creating file path
filepath = '/content/data_YesBank_StockPrices.csv'

# creating a pandas dataframe
stock_df = pd.read_csv(filepath)

## Dataset First View

In [None]:
# top 5 rows of the data
stock_df.head()

## Dataset Rows & Columns count

In [None]:
# Dataset rows and columns
stock_df.shape

## Dataset Information

In [None]:
# dataset info
stock_df.info()

### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

len(stock_df[stock_df.duplicated()])

### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(stock_df.isnull().sum())

In [None]:
# Checking Null Value by plotting Heatmap

sns.heatmap(stock_df.isnull(),cbar=True)


## What did you know about your dataset?

The dataset originates from the stock market domain, specifically focusing on Yes Bank's stock prices, with the objective of predicting closing stock prices. By leveraging historical trends, this dataset aims to provide insights into market fluctuations, assisting investors and analysts in making informed decisions.

It comprises 185 rows and 5 columns, capturing essential stock-related metrics. The dataset is clean and well-structured, with no missing or null values, ensuring a robust foundation for analysis.

# **2. Understanding Variables**

In [None]:
# Dataset Columns

stock_df.columns

In [None]:
# Dataset Describe

stock_df.describe(include='all')

## Variable Description



*   Date: Date of the Record
*   Open: Opening Price
*   High: Highest Price in the day
*   Low: Lowest Price in the day
*   Close: Closing Price in the day










##  Unique Values for each variable.

In [None]:
# number of unique values
stock_df.nunique()

# **3. Data Wrangling**

## Data Wrangling Code

In [None]:
# creating copy of dataset
stock_df_copy = stock_df.copy()

In [None]:
# the date column is not structured so making it useful for analysis and setting month end date to each month
stock_df['Date'] = pd.to_datetime(stock_df['Date'],format='%b-%y') + pd.offsets.MonthEnd(0)
stock_df['Date'] = stock_df['Date'].dt.normalize()  # removing the timestamp
stock_df['Day'] = stock_df['Date'].dt.day # extracting the date part
stock_df['Month'] = stock_df['Date'].dt.month # extracting the month part
stock_df['Year'] = stock_df['Date'].dt.year  # extracting the year part


In [None]:
# creating a new df with date column as index
stock_df = stock_df.set_index('Date')

In [None]:
# creating a group by on year to see the trend in stock prices
stock_df.groupby('Year').mean()

## What all manipulations have you done and insights you found?

To ensure data integrity and facilitate accurate analysis, I first created a copy of the original dataset to prevent any data loss during transformations. This allowed me to experiment with different data wrangling techniques without affecting the raw data.

One of the key transformations involved structuring the date column to ensure consistency in the format. To enable better trend analysis, I extracted separate Year, Month, and Day columns from the date field. This decomposition made it easier to analyze stock price movements over different time periods and identify patterns at various granular levels.

To understand the overall trend of Yes Bank’s stock prices, I used the groupby function to aggregate the data by year. This helped in observing how the opening price, closing price, highest price, and lowest price changed over time. By summarizing the stock price trends annually, I could identify key fluctuations and patterns in the market.

From the initial insights, the grouped data revealed noticeable variations in the closing stock prices across different years. Some years exhibited significant spikes or drops, suggesting potential market events that influenced Yes Bank’s stock performance. This structured approach to data wrangling not only cleaned and organized the dataset but also provided a strong foundation for deeper time-series analysis, including identifying seasonal trends and volatility patterns.

# **4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables**

## Chart 1: Line Chart



In [None]:
# creating a line chart to visualize the closing price
plt.figure(figsize=(12, 6))
plt.plot(stock_df.index, stock_df['Close'], label='Closing Price', color='blue')
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.title("Closing Stock Price Over Time")
plt.legend()
plt.grid(True)
plt.show()

###1. Why did you pick the specific chart?

This line chart was chosen because it effectively visualizes the historical trend of the closing stock price over time. It allows us to observe patterns, trends, and fluctuations in the stock price, which is crucial for understanding past performance and making future predictions.

###2. What is/are the insight(s) found from the chart?

Before 2018 the stock price showed a gradual increase from 2005 to 2018, indicating a positive trend.
In 2018-2019 there was a significant surge in stock price, followed by extreme fluctuations.
Post-2019 the stock collapsed sharply after 2019, reaching levels lower than in previous years.
There is a horizontal line on the left side, which might indicate a data processing issue or missing values in the dataset.

###3. Will the gained insights help creating a positive business impact?


*   Risk Management: Understanding the sharp decline can help businesses assess risks in stock investments.
*   Investment Strategies: Identifying cyclical patterns and volatility can help traders make informed decisions.
*   Market Analysis: Businesses can correlate stock performance with company events, policies, or market conditions to enhance decision-making.







##Chart 2: Seasonal Decompose

In [None]:
# Perform seasonal decomposition
result = seasonal_decompose(stock_df['Close'], model='additive', period=12)

In [None]:
# Plot trend
plt.figure(figsize=(14, 8))
plt.subplot(412)
plt.plot(result.trend, label='Trend')
plt.legend(loc='upper left')
plt.show()

In [None]:
# Plot seasonality
plt.figure(figsize=(14, 8))
plt.subplot(413)
plt.plot(result.seasonal, label='Seasonality')
plt.legend(loc='upper left')
plt.show()

In [None]:
# Plot residuals
plt.figure(figsize=(14, 8))
plt.subplot(414)
plt.plot(result.resid, label='Residuals')
plt.legend(loc='upper left')
plt.show()

##Chart 3: Box Plot

In [None]:
# creating a boxplot to visualize the features of prices
plt.figure(figsize=(10,5))
sns.boxplot(data=stock_df['Close'])
plt.title("Stock Price Distribution")
plt.show()

##Chart 4: Distribution

In [None]:
sns.histplot(data=stock_df, x="Close",  kde = True, color  = 'Red') # closing price
# Show plot
plt.show()


In [None]:
sns.kdeplot(data=stock_df, x="Close",  fill = True, color  = 'Red') # closing price
# Show plot
plt.show()


In [None]:
# Calculating daily returns to check stock price volatility.
stock_df['Daily Return'] = stock_df['Close'].pct_change()

# plotting
plt.figure(figsize=(10,5))
sns.histplot(stock_df['Daily Return'].dropna(), bins=50, kde=True)
plt.xlabel("Daily Return")
plt.ylabel("Frequency")
plt.title("Distribution of Daily Returns")
plt.show()

##Chart 5: Smoothed Closing Price Trend Over Time

In [None]:
moving_avg = stock_df['Close'].rolling(window=12).mean()  # 12 Month moving average

# Plotting the original data and the moving average
plt.figure(figsize=(12, 6))
plt.plot(stock_df['Close'], label='Original Data')
plt.plot(moving_avg, label='12-Month Moving Average', color='orange')
plt.legend(loc='upper left')
plt.title('Time Series with 12-Month Moving Average')
plt.show()

## Chart 6: Candle

In [None]:
# Candlestick Chart for Price Movement
# Visualize open, high, low, close (OHLC) prices.


fig = go.Figure(data=[go.Candlestick(
    x=stock_df.index,
    open=stock_df['Open'], high=stock_df['High'],
    low=stock_df['Low'], close=stock_df['Close'])])

fig.update_layout(title="Candlestick Chart for Yes Bank Stock")
fig.show()


## Chart 7: Relative Strength Index

In [None]:
# Relative Strength Index (RSI) – Momentum Indicator
# RSI helps identify overbought/oversold conditions.

def compute_RSI(data, window=14):
    delta = data.diff(1)
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    RS = gain / loss
    RSI = 100 - (100 / (1 + RS))
    return RSI

stock_df['RSI'] = compute_RSI(stock_df['Close'])

stock_df['RSI'].plot(title="Relative Strength Index (RSI)",figsize=(15,6))
plt.show()

# RSI > 70 → Overbought (Stock may be overpriced → possible sell signal)
# RSI < 30 → Oversold (Stock may be undervalued → possible buy signal)

###1. Why did you pick the specific chart?

The RSI chart is chosen to analyze the stock's momentum and identify overbought and oversold conditions over time. It helps in detecting trend reversals and understanding whether the stock is in a bullish or bearish phase.

###2. What is/are the insight(s) found from the chart?

RSI values above 70 indicate overbought conditions, suggesting potential price drops or corrections.
RSI values below 30 indicate oversold conditions, signaling possible price increases or rebounds.
The chart shows multiple peaks above 70 and dips below 30, suggesting high volatility and frequent trend changes.

###3. Will the gained insights help creating a positive business impact?

## Chart 8: Heatmap

In [None]:
# plotting heatmap
plt.figure(figsize=(8,6))
sns.heatmap(stock_df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()


## Chart 9: Pairplot

In [None]:
# plotting a pairplot
sns.pairplot(stock_df[['Open','High','Low','Close']])
plt.show()

# 5. Hypothesis Testing

### Hypothetical Statement - 1

The data is stationary and have a constant mean and variance with no seasonality

In [None]:
# Performing ADF test to calculate the p-value
# Null hypothesis = Series has a unit root
# Alternate hypothesis = Series has no unit root

result = adfuller(stock_df['Close'])
print('p-value: {}'.format(result[1]))
if result[1] < 0.05:
  print("Strong evidence against the null hypothesis, reject the null hypothesis.Data has no unit root and is stationary")
else:
  print('Weak evidence against the null hypothesis, time series has a unit root, indicating it is not stationary')

### Hypothetical Statement - 2

Stock Returns Follow a Normal Distribution

In [None]:
# Performing Shapiro-Wilk Test
# Null Hypothesis = Stock returns follow normal distribution
# Alternate Hypothesis = Stock returns do not follow normal distribution

from scipy.stats import shapiro

stock_returns = stock_df['Close'].pct_change().dropna()
shapiro_stat, shapiro_p = shapiro(stock_returns)

print(f'Shapiro-Wilk Statistic: {shapiro_stat}, p-value: {shapiro_p}')
if shapiro_p < 0.05:
    print("Reject null hypothesis: Stock returns are not normally distributed.")
else:
    print("Fail to reject null hypothesis: Stock returns follow a normal distribution.")

### Hypothetical Statement - 3

 If past stock prices influence future prices

In [None]:
# Performing Autocorrelation Test - Ljung-Box Test
# Null hypothesis = No autocorrelation in stock returns (efficient market).
# Alternate hypothesis = Stock returns show autocorrelation (inefficient market).

from statsmodels.stats.diagnostic import acorr_ljungbox

lb_test = acorr_ljungbox(stock_returns, lags=[10], return_df=True)
print(lb_test)

if lb_test['lb_pvalue'].values[0] < 0.05:
    print("Reject null hypothesis: Stock returns exhibit autocorrelation (potential inefficiency).")
else:
    print("Fail to reject null hypothesis: No significant autocorrelation (efficient market).")


#6. Feature Engineering & Data Pre-processing

In [None]:
# creating a function to check data is stationary or not
def adf_test(series):
    result = adfuller(series)
    print('ADF Statistics : {}'.format(result[0]))
    print('p-value: {}'.format(result[1]))
    if result[1] < 0.05:
        print("Strong evidence against the null hypothesis, reject the null hypothesis.Data has no unit root and is stationary")
    else:
        print('Weak evidence against the null hypothesis, time series has a unit root, indicating it is not stationary')

In [None]:
# adf test for closing price
adf_test(stock_df['Close'])

In [None]:
# first differencing
stock_df['First_Differencing'] = stock_df['Close'].diff()

In [None]:
# adf test after first differencing
adf_test(stock_df['First_Differencing'].dropna())

In [None]:
# second differencing
stock_df['Second_Differencing'] = stock_df['First_Differencing'].diff()

In [None]:
# adf test after second differencing
adf_test(stock_df['Second_Differencing'].dropna())

In [None]:
# Plotting the original and differenced data
plt.figure(figsize=(12, 6))
plt.subplot(2, 1, 1)
plt.plot(stock_df['Close'], label='Original Closing Price')
plt.legend(loc='upper left')
plt.subplot(2, 1, 2)
plt.plot(stock_df['Second_Differencing'].dropna(), label='Second Differencing', color='orange')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# plotting a acf plot to determine the q value for the arima model
acf_plot = plot_acf(stock_df['Second_Differencing'].dropna())

In [None]:
# plotting pacf plot to determine the p value
pacf_plot = plot_pacf(stock_df['Second_Differencing'].dropna())

#7. ML Model Implementation

In [None]:
# splitting train and test data
train = stock_df['Close'].iloc[:-12]
test = stock_df['Close'].iloc[-12:]

# creating model
arima_model = ARIMA(train, order=(12, 2,8))

# model fit
arima_result = arima_model.fit()

# prediction for test data
arima_forecast = arima_result.forecast(steps=12)

In [None]:
# Evaluating the models on test data
arima_mse = mean_squared_error(test, arima_forecast)

arima_mse

In [None]:
# summary of arima model
arima_result.summary()

In [None]:
# making prediction for all values
stock_df['pred_arima']  = arima_result.predict(start = datetime(2005,7,31), end=datetime(2020,11,30))

In [None]:
# calculating mean residuals between actual and predicted values
sum(abs(stock_df['Close'] - stock_df['pred_arima']))/len(stock_df['Close'])

In [None]:
# plotting the graph to visualize the predicted values over actual values
stock_df[['Close','pred_arima']].plot(figsize=(12,6))

In [None]:
# Applying SARIMA model
sarima_model = SARIMAX(train, order=(12, 0, 8), seasonal_order=(0, 1, 0, 12))

# model fit sarima
sarima_result = sarima_model.fit()

# sarima predictions
sarima_forecast = sarima_result.forecast(steps=12)

In [None]:
# Evaluating the sarima model on test data
sarima_mse = mean_squared_error(test, sarima_forecast)

sarima_mse

In [None]:
# predicting all values using sarima model
stock_df['pred_sarima']  = arima_result.predict(datetime(2005,7,31), end=datetime(2020,11,30))

In [None]:
# calculating mean residuals by sarima model
sum(abs(stock_df['Close'] - stock_df['pred_sarima']))/len(stock_df['Close'])

In [None]:
# plotting sarima's predicted and actual values
stock_df[['Close','pred_sarima']].plot(figsize=(12,6))

In [None]:
# creating new column storing previous day closing price (lag 1)
stock_df['Lag1'] = stock_df['Close'].shift(1)

# removing null
stock_df.dropna(inplace=True)

In [None]:
# preparing data for regression model
X = stock_df[['Lag1']]
y = stock_df['Close']

In [None]:
# splitting train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# linear regression model
model_lr = LinearRegression()

# model fit
model_lr.fit(X_train,y_train)

# model prediction
y_pred_lr = model_lr.predict(X_test)

In [None]:

mse = mean_squared_error(y_test, y_pred_lr)
r2 = r2_score(y_test,y_pred_lr)
print("Mean Squared Error:", mse)
print('r2 score is ',r2)

In [None]:
# Create a Lasso regression model
lasso = Lasso(alpha=1.0)  # alpha is the regularization parameter

# Train the model
lasso.fit(X_train, y_train)

# Evaluate the model
y_pred_lasso = lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test,y_pred_lasso)
print(f"Lasso Regression Mean Squared Error: {mse_lasso}")
print(f"Lasso Regression r2 score: {r2_lasso:.2f}")

In [None]:
# Create a Ridge regression model
ridge = Ridge(alpha=1.0)  # alpha is the regularization parameter

# Train the model
ridge.fit(X_train, y_train)

# Evaluate the model
y_pred_ridge = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test,y_pred_ridge)
print(f"Ridge Regression Mean Squared Error: {mse_ridge}")
print(f"Ridge Regression r2 score: {r2_ridge:.2f}")

#Conclusion

1. Accurate Stock Price Forecasting: Applied time series analysis (ARIMA, SARIMA) and machine learning models (Lasso, Ridge) to predict Yes Bank’s closing stock price, providing reasonable forecasts based on historical data.

2. Risk Management & Investment Strategies: Predictive models assist investors in assessing potential price fluctuations, helping in portfolio optimization and risk mitigation.

3. Market Analysis & Decision-Making: The insights gained enable financial analysts to better understand stock trends, supporting strategic market decisions.

4. Challenges & Limitations: The models relied solely on historical stock prices and did not incorporate external economic indicators, market news, or sentiment analysis, which could impact prediction accuracy.

5. Future Enhancements: Incorporating macroeconomic factors, company financials, and real-time news sentiment could improve forecasting precision. Advanced deep learning models like LSTMs and transformers can be explored for enhanced performance.

6. Business Impact: While absolute accuracy in stock price prediction is challenging due to market volatility, this project demonstrates the value of data-driven approaches in making informed financial decisions and reducing investment risks.