# **Project Name  - Yes Bank**



##### **Project Type**    - Regression/Machine Learning
##### **Contribution**    - Individual

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

[Github Link](https://github.com/Aastha2675/Stock_Price_Prediction)

# **Problem Statement**


Stock prices are highly volatile and influenced by various factors. In the case of Yes Bank, events like the 2018 fraud case led to sharp fluctuations. Accurately predicting the monthly closing stock price is challenging yet essential for investors and analysts. This project addresses the problem by using historical stock data to build a machine learning model that forecasts closing prices and aids decision-making.

# ***Solution***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
df.shape

### Dataset Information

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(6, 3))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Visualizing Missing Values Using Heatmap')
plt.show()

### What did you know about your dataset?

This is a monthly time series dataset containing historical stock price data for Yes Bank.

*Dataset Structure:* 185 rows and 5 columns(Date, Open - Price at the beginning of the month, High - Highest price in that month, Low - Lowest price in that month, Close - Price at the end of the month).

*Datatype of columns:* Date: Object type, Open, High, Low, Close: All are float64, representing monthly stock prices.

There are no missing or duplicate values, the dataset is already clean and ready for modeling after minor preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

All four are continuous numerical features, and the high standard deviation shows significant volatility in Yes Bank’s stock price over time.

1. Open - Opening stock price for the month. Ranges from ₹10 to ₹369.95. Median (50%) is ₹62.98.

2. High - Highest stock price during the month. Goes up to ₹404. Indicates peak market value.

3. Low - Lowest stock price in the month. Shows market downside, ranging from ₹5.55 to ₹345.50.

4. Close - Final stock price at month end. Median is ₹62.54, max is ₹367.90.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col} have {df[col].nunique()} unique values")

Date has 185 unique values, confirming that each row represents a unique month.

Other variables have high uniqueness (close to total rows), so they are not categorical, they're all numerical and continuous.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting the type of col 'Date' to datetime64[ns] format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

In [None]:
# checking the updated datatype of col 'Date'
df['Date'].dtypes

In [None]:
# sorting the df chronologically (if not sorted earlier)
df = df.sort_values('Date').reset_index(drop=True)

In [None]:
print(f"Range of Date is : {df['Date'].min()} , {df['Date'].max()}")

In [None]:
# adding new col 'Year' 'Month' for better understanding
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [None]:
# checking whether new columns are added successfully or not
df.head(2)

In [None]:
# as we added new cols 'Year' and 'Month' we can drop the col 'Date'
df.drop('Date',axis=1,inplace=True)

In [None]:
# checking
df.head(2)

### What all manipulations have you done and insights you found?

Date column was converted to datetime, sorted, split into year/month, and original column dropped.

This manipulations made the dataset clean, chronologically ordered, and ready for time-series or seasonal analysis, such as checking how the closing price varies across years or months.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

Scatter Plot

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(15, 5))

# Relation between 'Close' and 'Open'
plt.subplot(2, 3, 1)
sns.scatterplot(x='Open', y='Close', data=df)
plt.title('Open vs Close')

# Relation between 'Close' and 'Low'
plt.subplot(2, 3, 2)
sns.scatterplot(x='Low', y='Close', data=df)
plt.title('Low vs Close')

# Relation between 'Close' and 'High'
plt.subplot(2, 3, 3)
sns.scatterplot(x='High', y='Close', data=df)
plt.title('High vs Close')

# Relation between 'Close' and 'Month'
plt.subplot(2, 3, 4)
sns.scatterplot(x='Month', y='Close', data=df)
plt.title('Month vs Close')

# Relation between 'Close' and 'Month'
plt.subplot(2, 3, 5)
sns.scatterplot(x='Year', y='Close', data=df)
plt.title('Year vs Close')

plt.tight_layout()
plt.show()

##### Questions

###### 1. Why did you pick the specific chart?


*  It help us visually inspect whether a linear relationship exists between varibales
*  It helps to check data is evenly spread, clustered, or skewed, which can affect the model’s accuracy.
*  Show if there are any outliers, which might need special handling or removal.


###### 2. What is/are the insight(s) found from the chart?

* Input features - Open, Low, and High show a strong positive linear relationship with the Close price.
* Input features - Month and Year does not show any linear relation with the Close price.
* It does not showing major outliers or abnormal patterns.

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help in positive business impact. The visualizations clearly show that the variables Open, High, and Low have a strong positive linear relationship with the Close price. This confirms that these features are reliable predictors for building a regression model to forecast stock closing prices accurately. Month and Year show weak relation, which may reduce model accuracy if not handled properly.


#### Chart - 2

Line Plot

In [None]:
# Calulating the average Close and Open price
yearly_avg = df.groupby('Year')[['Close','Open']].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(data=yearly_avg, x='Year', y='Close', marker='o', color='blue',label='closing price')
sns.lineplot(data=yearly_avg, x='Year', y='Open', marker='o', color='green',label='opening price')
plt.title('Yearly trend')
plt.xlabel('Year')
plt.ylabel('Avg Close Price')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

##### Questions

###### 1. Why did you pick the specific chart?


*  To observe the yearly trend for open price and close price



###### 2. What is/are the insight(s) found from the chart?

answer

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

answer

#### Chart - 3

Box plot

In [None]:
for i, col in enumerate(df.columns[:4]):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x=df[col], color='skyblue')
    plt.title(f'Box Plot of {col}')

plt.tight_layout()
plt.show()


##### Questions

###### 1. Why did you pick the specific chart?


*  To visualize the outlier



###### 2. What is/are the insight(s) found from the chart?

answer

###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

answer

#### Chart - 4

Heapmap

In [None]:
# visualization code
plt.figure(figsize=(6, 4))
sns.heatmap(df[['Open', 'High', 'Low', 'Month','Year' ,'Close']].corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Heatmap")
plt.show()

###### 1. Why did you pick the specific chart?


* It clearly visualizes the strength and direction of correlation between variables.
* Helps to identify highly correlated features and avoid multicollinearity in regression.


##### 2. What is/are the insight(s) found from the chart?



*  Close has a very strong positive correlation with Open (0.98), High (0.99), and Low (1.0).
* Month and Year show very weak or negative correlation with Close.



###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, selecting features like Open, High, and Low will enhance prediction accuracy for closing prices.
However, since these are strongly inter-correlated, it may lead to multicollinearity, which can affect model stability.
Using Month and Year directly may not add value and might dilute model effectiveness.

#### Chart - 5

Pair Plot

In [None]:
# visualization code

plt.figure(figsize=(5,5))

selected_features = ['Open','Low','High','Year','Month','Close']
sns.pairplot(df, vars=selected_features)
plt.show()

##### Questions


###### 1. Why did you pick the specific chart?


* It helps visualize pairwise relationships between multiple variables in one view.
* Shows linear trends, correlations, and cluster patterns.
* Useful to check multicollinearity among predictors before model building.



###### 2. What is/are the insight(s) found from the chart?


* Strong positive linear relationship between Open, Low, High, and Close
* Month and Year show no clear trend with Close.
* No major outliers or noisy patterns detected among selected features.


###### 3. Will the gained insights help creating a positive business impact?Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights support selecting strong predictors (Open, High, Low) to build a reliable model for forecasting closing prices, which helps in better investment planning. Weak patterns from Year/Month may reduce accuracy if used without proper encoding, but don’t directly indicate negative growth.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Null Hypothesis (H₀): The mean of Open prices = mean of Close prices

Alternative Hypothesis (H₁): The mean of Open prices ≠ mean of Close prices

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_rel

# Paired sample t-test
t_stat, p_value = ttest_rel(df['Open'], df['Close'])

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Significance level
alpha = 0.05

# Conclusion
if p_value < alpha:
    print("Reject H0: There is a significant difference between Open and Close price means.")
else:
    print("Fail to reject H0: No significant difference between Open and Close price means.")


##### Questions

###### Which statistical test have you done to obtain P-Value?

paired t-test

###### Why did you choose the specific statistical test?

Because Open and Close prices are related for each time point, the paired t-test is appropriate. It checks whether the average difference between Open and Close is statistically significant.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in the mean closing price before and after the 2018 fraud case.

Alternative Hypothesis (H1): There is a significant difference in the mean closing price before and after the 2018 fraud case.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# split the data into before and after 2018
before_2018 = df[df['Year'] < 2018]['Close']
after_2018 = df[df['Year'] >= 2018]['Close']

# perfoming independent t-test
t_stat, p_value = ttest_ind(before_2018, after_2018, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# significance level
alpha = 0.05

if p_value < alpha:
    print("Reject H0: Significant difference in mean closing prices before and after 2018.")
else:
    print("Fail to reject H0: No significant difference in mean closing prices before and after 2018.")

##### Question

###### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test

###### Why did you choose the specific statistical test?

The goal is to compare the mean closing prices before and after 2018 fraud case.
The Close price is continuous numeric data, and we are comparing means across two different time periods before and after 2018.Hence, the t-test is the most appropriate and statistically valid choice for this comparison.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The average daily price range (High - Low) is less than or equal to a certain threshold.
Alternative Hypothesis (H1): The average daily price range is greater than a certain threshold.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_1samp

# Calculate daily price range
df['Price_Range'] = df['High'] - df['Low']

# Perform one-sample t-test (test against value = 5)
t_stat, p_value = ttest_1samp(df['Price_Range'], 5)

print(f"T-statistic: {t_stat}")
print(f"P-value (two-tailed): {p_value}")

# Convert to one-tailed p-value for H1: mean > 5
p_value_one_tailed = p_value / 2

# Significance level
alpha = 0.05

# Conclusion
if t_stat > 0 and p_value_one_tailed < alpha:
    print("Reject H0: The average daily price range is significantly greater than 5.")
else:
    print("Fail to reject H0: No significant evidence that average range exceeds 5.")

##### Which statistical test have you done to obtain P-Value?

one-sample, one-tailed t-test.

##### Why did you choose the specific statistical test?

Because we are testing a sample mean (High - Low) against a specific threshold value (Here Rs.5)
Since we are only interested in whether it's greater than 5 one-tailed one-sample t-test will work best .


## ***6. Feature Engineering & Data Pre-processing***

No need to handle missing values

### 1. Handling Outliers

In [None]:
# Making copy of dataset
df1 = df.copy()

# Define a function to remove outliers using IQR
def remove_outliers_iqr(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Filter out rows where values are outside the IQR bounds
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

    return df

# List of numeric columns to check for outliers
cols_to_check = ['Open', 'High', 'Low', 'Close']

# Apply the function
df = remove_outliers_iqr(df, cols_to_check)

df.shape

##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) Method - It detects outliers as values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

Why Used: This method is robust to non-normal distributions and works by defining a range around the median. Data points outside 1.5
timesIQR below the first quartile (Q_1) or above the third quartile (Q_3) are considered outliers.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Closing price from the previous trading day
df['Close_Lag_1'] = df['Close'].shift(1)

# Daily Open-Close difference
df['Daily_Open_Close_Diff'] = df['Close'] - df['Open']

df.head()

In [None]:
# Close_Lag_1 will have NaN in the first row since there's no previous day.
df.dropna(inplace=True)

#### 2. Feature Selection

In [None]:
# Select relevant features
features = ['Open', 'High', 'Low', 'Close_Lag_1', 'Price_Range', 'Daily_Open_Close_Diff']

# Calculate correlation with 'Close'
correlations = df[features + ['Close']].corr()['Close'].drop('Close')

print(correlations)

##### What all feature selection methods have you used  and why?

To identify which of these features might be most useful for predicting 'Close' price, I've calculated the Pearson correlation coefficient between each feature and the 'Close' price. Features with higher absolute correlation values are generally considered more relevant.

##### Which all features you found important and why?

**Low**, **High**, **Close_Lag_1**, and **Open** show extremely high positive correlations with the 'Close' price. This indicates they are very strong indicators, which is typical for stock data where prices are highly correlated day-to-day.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, Data transformation is needed because -


* Different Scales : Features like Open, High, and Low are on large numeric scales. Features like Open, High, and Low are on large numeric scales. This scale difference can bias models like linear regression, KNN, SVM, etc.
* Skewness : From histograms, the variables appeared right-skewed. Skewed data can violate model assumptions (especially linear regression) and lead to poor performance.
* mprove Model Convergence : Models like gradient descent (used in linear regression) converge faster with normalized or standardized inputs.



In [None]:
# Check the skrewness of newly added features
for i, col in enumerate(df.columns[6:]):
    plt.subplot(2, 3, i + 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'{col}')

plt.tight_layout()
plt.show()

In [None]:
# Log Transform - needed when dats is skewed
df['Open_log'] = np.log1p(df['Open'])
df['High_log'] = np.log1p(df['High'])
df['Low_log'] = np.log1p(df['Low'])
df['Price_Range_log'] = np.log1p(df['Price_Range'])
df['Close_Lag_1_log'] = np.log1p(df['Close_Lag_1'])

df.head()

### 6. Data Scaling

In [None]:
# Data Scaling
from sklearn.preprocessing import StandardScaler

# Use log-transformed features
features_to_scale = ['Open_log','High_log','Low_log','Price_Range_log','Close_Lag_1_log']

# standardization of log-transformed features
scaler = StandardScaler()
df = df.copy()
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

# Optional: drop original columns if no longer needed
# df_scaled.drop(['Open', 'High', 'Low', 'Price_Range', 'Close_Lag_1', 'Daily_Open_Close_Diff'], axis=1, inplace=True)

# Check the result
df.head()

##### Which method have you used to scale you data and why?

Here Standardscaler is used because it’s mathematically aligned with linear regression's assumptions and ensures proper coefficient learning and it is Preferred for Linear Regression, it transforms features to have: Mean = 0 and Standard deviation = 1.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No, dimensionality reduction is not required for current setup since the number of features is small and each has clear predictive value.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

X = df[['Open_log', 'High_log', 'Low_log', 'Close_Lag_1_log']]
y = df['Close']

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42  )

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

##### What data splitting ratio have you used and why?

 80:20 ratio - 80% train, 20% test

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**No**, dataset is not imbalanced as there are no classes or labels to balance.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Required Liraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the Linear Regression model
model = LinearRegression()

# Fit the Algorithm
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (target: Close) and one or more independent variables (Open_log, High_log, Low_log, etc.).

In [None]:
# Visualizing evaluation Metric Score chart

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Define metric names and scores
metrics = ['R² Score', 'MSE', 'RMSE']
scores = [r2, mse, rmse]

# Plot the bar chart
plt.figure(figsize=(5, 5))
sns.barplot(x=metrics, y=scores)

# Add score labels on bars
for i, score in enumerate(scores):
    plt.text(i, score + 2, f"{score:.2f}", ha='center', fontsize=12)

plt.title('Regression Model Evaluation Metrics', fontsize=14)
plt.ylabel('Score')
plt.xlabel('Metrics')
plt.tight_layout()
plt.show()

#### 2. Cross- Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error

model = LinearRegression()

# Use negative MSE because scikit-learn expects higher score = better
mse_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-mse_scores)

print("Cross-Validation RMSE Scores:", rmse_scores)
print("Average RMSE:", rmse_scores.mean())

##### Which hyperparameter optimization technique have you used and why?

Cross-validation helps us check how well a model generalizes to unseen data by splitting the dataset into multiple training/testing folds.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***