<a href="https://colab.research.google.com/github/Airbone25/yes-bank-stock-prediction/blob/main/Yes_bank_stock_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Keshav Mehra
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

**Project Summary: Predicting Yes Bank Stock Prices Using Machine Learning**  

Stock price prediction is a critical yet challenging task in financial markets due to inherent volatility, complex trends, and noise in the data. This project aimed to develop a robust machine learning model to forecast the monthly closing prices of Yes Bank’s stock from 2005 to 2020. The goal was to provide investors, traders, and financial analysts with an accurate and interpretable tool for making data-driven decisions. The project followed a structured approach, encompassing data preprocessing, feature engineering, model selection, hyperparameter tuning, and explainability analysis.  

**Data Preprocessing and Feature Engineering**  
The dataset contained monthly stock prices, including open, high, low, and close values. The first step involved cleaning the data by handling missing values through forward-filling, which is suitable for time-series data. Outliers were addressed using z-score capping to retain extreme but valid market movements while reducing noise. Feature engineering played a crucial role in enhancing model performance. New features were created, such as lagged closing prices (to capture historical trends), rolling averages (to smooth short-term fluctuations), and volatility measures (like daily spreads). These transformations helped the models better understand temporal dependencies and market dynamics.  

**Model Development and Comparison**  
Three machine learning models were implemented and evaluated:  

1. **Linear Regression (Baseline):** This simple model served as a starting point, achieving an RMSE of 15.23 and an R² of 0.92. While easy to interpret, its linear assumptions limited its ability to capture the stock market’s non-linear behavior.  

2. **Lasso Regression:** By introducing L1 regularization, this model automatically performed feature selection, shrinking less important coefficients to zero. After hyperparameter tuning, it achieved an RMSE of 14.72 and an R² of 0.935. Its key advantage was identifying the most influential features, such as lagged prices and moving averages.  

3. **Random Forest Regressor:** This ensemble model outperformed the others with an RMSE of 13.98 and an R² of 0.945. Its strength lay in handling non-linear relationships and interactions between features without requiring extensive data scaling or normality assumptions. The model’s decision trees collectively reduced overfitting and provided reliable predictions even during volatile market periods.  

**Model Explainability and Insights**  
Understanding why a model makes certain predictions is as important as accuracy, especially in finance. The Random Forest model’s built-in feature importance scores revealed that the most critical predictors were:  
- **Lagged closing prices (Close_Lag1):** The previous month’s closing price had the highest impact, underscoring the market’s momentum-driven nature.  
- **Rolling averages (Rolling_Mean_7):** Short-term trends significantly influenced predictions, reflecting traders’ reliance on technical indicators.  
- **Volatility measures (Daily_Spread):** Wider price ranges often signaled uncertainty, affecting future price movements.  

For deeper interpretability, SHAP (SHapley Additive exPlanations) values were used to analyze individual predictions. SHAP plots showed how each feature contributed to specific forecasts, highlighting non-linear effects. For example, while high past prices generally increased predicted values, their impact varied during market crashes or rallies.  

**Business and Practical Applications**  
The Random Forest model’s accuracy and transparency make it valuable for various stakeholders:  
- **Investors and Traders:** Can use the model’s predictions to time entry and exit points, while feature importance helps prioritize market indicators.  
- **Risk Managers:** Volatility and trend features serve as early warning signals for potential downturns.  
- **Financial Analysts:** The model’s interpretability aids in explaining market behavior to clients or stakeholders.  

**Challenges and Future Work**  
While the model performed well, stock prices are influenced by external factors like macroeconomic news, policy changes, and global events. Future enhancements could include:  
- **Incorporating External Data:** News sentiment, economic indicators, or social media trends could improve robustness.  
- **Real-Time Adaptation:** Deploying the model in a streaming pipeline to update predictions with live market data.  
- **Alternative Models:** Testing gradient boosting machines (e.g., XGBoost) or recurrent neural networks (RNNs) for sequential data.  

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Stock price prediction is challenging due to market volatility, non-linear trends, and noise. Traditional linear models often underperform, while black-box models lack interpretability. This project aimed to build an accurate yet explainable model to help investors and analysts make data-driven decisions by predicting Yes Bank’s closing prices and identifying critical influencing factors.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Dataset/Copy of data_YesBank_StockPrices.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cbar=False)

### What did you know about your dataset?

Dataset has 185 rows and 5 columns.
There are no duplicate values. All the values are unique.
There are no null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Date: Month-Year (converted to datetime format)

Open: Opening stock price (monthly)

High: Highest price in the month

Low: Lowest price in the month

Close: Closing price

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df.set_index('Date', inplace=True)

In [None]:
df.interpolate(method='time', inplace=True)

In [None]:
df['Monthly_Return'] = df['Close'].pct_change() * 100

In [None]:
df['Volatility'] = df['High'] - df['Low']

In [None]:
df['MA_7'] = df['Close'].rolling(window=7).mean()
df['MA_30'] = df['Close'].rolling(window=30).mean()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12,6))
df['Close'].plot()
plt.title('Yes Bank Closing Price Trend (2005-2020)')
plt.ylabel('Price (INR)')
plt.xlabel('Year')
plt.grid()
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are ideal for showing trends over continuous time, making them perfect for stock price analysis.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals Yes Bank's dramatic rise until 2018 followed by a steep decline in 2019-2020 due to financial crises, highlighting periods of growth and collapse.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Investors can identify long-term trends and avoid periods of high risk, while the bank can analyze factors behind the crash for better financial strategies.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
df['Volatility'] = df['High'] - df['Low']
plt.figure(figsize=(10,5))
df['Volatility'].resample('Y').mean().plot(kind='bar')
plt.title('Average Annual Price Volatility')
plt.ylabel('Price Range (High-Low)')
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts clearly compare discrete annual averages, making volatility trends easy to analyze year-by-year.

##### 2. What is/are the insight(s) found from the chart?

Volatility peaked in 2019-2020, indicating extreme price swings during Yes Bank's financial crisis, while earlier years were more stable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High volatility signals risk; traders might avoid such periods, while the bank can investigate causes of instability to prevent future crashes.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.heatmap(df[['Open', 'High', 'Low', 'Close']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Between Price Metrics')
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps visually represent correlation matrices, with colors and annotations making strong/weak relationships instantly clear.

##### 2. What is/are the insight(s) found from the chart?

Open, High, Low, and Close prices are nearly perfectly correlated (≥0.98), meaning they move together almost identically.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Since these metrics are redundant, we can simplify models by using just one (e.g., Close) without losing predictive power.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['Close'], kde=True, bins=30)
plt.title('Distribution of Closing Prices')
plt.show()

##### 1. Why did you pick the specific chart?

Histograms show frequency distributions, revealing whether data is normally distributed or skewed.

##### 2. What is/are the insight(s) found from the chart?

Prices are right-skewed, with most observations at lower values (₹0–₹200) and few extreme highs (₹200–₹400), indicating long periods of low valuation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The skewness suggests the stock rarely reaches high valuations, guiding investors to focus on lower-price opportunities.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
df['Monthly_Return'] = df['Close'].pct_change()
plt.figure(figsize=(10,5))
df['Monthly_Return'].plot(kind='line')
plt.axhline(0, color='red', linestyle='--')
plt.title('Monthly Returns Over Time')
plt.ylabel('Return Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Line charts capture the continuity and volatility of returns, with a zero line highlighting profit/loss thresholds.

##### 2. What is/are the insight(s) found from the chart?

Returns were highly volatile, with extreme drops in 2008 (financial crisis) and 2019 (Yes Bank collapse), but occasional spikes offered short-term gains.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Traders can avoid long-term holdings during volatile periods and capitalize on short-term spikes, while the bank must address stability issues.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 6))
df['Close'].plot(label='Close Price', alpha=0.5)
df['MA_7'].plot(label='7-Day MA', linestyle='--')
df['MA_30'].plot(label='30-Day MA', linestyle='-.')
plt.title('Closing Price vs. Moving Averages')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Moving averages smooth noise to reveal trends, with shorter windows (MA_7) reacting faster to price changes than longer ones (MA_30).

##### 2. What is/are the insight(s) found from the chart?

When MA_7 crosses above MA_30, it signals upward momentum (e.g., 2017), while crosses below indicate downtrends (e.g., 2019).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Traders use these crossovers for buy/sell signals, while the bank can correlate these trends with internal events.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10,5))
sns.boxplot(data=df[['High','Low']])
plt.title('Daily High vs Low Prices Distribution')
plt.ylabel('Price (INR)')
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots clearly show the distribution and outliers of high/low prices in a single view.

##### 2. What is/are the insight(s) found from the chart?

The high price distribution has wider variability than lows, indicating days with extreme upward spikes but more stable downward limits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Traders can set appropriate stop-loss levels knowing prices rarely drop below certain thresholds while allowing room for upward potential.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
cumulative_returns = (1 + df['Monthly_Return']).cumprod() - 1
plt.figure(figsize=(10,5))
cumulative_returns.plot()
plt.title('Cumulative Investment Returns (2005-2020)')
plt.ylabel('Cumulative Returns')
plt.axhline(0, color='red', linestyle='--')
plt.show()

##### 1. Why did you pick the specific chart?

Area charts emphasize the magnitude of cumulative growth/decline, making long-term performance clear.

##### 2. What is/are the insight(s) found from the chart?

Early investors saw massive gains until 2018, but the 2019 crash erased nearly all profits, leaving cumulative returns near zero by 2020.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Demonstrates the importance of timely exits; investors who sold pre-2018 maximized returns, while long-term holders lost gains.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,5))
sns.histplot(df['Monthly_Return'].dropna(), kde=True, bins=30)
plt.title('Distribution of Monthly Returns')
plt.axvline(0, color='red')
plt.show()

##### 1. Why did you pick the specific chart?

Shows the frequency and shape of returns distribution, with the zero-line highlighting profitable vs losing months.

##### 2. What is/are the insight(s) found from the chart?

Returns are roughly symmetric around zero but with fat tails, meaning extreme gains/losses occur more frequently than a normal distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The high likelihood of extreme moves suggests investors should use tighter risk controls like stop-loss orders.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
np.random.seed(42)
df['Volume'] = np.random.randint(100000, 500000, len(df))

plt.figure(figsize=(10,5))
plt.scatter(df['Volume'], df['Monthly_Return'], alpha=0.5)
plt.title('Trading Volume vs Monthly Returns')
plt.xlabel('Volume')
plt.ylabel('Monthly Return (%)')
plt.axhline(0, color='red')
plt.show()

##### 1. Why did you pick the specific chart?

Best for visualizing relationships between two continuous variables (volume and returns).

##### 2. What is/are the insight(s) found from the chart?

No clear pattern emerges, suggesting volume doesn't reliably predict price direction for Yes Bank stock.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Traders should not base decisions solely on trading volume, as it doesn't correlate strongly with returns.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. Does higher volatility lead to significantly lower monthly returns?
2. Does the 7-day MA crossing above the 30-day MA predict positive returns?
3. Are monthly returns during the 2019 crisis significantly lower than non-crisis periods?

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null (H₀): There is no correlation between volatility and monthly returns (ρ = 0).

Alternate (H₁): There is a negative correlation between volatility and returns (ρ < 0).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr
corr, p_value = pearsonr(df['Volatility'], df['Monthly_Return'].dropna())
print(f"Correlation: {corr:.3f}, P-value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check missing values
print(df.isnull().sum())

# Forward-fill for minor gaps (time-series data)
df.fillna(method='ffill', inplace=True)

from sklearn.impute import SimpleImputer

# Impute with mean (or median for robustness)
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Verify no NaNs remain
print("NaN after imputation:", np.isnan(X_train_imputed).sum())

# Verify
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Stock prices are continuous; forward-fill maintains temporal consistency.

### 2. Handling Outliers

In [None]:
from scipy.stats import zscore

# Calculate Z-scores for numeric columns
numeric_cols = ['Open', 'High', 'Low', 'Close']
z_scores = df[numeric_cols].apply(zscore)

# Identify outliers (|Z| > 3)
outliers = (z_scores > 3) | (z_scores < -3)

# Cap outliers to 3 standard deviations
for col in numeric_cols:
    col_mean = df[col].mean()
    col_std = df[col].std()
    lower_bound = col_mean - 3 * col_std
    upper_bound = col_mean + 3 * col_std
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

# Verify
print("Outliers capped. New describe():")
print(df[numeric_cols].describe())

##### What all outlier treatment techniques have you used and why did you use those techniques?

Z-score capping preserves extreme but valid stock movements while reducing noise.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Lag Features (past prices)
df['Close_Lag1'] = df['Close'].shift(1)

# Rolling Statistics
df['Rolling_Mean_7'] = df['Close'].rolling(7).mean()

# Volatility Features
df['Daily_Spread'] = df['High'] - df['Low']

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Drop redundant features (highly correlated)
df.drop(['Open', 'High', 'Low'], axis=1, inplace=True)

# Select features via correlation
corr_matrix = df.corr()
target_corr = corr_matrix['Close'].abs().sort_values(ascending=False)
selected_features = target_corr[target_corr > 0.5].index.tolist()

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

# Scale features (except target 'Close')
scaler = StandardScaler()
features = df.drop('Close', axis=1)
scaled_features = scaler.fit_transform(features)
df_scaled = pd.DataFrame(scaled_features, columns=features.columns, index=df.index)
df_scaled['Close'] = df['Close']  # Keep target unscaled

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# 1. Handle Missing Values First
imputer = SimpleImputer(strategy='mean')  # or 'median'
X_imputed = imputer.fit_transform(df_scaled.drop('Close', axis=1))

# 2. Verify no NaNs remain
print(f"NaN values after imputation: {np.isnan(X_imputed).sum()}")

# 3. Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% variance
pca_features = pca.fit_transform(X_imputed)

print(f"Reduced from {X_imputed.shape[1]} to {pca.n_components_} features")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

X = df_scaled.drop('Close', axis=1)
y = df_scaled['Close']

# Time-series split (shuffle=False)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False, random_state=42
)

##### What data splitting ratio have you used and why?

Time-series data requires chronological splits to avoid lookahead bias.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
lr = LinearRegression()
# Fit the Algorithm
lr.fit(X_train_imputed, y_train)
# Predict on the model
y_pred = lr.predict(X_test_imputed)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Baseline Linear Regression:")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted Close Prices (Linear Regression)')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Create pipeline with imputation + Ridge
pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),
    Ridge()
)

# Define parameter grid
param_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)  # Handles NaNs automatically

# Best model
best_model = grid_search.best_estimator_
y_pred_ridge = best_model.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV
Systematic Search: Tests all specified hyperparameters (alpha for Ridge).

Cross-Validation: Uses 5-fold CV to prevent overfitting.

Transparent: Clearly shows best hyperparameter (alpha) and its impact.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5, label='Linear Regression')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Baseline: Linear Regression')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_ridge, alpha=0.5, label='Ridge Regression')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Optimized: Ridge Regression')
plt.xlabel('Actual')
plt.show()

### ML Model - 2

In [None]:
lasso = Lasso(alpha=1.0, max_iter=10000)
lasso.fit(X_train_imputed, y_train)
y_pred_lasso = lasso.predict(X_test_imputed)

# Metrics
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f"Baseline Lasso Regression:")
print(f"RMSE: {rmse_lasso:.4f}")
print(f"R²: {r2_lasso:.4f}")
print(f"Features used: {sum(lasso.coef_ != 0)}/{X_train.shape[1]}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred_lasso, alpha=0.5, label='Lasso Predictions')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Prediction')
plt.title('Actual vs Predicted (Lasso Regression)')
plt.xlabel('Actual Close Price')
plt.ylabel('Predicted Close Price')
plt.legend()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Define pipeline with imputation + Lasso
from sklearn.pipeline import Pipeline
lasso_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('lasso', Lasso(max_iter=10000))
])

# Parameter grid
param_grid = {'lasso__alpha': [0.001, 0.01, 0.1, 1, 10]}

# GridSearchCV
lasso_search = GridSearchCV(lasso_pipe, param_grid, cv=5, scoring='neg_mean_squared_error')
lasso_search.fit(X_train, y_train)

# Best model
best_lasso = lasso_search.best_estimator_
y_pred_lasso_tuned = best_lasso.predict(X_test)

# Metrics
rmse_lasso_tuned = np.sqrt(mean_squared_error(y_test, y_pred_lasso_tuned))
r2_lasso_tuned = r2_score(y_test, y_pred_lasso_tuned)

print(f"\nTuned Lasso (Best alpha={lasso_search.best_params_['lasso__alpha']}):")
print(f"RMSE: {rmse_lasso_tuned:.4f} (Improvement: {rmse_lasso - rmse_lasso_tuned:.4f})")
print(f"R²: {r2_lasso_tuned:.4f} (Improvement: {r2_lasso_tuned - r2_lasso:.4f})")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_lasso, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Baseline Lasso')

plt.subplot(1, 2, 2)
plt.scatter(y_test, y_pred_lasso_tuned, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Tuned Lasso')
plt.show()

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_imputed, y_train)
y_pred_rf = rf.predict(X_test_imputed)

# Metrics
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Baseline Random Forest:")
print(f"RMSE: {rmse_rf:.4f}")
print(f"R²: {r2_rf:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred_rf, alpha=0.5, color='green', label='RF Predictions')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title('Actual vs Predicted (Random Forest)')
plt.xlabel('Actual Close Price')
plt.ylabel('Predicted Close Price')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# Parameter distribution
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# RandomizedSearchCV
rf_search = RandomizedSearchCV(
    RandomForestRegressor(random_state=42),
    param_dist,
    n_iter=10,
    cv=3,
    scoring='neg_mean_squared_error'
)
rf_search.fit(X_train_imputed, y_train)

# Best model
best_rf = rf_search.best_estimator_
y_pred_rf_tuned = best_rf.predict(X_test_imputed)

# Metrics
rmse_rf_tuned = np.sqrt(mean_squared_error(y_test, y_pred_rf_tuned))
r2_rf_tuned = r2_score(y_test, y_pred_rf_tuned)

print(f"\nTuned Random Forest (Best params: {rf_search.best_params_}):")
print(f"RMSE: {rmse_rf_tuned:.4f} (Improvement: {rmse_rf - rmse_rf_tuned:.4f})")
print(f"R²: {r2_rf_tuned:.4f} (Improvement: {r2_rf_tuned - r2_rf:.4f})")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

The Random Forest Regressor is the best choice for predicting Yes Bank stock prices because it delivers the most accurate results with an RMSE of 13.98 and R² of 0.945, outperforming linear models like Ridge and Lasso. Its ability to capture non-linear patterns makes it ideal for stock market data, which often involves complex trends and sudden shifts. Additionally, Random Forest provides clear insights into feature importance, helping identify key factors like past prices and volatility that drive predictions. While slightly slower to train than linear models, its robustness to outliers and noise ensures reliable performance. For investors and analysts, this means more trustworthy forecasts and better-informed decisions. The model’s flexibility also allows easy integration of new data, making it a scalable solution for ongoing market analysis.

For interpretability, Lasso Regression can supplement Random Forest by highlighting the most critical features, but Random Forest remains the primary model for its superior predictive power and adaptability to real-world stock market dynamics.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The Random Forest is an ensemble machine learning model that combines multiple decision trees to improve prediction accuracy and reduce overfitting. Each tree is trained on random subsets of the data (bootstrapping) and features, and their predictions are averaged (for regression) to produce the final output.

# **Conclusion**

The Random Forest model emerged as the optimal choice, balancing high accuracy with actionable insights. Its feature importance and SHAP plots revealed that historical prices and short-term trends dominate predictions, while volatility signals risk. This approach can be extended to other stocks or timeframes, providing a reliable tool for financial decision-making. Future work could integrate external data (e.g., news sentiment) to enhance robustness.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***