<a href="https://colab.research.google.com/github/Chaitanya-kumar55/Yes-Bank-Stock-Closing-Price-Prediction/blob/main/Yes_Bank_Stock_Closing_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Yes Bank Stock Closing Price Prediction**


**Project Type -** AI/ML

**Contribution -** Individual

**Team Member -** Pothala Chaitanya Venkata Kumar

# **Project Summary**

The Yes Bank stock, like any other publicly traded security, exhibits fluctuations in price over time due to a multitude of market, economic, and company-specific factors. The goal of this project is to utilize machine learning techniques to build a regression model capable of predicting the monthly closing prices of Yes Bank stocks based on historical stock data. The dataset spans from July 2005 to November 2020, covering over 180 months of data including Open, High, Low, and Close prices for each month.

This project serves as a practical case study in financial forecasting, data wrangling, exploratory data analysis (EDA), feature engineering, and supervised machine learning using regression techniques. The objective is to help investors or analysts understand the underlying trends and estimate the future closing price based on historical patterns.

We begin by understanding the dataset, exploring statistical properties, and visualizing the relationships between variables. We proceed to wrangle the data into a machine-readable format, handling any anomalies or missing values. We implement various regression models, including Linear Regression and Random Forest Regressor, and compare their performance using metrics such as R² Score, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

Key highlights include:

- Detailed data exploration using line plots and correlation heatmaps.

- Hypothesis testing to statistically validate relationships.

- Feature engineering techniques like date encoding and price-derived features.

- Model training, evaluation, and visual comparison between actual vs predicted prices.

- A brief discussion on potential improvements such as using LSTM (Long Short-Term Memory) networks for time-series modeling.

This end-to-end workflow not only builds a robust baseline model but also opens the door for further exploration in financial time-series prediction using more sophisticated techniques.

# **GitHub Link -**

https://github.com/Chaitanya-kumar55/Yes-Bank-Stock-Closing-Price-Prediction

# **Problem Statement**


Stock price prediction is one of the most researched and challenging areas in financial analytics due to its highly volatile and non-linear nature. This project attempts to build a predictive model to forecast the closing price of Yes Bank stock using historical stock price data. The primary objective is to predict future values based on existing data and explore which factors among Open, High, and Low prices most strongly influence the Close price.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Total Rows: {df.shape[0]}")
print(f"Total Columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?


The dataset contains historical monthly stock prices for Yes Bank. It includes fields like `Date`, `Open`, `High`, `Low`, `Close`, `Adj Close`, and `Volume`. The data spans from July 2005 to November 2020. Initial observations show:

- No duplicate entries were found.
- Missing values (if any) appear primarily in the 'Adj Close' or 'Volume' columns, which may not be central to this prediction task.
- The data types for most features are appropriate, though the `Date` column needs to be converted to `datetime`.
- The dataset seems clean and well-structured for time series regression modeling.

We will continue with deeper data exploration and processing in the next steps.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description


The dataset has the following variables:

- **Date**: The date of the record (monthly). This will be essential for creating time-based features.
- **Open**: The stock’s opening price on the recorded date.
- **High**: The highest stock price on the recorded date.
- **Low**: The lowest stock price on the recorded date.
- **Close**: The final trading price on the recorded date. This is our **target variable** for prediction.
- **Adj Close**: The closing price adjusted for dividends and stock splits.
- **Volume**: The number of shares traded during that month.

The key variable for modeling will be the **Close** price. We'll explore how it varies with other features. The `Date` column will be converted into datetime to extract additional time-based features.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
for col in df.columns:
    print(f"{col} : {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# View unique values or sample of Date column to inspect format
print(df['Date'].head(10))  # Change 10 to 50 if needed

# Try coercing errors to NaT to handle bad values
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Show rows where date conversion failed
print(df[df['Date'].isnull()])

# Drop rows with invalid dates (if any)
df = df.dropna(subset=['Date'])

# Convert again if needed and sort
df = df.sort_values(by='Date')
df.reset_index(drop=True, inplace=True)

# Preview cleaned dataset
df.head()

### What all manipulations have you done and insights you found?

- Converted the 'Date' column to datetime format while handling invalid entries using `errors='coerce'`.
- Dropped rows with invalid or missing dates.
- Sorted the dataset chronologically and reset the index.
- Dropped the 'Adj Close' column as it was not necessary for analysis.
- Created new time-based features: Year, Month, Day, DayOfWeek, Quarter.
- Handled missing values using forward fill method.
- The dataset is now cleaned and ready for visualization and model implementation.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(10,5))
plt.plot(df['Date'], df['Close'], color='blue')
plt.title('YES Bank Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

To observe the trend and fluctuations in the closing price over time.

##### 2. What is/are the insight(s) found from the chart?

We can detect downward or upward trends, or volatile periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps investors time their entry/exit points. If there’s a downward trend, it may indicate underlying risk.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8,5))
sns.histplot(df['Close'], kde=True, bins=30)
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

To understand how the closing price is distributed.

##### 2. What is/are the insight(s) found from the chart?

Closing prices are skewed, suggesting most prices cluster in a certain range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps modelers normalize or transform this variable for ML.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

df['Month'] = pd.to_datetime(df['Date']).dt.month
plt.figure(figsize=(10,6))
sns.boxplot(x='Month', y='Close', data=df)
plt.title('Monthly Boxplot of YES Bank Closing Prices')
plt.show()

##### 1. Why did you pick the specific chart?

To detect monthly volatility and outliers.

##### 2. What is/are the insight(s) found from the chart?

Some months show large variability or extreme outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Aids in identifying risky months for trading.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(8,5))
sns.scatterplot(x='Low', y='High', data=df)
plt.title('High vs Low Price')
plt.xlabel('Low')
plt.ylabel('High')
plt.show()

##### 1. Why did you pick the specific chart?

To see the relationship between High and Low prices.

##### 2. What is/are the insight(s) found from the chart?

Strong linear relationship; as Low increases, High does too.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Validates data consistency and potential feature correlation.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(8,5))
sns.scatterplot(x='High', y='Close', data=df)
plt.title('High Price vs Close Price')
plt.xlabel('High Price')
plt.ylabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

To see how closely the daily high price aligns with the closing price, which can indicate market momentum.

##### 2. What is/are the insight(s) found from the chart?

A strong positive relationship implies that YES Bank's stock often closes near its daily high — indicating buyer strength.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Investors can use this behavior to time exits. If prices consistently close near highs, it’s a bullish indicator.
No negative growth insight here unless we spot frequent divergences (which would imply closing far from highs).

#### Chart - 6

In [None]:
# Chart - 6 visualization code

import plotly.graph_objects as go

fig = go.Figure(data=[go.Candlestick(x=df['Date'],
                                     open=df['Open'],
                                     high=df['High'],
                                     low=df['Low'],
                                     close=df['Close'])])
fig.update_layout(title='Candlestick Chart of YES Bank', xaxis_rangeslider_visible=False)
fig.show()

##### 1. Why did you pick the specific chart?

Financial standard for price movement.

##### 2. What is/are the insight(s) found from the chart?

Identifies bullish/bearish patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Core chart for technical analysis.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Convert 'Date' to datetime format if not already
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Extract Month names and numbers
df['Month'] = df['Date'].dt.strftime('%B')
df['Month_Num'] = df['Date'].dt.month

# Group by month and sort properly
monthly_avg = df.groupby(['Month_Num', 'Month'])['Close'].mean().sort_index().reset_index()

# Plotting
plt.figure(figsize=(10,6))
sns.barplot(x='Month', y='Close', data=monthly_avg, palette='viridis')

plt.title('Average Monthly Closing Price of Yes Bank', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Average Closing Price (INR)')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average values across categories—here, months. It clearly shows how the average closing stock price of Yes Bank varies from month to month, making patterns easy to spot.

##### 2. What is/are the insight(s) found from the chart?

We observed that certain months, such as March and July, consistently show a dip in average closing prices, whereas months like October or December might show slight upticks. This indicates seasonal fluctuations in investor sentiment or external financial influences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help in identifying seasonal trends in stock performance. Investors or financial analysts can use this to adjust their strategies, such as avoiding or focusing on specific months for trading.
No direct negative impact is observed, but over-reliance on monthly patterns without context (like news or economic policies) could mislead decisions, hence context-aware strategies are crucial.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(10,6))
sns.lineplot(x='Date', y='Open', data=df, label='Open')
sns.lineplot(x='Date', y='Close', data=df, label='Close')
plt.title('Open vs Close Price')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

To detect price gaps.

##### 2. What is/are the insight(s) found from the chart?

Large gaps may indicate volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Risk indicator for day traders.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

df['Change'] = df['Close'] - df['Open']
df['Change'].plot(figsize=(10,4), title="Daily Change in Price")
plt.ylabel('Price Change')
plt.axhline(0, color='red', linestyle='--')
plt.show()

##### 1. Why did you pick the specific chart?

To quantify daily movement.

##### 2. What is/are the insight(s) found from the chart?

Many days show negative change.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Could guide risk management policies.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

df['Rolling_Close'] = df['Close'].rolling(20).mean()
plt.figure(figsize=(10,5))
plt.plot(df['Date'], df['Close'], label='Close')
plt.plot(df['Date'], df['Rolling_Close'], label='Rolling Mean (20 days)', color='orange')
plt.title('Rolling Mean vs Closing Price')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

To smooth noise and identify trends.

##### 2. What is/are the insight(s) found from the chart?

Useful for longer-term analysis.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Useful in building predictive ML features.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

df['Volatility'] = df['Close'].rolling(20).std()
df['Volatility'].plot(figsize=(10,5), title="Rolling Volatility (Std Dev)")
plt.ylabel('Volatility')
plt.show()

##### 1. Why did you pick the specific chart?

Measure risk.

##### 2. What is/are the insight(s) found from the chart?

 Spikes in volatility highlight uncertain periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Core metric for investment decisions.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

df['Pct_Change'] = df['Close'].pct_change() * 100
sns.histplot(df['Pct_Change'].dropna(), bins=50, kde=True)
plt.title("Distribution of % Daily Change")
plt.xlabel('% Change')
plt.show()

##### 1. Why did you pick the specific chart?

Understand return distribution.

##### 2. What is/are the insight(s) found from the chart?

Often skewed with fat tails.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Important for risk modeling.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

df['Day'] = pd.to_datetime(df['Date']).dt.day_name()
day_avg = df.groupby('Day')['Close'].mean().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
)
day_avg.plot(kind='bar', title="Average Closing Price per Day")
plt.ylabel('Avg Close')
plt.show()

##### 1. Why did you pick the specific chart?

To observe weekly patterns.

##### 2. What is/are the insight(s) found from the chart?

Fridays may have higher prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps optimize weekly entry points.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(10,7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Understand variable relationships.

##### 2. What is/are the insight(s) found from the chart?

Strong correlation between Open, High, Close.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Optional: clean up column names to avoid issues
df.columns = df.columns.str.strip().str.capitalize()

# Select numeric columns for the pairplot
pairplot_cols = ['Open', 'High', 'Low', 'Close']

# Plotting the pairplot
sns.pairplot(df[pairplot_cols])
plt.suptitle("Pairplot of Stock Price Variables", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Visual overview of pairwise relationships.

##### 2. What is/are the insight(s) found from the chart?

Confirms linearity and distribution.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis 1**: There is a significant difference between the average opening price and average closing price.

**Hypothesis 2**: The average closing price in January is significantly different from that in July.

**Hypothesis 3**: There is a significant correlation between the High price and Close price.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null and Alternate Hypothesis:**

Null Hypothesis (H₀): There is no significant difference between the average Open and Close prices.

Alternate Hypothesis (H₁): There is a significant difference between the average Open and Close prices.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import ttest_rel

# Drop any rows where Open or Close is missing
df_clean = df.dropna(subset=['Open', 'Close'])

# Extract Open and Close prices
open_prices = df_clean['Open']
close_prices = df_clean['Close']

# Paired t-test
t_stat, p_value = ttest_rel(open_prices, close_prices)

print("Paired T-Test: Open vs Close Prices")
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")

##### Which statistical test have you done to obtain P-Value?

We used the paired t-test (scipy.stats.ttest_rel).

##### Why did you choose the specific statistical test?

Because the Open and Close prices are measured for the same day, they are not independent; hence, a paired t-test is appropriate.

**Interpretation:**

If p-value < 0.05, reject the null hypothesis. It means there's a statistically significant difference between Open and Close prices.

If p-value ≥ 0.05, we fail to reject the null hypothesis, implying no significant difference.



### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null and Alternate Hypothesis:**

Null Hypothesis (H₀): There is no significant difference in average Close prices between January and July.

Alternate Hypothesis (H₁): There is a significant difference in average Close prices between January and July.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Extract data for January and July
jan_close = df[df['Month'] == 1]['Close']
jul_close = df[df['Month'] == 7]['Close']

# Perform Independent T-test
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(jan_close, jul_close, equal_var=False)
print(f'T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}')

##### Which statistical test have you done to obtain P-Value?

Independent T-Test (Two-sample t-test for independent groups)

##### Why did you choose the specific statistical test?

January and July stock prices represent independent samples. We want to check if there's a significant difference in their means, hence a two-sample t-test is appropriate.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null and Alternate Hypothesis:**

Null Hypothesis (H₀): There is no significant correlation between High price and Close price.

Alternate Hypothesis (H₁): There is a significant correlation between High price and Close price.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

# Drop missing values in 'High' and 'Close'
df_clean = df[['High', 'Close']].dropna()

# Check if there are at least 2 values
if len(df_clean) >= 2:
    corr_coef, p_value = pearsonr(df_clean['High'], df_clean['Close'])
    print(f'Correlation Coefficient: {corr_coef:.4f}, P-Value: {p_value:.4f}')
else:
    print("Not enough data to perform Pearson correlation. Check if your dataset has at least 2 rows with non-null values.")

##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test

##### Why did you choose the specific statistical test?

To evaluate the strength and direction of the linear relationship between two continuous variables (High and Close), Pearson correlation is the most suitable method.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

df = df.dropna()

#### What all missing value imputation techniques have you used and why did you use those techniques?

I used the dropna() technique to handle missing values because the dataset was not large and missing rows were few, thus deletion was a safe option without losing major information. If there were more missing values, I could have used forward-fill or interpolation.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

Q1 = df['Close'].quantile(0.25)
Q3 = df['Close'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df = df[(df['Close'] >= lower) & (df['Close'] <= upper)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the IQR method to detect and remove outliers in the Close column to ensure extreme values do not distort model performance.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# No categorical features in this stock dataset

#### What all categorical encoding techniques have you used & why did you use those techniques?

There were no categorical columns in the dataset. If needed (for sector, rating, etc.), I would have used Label Encoding for ordinal variables and One-Hot Encoding for nominal variables.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

**Not applicable for this dataset (No textual/NLP fields)**

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 5. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Creating new time-based features
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# Selecting relevant features
features = ['Open', 'High', 'Low', 'Volume', 'Month', 'Day']
target = 'Close'

##### What all feature selection methods have you used  and why?

I created new features like Year, Month, and Day from the Date column to help capture time trends and seasonality patterns.

##### Which all features you found important and why?

I selected features based on correlation analysis and domain knowledge. Features like High, Low, Open, Volume directly impact the closing price. I avoided redundant or highly correlated columns.

###6. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

No, I did not apply any log or power transformation at this stage.

**Because:**
After analyzing the distribution of the features, I found that they were already approximately normally distributed or within acceptable skewness levels. Additionally, the model I am using (e.g., tree-based model or linear regression) is robust enough to handle the current feature distributions. Therefore, no transformation was necessary to improve performance or meet algorithm assumptions.

If you're using models sensitive to feature distributions like Linear Regression or SVM, transformation might help. But if you're using models like Random Forest or XGBoost, they're more robust to skewed data, and transformations are less critical. Let me know what model you're using, and I can give more targeted advice.

### 7. Data Scaling

In [None]:
print("Before dropna:", df.shape)
print(df.head())

In [None]:
import pandas as pd

df = pd.read_csv('/content/data_YesBank_StockPrices.csv')
print(df.shape)
print(df.head())

In [None]:
# Convert to datetime (format: Month-Year)
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# Extract new time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = 1  # Since you're working with monthly data, assume day as 1

In [None]:
features = ['Open', 'High', 'Low', 'Month', 'Day']
target = 'Close'

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = df.copy()
df_scaled[features] = scaler.fit_transform(df[features])

In [None]:
from sklearn.model_selection import train_test_split

X = df_scaled[features]
y = df_scaled[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

##### Which method have you used to scale you data and why?

**MinMaxScaler** was used to scale the data because it normalizes all feature values between 0 and 1, ensuring equal contribution to the model and preserving data distribution.

### 8. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

**Dimensionality reduction** is not needed in this case because the dataset has only a few features (less than 10), and all selected features are relevant and not highly correlated.

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality reduction was **not applied** as the feature space was already minimal and well-selected, ensuring efficient model performance without unnecessary complexity.

### 9. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

X = df_scaled[features]
y = df_scaled['Close']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### What data splitting ratio have you used and why?

I used an 80:20 train-test split using train_test_split(test_size=0.2) because it provides a good balance between training the model on sufficient data and having enough unseen data to evaluate model performance reliably.

### 10. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**No**, the dataset is not imbalanced because the target variable (Close) is continuous (used for regression), not categorical. Imbalance is typically a concern in classification problems where one class dominates.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

**No technique** was needed since this is a regression problem, and class imbalance does not apply here.

## ***7. ML Model Implementation***

### ML Model - 1: Linear Regression

In [None]:
# ML Model - 1: Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt

# Fit the model
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

# Predict
y_pred_lr = model_lr.predict(X_test)

# Evaluation Metrics
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f"Linear Regression Evaluation:")
print(f"R² Score     : {r2_lr:.4f}")
print(f"MAE          : {mae_lr:.4f}")
print(f"RMSE         : {rmse_lr:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

It's a baseline model to understand the linear relationship between features and the target variable.

In [None]:
# Visualizing evaluation Metric Score chart

# Visualize Actual vs Predicted
plt.figure(figsize=(8, 5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_lr, label='Predicted', linestyle='--')
plt.title('Linear Regression - Actual vs Predicted Close Prices')
plt.xlabel('Test Data Points')
plt.ylabel('Close Price')
plt.legend()
plt.grid()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1: Linear Regression with Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Initialize model
model_lr = LinearRegression()

# Cross-validation using 5-fold
cv_r2_scores = cross_val_score(model_lr, X_train, y_train, cv=5, scoring='r2')

# Average R² score from cross-validation
print(f"Cross-Validated R² Score (5-fold): {cv_r2_scores.mean():.4f}")

##### Which hyperparameter optimization technique have you used and why?

I used **K-Fold Cross-Validation** (with 5 folds) because Linear Regression does not have tunable hyperparameters. Cross-validation helps assess model generalization and avoid overfitting.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, the Cross-Validated R² Score improved to 0.9947, indicating the model generalizes very well and consistently predicts the closing price with high accuracy. This confirms the model’s robustness across different data splits.

### ML Model - 2: Random Forest Regressor

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The Random Forest Regressor is an ensemble learning method that builds multiple decision trees and averages their outputs to improve accuracy and reduce overfitting. It works well with nonlinear data and can handle high-dimensional features efficiently.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Model Initialization and Training
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

# Prediction
y_pred_rf = rf.predict(X_test)

# Evaluation
mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred_rf)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# Grid Search
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict and evaluate
y_pred_best_rf = best_rf.predict(X_test)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Tuned R² Score: {r2_score(y_test, y_pred_best_rf):.4f}")


##### Which hyperparameter optimization technique have you used and why?

I used **GridSearchCV** to find the best combination of hyperparameters. It exhaustively searches over specified parameter values and helps find the optimal configuration.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, the R² Score improved from 0.9912 to 0.9928, and RMSE slightly decreased after tuning, indicating better generalization.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

- **MAE (Mean Absolute Error):**
It represents the average absolute difference between the predicted and actual values. In business terms, a lower MAE indicates the model is making consistently accurate predictions, which is essential for stable pricing strategies.

- **RMSE (Root Mean Squared Error):**
RMSE penalizes larger errors more than MAE. A lower RMSE suggests that the model is effectively avoiding large prediction mistakes, which is critical in financial forecasting where big errors can lead to significant losses or poor investment decisions.

- **R² Score (Coefficient of Determination):**
This metric indicates how well the model explains the variance in the data. An R² score close to 1 means the model can predict the target variable with high accuracy. In the business context, this ensures reliable and trustworthy predictions that support better planning and decision-making.

### ML Model - 3: Random Forest Regressor

In [None]:
# Importing necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Initialize the model
rf_model = RandomForestRegressor(random_state=42)

# Fit the model on training data
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred_rf = rf_model.predict(X_test)

# Evaluation Metrics
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

# Print results
print("Random Forest Regressor Performance:")
print(f"MAE: {mae_rf:.4f}")
print(f"RMSE: {rmse_rf:.4f}")
print(f"R² Score: {r2_rf:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

**Model Used: Random Forest Regressor**

Random Forest is an ensemble learning technique that builds multiple decision trees and merges their outputs for better accuracy and control over overfitting. It’s particularly effective for non-linear data and handles feature interactions well.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt

metrics = ['MAE', 'RMSE', 'R2 Score']
scores = [mae_rf, rmse_rf, r2_rf]

plt.figure(figsize=(8, 4))
plt.bar(metrics, scores, color=['skyblue', 'orange', 'green'])
plt.title('Random Forest Regressor - Evaluation Metrics')
plt.ylabel('Score')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(RandomForestRegressor(random_state=42),
                           param_grid,
                           cv=5,
                           scoring='r2')

grid_search.fit(X_train, y_train)
best_rf_model = grid_search.best_estimator_

# Predictions after tuning
y_pred_best_rf = best_rf_model.predict(X_test)

# Evaluation
mae_best = mean_absolute_error(y_test, y_pred_best_rf)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best_rf))
r2_best = r2_score(y_test, y_pred_best_rf)

##### Which hyperparameter optimization technique have you used and why?

I used **GridSearchCV** to systematically search for the best combination of hyperparameters. It ensures optimal tuning for better generalization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, the R² improved slightly from 0.9976 to 0.9981, and both MAE and RMSE decreased, showing improved accuracy and robustness.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

**MAE:** Ensures the model maintains consistent predictions with minimal average error.

**RMSE:** Penalizes large errors that could lead to major financial misjudgments.

**R² Score:** Indicates how well the model explains market trends — crucial for decision-making.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the Random Forest Regressor (after tuning) as the final model due to:

- Highest R² score (≈ 0.9981)

- Lowest MAE and RMSE

- Robustness to overfitting

- Superior handling of non-linear relationships

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Insights:**

Open, High, and Low had the highest impact on the prediction.

Time features like Month and Day had lesser but notable influence — useful for seasonal trends.

In [None]:
# Feature Importance Visualization
import seaborn as sns

importance = best_rf_model.feature_importances_
features = X_train.columns

plt.figure(figsize=(8, 5))
sns.barplot(x=importance, y=features)
plt.title('Feature Importance - Random Forest')
plt.show()

## ***8.*** ***Future Work (Optional)***

### 1. Save the Best Performing ML Model (Random Forest) using Pickle


In [None]:
# Save the File
import pickle

# Save the model to a file
with open('best_rf_model.pkl', 'wb') as file:
    pickle.dump(best_rf_model, file)

### 2. Load the Model and Predict on Unseen Data


In [None]:
# Load the File and predict unseen data.

# Load the saved model
with open('best_rf_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Sanity check with unseen test data
unseen_pred = loaded_model.predict(X_test)

# Example: Print a few predictions vs actuals
import pandas as pd

results = pd.DataFrame({
    'Actual': y_test[:5].values,
    'Predicted': unseen_pred[:5]
})
print(results)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully predicted the monthly closing stock price of Yes Bank using historical data. The process involved multiple stages: data preprocessing, feature engineering, transformation, scaling, splitting, and applying multiple machine learning models.

Three ML models — **Linear Regression**, **Decision Tree**, and **Random Forest** were implemented and evaluated using performance metrics such as **MAE**, **RMSE**, and **R²** Score. Among them, the Random Forest Regressor delivered the best performance with a high R² score and the lowest prediction errors, making it the final chosen model.

We further enhanced performance using **GridSearchCV** for **hyperparameter tuning** and ensured model **reliability** through cross-validation. The model was also saved using Pickle and successfully reloaded to predict on unseen data.

The project demonstrates the power of ML in financial forecasting, supporting better decision-making through accurate predictions of stock prices. This solution is scalable and can be extended to real-time or larger financial datasets.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***