# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - INDIVIDUAL
##### **Team Member 1 -**    SARATHRAJ R
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

This project aims to analyze and predict the monthly closing prices of Yes Bank, a significant player in the Indian financial sector. The dataset spans from the bank's inception to the present, encompassing monthly stock prices, including opening, closing, highest, and lowest values for each month. The primary goal is to predict the monthly closing price of Yes Bank's stock using various machine learning and time series models.

The project begins with comprehensive data exploration and preprocessing using Pandas for efficient data manipulation and aggregation. This involves cleaning the data, handling missing values, and ensuring it is in a suitable format for analysis. Visualizations will be created using Matplotlib and Seaborn to understand the trends, seasonal patterns, and potential anomalies in the stock prices, particularly around significant events like the fraud case revelation. These visualizations help in gaining insights into the stock’s behavior and its correlation with various factors over time.

For the computational aspect, NumPy will be utilized for performing efficient numerical operations on the dataset. This includes operations like normalization, transformation, and other preprocessing steps that are crucial for preparing the data for modeling. The core of the project involves leveraging Scikit Learn for model training, optimization, and evaluation. Various time series models, such as ARIMA, SARIMA, and Prophet, along with machine learning models like Linear Regression, Random Forest, and Gradient Boosting, will be employed to predict the stock's closing price. Model performance will be evaluated using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²) to determine the best-performing model.

The project architecture involves several key stages. Initially, data collection and preprocessing form the foundation, ensuring the dataset is clean and structured. Next, exploratory data analysis (EDA) provides valuable insights into the stock price trends and their driving factors. Following this, feature engineering and selection help in identifying the most relevant features that influence the stock’s closing price. The modeling phase includes training various models and fine-tuning their parameters to enhance predictive accuracy. Finally, model evaluation and validation are conducted to assess the models' performance and select the best one for predicting the future closing prices.

Overall, this project integrates data analysis, visualization, and predictive modeling to understand and forecast the monthly closing prices of Yes Bank's stock.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Predicting Monthly Closing Stock Prices for Yes Bank .

The objective of this project is to predict the monthly closing stock price of Yes Bank. Given the historical stock price data since its inception, including the opening, closing, highest, and lowest prices for each month, the goal is to develop a predictive model that can accurately forecast the closing price.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error,mean_absolute_error
import itertools

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset

pathway = '/content/drive/MyDrive/data_YesBank_StockPrices.csv'
df = pd.read_csv(pathway)

In [None]:
df

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull())


### What did you know about your dataset?

The dataset comprises monthly stock prices of Yes Bank, including the closing, starting, highest, and lowest stock prices from its inception to the present. The primary objective is to predict the stock's closing price for each month.

Eventually, I dont find any missing or duplicates in the data set upon examining with heat map.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe
df.describe()

### Variables Description

Date - month of the candles

Open - opening price of the month

Hight - highest price of the month

Low - lowest price of the month

close- closing price of the month

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns:
  print("No. of unique values in ",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df.dtypes

In [None]:

#Data Inspection
print(df.head())
print(df.info())
print(df.describe())

In [None]:
#Data Cleaning
print(df.isnull().sum())

In [None]:

# visualize Outliers
plt.figure(figsize=(12, 6))
df[['Open', 'High', 'Low', 'Close']].boxplot()
plt.xlabel('Price Type')
plt.ylabel('Price')
plt.title('Box Plot of Stock Prices')
plt.grid(True)
plt.show()

In [None]:

#Set the Date column as index

df.set_index('Date',inplace = True)

### What all manipulations have you done and insights you found?

'Date' column has been moved to the index position, and it is no longer a column in the DataFrame.

Upon examining we could see no missing values and no duplicates and we can also see that there are some outliers in our features lets treat this accordingly in future steps.

And describing the dataset we have examined stock prices stat values of mean, median and aggregations.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code


# Chart - 1 Box Plot
sns.boxplot(df['Close'])
plt.title('Boxplot of Close')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen to visualize the distribution of the closing price and identify potential outliers.



##### 2. What is/are the insight(s) found from the chart?

The box plot shows the distribution of Yes Bank's closing prices. We can see the median closing price, the quartiles, and potential outliers, which are the data points beyond the whiskers of the box plot.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying outliers can help in refining the predictive model, leading to more accurate predictions of Yes Bank's closing prices. This can be valuable for making informed investment decisions. However, further analysis is needed to determine if these outliers represent actual market trends or data anomalies.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

sns.boxplot(df['Open'])
plt.title('Boxplot of Open')
plt.show()


##### 1. Why did you pick the specific chart?

Similar to the closing price, a box plot was used to visualize the distribution of the opening price and identify potential outliers.



##### 2. What is/are the insight(s) found from the chart?

The box plot for 'Open' shows the distribution of Yes Bank's opening prices. It helps visualize the median opening price, quartiles, and potential outliers (data points beyond the whiskers). This gives an idea of the typical range of opening prices and any extreme values.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the distribution of opening prices, including outliers, can help traders and investors make more informed decisions. Identifying patterns and anomalies in opening prices can be useful for developing trading strategies or risk management.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.lineplot(df['High'])
plt.xlim(0,100)
plt.title('Lineplot of High')
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the trend of Yes Bank's highest stock price over time.



##### 2. What is/are the insight(s) found from the chart?

The line chart shows the fluctuations in the highest price of Yes Bank's stock over time. It reveals periods of growth, decline, and potential volatility.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the historical trend of the highest stock price can be useful for investors and traders to identify potential buying or selling opportunities.

#### Chart - 4

In [None]:


# Chart - 4 Area Chart
df.plot.area(y='High')
plt.title('Area Chart of High')
plt.show()

##### 1. Why did you pick the specific chart?

An area chart was chosen to visualize the cumulative trend of the highest stock price over time, emphasizing the magnitude of change.

##### 2. What is/are the insight(s) found from the chart?

The area chart shows the cumulative growth and decline of Yes Bank's highest stock price. It highlights periods of significant price changes and the overall trend.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, visualizing the cumulative trend can help investors understand the overall performance of the stock and identify periods of substantial growth or decline.

#### Chart - 5

In [None]:

# Chart - 5 histogram
sns.histplot(df['Low'])
plt.title('Histogram of Low')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen to visualize the distribution of Yes Bank's lowest stock prices, showing the frequency of different price ranges.



##### 2. What is/are the insight(s) found from the chart?

The histogram reveals the distribution of the lowest stock prices, indicating the most common price ranges and potential outliers. It helps understand the frequency of low prices.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of low prices can be useful for investors to assess the risk associated with the stock and make informed decisions.



#### Chart - 6

In [None]:
# Chart - 6 Moving average chart
df['MA10'] = df['Close'].rolling(window=10).mean()
df['MA20'] = df['Close'].rolling(window=20).mean()
df.plot(y=['Close', 'MA10', 'MA20'])
plt.title('Candlestick Chart of Close')
plt.show()

##### 1. Why did you pick the specific chart?

A moving average chart was chosen to visualize the trend of Yes Bank's closing price while smoothing out short-term fluctuations, providing a clearer picture of the overall trend.



##### 2. What is/are the insight(s) found from the chart?

The chart shows the closing price along with its 10-day and 20-day moving averages. It helps identify trends and potential buy/sell signals based on the crossover of the moving averages.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, using moving averages can help traders and investors identify potential trend reversals and make more informed trading decisions.



#### Chart - 7

In [None]:
# Chart - 7 Horizontal bar chart
df.plot.barh(y='Close')
plt.title('Horizontal Bar Chart of Close')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to visualize the closing prices over time in a horizontal format, allowing for easier comparison of prices across different dates.



##### 2. What is/are the insight(s) found from the chart?

The chart displays the closing prices for each date horizontally, making it easy to compare the closing prices across different periods.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the horizontal bar chart can help identify periods of high and low closing prices, which can be useful for traders and investors to understand historical price patterns.



#### Chart - 8

In [None]:
# Chart - 8 Scatter plot
sns.scatterplot(x='High', y='Low', data=df)
plt.title('Scatter Plot of High and Low')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between Yes Bank's highest and lowest stock prices, exploring potential correlations or patterns.



##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the relationship between the highest and lowest prices for each period. It helps identify any correlation or trends between these two variables.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between high and low prices can be useful for traders to assess the volatility and potential trading range of the stock.

#### Chart - 9

In [None]:

# Chart - 9 Linr Chart For "Low"
sns.lineplot(df['Low'])
plt.title('Lineplot of Low')
plt.show()


##### 1. Why did you pick the specific chart?

A line chart was chosen to visualize the trend of Yes Bank's lowest stock price over time.



##### 2. What is/are the insight(s) found from the chart?

The line chart shows the fluctuations in the lowest price of Yes Bank's stock over time. It reveals periods of growth, decline, and potential volatility in the lower price range.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the historical trend of the lowest stock price can be useful for investors and traders to identify potential buying opportunities or assess downside risk.

#### Chart - 10

In [None]:

# Chart - 10 visualization code
sns.regplot(data=df, x='Close', y='Open')


##### 1. Why did you pick the specific chart?

Line charts are excellent for showing trends over time, such as changes in sales, temperature, stock prices, etc. They make it easy to identify upward, downward, or cyclical trends.

##### 2. What is/are the insight(s) found from the chart?

In this chart, we compare the stock's opening and closing prices.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Open and Close prices are generally close because they’re on the same trading day/month.

If points lie close to a 45-degree line, it suggests little daily/monthly change — markets opened and closed around similar values.

If there’s significant scatter, it may indicate volatile trading within the month.

#### Chart - 11

In [None]:
!pip install mplfinance

import mplfinance as mpf

# Ensure date is datetime and set as index
df_candle = df.copy()
df_candle = df_candle.reset_index()
df_candle['Date'] = pd.to_datetime(df_candle['Date'], format='%b-%y')
df_candle.set_index('Date', inplace=True) # Set 'Date' back as index

# Plot candlestick
mpf.plot(df_candle, type='candle', style='charles', title="Monthly Candlestick Chart", volume=False)

##### 1. Why did you pick the specific chart?

I chose the candlestick chart because it is a standard and powerful visual tool in financial markets. It shows all four key prices — Open, High, Low, Close (OHLC) — in a single compact visual per month, making it ideal for detecting monthly sentiment and volatility.

##### 2. What is/are the insight(s) found from the chart?

Certain months showed long candles and wicks, indicating extreme volatility — consistent with Yes Bank's crisis period.

There are periods where the closing price is consistently lower than the opening price, suggesting bearish sentiment and lack of investor confidence.

Volatility spikes during and after significant events like the Rana Kapoor fraud case in 2018, reflecting market reaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights can:

Help investors time entries/exits better by recognizing volatile periods.

Allow risk managers to build volatility buffers during months of historic turbulence.

Assist analysts in mapping event-driven price behavior, which strengthens predictive modeling.



#### Chart - 12

In [None]:
df_rolling = df_candle.copy()
df_rolling['Close_rolling_mean'] = df_rolling['Close'].rolling(window=6).mean()
df_rolling['Close_rolling_std'] = df_rolling['Close'].rolling(window=6).std()

df_rolling[['Close', 'Close_rolling_mean', 'Close_rolling_std']].plot(figsize=(10, 5), title="Rolling Mean & Volatility")


##### 1. Why did you pick the specific chart?

This chart was selected to analyze both trends and volatility over time using rolling statistics (mean and standard deviation). It helps identify smoothed trends and price stability over a 6-month window.

##### 2. What is/are the insight(s) found from the chart?

Long-term declining trend visible between 2018–2020, followed by price stabilization.

Significant volatility spikes occur in crisis periods — volatility increased sharply post-2018 and again during COVID.

The price becomes range-bound in recent years, indicating reduced market confidence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These patterns help:

Risk teams forecast future volatility and adjust portfolios accordingly.

Model developers create features that capture volatility

Stakeholders interpret market sentiment changes, allowing for proactive risk assessment and intervention.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
for col in ['Open', 'High', 'Low', 'Close']:
    df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', '').str.strip(), errors='coerce')

# Drop rows with NaNs
df_cleaned = df.dropna(subset=['Open', 'High', 'Low', 'Close'])

# Reshape the data for violin plot
df_melted = df_cleaned[['Open', 'High', 'Low', 'Close']].melt(var_name='Price_Type', value_name='Value')

# Create the violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(data=df_melted, x='Price_Type', y='Value')
plt.title("Violin Plot of Stock Price Types (Open, High, Low, Close)")
plt.ylabel("Stock Price")
plt.xlabel("Price Type")
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

The violin plot combines the benefits of a boxplot and a KDE (kernel density estimate).

It shows:

Distribution shape

Central tendency (median)

Spread (interquartile range)

Presence of outliers and fat tails

This makes it a powerful tool to visualize stock price volatility and distribution asymmetry.

##### 2. What is/are the insight(s) found from the chart?

The High and Low prices have wider distributions, reflecting greater variability and intramonth volatility.

The Close prices are slightly right-skewed, indicating rare spikes in value.

The Open vs Close prices have similar median ranges, but Close tends to show more spread, suggesting intramonth movement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding the distribution and volatility of each price type helps:

Modelers to engineer better target and feature transformations

Traders to assess the risk/reward trade-off for different price levels

Stakeholders to interpret market stability and investor sentiment

#### Chart - 14 - Correlation Heatmap

In [None]:
# Add month and year columns

df_heatmap = df_candle.copy()
df_heatmap['Year'] = df_heatmap.index.year
df_heatmap['Month'] = df_heatmap.index.month_name().str[:3]

# Pivot table for heatmap
pivot_table = df_heatmap.pivot_table(values='Close', index='Month', columns='Year')

# Reorder months
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
pivot_table = pivot_table.reindex(month_order)


plt.figure(figsize=(12, 6))
sns.heatmap(pivot_table, annot=True, fmt=".1f", cmap="YlGnBu")
plt.title("Monthly Avg Closing Price Heatmap by Year")
plt.show()



##### 1. Why did you pick the specific chart?

I chose the heatmap to explore seasonal trends in the stock's performance by comparing average monthly closing prices across different years. It visually highlights recurring strong/weak months and detects patterns or anomalies.

##### 2. What is/are the insight(s) found from the chart?

Some months like January and March tend to have lower average closing prices, possibly due to fiscal closing behaviors or budget announcements.

Pre-2018 years had higher seasonal variability, while post-crisis years appear relatively flat — signaling investor fatigue or institutional hold.

Helps identify months that may perform better, based on historical averages.

3. Will the gained insights help create a positive business impact?

Yes, absolutely:

Traders can build seasonality-aware strategies.

Portfolio managers can optimize entry/exit windows based on favorable historical patterns.

ML models can use month-based features to enhance prediction accuracy, especially if seasonality exists.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

for col in ['Open', 'High', 'Low', 'Close']:
    df[col] = pd.to_numeric(df[col].astype(str).str.replace(',', '').str.strip(), errors='coerce')

# Drop rows with NaNs after conversion
df_cleaned = df.dropna(subset=['Open', 'High', 'Low', 'Close'])


sns.pairplot(df_cleaned[['Open', 'High', 'Low', 'Close']], diag_kind='kde')
plt.suptitle("Pair Plot of OHLC Stock Prices", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the pair plot because it provides a comprehensive view of how each of the four main stock price metrics — Open, High, Low, and Close — are related to each other. Unlike single plots, a pair plot displays:

Scatter plots for every pair of variables (bivariate relationships)

Distribution plots for each variable on the diagonal

Optional kernel density estimates to observe distribution shape

This makes it ideal for:

Identifying multicollinearity

Spotting linear/nonlinear relationships

Understanding data distributions and spread



##### 2. What is/are the insight(s) found from the chart?


Strong linear correlations exist among all four features — especially between:

High vs Close, Low vs Open, and Open vs Close

This is expected due to the nature of monthly stock data

The scatter plots show that all price variables tend to move together, suggesting that one can potentially be used to predict another.

Distributions are right-skewed, especially for High and Close, indicating that the stock has had occasional sharp price spikes, but spent most months at lower price levels.



3.  Will the gained insights help create a positive business impact?

Yes — the insights from the pair plot are valuable both for modeling and strategic analysis:

Business & Modeling Impact:
The strong multicollinearity suggests some variables may be redundant for modeling. For instance, if Open and Close are tightly correlated, including both without adjustment could lead to model overfitting.

Investors and analysts can use the strong correlation between High, Low, and Close to estimate price bands for risk management and stop-loss planning.

Modelers can use insights from distribution skew to apply appropriate transformations or robust models that aren’t misled by outliers.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis - There is a no significant difference between High and Low stock prices.

Alternate hypothesis - There is significant difference between High and Low stock prices.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy.stats import ttest_ind
t_statistic, p_value = ttest_ind(df['High'], df['Low'])
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is  significant difference between the mean of High and Low prices..")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between High and Low stock prices.")

##### Which statistical test have you done to obtain P-Value?

I have done ttest_ind,  t individual group test between high and low stock prices. So, I could see T-Statistic of 2.05 and  P_value of 0.040. It tells us that we reject null hypothesis as it displays less than 0.05.


##### Why did you choose the specific statistical test?

I have choosen this statistic test because here I could find the two groups actually had the same mean, what’s the probability that I would observe a difference this large or larger just due to random chance?

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis:  The true mean of Yes Bank’s monthly closing stock price is equal to ₹150.

Alternate Hypothesis (H₁):  The true mean of the closing stock price is not equal to ₹150.


#### 2. Perform an appropriate statistical test.

In [None]:

# Perform Statistical Test to obtain P-Value
from scipy import stats
sample_mean = df['Close'].mean()
hypothesized_mean = 150

t_statistic, p_value = stats.ttest_1samp(a=df['Close'], popmean=hypothesized_mean)

print("T-statistic:", t_statistic)
print("P-value:", p_value)


alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The true mean of the closing stock price is not equal to ₹150.")
else:
  print("Fail to reject the null hypothesis. The true mean of the closing stock price is equal to ₹150.")

##### Which statistical test have you done to obtain P-Value?

I used the One-Sample T-Test using stats.ttest_1samp

##### Why did you choose the specific statistical test?

We are comparing the sample mean of a single numerical variable (Close prices) to a known or hypothesized population mean (₹150).

The test determines if the sample provides enough statistical evidence to say that the true mean is significantly different from ₹150.

The population standard deviation is unknown, so a Z-test is not appropriate — hence the t-test.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀):
There is no linear correlation between the Open and Close stock prices of Yes Bank.

Alternate Hypothesis (H₁):
There is a linear correlation between the Open and Close stock prices.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

correlation_coefficient, p_value = stats.pearsonr(df['Open'], df['Close'])

print("Pearson correlation coefficient:", correlation_coefficient)
print("P-value:", p_value)

alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a linear correlation between the Open and Close stock prices.")
else:
    print("Fail to reject the null hypothesis. There is no linear correlation between the Open and Close stock prices.")

There is a strong, positive linear correlation between Open and Close prices, and it is statistically significant (since p < 0.05).
This implies the opening price is a strong predictor of the closing price.

##### Which statistical test have you done to obtain P-Value?

I performed the Pearson’s Correlation Coefficient Test, using:
stats.pearsonr(df['Open'], df['Close'])

This test outputs:

The Pearson correlation coefficient (r), which measures the strength and direction of linear association

The p-value, which tells us whether the observed correlation is statistically significant


##### Why did you choose the specific statistical test?

We chose Pearson’s correlation test because:

Both Open and Close are continuous, numerical variables

We're interested in testing the strength and direction of their linear relationship

Pearson’s r is the standard method for testing linear correlation when:

The data is approximately normally distributed

The relationship is assumed to be linear

There are no strong outliers that could skew results

This test helps assess whether one price (Open) can be linearly associated with another (Close) — useful in building predictive models or trading strategies.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:

# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values ​​in the data set



### 2. Handling Outliers

In [None]:

# visualize Outliers
plt.figure(figsize=(12, 6))
df[['Open', 'High', 'Low', 'Close']].boxplot()
plt.xlabel('Price Type')
plt.ylabel('Price')
plt.title('Box Plot of Stock Prices')
plt.grid(True)
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

This outlier is significant for this data set

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Not needed for this dataset so skipping this step.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Not needed for this dataset so skipping this step.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new

df['Price_Range'] = df['High'] - df['Low']
df['Price_Range']

In [None]:
df['Price_Change'] = df['Close'] - df['Open']
df['Price_Change']

In [None]:
df['Price_Change_Pct'] = ((df['Close'] - df['Open']) / df['Open']) * 100
df['Price_Change_Pct']

In [None]:
df['Avg_Price'] = (df['Open'] + df['High'] + df['Low'] + df['Close']) / 4
df['Avg_Price']

In [None]:
df['Volatility_Ratio'] = (df['High'] - df['Low']) / df['Open']
df['Volatility_Ratio']

In [None]:
df['Is_Green_Month'] = (df['Close'] > df['Open']).astype(int)
df['Is_Green_Month']

Now, I would like to do type conversion of date in date format and sort it out for our future temporal work or calculation.

In [None]:
# Convert the index to datetime format
df.index = pd.to_datetime(df.index, format='%b-%y')

df = df.sort_index()



Lets create rolling metric features.

In [None]:
df['Close_Lag1'] = df['Close'].shift(1)
df['Close_Rolling_Mean_3'] = df['Close'].rolling(window=3).mean()
df['Close_Rolling_Std_3'] = df['Close'].rolling(window=3).std()

In [None]:
df.tail()

#### 2. Feature Selection

In [None]:

# Select your features wisely to avoid overfitting
#  lineplot for Constant Feature vs Target
plt.figure(figsize=(6, 4))
sns.lineplot(data=df, x='Open', y='Close')
plt.title("open vs close")
plt.show()

In [None]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(Dataframe):
    # Handle missing values (NaN) by dropping rows containing them
    Dataframe = Dataframe.dropna()

    # Handle infinite values (inf) by replacing them with a large finite number
    Dataframe = Dataframe.replace([np.inf, -np.inf], 1e9)

    vif_data = pd.DataFrame()
    vif_data["feature"] = Dataframe.columns
    vif_data["VIF"] = [variance_inflation_factor(Dataframe.values, i) for i in range(len(Dataframe.columns))]
    return vif_data

vif_sample_df = calculate_vif(df.copy()) # Create a copy of df to avoid modifying the original

vif_sample_df


In [None]:
# Select only numerical columns for correlation
numerical_df = df.select_dtypes(include=np.number)

# Handle infinite values by replacing them with a large finite number
numerical_df = numerical_df.replace([np.inf, -np.inf], 1e9)

# Drop rows with NaN values that might result from previous operations (like rolling calculations)
numerical_df = numerical_df.dropna()

# Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Plot the correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

VIF analysis revealed extreme multicollinearity among raw price features (Open, High, Low, Close) and derived metrics (Avg_Price, Price_Range, etc.), leading to VIF scores of inf. To mitigate this, we dropped redundant features and retained only independent, high-information features like Close_Lag1, Price_Change_Pct, and Volatility_Ratio. This step ensures model stability, especially for linear models.

In [None]:
# Drop redundant and multicollinear features
features_to_drop = [
    'Open', 'High', 'Low',
    'Price_Range', 'Price_Change', 'Avg_Price',
    'Close_Rolling_Mean_3'
]

df_reduced = df.drop(columns=features_to_drop)

# Verify final columns
print(df_reduced.columns)


We have dropped redundant features step by step using a strategic approach to reduce multicollinearity, without losing important signals for prediction.

##### What all feature selection methods have you used  and why?

I used VIF to identify features that are highly correlated with each other.

This is important because high multicollinearity can cause instability in regression-based models and lead to unreliable coefficient estimates.

Features with very high VIF (approaching or at inf) were considered redundant and removed.

Domain Knowledge & Redundancy Reduction

I grouped raw features like Open, High, Low, Close and derived features such as Price_Change, Avg_Price, Price_Range.

Since many derived features were based on the same components, I kept only one representative from each group to avoid duplication of information.

Statistical Insight (T-tests and Correlation)

I conducted T-tests and Pearson correlation to understand which features show a statistically significant relationship with the target (Close price).

Features like Close_Lag1, Price_Change_Pct, and Volatility_Ratio showed meaningful signals.

##### Which all features you found important and why?

Price_Change_Pct,  Volatility_Ratio	, Close_Lag1	, Close_Rolling_Std_3, Is_Green_Month

These features were chosen because they:

Are statistically relevant

Are non-redundant (low multicollinearity)

Add predictive value without overfitting

Have interpretability for stakeholders

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler, PowerTransformer

#  Start from df_reduced (already cleaned and with low VIF)
df_reduced = df_reduced.copy()

In [None]:
df_reduced['MA10'] = df_reduced['MA10'].fillna(method='bfill')
df_reduced['MA20'] = df_reduced['MA20'].fillna(method='bfill')
df_reduced['Close_Lag1'] = df_reduced['Close_Lag1'].fillna(method='bfill')
df_reduced['Close_Rolling_Std_3'] = df_reduced['Close_Rolling_Std_3'].fillna(method='bfill')

In [None]:
from sklearn.preprocessing import StandardScaler

features_to_scale = [
    'Price_Change_Pct', 'Volatility_Ratio', 'Close_Lag1',
    'Close_Rolling_Std_3', 'MA10', 'MA20'
]

scaler = StandardScaler()
df_scaled = df_reduced.copy()
df_scaled[features_to_scale] = scaler.fit_transform(df_scaled[features_to_scale])


In [None]:
#Doing log transformation for skewed features.

df_log = df_scaled.copy()
# Adding 1 to avoid log(0) issues
df_log['Volatility_Ratio_log'] = np.log1p(df_log['Volatility_Ratio'])


In [None]:
#To further normalize skewed variables.
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer()
df_log[['Price_Change_Pct', 'Volatility_Ratio']] = pt.fit_transform(
    df_log[['Price_Change_Pct', 'Volatility_Ratio']]
)

In [None]:
#: Stationarity Transformation (for doing Time-Series) -- For models  ARIMA / SARIMA.
df_stationary = df_reduced[['Close']].diff().dropna()


In [None]:
#To avoid redundancy and multicollinearity, dropping raw features.

df_final = df_log.drop(['Volatility_Ratio'], axis=1)


In [None]:
df_final.head()

##### Which method have you used to scale you data and why?

I used StandardScaler from sklearn.preprocessing to scale the numerical features.

StandardScaler transforms features so that they have a mean of 0 and a standard deviation of 1.

This ensures that all features are on the same scale, which is important for algorithms that are distance-based (e.g., KNN, SVM) or sensitive to feature magnitude (e.g., Logistic Regression, Linear Regression).

It also helps improve convergence speed in gradient-based models and ensures that no single feature dominates the model purely due to scale differences.

StandardScaler was the best fit because it maintains the shape of the feature distribution while normalizing the scale, making it suitable for most machine learning algorithms we plan to use.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Not really necessary right now and because number of Features is Small and we  already Removed Redundant Feature and we dont have the risk of losing model Interpretability.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

# 1️⃣ Separate features and target
X = df_final.drop('Close', axis=1)  # All independent variables
y = df_final['Close']               # Dependent variable

# 2️⃣ Train-test split (80% train, 20% test for example)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


##### What data splitting ratio have you used and why?

I used an 80:20 train-test split ratio.

80% for training: This ensures the model has sufficient data to learn patterns, relationships, and variance in the features, reducing underfitting risk.

20% for testing: This portion is kept completely unseen during training to evaluate the model’s generalization ability on new data.

The 80:20 split is a widely accepted balance in machine learning because:

It provides enough data for both learning and reliable evaluation.

It works well for datasets of moderate size — large enough to learn from but not so small that evaluation metrics become unstable.

It prevents data leakage by ensuring the testing set represents real-world unseen scenarios.

If the dataset were extremely small, I would have considered k-fold cross-validation instead to maximize training data while still validating performance.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

For our Yes Bank stock prices dataset, handling an imbalanced dataset is generally not required because we are mainly focussing on regression and time series concepts rather than classification problem.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


X = df_final.drop('Close', axis=1)
y = df_final['Close']

# Fit the Algorithm
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on the model
y_pred = lr_model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Calculate the evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model: Linear Regression")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

In [None]:

metrics = ['RMSE', 'R² Score']
scores = [rmse, r2]  # from previous code

plt.bar(metrics, scores, color=['skyblue', 'lightgreen'])
plt.ylabel('Score')
plt.title('Model 1: Linear Regression Performance')
for i, score in enumerate(scores):
    plt.text(i, score + 0.01, f"{score:.2f}", ha='center')
plt.show()


Linear Regression is a supervised learning algorithm that models the relationship between a dependent variable (Close price in our case) and one or more independent variables (our other stock features) by fitting a straight line (or hyperplane in multiple dimensions) that minimizes the error between predicted and actual values.
The goal is to learn weights for each feature that best predict the target.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define model
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 100]
}

# GridSearch with Cross-Validation (5-fold)
grid_search = GridSearchCV(
    estimator=ridge,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)

# Fit on training data
grid_search.fit(X_train, y_train)

# Best model
best_ridge = grid_search.best_estimator_

# Predict
y_pred = best_ridge.predict(X_test)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("RMSE:", rmse)
print("R² Score:", r2)

# Cross-validation score
cv_scores = cross_val_score(best_ridge, X_train, y_train, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores)
print("Mean CV R² Score:", cv_scores.mean())


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for hyperparameter tuning.

Reason for Choosing GridSearchCV:

Exhaustive Search → GridSearchCV systematically tries all possible combinations of the given hyperparameters, ensuring that the optimal set is not missed.

Cross-Validation Integrated → It evaluates each hyperparameter set using k-fold cross-validation, which reduces overfitting risk and gives a better estimate of performance.

Deterministic Output → Given the same data and parameter grid, GridSearchCV produces consistent, reproducible results, which is important for reliable model comparison.

Best for Small Search Spaces → Since my parameter space (e.g., alpha in Ridge Regression) was relatively small, GridSearchCV was computationally feasible and guaranteed to find the best configuration.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
metrics = {
    'RMSE': 13.280393012884526,
    'R² Score': 0.9804876544056066,
    'Mean CV R² Score': 0.9649074401795883
}

# Create bar chart
plt.figure(figsize=(8, 5))
plt.bar(metrics.keys(), metrics.values(), color=['skyblue', 'lightgreen', 'salmon'])
plt.ylabel('Score / Value')
plt.title('Evaluation Metrics - Tuned Linear Regression Model')

# Annotate bars
for i, (metric, value) in enumerate(metrics.items()):
    plt.text(i, value + 0.005, f'{value:.3f}', ha='center', fontsize=10)

plt.ylim(0, 1.1 * max(metrics.values()))  # Adjust Y limit
plt.show()

After applying hyperparameter tuning using GridSearchCV, the model’s performance remained almost the same in terms of RMSE and R² on the test set. However, the cross-validation results show that the tuned model generalizes well across different folds.

### ML Model - 2

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Model 2: Random Forest Regressor
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

# Fit the model
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluation
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Model: Random Forest Regressor")
print(f"RMSE: {rmse_rf:.2f}")
print(f"R² Score: {r2_rf:.2f}")


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Metrics for Model 2
metrics = ['RMSE', 'R² Score']
values = [15.71, 0.97]

plt.figure(figsize=(6,4))
bars = plt.bar(metrics, values, color=['skyblue', 'orange'])

# Add values on top of bars
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f"{value:.2f}", ha='center', fontsize=10, fontweight='bold')

plt.title('Evaluation Metric Score Chart - Model 2', fontsize=14)
plt.ylabel('Score')
plt.ylim(0, max(values) + 5)  # Extra space above bars
plt.show()


The second model is a Random Forest Regressor, an ensemble learning method that constructs multiple decision trees during training and outputs the mean prediction of the individual trees. It is particularly effective at capturing complex, non-linear relationships between features and the target variable.

Random Forest reduces overfitting through averaging and can handle both numerical and categorical features without requiring strict scaling assumptions.

We got RMSE score of 15.71 and R2 score of 0.97 from our random forest model.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define base model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# GridSearchCV for tuning
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,                  # 5-fold cross-validation
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

# Fit model on training data
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best model
best_rf = grid_search.best_estimator_

# Predictions
y_pred = best_rf.predict(X_test)

# Evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

# Cross-validation scores
cv_scores = cross_val_score(best_rf, X_train, y_train, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores)
print("Mean CV R² Score:", np.mean(cv_scores))

Which hyperparameter optimization technique have you used and why?

We used GridSearchCV for hyperparameter optimization.

Reason:
GridSearchCV exhaustively searches through a manually specified subset of hyperparameters, testing all possible combinations within the defined grid. It performs k-fold cross-validation (here, 5-fold) for each combination, ensuring the selected parameters generalize well to unseen data.

This method was chosen because:

The parameter space for Random Forest was small enough to allow exhaustive search without excessive computational cost.

It ensures we do not miss the best-performing parameter set within the given grid.

It directly integrates cross-validation, which provides a reliable estimate of model performance for each hyperparameter configuration.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Metrics before and after tuning
metrics = ['RMSE', 'R² Score']
before_tuning = [15.71, 0.97]
after_tuning = [16.16, 0.97]

x = range(len(metrics))

plt.figure(figsize=(6, 4))
plt.bar(x, before_tuning, width=0.4, label='Before Tuning', align='center')
plt.bar([i + 0.4 for i in x], after_tuning, width=0.4, label='After Tuning', align='center')

plt.xticks([i + 0.2 for i in x], metrics)
plt.ylabel('Score')
plt.title('Evaluation Metric Score Chart - Model 2 (Tuned)')
plt.legend()
plt.show()

After tuning the Random Forest Regressor with GridSearchCV, the performance improved slightly in terms of cross-validation consistency, though the RMSE increased marginally. The tuned model shows more stable R² scores across folds, indicating better generalization.

RMSE	 - before 15.71  after - 16.16

R2     - before 0.97	 after - 0.97

Mean CV R² Score - 0.9700

CV R² Scores (per fold)	 - [0.9075, 0.9884, 0.9856, 0.9816, 0.9871]:

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

1. RMSE (Root Mean Squared Error)
What it means:
RMSE measures the average prediction error of the model in the same units as the target variable. Lower RMSE means more accurate predictions.

In our context:
Since our target is stock price (or closing price), an RMSE of ~15 means the model’s predictions are, on average, off by about ₹15 from the actual price.

Business impact:

Lower RMSE → better price prediction accuracy → more reliable for decision-making (e.g., timing buy/sell).

High RMSE could lead to poor entry/exit decisions, affecting profitability.

2. R² Score (Coefficient of Determination)
What it means:
R² shows the proportion of variance in the target variable explained by the model (0 to 1 range). Higher R² means the model fits the data better.

In our context:
An R² of ~0.97 means 97% of the variation in stock prices is explained by our model’s features.

Business impact:

High R² means the model captures most market trends and patterns.

This improves trust in predictions for business planning, risk management, and investment strategies.

3. Cross-Validation R² Score (Mean CV R²)
What it means:
Shows how well the model generalizes to unseen data by training and testing on multiple subsets.

In our context:
A mean CV R² of ~0.97 means the model is consistent and not overfitting to historical data.

Business impact:

Ensures reliability across different time periods and market conditions.

Reduces risk of making trading decisions based on over-optimistic models.

Overall Business Impact
High R² + Low RMSE → Model can accurately and consistently predict stock prices, aiding in:

Better trading signals → optimized buy/sell points.

Risk mitigation → avoiding large price surprises.

Confidence in automated strategies → reduced manual intervention.

### ML Model - 3

In [None]:
pip install --upgrade xgboost


In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define model (no early stopping to avoid compatibility issues)
xgb_model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='rmse'  # Works in older versions
)

# Fit model normally
xgb_model.fit(X_train, y_train)

# Predict
y_pred = xgb_model.predict(X_test)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")




#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

import matplotlib.pyplot as plt

# Replace with your actual values
rmse = 13.21
r2 = 0.98

metrics = {'RMSE': rmse, 'R² Score': r2}
plt.bar(metrics.keys(), metrics.values(), color=['orange', 'green'])
plt.ylabel('Score')
plt.title('XGBoost Model Evaluation Metrics')
plt.ylim(0, max(metrics.values()) + 1)
plt.show()


We implemented XGBoost Regressor, which is an optimized gradient boosting framework designed for speed and performance. It builds an ensemble of decision trees sequentially, where each new tree corrects errors made by the previous ones.
XGBoost is well-suited for tabular datasets, handles missing values internally, and can capture complex non-linear relationships in the data.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Define the model
xgb_model = XGBRegressor(objective='reg:squarederror', random_state=42)

# Parameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, None],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='r2',
    cv=5,
    verbose=1,
    n_jobs=-1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Predict
y_pred = grid_search.predict(X_test)

# Evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("RMSE:", rmse)
print("R² Score:", r2)

# Cross-validation scores
cv_scores = cross_val_score(grid_search.best_estimator_, X_train, y_train, cv=5, scoring='r2')
print


##### Which hyperparameter optimization technique have you used and why?

We used GridSearchCV for hyperparameter optimization.

Reason: GridSearchCV exhaustively searches through a manually specified subset of hyperparameters, testing all possible combinations within the defined grid. It performs k-fold cross-validation (here, 5-fold) for each combination, ensuring the selected parameters generalize well to unseen data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No prominent improvement seen as rmse went from  13.21 to 13.14 and r2 score remains the same 0.98

In [None]:
# Replace with your actual values
rmse = 13.14
r2 = 0.98

metrics = {'RMSE': rmse, 'R² Score': r2}
plt.bar(metrics.keys(), metrics.values(), color=['orange', 'green'])
plt.ylabel('Score')
plt.title('XGBoost Model Evaluation Metrics')
plt.ylim(0, max(metrics.values()) + 1)
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I chose RMSE it penalizes larger errors more than MAE because of the squaring before averaging. This is important because large prediction errors in stock prices can lead to disproportionately large financial losses.

Business Impact:

Reducing large errors is critical to avoid catastrophic investment decisions, making RMSE valuable for risk-sensitive scenarios.

And I chose R2 it measures the proportion of variance in the actual stock prices that the model explains through its predictions.

Ranges from 0 to 1 (or can be negative if the model performs worse than a simple mean prediction).

Business Impact:

R² is useful as a general model quality check but should not be the sole metric for business decisions.

It should be used in combination with MAE, RMSE, directional accuracy, and profit simulation metrics for a fuller picture.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose tuned XGBRegressor as my final prediction model as a succesful one by comparing others. I got 13.14 as a rmse score and 0.98 as R2 score which is comparatively good than the other models which we have implemented.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
pip install shap

In [None]:
import shap


target = 'Close'

# Prepare features and target
X = df.drop(columns=[target])
y = df[target]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)

shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)


Features like Low, Avg_Price, and High have the most spread in SHAP values, meaning they influence the model's prediction strongly with close feature.

Similarly, Avg_Price and High also show a trend where higher values push the prediction higher.

Features like Price_Change_Pct, MA10, and Volatility_Ratio have little impact since their SHAP values cluster around zero.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In conclusion, we can implement our tuned XGBRegressor to make the prediction of the stock prices as it give the best result comparatively as we got 13.14 as a rmse score and 0.98 as R2 score  which displays less error between the predicted and actual values.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***