# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Shreyash Pal


# **Project Summary -**

The financial market is a dynamic environment where stock prices are influenced by a multitude of factors, including economic conditions, political scenarios, company performance, and unforeseen events such as fraud cases. In this project, we focused on predicting the monthly closing stock price of Yes Bank, a prominent private sector bank in India that has faced considerable volatility in recent years, particularly after the 2018 financial fraud involving its co-founder, Rana Kapoor. This event sparked significant interest in understanding how a company's stock behaves under adverse conditions and how machine learning techniques can be leveraged to predict such market behaviors.within 500-600 words.The primary objective of this project is to use regression models to forecast the closing stock price of Yes Bank using historical stock data. The dataset used contains monthly records of the stock's opening price, highest and lowest prices of the day, and the closing price, which is our target variable. Our aim was to build a robust model that can learn from historical trends and provide accurate predictions of the closing price using input features like Open, High, and Low

# **GitHub Link -**

https://github.com/ShreyashPal88/Labmentix_YES_Bank_Project_Shreyash

# **Problem Statement**


The goal of this project is to develop a predictive regression model that accurately forecasts the monthly closing stock price of Yes Bank using historical stock market data. The dataset includes features such as the stock’s opening price, highest price, and lowest price within a month. The target variable is the closing price

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
df

In [None]:
display(df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset Shape:", df.shape)
print("\nNumber of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
print("\nColumn Data Types:")
print(df.dtypes)

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("\nNumber of Duplicate Rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:

# Missing Values/Null Values Count
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

By performing these exploratory data analysis (EDA) steps, here’s what we have learned about the Yes Bank stock prices dataset:

Basic Dataset Overview The dataset has a specific number of rows and columns (determined from df.shape). It contains various columns related to stock prices, which can include open price, close price, high, low, volume, etc. The dataset's column types indicate which columns are numerical and which might be categorical.
Missing Values Analysis We identified missing values in the dataset. The heatmap and bar plot helped visualize where the missing values exist. If missing values were present, we handled them by filling with the mean (df.fillna(df.mean())).
Duplicate Values The dataset may contain duplicate rows, and we identified their count. If necessary, we could remove duplicate rows using df.drop_duplicates(inplace=True).
Data Distribution & Outliers Histogram analysis showed the distribution of numerical data. Boxplots helped detect potential outliers in the dataset. If needed, we could further analyze outliers using interquartile range (IQR).
Correlations Between Features The correlation heatmap revealed relationships between numerical columns. If strong correlations exist, we might decide whether to remove or engineer features.
Stock Price Trends (if applicable) If there is a time-series component, we can later analyze trends over time

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("\nColumn Names:")
print(df.columns.tolist())
print(df.dtypes)

In [None]:
# Dataset Describe
print("\nDataset Summary Statistics:")
display(df.describe())

### Variables Description

Variable - Description

Date -The date of the stock price record.

Open -The opening price of the stock for the day.

High -The highest price of the stock on that day.

Low -The lowest price of the stock on that day.

Close -The closing price of the stock for the day.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    print(f"Column: {column}")
    print(f"Unique Values: {df[column].unique()}")
    print("\n")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd

# Assuming df is your DataFrame
# Example: Creating a sample DataFrame
data = {
    'Date': ['Jan-20', 'Feb-20', 'Mar-20', 'Apr-20', 'May-20'],
    'Open': [100, 101, 102, 103, 104],
    'High': [105, 106, 107, 108, 109],
    'Low': [99, 100, 101, 102, 103],
    'Close': [104, 105, 106, 107, 108]
}
df = pd.DataFrame(data)

# Step 1: Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')

# Step 2: Check for missing values
missing_values = df.isnull().sum()

# Step 3: Sort data by date
df = df.sort_values(by='Date')

# Step 4: Check for duplicate rows and remove them
df = df.drop_duplicates()

# Extract Year and Month for further analysis
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Step 5: Validate data types
data_types = df.dtypes

# Display results
print("Missing Values:\n", missing_values)
print("\nData Types:\n", data_types)
print("\nDataFrame Head:\n", df.head())

### What all manipulations have you done and insights you found?

Data Manipulations and Insights We performed various data wrangling and exploratory analysis steps on the dataset to ensure it is clean, structured, and ready for further analysis. Below is a summary of all manipulations and the key insights derived.

Data Manipulations Performed:

1️⃣ Data Loading and Inspection Loaded the dataset using pandas (pd.read_csv()). Displayed the first few rows using df.head(). Checked dataset structure using df.info(). Generated summary statistics using df.describe().

🔎 Insights: ✔ The dataset has 185 rows and 5 columns: Date, Open, High, Low, and Close. ✔ Date was stored as a string (object) instead of a datetime format. ✔ Other columns were numerical.

2️⃣ Data Cleaning --Converted Date column to datetime format --Checked for missing values: Result: No missing values were found. --Checked for duplicate records: Result: No duplicate rows were found. --Sorted dataset by Date. --Set Date column as the index for time-series analysis

Insights: ✔ No missing or duplicate values were found. ✔ Sorting by Date ensures correct time-series ordering.

3️⃣ Handling Outliers Used Interquartile Range (IQR) method to detect and remove extreme values.

Insights: ✔ Outliers were detected and removed to improve data reliability. ✔ This helps avoid misleading statistical trends.

Final Summary of Insights

🔹 No missing or duplicate values, ensuring data integrity.

🔹 Stock prices fluctuate, requiring trend analysis.

🔹 Outliers were removed to avoid distorted trends.

🔹 High correlation between stock prices indicates synchronized movements.

🔹 Moving averages help in identifying trends for better decision-making.

🔹 Daily returns provide insight into market volatility.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Close'], color='blue', label='Closing Price')
plt.title('YES Bank Stock Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price (INR)')
plt.grid(True)
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

The Closing Price Over Time chart was chosen to observe long-term trends in YES Bank's stock performance.

The Daily Change in Closing Price chart helps visualize short-term volatility.

The Missing Values Heatmap was used to check data quality before analysis.

##### 2. What is/are the insight(s) found from the chart?

The Closing Price Over Time chart likely reveals trends such as steady growth, decline, or sudden fluctuations.

The Daily Change in Closing Price chart indicates volatility, showing sharp price changes that may suggest market instability.

The Missing Values Heatmap helps detect gaps in data that might affect analysis accuracy.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Identifying trends in stock prices helps investors make informed decisions. Understanding volatility patterns can aid risk assessment and strategic trading.

Negative Growth Insights:

If the stock shows consistent downward trends or extreme volatility, it may signal instability, deterring investors. Large missing values in the dataset can lead to unreliable insights, affecting decision-making.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# if your DataFrame is df with the stock data
# Calculating the daily change in closing price
df['Daily_Change'] = df['Close'].diff()

# Create a bar chart of daily change
plt.figure(figsize=(12, 6))
plt.bar(df.index, df['Daily_Change'], color='red', label='Daily Change')

plt.title('YES Bank Daily Change in Closing Price')
plt.xlabel('Date')
plt.ylabel('Daily Change (INR)')
plt.grid(True)
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for clarity

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it effectively represents day-to-day fluctuations in stock price.

It visually highlights the magnitude and frequency of positive and negative price changes.

This helps in understanding stock volatility and identifying patterns in price movements.

##### 2. What is/are the insight(s) found from the chart?

The chart shows how frequently the stock price changes and by how much.

Large spikes (both positive and negative) indicate high volatility.

Periods of small or no changes suggest stability in stock price movement.

A consistent downward trend in daily change might indicate an overall declining stock value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Here Positive Impact: Traders and investors can use this information to plan entry and exit points. High volatility presents trading opportunities for short-term gains.

Negative Growth Insights: If frequent large negative bars are observed, it indicates instability, which may lead to a loss of investor confidence.

Erratic movements may suggest uncertainty in the market, making long-term investments risky.Answer Here

#### Chart - 3

In [None]:
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame with the stock data
plt.figure(figsize=(10, 6))

# Create a scatter plot of high vs. low prices
plt.scatter(df['High'], df['Low'], color='purple', alpha=0.7)

plt.title('Relationship between Daily High and Low Prices')
plt.xlabel('Daily High Price (INR)')
plt.ylabel('Daily Low Price (INR)')
plt.grid(True)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it effectively shows the relationship between two numerical variables: daily high and low stock prices.

This chart helps identify patterns, correlations, and outliers in price movements.

It visually represents the range of daily stock price fluctuations, which is crucial for traders and analysts.

##### 2. What is/are the insight(s) found from the chart?

A strong positive correlation between daily high and low prices indicates consistency in stock movement.

If the points closely follow a diagonal trend, it suggests low volatility, meaning daily price swings are predictable.

A wide spread in data points suggests higher price volatility, indicating unstable market conditions.

Outliers (high points far from the trend) may signal sudden price jumps or crashes, potentially caused by external market events or investor sentiment shifts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

If the high-low relationship is stable, it helps traders set realistic stop-loss and take-profit points. Investors can assess risk levels by observing volatility patterns over time.

Negative Growth Insights:

High volatility with extreme fluctuations may indicate an unpredictable market, deterring long-term investors. If the high prices increase but low prices remain stagnant, it may signal price manipulation or speculative trading rather than genuine growth.

#### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame with the stock data
plt.figure(figsize=(10, 6))

# Create a box plot of daily closing prices
sns.boxplot(data=df, y='Close', color='yellow')

plt.title('Distribution of Daily Closing Prices')
plt.ylabel('Closing Price (INR)')
plt.grid(True)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen because it effectively shows the distribution, variability, and outliers in daily closing prices.

It helps in identifying median prices, interquartile range (IQR), and potential extreme values.

This visualization is useful for detecting market trends, volatility, and unusual price movements.

##### 2. What is/are the insight(s) found from the chart?

The median closing price provides a central value, helping to understand the stock’s typical price level.

The spread (IQR) indicates how much the prices fluctuate on a regular basis.

Presence of outliers (dots outside the whiskers) suggests extreme price movements, which could be caused by news events, investor panic, or high volatility.

A narrow IQR suggests stable stock performance, while a wide IQR indicates high variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Positive Impact:

If the box plot shows a narrow and stable distribution, it indicates a less volatile stock, which attracts long-term investors. Identifying outliers and price swings can help traders develop risk management strategies.

Negative Growth Insights:

If the stock has many extreme outliers, it suggests unpredictable fluctuations, which can discourage long-term investments. A widening IQR over time may indicate increasing risk, making the stock less attractive for conservative investors.

#### Chart - 5

In [None]:
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame with the stock data
plt.figure(figsize=(10, 6))

# Create a scatter plot of opening vs. closing prices
plt.scatter(df['Open'], df['Close'], color='green', alpha=0.7)

plt.title('Relationship between Opening and Closing Prices')
plt.xlabel('Opening Price (INR)')
plt.ylabel('Closing Price (INR)')
plt.grid(True)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between opening and closing prices of the stock.

This chart helps identify how closely the opening price predicts the closing price and whether there is a strong correlation.

It allows for spotting outliers or anomalies, such as significant gaps between opening and closing prices.

##### 2. What is/are the insight(s) found from the chart?

If the points align closely along a diagonal line, it indicates a strong correlation, meaning the stock’s opening price is a good predictor of its closing price.

A wide spread suggests greater intraday volatility, meaning significant fluctuations happen within a single trading session.

If several points deviate significantly from the diagonal, it could indicate market events, investor sentiment shifts, or sudden external influences affecting closing prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Positive Impact: A strong correlation between opening and closing prices can help traders and investors predict market movements more effectively. Understanding intraday volatility helps traders plan better risk management strategies.

Negative Growth Insights: If there are large deviations, it suggests unpredictability and risk, which can discourage conservative investors. If opening prices are consistently higher than closing prices, it might indicate selling pressure, leading to a bearish trend in the stock market

#### Chart - 6

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Assuming 'df' is your DataFrame with the stock data
# Calculate the daily percentage change in closing price
df['Daily_Percentage_Change'] = df['Close'].pct_change() * 100

# Create a histogram of daily percentage change
plt.figure(figsize=(10, 6))
plt.hist(df['Daily_Percentage_Change'].dropna(), bins=30, color='green', edgecolor='black')

plt.title('Distribution of Daily Percentage Change in Closing Price')
plt.xlabel('Daily Percentage Change (%)')
plt.ylabel('Frequency')
plt.grid(True)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen because it effectively shows the distribution of daily percentage changes in closing prices.

This helps in understanding the volatility of the stock by showing how frequently different levels of price changes occur.

It provides insights into whether stock returns follow a normal distribution or exhibit extreme fluctuations.

##### 2. What is/are the insight(s) found from the chart?

If the histogram has a bell-shaped curve, it indicates that most price changes are small, with extreme changes being rare.

A wide spread suggests high volatility, meaning the stock experiences frequent large price swings.

If the histogram is skewed, it may indicate that the stock has more frequent gains or losses, influencing investment strategies.

Presence of extreme values (long tails) suggests occasional large price swings, which could be driven by market news, investor sentiment, or external events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

If the histogram shows a concentrated range of small daily changes, it suggests a stable stock, which attracts long-term investors. Understanding volatility helps traders and portfolio managers develop better risk management and hedging strategies.

Negative Growth Insights:

If the histogram has a high frequency of large negative price changes, it suggests frequent stock declines, which can lead to loss of investor confidence. High volatility can discourage risk-averse investors, making it harder for the company to attract long-term capital.

#### Chart - 7

In [None]:
plt.figure(figsize=(12, 6))
df["Close"].plot(kind="area", alpha=0.5)
plt.title("Stock Trend (Area Chart)")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.show()

##### 1. Why did you pick the specific chart?

An area chart was chosen because it effectively represents the trend of closing prices over time while also highlighting the magnitude of changes.

The filled area provides a clear visual of stock fluctuations, making it easy to spot trends, peaks, and dips.

This chart is useful for identifying long-term patterns, such as uptrends, downtrends, and periods of stability.

##### 2. What is/are the insight(s) found from the chart?

The general trend of the stock price is visible—whether it is increasing, decreasing, or fluctuating.

If the area steadily rises, it suggests positive growth and strong investor confidence. A declining area may indicate a bearish trend, meaning the stock is losing value over time.

If the chart shows high volatility with frequent peaks and dips, it suggests market instability or speculative trading activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

If the stock is in an uptrend, it reassures investors and encourages further investment. Recognizing long-term trends helps businesses and traders make informed investment decisions.

Negative Growth Insights:

A continuous downward trend signals declining investor confidence, possibly leading to lower market valuation. High volatility without a clear trend can indicate market uncertainty, making it risky for long-term investors.

#### Chart - 8

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(data=df[["Open", "High", "Low", "Close"]])
plt.title("Price Spread (Violin Plot)")
plt.ylabel("Price")
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot was chosen because it combines the benefits of a box plot and a density plot, allowing us to see both the distribution and spread of stock prices (Open, High, Low, Close).

It provides insights into price variation, volatility, and the probability density of stock prices, helping investors understand how prices are distributed.

Unlike a box plot, a violin plot also shows the shape of the distribution, making it easier to detect multimodal distributions (multiple peaks).

##### 2. What is/are the insight(s) found from the chart?

The width of the violin plot at different price levels represents the frequency of price occurrences.

A wider section means prices frequently stay around that value, while a narrower section indicates fewer occurrences.

If the distributions for Open, High, Low, and Close prices differ significantly, it suggests high volatility in stock performance.

The presence of long tails indicates extreme values or outliers, suggesting significant price swings on certain days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Helps traders understand how stock prices fluctuate, allowing them to adjust their strategies.

If the violin plot is narrow and concentrated, it indicates stable price movements, which is reassuring for investors.

Negative Growth Insights:

If the violin plot shows high variability with long tails, it suggests unpredictability and high risk, discouraging risk-averse investors. A very asymmetrical distribution in Open vs. Close prices may indicate a bearish or highly speculative market.

#### Chart - 9

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(df["Close"], fill=True)
plt.title("Closing Price Density (KDE)")
plt.xlabel("Closing Price")
plt.show()

##### 1. Why did you pick the specific chart?

A Kernel Density Estimate (KDE) plot was chosen because it provides a smooth distribution of closing prices over time.

Unlike a histogram, a KDE plot avoids issues with bin size selection, offering a clearer view of how frequently different prices occur.

It helps in identifying the most common closing prices, peaks in the data, and overall price trends.

##### 2. What is/are the insight(s) found from the chart?

The peak(s) in the KDE plot indicate price levels where the stock frequently closes, suggesting areas of support or resistance.

A wide spread suggests high volatility, while a narrow, tall distribution indicates stability in stock performance.

If the KDE is right-skewed (long tail on the right), it indicates occasional high closing prices, possibly due to sudden rallies.

If the KDE is left-skewed, it suggests more frequent lower closing prices, indicating bearish market trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Identifying common closing price ranges helps traders make informed buy/sell decisions. Recognizing price concentration areas can help investors set realistic entry and exit points.

Negative Growth Insights:

If the KDE plot shows multiple peaks, it may indicate an unstable stock with erratic price behavior, discouraging long-term investment. A left-skewed distribution may suggest a declining stock trend, which could reduce investor confidence.

#### Chart - 10

In [None]:
plt.figure(figsize=(12, 6))
df["Close"].rolling(window=5).mean().plot(label="5-Month MA", color="red")
df["Close"].plot(alpha=0.6)
plt.legend()
plt.title("5-Month Moving Average")
plt.xlabel("Date")
plt.ylabel("Closing Price")
plt.show()

##### 1. Why did you pick the specific chart?

A line chart with a moving average overlay was chosen because it helps identify trends in closing prices while smoothing out short-term fluctuations.

The 5-month moving average (MA) provides a clearer view of the stock's general direction, reducing noise from daily price changes.

This is a useful trend-following indicator that helps investors and traders assess the momentum and stability of the stock.

##### 2. What is/are the insight(s) found from the chart?

If the moving average is sloping upward, it indicates a bullish trend, meaning the stock price is generally increasing.

A downward-sloping MA suggests a bearish trend, meaning the stock is losing value over time.

If the closing price consistently stays above the moving average, it signals strong market sentiment and potential price growth.

If the price crosses below the moving average, it may indicate a trend reversal or weakening stock momentum.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Identifying trend direction allows investors to make informed decisions about buying, holding, or selling the stock. A sustained upward trend boosts investor confidence, leading to more capital inflow into the stock.

Negative Growth Insights:

If the stock price is consistently below the moving average, it suggests weak momentum and declining investor confidence. Frequent crossovers (price moving above and below the MA) indicate high volatility, making it difficult for investors to predict future trends reliably

#### Chart - 11

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming df is your DataFrame and it has a 'Close' column
df["Monthly Return"] = df["Close"].pct_change() * 100

# Drop NaN values that result from pct_change
df.dropna(inplace=True)

# Ensure the index is in datetime format
df.index = pd.to_datetime(df.index)

plt.figure(figsize=(12, 6))
sns.barplot(x=df.index.strftime('%b-%y'), y=df["Monthly Return"])
plt.title("Monthly Returns (%)")
plt.xlabel("Month-Year")
plt.ylabel("Return (%)")
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it effectively displays the monthly return percentages over time, making it easy to compare performance across different months.

This visualization helps identify periods of strong growth and decline, revealing seasonal trends or significant market movements.

The use of percentage changes (rather than absolute values) allows for a better understanding of relative stock performance

##### 2. What is/are the insight(s) found from the chart?

Months with high positive returns indicate strong stock performance, possibly due to market optimism, earnings reports, or external factors.

Months with negative returns highlight downturns, which may be caused by economic conditions, company-specific issues, or broader market trends.

If the returns fluctuate significantly, it suggests high volatility, which could be risky for investors. A pattern of consistent positive or negative returns can indicate a long-term trend in the stock's performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Understanding which months tend to perform well can help investors time their buying and selling decisions strategically.

If the company identifies reasons for strong months, it can use that insight to optimize business operations or marketing strategies.

Negative Growth Insights:

A trend of declining monthly returns suggests weakening stock performance, possibly leading to a loss of investor confidence.

High volatility in monthly returns may discourage risk-averse investors, making it harder to attract stable long-term investments.

#### Chart - 12

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(y=df["Close"])
plt.title("Outliers in Closing Prices")
plt.ylabel("Closing Price")
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen because it is one of the best ways to identify outliers and understand the distribution of closing prices.

It provides key statistical insights such as minimum, first quartile (Q1), median, third quartile (Q3), and maximum values, helping to assess price variations.

The presence of outliers (data points outside the whiskers) can indicate unusual price movements due to extreme market events.

##### 2. What is/are the insight(s) found from the chart?

If there are outliers above the upper whisker, it suggests spikes in stock price, which could be due to positive news, earnings reports, or strong market performance.

If there are outliers below the lower whisker, it signals sharp price drops, possibly due to market crashes, poor earnings, or negative sentiment.

A narrow interquartile range (IQR) indicates low volatility, while a wide IQR suggests high price fluctuations over time.

The median position in the box helps understand whether the stock tends to stay in the upper or lower price range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Identifying outliers helps investors understand abnormal price movements, allowing them to anticipate market shifts.

If the majority of prices fall within a stable range, it boosts investor confidence by showing low volatility.

Negative Growth Insights:

A high number of downward outliers could indicate frequent crashes, discouraging investors from holding long-term positions.

If the stock price shows extreme variability, it suggests market uncertainty, which can lead to lower investor trust and speculative trading.

#### Chart - 13

In [None]:
plt.figure(figsize=(12, 6))
pd.plotting.autocorrelation_plot(df["Close"])
plt.title("Autocorrelation of Closing Prices")
plt.show()

##### 1. Why did you pick the specific chart?

An autocorrelation plot was chosen because it helps analyze the relationship between a stock’s closing prices over time.

It shows how past prices influence future prices, which is crucial for trend analysis and forecasting.

This type of chart helps identify seasonal trends, momentum, or mean-reverting behavior in stock prices.

##### 2. What is/are the insight(s) found from the chart?

High positive autocorrelation at short lags (e.g., 1–5 days) suggests strong momentum, meaning past prices have a significant impact on future prices.

Low or negative autocorrelation indicates that price changes are more random, making predictions harder.

If periodic peaks appear, it could signal seasonal trends, where prices follow a recurring pattern over weeks or months.

A gradual decline in autocorrelation suggests that past trends fade over time, meaning price movements are less predictable as the time gap increases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

If autocorrelation is high, traders can use trend-following strategies to profit from momentum.

Identifying seasonal cycles helps businesses and investors time their trades more effectively.

Negative Growth Insights:

If the stock shows no autocorrelation, it suggests a highly unpredictable price pattern, making it risky for long-term investors.

A sudden drop in autocorrelation may indicate market instability, causing uncertainty and reducing investor confidence.

#### Chart - 14 - Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Example: Creating a sample DataFrame
data = {
    'Stock_A': [1.2, 1.5, 1.7, 1.6, 1.8],
    'Stock_B': [2.1, 2.0, 2.2, 2.3, 2.1],
    'Stock_C': [3.0, 3.1, 3.2, 3.3, 3.4],
    'Stock_D': [4.0, 4.1, 4.2, 4.3, 4.4]
}
df = pd.DataFrame(data)

# Compute the correlation matrix
correlation_matrix = df.corr()

# Plotting the heatmap
plt.figure(figsize=(10, 8))  # Adjusted figure size for better readability
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Stock Prices")
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen because it provides a clear visual representation of the relationships between different stock price attributes (Open, High, Low, Close, Volume, etc.).

It helps identify strong positive or negative correlations, allowing for better analysis of how different factors influence stock price movements. The color gradient makes it easier to spot patterns compared to a traditional correlation matrix.

##### 2. What is/are the insight(s) found from the chart?

A high correlation (close to +1) between Open, High, Low, and Close prices indicates that these variables move together, meaning the stock follows a predictable intraday price pattern.

A negative correlation (close to -1) with volume could suggest that price increases when trading volume is low, or vice versa.

A weak correlation (close to 0) between volume and closing price indicates that trading activity does not directly impact price changes.

If High and Close prices are almost perfectly correlated, it means the stock often closes near its daily high, which suggests bullish sentiment in the market.

#### Chart - 15 - Pair Plot

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Example: Creating a sample DataFrame
data = {
    'Open': [100, 101, 102, 103, 104],
    'High': [105, 106, 107, 108, 109],
    'Low': [99, 100, 101, 102, 103],
    'Close': [104, 105, 106, 107, 108]
}
df = pd.DataFrame(data)

plt.figure(figsize=(10, 8))
sns.pairplot(df[["Open", "High", "Low", "Close"]])
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot was chosen because it provides a detailed overview of relationships between multiple numerical variables (Open, High, Low, Close).

It helps visualize scatter plots for each pair of variables and their distribution in a diagonal histogram.

This allows for spotting linear relationships, clusters, and potential anomalies in stock price movements.

##### 2. What is/are the insight(s) found from the chart?

Strong positive relationships between Open, High, Low, and Close prices indicate that these variables move together, which is expected in stock price behavior.

The scatter plots can reveal price trends, such as whether higher opening prices tend to lead to higher closing prices.

Non-linear relationships or clusters may indicate periods of volatility or shifts in market behavior.

The diagonal histograms show the distribution of each variable, highlighting whether prices follow a normal distribution or have skewness.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant difference between the Open and Close prices.

Alternative Hypothesis (H₁): There is a significant difference between the Open and Close prices.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy import stats

# Perform Paired t-test for Open vs. Close prices
t_stat, p_value_ttest = stats.ttest_rel(df["Open"], df["Close"])

# Display results
{
    "Test": "Paired t-test (Open vs. Close)",
    "t-statistic": t_stat,
    "p-value": p_value_ttest,
}


##### Which statistical test have you done to obtain P-Value?

Paired t-test

##### Why did you choose the specific statistical test?

A paired t-test compares two related samples, in this case, Open and Close prices for the same stock on different days. Since they are dependent variables, a paired t-test is appropriate to check if there's a significant difference between them.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): The variance of stock prices (High - Low) remains constant over time.

Alternative Hypothesis (H₁): The variance of stock prices changes significantly over time.

#### 2. Perform an appropriate statistical test.

In [None]:
# Compute stock price volatility as (High - Low)
volatility = df["High"] - df["Low"]
mid = len(volatility) // 2  # Split dataset into two halves

# Perform Levene’s test for equality of variance
levene_stat, p_value_levene = stats.levene(volatility[:mid], volatility[mid:])

# Display results
{
    "Test": "Levene’s Test (Volatility over time)",
    "Levene-statistic": levene_stat,
    "p-value": p_value_levene,
}


##### Which statistical test have you done to obtain P-Value?

Levene's Statistical Test

##### Why did you choose the specific statistical test?

Levene’s test checks whether the variance of stock price fluctuations (High - Low) remains constant over time. Since market volatility can change, this test helps determine if the variations in stock prices differ significantly between two time periods.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant correlation between Open and Close prices.

Alternative Hypothesis (H₁): There is a significant correlation between Open and Close prices.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Pearson Correlation Test for Open vs. Close prices
corr_coeff, p_value_corr = stats.pearsonr(df["Open"], df["Close"])

# Display results
{
    "Test": "Pearson Correlation (Open vs. Close)",
    "Correlation Coefficient": corr_coeff,
    "p-value": p_value_corr,
}


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test

##### Why did you choose the specific statistical test?

Pearson’s correlation measures the strength and direction of the relationship between two continuous variables. Since Open and Close prices are expected to be highly correlated, this test quantifies their association.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Check for missing values in the dataset
missing_values = df.isna().sum()

# Handle missing values by imputing with appropriate strategies
df_imputed = df.copy()
df_imputed["Open"].fillna(df["Open"].median(), inplace=True)   # Median imputation for Open prices
df_imputed["High"].fillna(df["High"].median(), inplace=True)   # Median imputation for High prices
df_imputed["Low"].fillna(df["Low"].median(), inplace=True)     # Median imputation for Low prices
df_imputed["Close"].fillna(df["Close"].median(), inplace=True) # Median imputation for Close prices

# Verify if missing values are handled
missing_values_after = df_imputed.isna().sum()

# Display before and after missing values count
missing_values, missing_values_after

In [None]:
# Check for NaN values in all columns
nan_values_per_column = df.isna().sum()
nan_values_per_column

# Drop NaN values from the dataset
df_cleaned = df.dropna()

# Verify if NaN values are removed
nan_values_after_drop = df_cleaned.isna().sum()
nan_values_after_drop

#### What all missing value imputation techniques have you used and why did you use those techniques?

Before Imputation: The dataset had no missing values, so no imputation was required.

After Imputation: All values remain intact, confirming there were no missing data points.

### 2. Handling Outliers

In [None]:
# Define a function to detect outliers using the IQR method
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)  # First quartile (25th percentile)
    Q3 = data.quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range
    lower_bound = Q1 - 1.5 * IQR  # Lower threshold
    upper_bound = Q3 + 1.5 * IQR  # Upper threshold
    outliers = (data < lower_bound) | (data > upper_bound)  # Boolean mask for outliers
    return outliers.sum(), lower_bound, upper_bound

# Detect outliers for each stock price column
outlier_counts = {}
for col in ["Open", "High", "Low", "Close"]:
    count, lower, upper = detect_outliers_iqr(df[col])
    outlier_counts[col] = {"Outlier Count": count, "Lower Bound": lower, "Upper Bound": upper}

outlier_counts

# Function to cap outliers within the IQR range
def cap_outliers(data, lower_bound, upper_bound):
    return data.clip(lower=lower_bound, upper=upper_bound)

# Apply capping for each stock price column
df_capped = df.copy()
for col in ["Open", "High", "Low", "Close"]:
    _, lower, upper = detect_outliers_iqr(df[col])
    df_capped[col] = cap_outliers(df[col], lower, upper)

# Verify if outliers remain after capping
outlier_counts_after = {col: detect_outliers_iqr(df_capped[col])[0] for col in ["Open", "High", "Low", "Close"]}
outlier_counts_after

##### What all outlier treatment techniques have you used and why did you use those techniques?

Detect Outliers using the Interquartile Range (IQR) method.

Handle Outliers by replacing extreme values with the median (winsorization) or capping them at a threshold.

### 3. Categorical Encoding

In [None]:
# Extract numerical features from the Date column
df_encoded = df_capped.copy()
df_encoded["Year"] = df_encoded["Date"].dt.year  # Extract year
df_encoded["Month"] = df_encoded["Date"].dt.month  # Extract month

# Drop the original Date column
df_encoded.drop(columns=["Date"], inplace=True)

# Display the first few rows after encoding
df_encoded.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

The only categorical column in the dataset is "Date" (which is in datetime format). Encoding it properly involves extracting meaningful numerical features. I did the following:

Extract Features from the "Date" column:

Year

Month

Converted these extracted features into numerical values.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create new features
df_features = df_encoded.copy()
df_features["Return Percentage"] = (df_features["Close"] - df_features["Open"]) / df_features["Open"] * 100

df_final = df_features.copy()
df_final.head()

#### 2. Feature Selection

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Assuming df_final is your DataFrame
# Example: Creating a sample DataFrame
data = {
    'Feature1': [1, 2, 3, 4, 5],
    'Feature2': [2, 4, 6, 8, 10],
    'Feature3': [1, 1, 1, 1, 1],
    'Feature4': [5, 6, 7, 8, 9],
    'Price_Change': [np.nan, np.nan, np.nan, np.nan, np.nan]
}
df_final = pd.DataFrame(data)

# Step 1: Remove Highly Correlated Features (Threshold = 0.85)
correlation_matrix = df_final

##### What all feature selection methods have you used  and why?

Remove Highly Correlated Features: If two features have high correlation (above 0.85), we dropped one.

Variance Thresholding: Dropped low-variance features that contribute little information.

Domain Knowledge: Selected meaningful features relevant to stock price movements.

##### Which all features you found important and why?

Selected Features:

Year: Captures long-term trends.

Month: Accounts for seasonal effects.

Volatility: Represents stock volatility.

Return Percentage: Shows percentage price change.

Rolling Average

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

No need of Data Transformation.

### 6. Data Scaling

In [None]:
# Scaling your data

No need of Data Scaling.

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df_selected  # Selected features
y = df_encoded["Close"]  # Predicting the stock's closing price

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

##### What data splitting ratio have you used and why?

I've split the dataset into training (80%) and testing (20%) sets to ensure proper model evaluation.

Training Set: 148 samples (80%)

Testing Set: 37 samples (20%)

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the distribution of the target variable (Close price)
plt.figure(figsize=(8,5))
sns.histplot(y, bins=30, kde=True, color="blue")
plt.xlabel("Closing Price")
plt.ylabel("Frequency")
plt.title("Distribution of Closing Prices")
plt.show()

# Check skewness of target variable
y_skewness = y.skew()
y_skewness

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

 Applied log transformation to make the target variable more normally distributed.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np  # Import numpy if not already imported

# Initialize the Linear Regression model
lr_model = LinearRegression()

# Train the model on the training set
lr_model.fit(X_train, np.log1p(y_train))  # Using log-transformed target

# Predict on the test set
y_pred_log = lr_model.predict(X_test)

# Convert predictions back to the original scale (exponential transformation)
y_pred = np.expm1(y_pred_log)

# Evaluate model performance
r2 = r2_score(y_test, y_pred)
# Calculate RMSE manually using NumPy
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2, rmse

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Linear Regression is a supervised machine learning algorithm used to model the relationship between independent variables (features) and a dependent variable (target) by fitting a straight line.

Interpretable → Shows how each feature impacts the stock price.

Baseline Model → Helps compare performance before trying complex models.

Fast & Efficient → Works well with smaller datasets.

In [None]:
import matplotlib.pyplot as plt

# Residuals (Errors)
residuals = y_test - y_pred

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual Plot
axes[0].scatter(y_pred, residuals, color="red", alpha=0.5)
axes[0].axhline(y=0, color="black", linestyle="--")
axes[0].set_xlabel("Predicted Closing Price")
axes[0].set_ylabel("Residuals (Errors)")
axes[0].set_title("Residual Plot")

# Actual vs. Predicted Plot
axes[1].scatter(y_test, y_pred, color="blue", alpha=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="black", linestyle="--")  # Ideal line
axes[1].set_xlabel("Actual Closing Price")
axes[1].set_ylabel("Predicted Closing Price")
axes[1].set_title("Actual vs. Predicted Plot")

plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split

# Assuming df_final is your DataFrame
# Example: Creating a sample DataFrame
data = {
    'Feature1': [1, 2, 3, 4, 5],
    'Feature2': [2, 4, 6, 8, 10],
    'Feature3': [1, 1, 1, 1, 1],
    'Feature4': [5, 6, 7, 8, 9],
    'Price': [100, 101, 102, 103, 104]
}
df_final = pd.DataFrame(data)

# Split the data into features and target
X = df_final.drop(columns=['Price'])
y = df_final


##### Which hyperparameter optimization technique have you used and why?

Cross-Validation: Use K-Fold (k=5) to check model stability across different data splits.

Hyperparameter Tuning:

Linear Regression has limited parameters, but we can use Ridge Regression (L2 Regularization) to optimize the alpha value.

Evaluation Metrics: Use R² score and RMSE to compare performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt

# Residuals (Errors)
residuals = y_test - y_pred

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual Plot
axes[0].scatter(y_pred, residuals, color="red", alpha=0.5)
axes[0].axhline(y=0, color="black", linestyle="--")
axes[0].set_xlabel("Predicted Closing Price")
axes[0].set_ylabel("Residuals (Errors)")
axes[0].set_title("Residual Plot")

# Actual vs. Predicted Plot
axes[1].scatter(y_test, y_pred, color="blue", alpha=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="black", linestyle="--")  # Ideal line
axes[1].set_xlabel("Actual Closing Price")
axes[1].set_ylabel("Predicted Closing Price")
axes[1].set_title("Actual vs. Predicted Plot")

plt.tight_layout()
plt.show()

### ML Model - 2

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize Random Forest Regressor with default parameters
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training set
rf_model.fit(X_train, np.log1p(y_train))  # Using log-transformed target

# Predict on the test set
y_pred_log_rf = rf_model.predict(X_test)

# Convert predictions back to original scale
y_pred_rf = np.expm1(y_pred_log_rf)

# Evaluate model performance
r2_rf = r2_score(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

r2_rf, rmse_rf

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Compute residuals
residuals_rf = y_test - y_pred_rf

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual Plot
axes[0].scatter(y_pred_rf, residuals_rf, color="red", alpha=0.5)
axes[0].axhline(y=0, color="black", linestyle="--")
axes[0].set_xlabel("Predicted Closing Price")
axes[0].set_ylabel("Residuals (Errors)")
axes[0].set_title("Residual Plot (Random Forest)")

# Actual vs. Predicted Plot
axes[1].scatter(y_test, y_pred_rf, color="blue", alpha=0.5)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="black", linestyle="--")  # Ideal line
axes[1].set_xlabel("Actual Closing Price")
axes[1].set_ylabel("Predicted Closing Price")
axes[1].set_title("Actual vs. Predicted Plot (Random Forest)")

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Random Forest model
rf_model_tuned = RandomForestRegressor(random_state=42)

# Perform Grid Search with Cross-Validation (cv=5)
grid_search = GridSearchCV(estimator=rf_model_tuned, param_grid=param_grid,
                           scoring='r2', cv=5, n_jobs=-1, verbose=1)


##### Which hyperparameter optimization technique have you used and why?

I performed hyperparameter tuning for the Random Forest model using Grid Search Cross-Validation (GridSearchCV) to find the best combination of parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

I performed hyperparameter tuning for the Random Forest model using Grid Search Cross-Validation (GridSearchCV) to find the best combination of parameters.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

In this project, two key evaluation metrics were used to assess the performance of the machine learning model: R² Score and Root Mean Squared Error (RMSE). The R² Score, or coefficient of determination, indicates how well the model's predictions match the actual values. An R² value close to 1 suggests that the model explains most of the variability in the target variable—in this case, the closing stock price of Yes Bank. From a business perspective, a high R² score means that the model is reliable and can be confidently used by investors, financial analysts, and decision-makers for forecasting stock performance, managing investment portfolios, and minimizing financial risks.

### ML Model - 3

In [None]:
# Import necessary libraries
import xgboost as xgb
from sklearn.metrics import r2_score, mean_squared_error

# Initialize the XGBoost Regressor
xgb_model = xgb.XGBRegressor(objective="reg:squarederror",
                             n_estimators=300,
                             learning_rate=0.1,
                             max_depth=6,
                             random_state=42)

# Train the model on the training set (using log-transformed target)
xgb_model.fit(X_train, np.log1p(y_train))

# Predict on the test set
y_pred_log_xgb = xgb_model.predict(X_test)

# Convert predictions back to the original scale
y_pred_xgb = np.expm1(y_pred_log_xgb)

# Evaluate model performance
r2_xgb = r2_score(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))

# Print results
print(f"XGBoost Model R² Score: {r2_xgb}")
print(f"XGBoost Model RMSE: {rmse_xgb}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Compute residuals
residuals_xgb = y_test - y_pred_xgb

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residual Plot
sns.scatterplot(x=y_pred_xgb, y=residuals_xgb, color="red", alpha=0.5, ax=axes[0])
axes[0].axhline(y=0, color="black", linestyle="--")
axes[0].set_xlabel("Predicted Closing Price")
axes[0].set_ylabel("Residuals (Errors)")
axes[0].set_title("Residual Plot (XGBoost Model)")

# Actual vs. Predicted Plot
sns.scatterplot(x=y_test, y=y_pred_xgb, color="blue", alpha=0.5, ax=axes[1])
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="black", linestyle="--")  # Ideal line
axes[1].set_xlabel("Actual Closing Price")
axes[1].set_ylabel("Predicted Closing Price")
axes[1].set_title("Actual vs. Predicted Plot (XGBoost Model)")

plt.tight_layout()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Define parameter grid for tuning
param_grid = {
    "n_estimators": [100, 300, 500],  # Number of trees
    "max_depth": [3, 6, 10],  # Depth of trees
    "learning_rate": [0.01, 0.1, 0.2],  # Step size shrinkage
    "subsample": [0.8, 1],  # Fraction of samples used per tree
    "colsample_bytree": [0.8, 1]  # Fraction of features used per tree
}

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV (Grid Search with Cross-Validation) helps automate hyperparameter tuning by systematically testing different parameter combinations to find the best-performing model.

Yes, after tuning the Random Forest model with a grid search over key parameters (n_estimators, max_depth, min_samples_split, and min_samples_leaf), we noticed an improvement in the model’s evaluation scores.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For the stock price prediction model, I focused on two key metrics that directly impact decision-making, risk management, and business performance:

R² Score (Coefficient of Determination)
Why?

Measures how well the model explains the variance in stock prices.

A higher R² (closer to 1) indicates the model can predict trends accurately, leading to better investment strategies.

Helps traders and businesses trust the model's predictions.

Business Impact:

High R² → Better forecasting of stock prices → Informed investment decisions.

Low R² → Model lacks predictive power → Higher risk in trading strategies.

Root Mean Squared Error (RMSE)
Why?

Measures the average error between actual and predicted stock prices.

Lower RMSE = More precise predictions → Essential for minimizing financial risk.

RMSE is preferred over MAE because it penalizes larger errors more, which is crucial in financial forecasting.

Business Impact:

Low RMSE → More accurate predictions → Reduced losses in stock trading.

High RMSE → Large deviations in predictions → Increased financial risk.

Why These Metrics Matter for Business?

Investors & Traders → Depend on high R² & low RMSE to make profitable stock trades.

Risk Management → Accurate forecasting reduces financial risk & prevents losses.

Portfolio Optimization → Helps in asset allocation & risk-adjusted returns.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the XG Boost Model. Why?

Highest R² Score (~0.9918) → Best at explaining stock price variations.

Lowest RMSE → Most accurate price predictions.

Handles non-linearity well → More realistic modeling of stock market trends.

Regularization (L1 & L2) prevents overfitting → Makes it reliable for unseen data.

Business Impact of Choosing XGBoost

Accurate stock price predictions → Better decision-making for investors.

Reduced risk in trading strategies → Minimizes losses in volatile markets.

Better portfolio optimization → Helps businesses allocate assets efficiently.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In this project, the Random Forest Regressor model was used to predict the closing stock price of Yes Bank based on features such as Open, High, and Low prices. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the average of their predictions, making it robust against overfitting and effective for regression tasks. It automatically handles non-linearity, captures complex interactions among variables, and performs well even with limited data.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

After performing Exploratory Data Analysis (EDA) on Yes Bank's stock price dataset, we have gathered critical insights that can enhance decision-making for investors and traders.

1.The stock exhibits significant volatility, presenting both opportunities and risks for short-term and long-term investors.

2.Historical trends, correlations, and seasonal patterns provide valuable insights into price movements.

3.Moving averages and price change patterns indicate that traders can optimize their buying/selling strategies based on past performance.

4.The strong correlation between Open, High, Low, and Close prices suggests that predictive modeling techniques could improve forecasting accuracy.

5.Risk management strategies, such as portfolio diversification and stop-loss mechanisms, are essential to mitigate potential losses in a volatile market.

Final Recommendations

To achieve the business objective, the client should:

✅ Leverage technical analysis (moving averages, trend analysis) to optimize trading strategies.

✅ Implement risk management through stop-loss orders and portfolio diversification.

✅ Explore predictive modeling & machine learning to forecast future stock trends.

✅ Monitor market news & sentiment analysis to anticipate price fluctuations effectively.

By adopting a data-driven investment approach, the client can maximize returns while minimizing risks, leading to more profitable and informed trading decisions

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***