# **Project Name**    - Yes Bank Stock Closing Price Prediction


##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Project Summary -**

Write the summary here within 500-600 words.

This project focuses on building a regression model to predict the monthly closing price of Yes Bank stock using historical stock data. The aim is to leverage machine learning techniques to understand market patterns and provide data-driven insights into stock price behavior.

Yes Bank has been a prominent player in the Indian banking sector. Since 2018, it has been in the news due to a major financial fraud case involving its former CEO, Rana Kapoor. This led to significant volatility in the stock prices, making it a suitable case for applying regression models and understanding how certain features influence the closing price. The dataset provided contains monthly stock price records, including the opening, highest, lowest, and closing prices along with the date.

The project began with Exploratory Data Analysis (EDA) to understand trends and detect patterns in the data. Visualizations using libraries such as matplotlib and seaborn helped in understanding the relationship between different features like Open, High, Low, and Close. The Date column was broken down into additional features such as Month and Year to help the model better understand temporal behavior.

Following EDA, we moved to data cleaning, where missing values and outliers were handled appropriately. Outliers were detected using methods like the Interquartile Range (IQR), and null values (if any) were imputed or removed after careful consideration. This step ensured that the model would not be biased or skewed due to noisy data.

Next came feature engineering, where new columns were created, such as High-Low and Open-Close differences. These derived features were added to capture the intra-month volatility and price movement trends, which can be important indicators for predicting closing prices.

In the preprocessing stage, we used StandardScaler to normalize numerical features and applied an 80-20 train-test split strategy. Since the data had a time-series nature, we ensured chronological order was preserved to avoid data leakage.

We trained at least two models — Linear Regression and Random Forest Regressor — to compare performance. The models were evaluated based on metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R² Score. Visualization of actual vs predicted closing prices gave a clear picture of the models’ performance.

We also performed hyperparameter tuning to improve the performance of the Random Forest model and identified the most influential features through feature importance plots. All code was modularized, well-commented, and properly formatted in a Google Colab notebook, which includes summary explanations, graphs, and result interpretation.

In conclusion, this project not only demonstrates how machine learning can be applied to stock price prediction but also reflects real-world data handling and problem-solving skills. The results indicate that while stock markets are inherently volatile, using historical data and the right ML techniques, it is possible to build fairly accurate predictive models.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to predict the monthly closing price of Yes Bank’s stock using historical stock price data through regression techniques. Given the fluctuations in Yes Bank’s stock prices—especially following significant financial events like the 2018 fraud case involving former CEO Rana Kapoor—there is a need to explore how well machine learning models can capture and forecast stock price behavior based on key variables.

The dataset contains monthly records of stock prices, including:

Opening Price

Highest Price

Lowest Price

Closing Price

Date of Record

The goal is to develop a predictive model that can estimate the closing price based on the other available features. This involves applying data preprocessing, exploratory data analysis, feature engineering, model training, and evaluation using appropriate regression techniques. The project also aims to identify which features have the most impact on closing prices, and to visualize and explain the model's performance.

Ultimately, this project will demonstrate the applicability of machine learning in financial forecasting and show how historical market data can be used to make informed predictions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# 📊 Data handling and manipulation
import pandas as pd   # For reading and handling data
import numpy as np    # For numerical operations

# 📈 Data visualization
import matplotlib.pyplot as plt  # For creating plots
import seaborn as sns            # For statistical visualizations


# 🧹 Data preprocessing
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.model_selection import train_test_split  # For splitting data

# 🤖 Machine learning models
from sklearn.linear_model import LinearRegression   # Simple regression model
from sklearn.ensemble import RandomForestRegressor  # Tree-based ensemble model

# 📊 Model evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  # Performance metrics

# 🛠️ Warnings
import warnings
warnings.filterwarnings('ignore')  # To ignore unnecessary warnings

# 📅 Date handling (if needed during feature engineering)
from datetime import datetime


### Dataset Loading

In [None]:
# Load Dataset

# Load the dataset from a CSV file
# Replace 'your_file.csv' with the actual path to your dataset
df = pd.read_csv('/content/data_YesBank_StockPrices.csv')

# Display the first 5 rows to get an overview of the data
print("🔍 Preview of the dataset:")
print(df.head())

# Check the shape of the dataset (rows, columns)
print("\n📏 Dataset shape:", df.shape)

# View data types and non-null counts
print("\n🔧 Dataset info:")
df.info()


### Dataset First View

In [None]:
# Dataset First Look
# Show the first 5 rows of the dataset
print("🔍 First 5 rows of the dataset:")
print(df.head())

# Show the last 5 rows (to check the tail-end of the data)
print("\n🔎 Last 5 rows of the dataset:")
print(df.tail())

# Check column names
print("\n📝 Column Names:")
print(df.columns)

# Basic statistical summary of numerical columns
print("\n📊 Descriptive Statistics:")
print(df.describe())

# Check for missing values
print("\n❓ Missing Values:")
print(df.isnull().sum())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Get the number of rows and columns
rows, columns = df.shape

print(f"📏 The dataset contains {rows} rows and {columns} columns.")


### Dataset Information

In [None]:
# Dataset Info
# Display structure and summary info about the dataset
print(" Dataset Information:")
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count the number of duplicate rows in the dataset
duplicate_count = df.duplicated().sum()

print(f"🔁 Total number of duplicate rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count of missing (NaN) values in each column
print("❓ Missing / Null Values in Each Column:")
print(df.isnull().sum())

# Optional: Get total number of missing values in the entire dataset
total_missing = df.isnull().sum().sum()
print(f"\n🧮 Total missing values in the dataset: {total_missing}")


In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt

# Visual heatmap to show where nulls exist
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap="Reds", yticklabels=False)
plt.title("🔍 Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

There are no missing values or null values and duplicate values

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Print all column names
print("📝 Column Names in the Dataset:")
print(df.columns.tolist())


In [None]:
# Dataset Describe
# Summary statistics for all numeric columns
print("📊 Descriptive Statistics:")
print(df.describe())


### Variables Description

Date

This column represents the month and year for which the stock data is recorded.

It is usually in the format YYYY-MM-DD (though only one entry per month).

This will be useful for extracting time-based features like month and year.

Open

This is the stock’s price at the beginning of the month (the first trading day).

It reflects the starting market sentiment and is often compared with the closing price to gauge movement.

High

This represents the highest price the stock reached during that particular month.

It indicates the peak market optimism or demand during that period.

Low

This shows the lowest price the stock touched in the month.

It helps in understanding the depth of negative sentiment or volatility.

Close

This is the closing price of the stock at the end of the month (last trading day).

It is the main target variable in this project (i.e., the value we aim to predict).

Often used in financial forecasting as it reflects the final consensus price.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Loop through all columns and print the count of unique values
print("🔢 Unique value count for each column:\n")
for column in df.columns:
    unique_count = df[column].nunique()
    print(f"{column}: {unique_count} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Load Dataset
df = pd.read_csv('/content/data_YesBank_StockPrices.csv')
print("Shape after loading:", df.shape)

# 🧹 1. Drop duplicate rows (if any)
df.drop_duplicates(inplace=True)
print("Shape after dropping duplicates:", df.shape)

# 📆 2. Convert 'Date' column to datetime format
# Use '%b-%y' format string for 'Jul-05' type dates
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors='coerce')
print("Shape after date conversion:", df.shape)
print("Nulls in Date after conversion:", df['Date'].isnull().sum())


# 🧼 3. Drop rows with missing or malformed dates (if any)
# This is important because if the format was wrong, `errors='coerce'` would make them NaT
# Now that we are using the correct format, hopefully, this step won't drop many rows.
df = df.dropna(subset=['Date'])
print("Shape after dropping rows with missing dates:", df.shape)


# 📊 4. Sort the data by Date (oldest to newest)
df = df.sort_values(by='Date').reset_index(drop=True)
print("Shape after sorting:", df.shape)

# 🧠 5. Create new time-based features
# Ensure 'Date' is datetime before extracting features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Month_Name'] = df['Date'].dt.strftime('%B')
print("Shape after adding time features:", df.shape)


# ⚙️ 6. Feature Engineering: Create new informative features
df['High_Low_Diff'] = df['High'] - df['Low']       # Measures volatility
df['Open_Close_Diff'] = df['Open'] - df['Close']   # Measures trend direction
print("Shape after adding engineered features:", df.shape)


# 🧽 7. Handle missing values if any (Check again after feature engineering)
print("\nMissing values before ffill:")
print(df.isnull().sum())
# For numerical columns, fill with forward fill (or you can use mean/median)
# You might want to be careful with ffill on stock data if there are long gaps
# Given this is monthly data, ffill might be acceptable, but consider alternatives
# like imputation with mean/median if you have significant missing data.
df.fillna(method='ffill', inplace=True)
print("\nMissing values after ffill:")
print(df.isnull().sum())


# ✅ Dataset is now cleaned, structured, and ready for EDA and modeling
print("\n🎉 Dataset is now analysis-ready!")
print("\nFirst 5 rows of the processed DataFrame:")
print(df.head())
print("\nLast 5 rows of the processed DataFrame:")
print(df.tail())

### What all manipulations have you done and insights you found?

✅ Data Manipulations Done (Data Wrangling)
Removed Duplicate Rows

Ensured there are no repeated entries that could bias the model.

Converted Date Column to Datetime Format

Allows time-based feature extraction and proper chronological sorting.

Dropped Rows with Invalid or Missing Dates

Cleaned corrupted records (if any) to maintain data consistency.

Sorted Data by Date

Ensures that stock data flows in proper monthly order — important for time series or trend-based models.

Extracted New Time-Based Features

Year, Month, and Month_Name were created for better seasonal or temporal trend analysis.

Created New Features (Feature Engineering)

High_Low_Diff: Captures monthly stock volatility.

Open_Close_Diff: Indicates monthly price movement (gain or loss).

Handled Missing Values

Used forward fill to impute any remaining gaps in data.

📊 Initial Insights Gained
The dataset seems to have monthly granularity, with one record per month.

Stock prices vary significantly across months, with some months showing high volatility (High_Low_Diff).

There are patterns in price trends where either Open is greater than Close (falling month) or vice versa (rising month).

The date column is now usable for trend analysis over years or months.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# 📊 Visualize the Closing Price over Time
plt.figure(figsize=(12, 6)) # Set the figure size for better readability
sns.lineplot(data=df, x='Date', y='Close') # Create the line plot using seaborn
plt.title('📈 Yes Bank Stock Closing Price Over Time', fontsize=16) # Add a title
plt.xlabel('Date', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid for easier reading of values
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A line plot is ideal for showing trends and changes in a variable (Closing Price) over a continuous time period (Date).

##### 2. What is/are the insight(s) found from the chart?

Observed the overall trend (upward, downward, or sideways) of the Yes Bank stock closing price over the given time frame.
Identified periods of high volatility (steep increases or decreases) and periods of relative stability.
Saw the approximate range of closing prices over the years.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, identifying upward trends helps in understanding potential growth periods, which can inform investment strategies (e.g., when to buy). Observing periods of stability can indicate lower risk. Negative Growth Insights: Yes, the chart likely shows periods of significant price decline, indicating negative growth. These insights are crucial for risk assessment and can inform decisions on when to sell or avoid investing to mitigate losses. Sharp drops clearly show periods where the stock value decreased, leading to negative returns for investors during those times.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2 visualization code

# 📊 Visualize the distribution of stock prices by Year using Box Plots
plt.figure(figsize=(15, 8)) # Set the figure size

# Melt the DataFrame to have a single column for price types
# We'll exclude the Date and Month_Name columns for melting
df_melted = df.melt(id_vars=['Date', 'Year', 'Month', 'Month_Name', 'High_Low_Diff', 'Open_Close_Diff'],
                   value_vars=['Open', 'High', 'Low', 'Close'],
                   var_name='Price_Type', # New column for 'Open', 'High', 'Low', 'Close'
                   value_name='Price')   # New column for the price values

sns.boxplot(data=df_melted, x='Year', y='Price', hue='Price_Type', palette='viridis') # Create the box plot
plt.title('📦 Distribution of Stock Prices by Year', fontsize=16) # Add a title
plt.xlabel('Year', fontsize=12) # Label the x-axis
plt.ylabel('Price', fontsize=12) # Label the y-axis
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for readability
plt.legend(title='Price Type') # Add a legend
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

Box plots are excellent for visualizing the distribution of a numerical variable across different categories. In this case, we use it to show the distribution of stock prices (Open, High, Low, Close) for each year. This allows us to see the median price, the spread (Interquartile Range), the minimum and maximum values (excluding outliers), and identify potential outliers within each year. Comparing box plots across years helps in understanding how the price range and variability have changed over time.

##### 2. What is/are the insight(s) found from the chart?

You can see how the median price for Open, High, Low, and Close has changed from year to year.
Observe the interquartile range (the box itself) for each year and price type, indicating the typical range of prices for 50% of the data points in that year. This tells you about the price volatility within the year.
Identify potential outliers (points outside the whiskers), which could represent months with unusually high or low price movements.
Compare the overall range of prices (from minimum to maximum whisker values) across different years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Understanding the typical price range and variability within each year can help in setting realistic expectations for trading or investment. Years with tighter boxes might indicate more predictable price movements, which could be favorable for certain strategies. Identifying upward shifts in the median price over consecutive years suggests a positive long-term trend.
Negative Growth Insights: Years with significantly lower median prices or a concentration of values in a lower range compared to previous years indicate a period of negative growth. The presence of outliers below the typical range could signal months with significant price drops. Years with wider boxes might indicate higher volatility and risk, which, if combined with a downward trend, points towards potential for negative returns. For instance, a year where the median price box is much lower than the previous year, and the box itself is quite wide (showing high volatility), is a strong indicator of a challenging period for the stock

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# 📊 Visualize the distribution of engineered features
plt.figure(figsize=(14, 6)) # Set the figure size

# Subplot 1: Distribution of High-Low Difference
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st plot
sns.histplot(data=df, x='High_Low_Diff', kde=True, bins=20) # Histogram with KDE
plt.title('📈 Distribution of Monthly High-Low Difference', fontsize=14)
plt.xlabel('High-Low Difference', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.grid(axis='y', alpha=0.75)

# Subplot 2: Distribution of Open-Close Difference
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd plot
sns.histplot(data=df, x='Open_Close_Diff', kde=True, bins=20) # Histogram with KDE
plt.title('📊 Distribution of Monthly Open-Close Difference', fontsize=14)
plt.xlabel('Open-Close Difference', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.grid(axis='y', alpha=0.75)

plt.tight_layout() # Adjust layout to prevent overlapping titles/labels
plt.show() # Display the plots

##### 1. Why did you pick the specific chart?

 Histograms (with Kernel Density Estimate - KDE) are used to show the distribution and frequency of a single numerical variable. In this case, we are visualizing the distributions of High_Low_Diff and Open_Close_Diff to understand their typical ranges and how often certain values occur. This helps in understanding the characteristic volatility and monthly price swings of the stock.

##### 2. What is/are the insight(s) found from the chart?

For High_Low_Diff, you can see the most frequent range of monthly volatility. A wider spread indicates months with larger price swings.
For Open_Close_Diff, you can observe whether the distribution is centered around zero (meaning prices tend to end the month near where they started), skewed towards positive values (indicating months where the stock tended to close lower than it opened - bearish months), or skewed towards negative values (indicating months where the stock tended to close higher than it opened - bullish months).
The KDE line provides a smoothed estimate of the distribution shape.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Understanding the typical monthly volatility (High_Low_Diff) helps in assessing the risk associated with trading or investing in this stock. Lower volatility periods might be preferred by risk-averse investors. Knowing the common range and direction of monthly price movements (Open_Close_Diff) can inform short-term trading strategies. For example, if there's a clear tendency for the stock to close higher than it opens, it might indicate a general upward trend or positive market sentiment within months.
Negative Growth Insights: A distribution of Open_Close_Diff heavily skewed towards positive values (meaning Close < Open) indicates that in many months, the stock price has decreased, which points to negative growth over those periods. Higher frequency of large positive Open_Close_Diff values means frequent significant monthly losses. Similarly, a histogram for High_Low_Diff showing a significant number of months with very large differences implies high volatility, which, during a downtrend, can lead to substantial negative returns quickly

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4 visualization code

# 📊 Visualize the relationship between Open and Close prices using a Scatter Plot
plt.figure(figsize=(10, 6)) # Set the figure size
sns.scatterplot(data=df, x='Open', y='Close', alpha=0.6) # Create the scatter plot
plt.title('📉 Relationship between Monthly Open and Close Prices', fontsize=16) # Add a title
plt.xlabel('Open Price', fontsize=12) # Label the x-axis
plt.ylabel('Close Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

 A scatter plot is used to visualize the relationship between two numerical variables. In this case, we want to see how the 'Open' price of the stock relates to its 'Close' price. A scatter plot helps in identifying if there's a linear relationship, the strength of the relationship, and any potential outliers.

##### 2. What is/are the insight(s) found from the chart?

You can observe if there's a positive correlation (as Open price increases, Close price tends to increase), a negative correlation (as Open price increases, Close price tends to decrease), or no clear correlation.
The tightness of the cluster of points indicates the strength of the relationship. Points closely following a line suggest a strong correlation.
You can spot any data points that are far away from the main cluster, which could be outliers representing months with unusual price movements relative to their opening price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: A strong positive correlation between Open and Close prices suggests that the opening price is a good indicator of the potential closing price for that month. This can be a positive insight for short-term trading strategies, where predicting the direction of the daily or monthly price movement is important. If the plot shows that generally, when the stock opens high, it also closes high, this supports a positive outlook for bullish market conditions.
Negative Growth Insights: While the scatter plot primarily shows correlation, if the points tend to lie below the line where Open = Close (meaning Close < Open), this visually represents months where the stock price has decreased, indicating negative growth within those periods. A weak or scattered relationship might suggest that the opening price alone is not a strong predictor of the closing price, implying higher volatility or influence from other factors, which could increase investment risk and potentially lead to negative outcomes if price movements are unpredictable.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 5 visualization code

# 📊 Visualize the relationship between High and Close prices using a Scatter Plot with Regression Line
plt.figure(figsize=(10, 6)) # Set the figure size
sns.regplot(data=df, x='High', y='Close', scatter_kws={'alpha':0.6}, line_kws={"color": "red"}) # Create the scatter plot with a regression line
plt.title('📈 Relationship between Monthly High and Close Prices', fontsize=16) # Add a title
plt.xlabel('High Price', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A regplot in Seaborn combines a scatter plot with a regression line. This is useful for visualizing the relationship between two numerical variables ('High' and 'Close') and also showing the linear trend (the regression line) and its confidence interval. This gives a clear picture of how the closing price tends to change as the highest price reached during the month changes.

##### 2. What is/are the insight(s) found from the chart?

You can observe the strength and direction of the linear relationship between the monthly highest price and the closing price.
The slope of the regression line indicates how much the closing price is expected to change for a one-unit increase in the high price.
The spread of the scatter points around the line shows the variability and helps assess how well the 'High' price alone predicts the 'Close' price.
The confidence interval around the regression line shows the range within which the true relationship is likely to lie.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: A strong positive linear relationship suggests that months where the stock reaches a higher peak tend to also end with a higher closing price. This insight can be used by traders to anticipate potential closing prices based on the high price reached during the month. A strong positive correlation is generally a positive indicator of the stock's ability to hold onto gains.
Negative Growth Insights: While the relationship between 'High' and 'Close' is expected to be positive, a significant spread of points below the regression line could indicate months where, despite reaching a high price, the stock experienced a notable sell-off by the end of the month, leading to a significantly lower closing price. This "failure" to close near the high could be a sign of bearish pressure or negative sentiment, which, if frequent, points to periods or overall patterns of negative growth or difficulty in sustaining positive price movements.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 6 visualization code

# 📊 Visualize the relationship between Low and Close prices using a Scatter Plot with Regression Line
plt.figure(figsize=(10, 6)) # Set the figure size
sns.regplot(data=df, x='Low', y='Close', scatter_kws={'alpha':0.6}, line_kws={"color": "green"}) # Create the scatter plot with a regression line
plt.title('📉 Relationship between Monthly Low and Close Prices', fontsize=16) # Add a title
plt.xlabel('Low Price', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

regplot is used to show the linear relationship and trend between 'Low' and 'Close' prices.

##### 2. What is/are the insight(s) found from the chart?

Reveals the correlation between the lowest price hit in a month and the closing price. Indicates how often the closing price is near the monthly low or rebounds significantly from it.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: A strong positive correlation means months with higher lows tend to close higher, suggesting support levels. Negative: Points far above the line (Close much higher than Low) could indicate strong rebounds, but also potential volatility. If lows are consistently decreasing while closing prices are also low, it indicates a strong negative trend.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 7 visualization code

# 📊 Visualize High-Low Difference and Open-Close Difference over Time
plt.figure(figsize=(14, 7)) # Set the figure size

# Subplot 1: High-Low Difference over Time
plt.subplot(2, 1, 1) # 2 rows, 1 column, 1st plot
sns.lineplot(data=df, x='Date', y='High_Low_Diff') # Line plot for High-Low Diff
plt.title('📈 Monthly High-Low Difference Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('High-Low Difference', fontsize=12)
plt.grid(True)

# Subplot 2: Open-Close Difference over Time
plt.subplot(2, 1, 2) # 2 rows, 1 column, 2nd plot
sns.lineplot(data=df, x='Date', y='Open_Close_Diff', color='orange') # Line plot for Open-Close Diff
plt.title('📊 Monthly Open-Close Difference Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Open-Close Difference', fontsize=12)
plt.grid(True)

plt.tight_layout() # Adjust layout
plt.show() # Display the plots

##### 1. Why did you pick the specific chart?

Line plots are used to visualize the trend of numerical variables (High_Low_Diff and Open_Close_Diff) over a continuous time axis (Date).

##### 2. What is/are the insight(s) found from the chart?

Observe trends in volatility (High-Low Difference) – are months getting more or less volatile over time? See if there are periods with consistently positive or negative Open-Close Differences, indicating sustained upward or downward monthly price movements.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying periods of lower volatility can inform less risky trading strategies. Recognizing periods with predominantly negative Open-Close Difference suggests a general upward trend within months, which is positive. Negative: Spikes in High-Low Difference indicate increased volatility and risk. Sustained periods with predominantly positive Open-Close Difference mean months often end lower than they began, reflecting a negative trend within those months.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# 📊 Visualize the relationship between High-Low Difference and Close Price
plt.figure(figsize=(10, 6)) # Set the figure size
sns.scatterplot(data=df, x='High_Low_Diff', y='Close', alpha=0.6) # Create the scatter plot
plt.title('📈 Relationship between Monthly Volatility (High-Low Diff) and Closing Price', fontsize=16) # Add a title
plt.xlabel('High-Low Difference', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between High_Low_Diff (a measure of monthly volatility) and the target variable Close. It helps determine if there's a pattern or correlation between how much the stock price moves within a month and the final closing price.

##### 2. What is/are the insight(s) found from the chart?

 You can observe if months with higher volatility tend to be associated with higher or lower closing prices, or if there's no clear relationship. The spread of points shows the variability in closing prices for similar levels of volatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: If higher volatility is often associated with higher closing prices, it could suggest that upward price movements are often accompanied by large intraday/intra-month swings, which could inform trading strategies focused on volatile periods. Negative: If high volatility is associated with a wide range of closing prices, it indicates that high volatility doesn't guarantee a favorable closing price and could imply increased risk. If high volatility frequently leads to lower closing prices (points clustered in the bottom-right), it's a strong indicator of negative growth potential during volatile periods.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Chart - 9 visualization code

# 📊 Visualize the relationship between Open-Close Difference and Close Price
plt.figure(figsize=(10, 6)) # Set the figure size
sns.scatterplot(data=df, x='Open_Close_Diff', y='Close', alpha=0.6) # Create the scatter plot
plt.title('📊 Relationship between Monthly Price Movement (Open-Close Diff) and Closing Price', fontsize=16) # Add a title
plt.xlabel('Open-Close Difference', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between Open_Close_Diff (indicating monthly price gain/loss) and the target variable Close. It helps understand how the magnitude and direction of the monthly price change relates to the final closing value.

##### 2. What is/are the insight(s) found from the chart?

 You can observe if months with large positive Open_Close_Diff (meaning the price dropped significantly within the month) tend to have lower closing prices, and if months with large negative Open_Close_Diff (meaning the price increased significantly within the month) tend to have higher closing prices. The concentration of points around zero Open_Close_Diff would indicate months where the open and close prices were similar.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: If a clear pattern emerges where negative Open_Close_Diff values (monthly gains) strongly correlate with higher closing prices, it reinforces the idea that periods of upward monthly movement result in favorable closing prices. This can inform bullish strategies. Negative: If months with significant positive Open_Close_Diff (monthly losses) cluster heavily at low closing prices, it directly highlights periods of negative growth and the impact of downward monthly price swings. A wide scatter of points for similar Open_Close_Diff values would suggest that the monthly movement isn't the sole determinant of the closing price and other factors play a role, implying less predictability.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Chart - 10 visualization code

# 📊 Visualize the distribution of Closing Price by Month using Violin Plots
plt.figure(figsize=(14, 7)) # Set the figure size
sns.violinplot(data=df, x='Month_Name', y='Close', palette='coolwarm', order=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']) # Create the violin plot
plt.title('🎻 Distribution of Monthly Closing Price by Month', fontsize=16) # Add a title
plt.xlabel('Month', fontsize=12) # Label the x-axis
plt.ylabel('Closing Price', fontsize=12) # Label the y-axis
plt.grid(axis='y', alpha=0.75) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

Violin plots show the distribution of a numerical variable across different categories (in this case, Close price by Month_Name). They are more visually appealing than box plots for showing the density of the data at different price points within each month, highlighting multimodality if present.

##### 2. What is/are the insight(s) found from the chart?

Observe how the distribution of closing prices varies across different months of the year. Identify months that tend to have higher or lower median closing prices, or months with wider or narrower distributions (indicating volatility). See if there are any months with unusual clusters of closing prices or potential outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying months where the stock price distribution is generally higher or less volatile can inform favorable trading or investment timing. Understanding seasonal patterns might lead to strategies capitalizing on typical monthly movements. Negative: Months where the distribution is centered at lower prices, or has a wider spread towards lower values, indicate periods of potential negative growth or higher risk. If certain months consistently show lower price distributions over the years, it suggests a seasonal negative trend.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# 📊 Visualize Open, High, Low, and Close Prices over Time
plt.figure(figsize=(14, 7)) # Set the figure size

sns.lineplot(data=df, x='Date', y='Open', label='Open') # Line plot for Open Price
sns.lineplot(data=df, x='Date', y='High', label='High') # Line plot for High Price
sns.lineplot(data=df, x='Date', y='Low', label='Low')   # Line plot for Low Price
sns.lineplot(data=df, x='Date', y='Close', label='Close') # Line plot for Close Price

plt.title('📈 Open, High, Low, and Closing Prices Over Time', fontsize=16) # Add a title
plt.xlabel('Date', fontsize=12) # Label the x-axis
plt.ylabel('Price', fontsize=12) # Label the y-axis
plt.legend() # Add a legend to distinguish the lines
plt.grid(True) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A multi-line plot is used to compare the trends of multiple numerical variables (Open, High, Low, Close) over a continuous time period (Date). This allows for easy visual comparison of their movements relative to each other.



##### 2. What is/are the insight(s) found from the chart?

Observe the relationship and relative positions of the Open, High, Low, and Close lines over time. See how the spread between High and Low (volatility) changes. Note if the Close price tends to follow the Open price closely or deviate significantly. Identify periods where all prices move together in a strong trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Seeing consistent upward trends across all four price points indicates strong positive growth periods, informing buy decisions. Observing that Close price often stays near the High price could indicate bullish sentiment and strength. Negative: Periods where the lines consistently trend downwards represent negative growth. If the Close price frequently ends near the Low price, it suggests bearish pressure and difficulty in maintaining value, indicating negative sentiment.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# 📊 Visualize Average Closing Price per Year using a Bar Plot
plt.figure(figsize=(12, 6)) # Set the figure size

# Calculate the average closing price per year
avg_close_by_year = df.groupby('Year')['Close'].mean().reset_index()

sns.barplot(data=avg_close_by_year, x='Year', y='Close', palette='viridis') # Create the bar plot
plt.title('📊 Average Yes Bank Stock Closing Price per Year', fontsize=16) # Add a title
plt.xlabel('Year', fontsize=12) # Label the x-axis
plt.ylabel('Average Closing Price', fontsize=12) # Label the y-axis
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels
plt.grid(axis='y', alpha=0.75) # Add a grid
plt.show() # Display the plot

##### 1. Why did you pick the specific chart?

A bar plot is effective for comparing a numerical value (average closing price) across distinct categories (each year). It provides a clear visual comparison of the average stock value in each year.

##### 2. What is/are the insight(s) found from the chart?

Clearly see the average closing price for each year and how it changes from one year to the next. Easily identify years with the highest and lowest average closing prices. Observe periods of sustained increase or decrease in the average yearly price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Years with significantly taller bars than previous years indicate strong positive growth in the average stock value for that year, informing long-term investment perspective. Negative: Years with significantly shorter bars than previous years clearly highlight periods of negative growth in the average stock value. A downward trend in the bar heights over consecutive years is a strong indicator of sustained negative performance.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# 📊 Visualize Average High-Low Difference and Open-Close Difference per Year
plt.figure(figsize=(14, 7)) # Set the figure size

# Calculate the average engineered features per year
avg_engineered_by_year = df.groupby('Year')[['High_Low_Diff', 'Open_Close_Diff']].mean().reset_index()

# Subplot 1: Average High-Low Difference per Year
plt.subplot(1, 2, 1) # 1 row, 2 columns, 1st plot
sns.barplot(data=avg_engineered_by_year, x='Year', y='High_Low_Diff', palette='coolwarm')
plt.title('📈 Average Monthly High-Low Difference per Year', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average High-Low Difference', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.75)

# Subplot 2: Average Open-Close Difference per Year
plt.subplot(1, 2, 2) # 1 row, 2 columns, 2nd plot
sns.barplot(data=avg_engineered_by_year, x='Year', y='Open_Close_Diff', palette='viridis')
plt.title('📊 Average Monthly Open-Close Difference per Year', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Open-Close Difference', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.75)

plt.tight_layout() # Adjust layout
plt.show() # Display the plots

##### 1. Why did you pick the specific chart?

Bar plots are used to compare the average values of High_Low_Diff and Open_Close_Diff across different years, providing a clear visual comparison of how these metrics have trended annually.

##### 2. What is/are the insight(s) found from the chart?

Observe which years had the highest or lowest average monthly volatility (High_Low_Diff). See if there are years where the average monthly price movement (Open_Close_Diff) was significantly positive (indicating average monthly loss) or negative (indicating average monthly gain). Identify periods where the stock was consistently more or less volatile or showed a consistent average monthly trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying years with lower average volatility can indicate periods of lower risk. Years with a consistent negative average Open_Close_Diff suggest that months in that year typically ended higher than they opened, which is a positive sign for investors. Negative: Years with high average High_Low_Diff indicate increased risk due to volatility. Years with a consistent positive average Open_Close_Diff mean months in that year typically ended lower than they opened, indicating a period of average monthly losses and negative growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 - Correlation Heatmap visualization code

# 📊 Compute the correlation matrix
# Select only the numerical columns for the heatmap
numerical_cols = df.select_dtypes(include=np.number).columns
corr_matrix = df[numerical_cols].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8)) # Set the figure size
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('📊 Correlation Heatmap of Numerical Features', fontsize=16) # Add a title
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is used to visualize the pairwise correlation coefficients between multiple numerical variables. It provides a quick overview of how strongly and in what direction each pair of variables is linearly related. This is essential for identifying multicollinearity among features and understanding relationships with the target variable (Close).

##### 2. What is/are the insight(s) found from the chart?

identify variables that are highly correlated with the target variable (Close). Look for strong positive or negative correlations between predictor variables (e.g., between Open, High, Low). Understand the strength of the relationship between the engineered features (High_Low_Diff, Open_Close_Diff) and other variables, including the target.

#### Chart - 15 - Pair Plot

In [None]:
# Chart - 15 - Pair Plot visualization code

# 📊 Create a Pair Plot of the numerical features
# Select only the numerical columns for the pair plot
numerical_cols = df.select_dtypes(include=np.number).columns

# It's often good to exclude the 'Year' and 'Month' for pair plot if they are just identifiers,
# but in this case, they might show trends related to price. Let's include them.
# You might exclude High_Low_Diff and Open_Close_Diff if the plot becomes too crowded,
# but they are engineered features, so let's include them to see relationships.

# Consider limiting the number of columns if the dataset is large to keep the plot readable.
# For this dataset, including all numerical columns should be fine.

sns.pairplot(df[numerical_cols], diag_kind='kde') # Create the pair plot
plt.suptitle('📊 Pair Plot of Numerical Features', y=1.02, fontsize=16) # Add a title
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is a matrix of scatter plots for each pair of variables and histograms/KDEs for the diagonal elements. It provides a comprehensive visual overview of the relationships between all numerical variables and their individual distributions in a single visualization.

##### 2. What is/are the insight(s) found from the chart?

Quickly see the relationships between all pairs of numerical features, including how each predictor relates to the target variable (Close). Observe the distribution of each individual numerical feature (on the diagonal). Identify potential linear or non-linear relationships and clusters in the data. It complements the correlation heatmap by showing the actual scatter of points.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

The average closing price of Yes Bank stock is significantly different in years after 2018 (when the fraud case became prominent) compared to years before 2018.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 1

from scipy.stats import ttest_ind # Import the independent t-test function

# Define the two groups: Closing prices before and after 2018
# Ensure df['Year'] is treated as a number for comparison
df['Year'] = pd.to_numeric(df['Year'])

close_before_2018 = df[df['Year'] < 2018]['Close']
close_after_2018 = df[df['Year'] >= 2018]['Close']

# Perform the independent samples t-test
# We set equal_var=False to perform Welch's t-test, which does not assume equal variances.
# This is generally safer unless you have strong evidence of equal variances.
t_statistic, p_value = ttest_ind(close_before_2018, close_after_2018, equal_var=False)

print(f"Independent Samples t-test Results for Hypothesis 1:")
print(f"  Mean Closing Price Before 2018: {close_before_2018.mean():.2f}")
print(f"  Mean Closing Price After 2018: {close_after_2018.mean():.2f}")
print(f"  T-statistic: {t_statistic:.4f}")
print(f"  P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05 # Set a significance level (commonly 0.05)

print("\nConclusion:")
if p_value < alpha:
    print(f"  The p-value ({p_value:.4f}) is less than the significance level ({alpha}).")
    print("  We reject the null hypothesis.")
    print("  Conclusion: The average closing price of Yes Bank stock is significantly different in years after 2018 compared to years before 2018.")
else:
    print(f"  The p-value ({p_value:.4f}) is greater than or equal to the significance level ({alpha}).")
    print("  We fail to reject the null hypothesis.")
    print("  Conclusion: There is not enough statistical evidence to conclude that the average closing price of Yes Bank stock is significantly different in years after 2018 compared to years before 2018.")

##### Which statistical test have you done to obtain P-Value?

I used the Independent Samples t-test (specifically, Welch's t-test).

##### Why did you choose the specific statistical test?

I chose the Independent Samples t-test because:

It compares means: Our hypothesis is about the average closing price. The t-test is designed to compare the means of two groups.
Independent groups: The data points in the "before 2018" group are independent of the data points in the "after 2018" group; the closing price in one period doesn't directly determine the closing price in the other in a paired way.
Continuous data: The variable being compared (Close) is continuous (a numerical price).

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$$H_0$): There is no statistically significant linear relationship between the High price and the Close price of Yes Bank stock. (The true population correlation coefficient is zero.)

Alternate Hypothesis ($H_a$$H_a$): There is a statistically significant linear relationship between the High price and the Close price of Yes Bank stock. (The true population correlation coefficient is not zero.)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 2

from scipy.stats import pearsonr # Import the Pearson correlation function

# Perform the Pearson correlation test between 'High' and 'Close' prices
correlation_coefficient, p_value = pearsonr(df['High'], df['Close'])

print(f"Pearson Correlation Test Results for Hypothesis 2:")
print(f"  Correlation Coefficient (High vs. Close): {correlation_coefficient:.4f}")
print(f"  P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05 # Set a significance level (commonly 0.05)

print("\nConclusion:")
if p_value < alpha:
    print(f"  The p-value ({p_value:.4f}) is less than the significance level ({alpha}).")
    print("  We reject the null hypothesis.")
    print("  Conclusion: There is a statistically significant linear relationship between the High price and the Close price of Yes Bank stock.")
else:
    print(f"  The p-value ({p_value:.4f}) is greater than or equal to the significance level ({alpha}).")
    print("  We fail to reject the null hypothesis.")
    print("  Conclusion: There is not enough statistical evidence to conclude that there is a statistically significant linear relationship between the High price and the Close price of Yes Bank stock.")

##### Which statistical test have you done to obtain P-Value?

For Hypothetical Statement 2, the statistical test used is the Pearson Correlation Test.



##### Why did you choose the specific statistical test?

Measuring Linear Relationship: Our hypothesis specifically concerns a linear relationship between High and Close. The Pearson correlation coefficient is the standard measure for the strength and direction of a linear association between two continuous variables.
Two Continuous Variables: The test is appropriate for analyzing the relationship between two continuous numerical variables, which High and Close both are.
Testing for Significance: The test provides a p-value to determine if the observed linear correlation in our sample data is strong enough to conclude that a linear relationship exists in the overall population (i.e., the correlation coefficient is statistically different from zero).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis ($H_0$$H_0$): The average monthly volatility (High_Low_Diff) is not significantly different between months where the stock closed lower than it opened and months where it closed higher than it opened. (The true population mean High_Low_Diff is the same for both groups.)

Alternate Hypothesis ($H_a$$H_a$): The average monthly volatility (High_Low_Diff) is significantly different between months where the stock closed lower than it opened and months where it closed higher than it opened. (The true population mean High_Low_Diff is different for the two groups.)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value for Hypothesis 3

from scipy.stats import ttest_ind # Import the independent t-test function

# Define the two groups based on Open-Close Difference:
# Group 1: Months where Close < Open (Open_Close_Diff > 0)
volatility_when_close_lower = df[df['Open_Close_Diff'] > 0]['High_Low_Diff']

# Group 2: Months where Close > Open (Open_Close_Diff < 0)
volatility_when_close_higher = df[df['Open_Close_Diff'] < 0]['High_Low_Diff']

# Note: Months where Open == Close (Open_Close_Diff == 0) are excluded from both groups.
# Check if either group is empty before performing the test
if volatility_when_close_lower.empty or volatility_when_close_higher.empty:
    print("Cannot perform t-test: One or both groups are empty.")
    print(f"  Months with Close < Open: {len(volatility_when_close_lower)}")
    print(f"  Months with Close > Open: {len(volatility_when_close_higher)}")
else:
    # Perform the independent samples t-test (using Welch's t-test)
    t_statistic, p_value = ttest_ind(volatility_when_close_lower, volatility_when_close_higher, equal_var=False)

    print(f"Independent Samples t-test Results for Hypothesis 3:")
    print(f"  Mean High-Low Diff when Close < Open: {volatility_when_close_lower.mean():.2f}")
    print(f"  Mean High-Low Diff when Close > Open: {volatility_when_close_higher.mean():.2f}")
    print(f"  T-statistic: {t_statistic:.4f}")
    print(f"  P-value: {p_value:.4f}")

    # Interpret the results
    alpha = 0.05 # Set a significance level (commonly 0.05)

    print("\nConclusion:")
    if p_value < alpha:
        print(f"  The p-value ({p_value:.4f}) is less than the significance level ({alpha}).")
        print("  We reject the null hypothesis.")
        print("  Conclusion: The average monthly volatility (High_Low_Diff) is significantly different between months where the stock closed lower than it opened and months where it closed higher than it opened.")
    else:
        print(f"  The p-value ({p_value:.4f}) is greater than or equal to the significance level ({alpha}).")
        print("  We fail to reject the null hypothesis.")
        print("  Conclusion: There is not enough statistical evidence to conclude that the average monthly volatility (High_Low_Diff) is significantly different between months where the stock closed lower than it opened and months where it closed higher than it opened.")

##### Which statistical test have you done to obtain P-Value?

Independent Samples t-test (specifically, Welch's t-test).



##### Why did you choose the specific statistical test?

Chosen because we are comparing the means of a continuous variable (High_Low_Diff) between two independent groups of months, defined by the sign of Open_Close_Diff. The t-test is designed for this mean comparison, and Welch's version is used for robustness against unequal variances

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# 🧽 7. Handle missing values if any (Check again after feature engineering)
print("\nMissing values before ffill:")
print(df.isnull().sum())
# For numerical columns, fill with forward fill (or you can use mean/median)
# You might want to be careful with ffill on stock data if there are long gaps
# Given this is monthly data, ffill might be acceptable, but consider alternatives
# like imputation with mean/median if you have significant missing data.
df.fillna(method='ffill', inplace=True)
print("\nMissing values after ffill:")
print(df.isnull().sum())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Based on the code you have in your notebook, you have used the Forward Fill (ffill) method for missing value imputation.

Why Forward Fill (ffill)?

You used df.fillna(method='ffill', inplace=True). This technique is commonly used in time series data because it propagates the last valid observation forward.

Reasoning: In the context of stock prices, using the previous month's value (ffill) to fill a missing value can be a reasonable assumption. It reflects the idea that prices tend to maintain a certain level unless there's a new event, and using a past value preserves the temporal sequence of the data. This is often more suitable for time series than using a simple mean or median of the entire dataset, which would not account for trends or seasonality. Your comment in the code "Given this is monthly data, ffill might be acceptable" aligns with this reasoning.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments Code

# Let's identify potential outliers using the Interquartile Range (IQR) method
# This method is good for skewed data and doesn't assume a normal distribution.

# Columns where you want to check for outliers (e.g., Open, High, Low, Close, engineered features)
outlier_check_cols = ['Open', 'High', 'Low', 'Close', 'High_Low_Diff', 'Open_Close_Diff']

print("--- Checking for Outliers (IQR Method) ---")

for col in outlier_check_cols:
    print(f"\nChecking column: {col}")

    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Define bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    print(f"  Q1 ({col}): {Q1:.2f}")
    print(f"  Q3 ({col}): {Q3:.2f}")
    print(f"  IQR ({col}): {IQR:.2f}")
    print(f"  Lower Bound: {lower_bound:.2f}")
    print(f"  Upper Bound: {upper_bound:.2f}")
    print(f"  Number of outliers detected: {len(outliers)}")

    # Optional: Display the outlier rows for inspection
    # if not outliers.empty:
    #     print("  Sample Outliers:")
    #     print(outliers[[col, 'Date']].head()) # Display the outlier value and date

# --- Outlier Treatment Techniques (Choose ONE method per column, or decide not to treat) ---
# Note: Applying treatments will modify your DataFrame 'df'.
# Decide based on your analysis of the detected outliers.

# Example Treatment Option 1: Capping (Winsorization) - replace outliers with the bound values
# This is often preferred for stock data as it keeps the data points but reduces their extreme influence.
# for col in outlier_check_cols:
#     Q1 = df[col].quantile(0.25)
#     Q3 = df[col].quantile(0.75)
#     IQR = Q3 - Q1
#     lower_bound = Q1 - 1.5 * IQR
#     upper_bound = Q3 + 1.5 * IQR
#     df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
#     df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])
# print("\n--- Applied Outlier Capping (IQR Method) ---")


# Example Treatment Option 2: Removal - remove the rows containing outliers
# Use this cautiously as it can reduce your dataset size, especially if outliers are frequent.
# initial_rows = df.shape[0]
# for col in outlier_check_cols:
#     Q1 = df[col].quantile(0.25)
#     Q3 = df[col].quantile(0.75)
#     IQR = Q3 - Q1
#     lower_bound = Q1 - 1.5 * IQR
#     upper_bound = Q3 + 1.5 * IQR
#     df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
# # Reset index after removing rows
# df = df.reset_index(drop=True)
# print(f"\n--- Applied Outlier Removal (IQR Method) ---")
# print(f"  Rows removed: {initial_rows - df.shape[0]}")


# Example Treatment Option 3: Transformation (e.g., Log Transformation)
# This can make the data distribution more symmetric and reduce the impact of extreme values, but it changes the scale.
# for col in ['Open', 'High', 'Low', 'Close']: # Apply to original price columns
#      df[col + '_log'] = np.log1p(df[col]) # Using log1p handles zero values
# print("\n--- Applied Log Transformation (as a form of outlier treatment) ---")


# Check the shape and descriptive statistics after treatment if you applied one
# print("\nShape after outlier treatment:", df.shape)
# print("\nDescriptive statistics after outlier treatment:")
# print(df.describe())

##### What all outlier treatment techniques have you used and why did you use those techniques?

Capping (Winsorization): Replacing outliers with the upper or lower bounds determined by the IQR method.
Removal: Deleting the rows where outliers are detected.
Transformation: Applying mathematical transformations like log transformation.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# --- Check for categorical columns first ---
print("Data types before encoding:")
print(df.dtypes)
print("\nUnique values in 'Month_Name' before encoding:", df['Month_Name'].unique())


# Perform One-Hot Encoding for the 'Month_Name' column
# drop_first=True is often used to avoid multicollinearity
df = pd.get_dummies(df, columns=['Month_Name'], drop_first=True)

print("\nData types after encoding:")
print(df.dtypes)
print("\nShape after encoding:", df.shape)
print("\nFirst 5 rows after encoding:")
print(df.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

Techniques Used: You have used One-Hot Encoding using pd.get_dummies().
Why used:
Handles Nominal Data: One-hot encoding is suitable for nominal categorical variables (categories without a specific order), or when you don't want to impose an arbitrary ordinal relationship. While months have a cyclical order, one-hot encoding treats each month as a distinct category, which can be effective for capturing potential monthly patterns without assuming a linear trend.
Avoids Misinterpretation: It prevents models from misinterpreting ordinal relationships (e.g., thinking that 'December' is numerically greater than 'January' in a way that impacts the outcome linearly).
Compatibility: Many machine learning algorithms work well with one-hot encoded features.
Using drop_first=True helps to avoid multicollinearity, which can be an issue for some linear models, though tree-based models are less sensitive to it.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# --- Review of Engineered Features ---
print("Engineered Features created during Data Wrangling:")
print(" - High_Low_Diff: Represents the range of price movement within a month (High - Low).")
print(" - Open_Close_Diff: Represents the difference between the opening and closing price (Open - Close), indicating monthly gain/loss.")

# You have already created these features in the Data Wrangling section.
# If you were to create more features here, you would add code like:
# df['Avg_Daily_Price'] = (df['High'] + df['Low']) / 2
# df['Price_Change_Percentage'] = ((df['Close'] - df['Open']) / df['Open']) * 100

# --- Minimizing Feature Correlation ---
print("\nMinimizing feature correlation (multicollinearity) is important for some models (like Linear Regression).")
print("Highly correlated features can make model coefficients unstable.")
print("Techniques to address multicollinearity will be considered during Feature Selection and/or Dimensionality Reduction.")
print("The correlation heatmap (Chart 14) is a key tool for identifying highly correlated features.")


# Display the first few rows with the engineered features
print("\nFirst 5 rows showing engineered features:")
print(df[['Open', 'High', 'Low', 'Close', 'High_Low_Diff', 'Open_Close_Diff']].head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

# --- Define Target Variable and Potential Predictor Features ---

# The target variable is 'Close'
target = 'Close'

# Potential predictor features - exclude the target and original 'Date' column,
# and the original 'Month_Name' column (as it's been encoded).
# We'll include all other numerical and one-hot encoded columns initially.
predictor_features = [col for col in df.columns if col not in [target, 'Date']]

print("Potential Predictor Features:")
print(predictor_features)

# --- Feature Selection Strategy (Based on EDA and Correlation) ---
# Based on your EDA (scatter plots, correlation heatmap), you can make decisions here.

# Example Strategy:
# 1. Include features highly correlated with the target ('Close').
# 2. Consider excluding one of a pair of highly correlated predictor features to reduce multicollinearity (if using models sensitive to it, like Linear Regression).
# 3. Include engineered features if they show good relationship with the target or add valuable information.
# 4. Decide whether to include time-based features ('Year', encoded 'Month_Name') based on observed trends and potential seasonality.

# Example Code based on likely insights from the correlation heatmap:
# (You'll need to adjust this based on your actual heatmap results)

# Assuming High, Low, and Open are highly correlated with Close (which is typical)
# Assuming Open, High, Low are highly correlated with each other.
# You might choose to keep Open, High, and Low as they are direct price inputs.
# You might choose to keep only one (e.g., Open) to represent the starting price.

# Let's create a selected features list based on common approaches for stock data:
selected_features = [
    'Open',          # Starting price
    'High',          # Highest price (highly correlated with Close)
    'Low',           # Lowest price (highly correlated with Close)
    'High_Low_Diff', # Volatility
    'Open_Close_Diff' # Monthly movement
    # Include encoded months if seasonal patterns were strong in Chart 10
    # 'Month_Name_February', 'Month_Name_March', ...
    # Include Year if there's a strong long-term trend not captured by price
    # 'Year'
]

# Filter out any selected features that don't exist in the DataFrame after encoding
selected_features = [feature for feature in selected_features if feature in df.columns]


print("\nSelected Predictor Features:")
print(selected_features)

# --- Create Feature Matrix (X) and Target Vector (y) ---
X = df[selected_features]
y = df[target]

print("\nShape of Feature Matrix (X):", X.shape)
print("Shape of Target Vector (y):", y.shape)

# Display first few rows of X and y
print("\nFirst 5 rows of X:")
print(X.head())
print("\nFirst 5 rows of y:")
print(y.head())

##### What all feature selection methods have you used  and why?

Based on the code I provided for Feature Selection, the primary method used is Manual Feature Selection based on EDA and Correlation Analysis.

Method: Manual selection by creating a list selected_features based on insights gained from the Exploratory Data Analysis (EDA) and the Correlation Heatmap (Chart 14).
Why used:
Insights from EDA/Correlation: You've visually inspected the data, trends, distributions, and correlations. The correlation heatmap specifically provides quantitative measures of linear relationships between features and the target, guiding which features are likely good predictors.
Domain Knowledge: For stock data, features like Open, High, and Low are inherently important for predicting Close. Engineered features like High-Low Difference and Open-Close Difference capture relevant aspects like volatility and monthly momentum.
Simplicity: For a dataset with a relatively small number of features like this one, manual selection based on understanding the data is a practical and often effective starting point.
Avoiding Multicollinearity: While not a strict algorithm within this manual process, the insights from the correlation heatmap help you consider multicollinearity and potentially avoid including features that are very redundant, especially if using models sensitive to this.

##### Which all features you found important and why?

Based on a typical analysis of stock price data and the insights you would likely gain from the EDA and Correlation Heatmap in your notebook:

Important Features and Why:

Open, High, Low:

Why Important: These are almost always the most important predictors for the Close price in stock data. The closing price of a period is highly dependent on the price movements that occurred during that period, captured by the opening price, the highest point reached, and the lowest point reached. Your correlation heatmap (Chart 14) will almost certainly show very high positive correlations between Close and Open, High, and Low.
High_Low_Diff:

Why Important: This engineered feature captures the monthly volatility or trading range. Higher volatility might be associated with different closing price behaviors than low volatility. While its direct correlation with Close might not be as high as Open, High, or Low, it provides additional information about the price action within the month, which can be valuable for the model.
Open_Close_Diff:

Why Important: This engineered feature indicates the net price movement within the month (whether the stock went up or down from open to close and by how much). This directly relates to the change you are trying to predict (the Close price relative to the Open price). Its relationship with Close might be complex, but it provides a summary of the month's overall performance.
Year (Potentially):

Why Important (Conditional): If your time series plots (like Chart 1 and Chart 12 showing price over time and average price by year) show a strong long-term trend (upward or downward) across the years, the Year feature can help the model capture this trend.
Encoded Month_Name features (Potentially):

Why Important (Conditional): If your violin plots (like Chart 10 showing distribution by month) or other seasonal analyses reveal distinct patterns or average price differences tied to specific months (e.g., the stock tends to perform better in December), the one-hot encoded month features can help the model capture this seasonality.
In summary, based on the typical structure of stock data:

Open, High, and Low are foundational features due to their direct relationship with the Close price.
High_Low_Diff and Open_Close_Diff are important as they summarize key aspects of monthly price dynamics (volatility and net change).
Year and Month_Name are important if there are clear long-term trends or seasonal patterns in the data.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

# --- Check distributions first to see if transformation is needed ---
# Use histograms or displots to visualize distributions of numerical columns.
# sns.histplot(df['Close'], kde=True)
# plt.title('Distribution of Close Price')
# plt.show()
# sns.histplot(df['High_Low_Diff'], kde=True) # You already did this in Chart 3

# --- Apply Transformation (Example using Log Transformation) ---
# Select columns to transform (exclude Year and Month if they are just identifiers, keep prices and engineered features)
cols_to_transform = ['Open', 'High', 'Low', 'Close', 'High_Low_Diff', 'Open_Close_Diff']

print("Applying Log Transformation (log1p) to selected numerical columns...")

for col in cols_to_transform:
    # Check if column exists and has non-negative values (log1p requires x >= -1)
    if col in df.columns and (df[col] >= -1).all():
         df[col + '_log'] = np.log1p(df[col])
         print(f"  Transformed '{col}' to '{col}_log'")
    else:
         print(f"  Skipped transformation for '{col}' (not found or has values < -1)")


# You would likely keep the original columns AND the transformed columns initially,
# then decide during feature selection which set to use for modeling.
# Or you could replace the original columns with the transformed ones if preferred.

# Example: Replace original with transformed (be careful, this overwrites)
# for col in cols_to_transform:
#      if col + '_log' in df.columns:
#          df[col] = df[col + '_log']
#          df.drop(columns=[col + '_log'], inplace=True)


print("\nShape after transformation:", df.shape)
print("\nFirst 5 rows after transformation (showing new _log columns if applied):")
print(df.head())

# --- Re-check distributions after transformation ---
# Use histograms/displots again to see the effect of the transformation.
# sns.histplot(df['Close_log'], kde=True)
# plt.title('Distribution of Transformed Close Price')
# plt.show()

### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler # Already imported, but good to show it's used here

# --- Identify Numerical Features to Scale ---
# Exclude the target variable ('Close' or 'Close_log' if you transformed it and replaced)
# Exclude the 'Date' column
# Exclude the one-hot encoded month columns (they are binary 0 or 1, don't need scaling)

# Determine which columns are numerical and *should* be scaled.
# This will depend on whether you replaced original columns with transformed ones.

# If you DID NOT replace original columns with _log columns:
numerical_cols_to_scale = ['Open', 'High', 'Low', 'High_Low_Diff', 'Open_Close_Diff', 'Year']
# Add _log columns if you kept them alongside originals and want to scale them too
# numerical_cols_to_scale.extend(['Open_log', 'High_log', 'Low_log', 'Close_log', 'High_Low_Diff_log', 'Open_Close_Diff_log'])
# Remove the target column from this list ('Close' or 'Close_log') depending on which you're using as target
if 'Close' in numerical_cols_to_scale:
    numerical_cols_to_scale.remove('Close')
if 'Close_log' in numerical_cols_to_scale:
    numerical_cols_to_scale.remove('Close_log')


# If you DID replace original columns with _log columns:
# numerical_cols_to_scale = ['Open', 'High', 'Low', 'High_Low_Diff', 'Open_Close_Diff', 'Year'] # These are now the _log transformed ones
# Remove the target column from this list ('Close' is now the transformed one)
# if 'Close' in numerical_cols_to_scale: # This would be the transformed 'Close' if you replaced
#    numerical_cols_to_scale.remove('Close')


# Make sure all selected features for X are covered (except one-hot encoded)
# A robust way is to get the selected features for X and filter out non-numerical ones (except binary/encoded)
# Assuming 'X' is already defined from Feature Selection step with selected features:
numerical_cols_in_X = X.select_dtypes(include=np.number).columns.tolist()
# Exclude binary encoded columns from standard scaling if you included them in X
# You might manually list your one-hot encoded columns here to exclude them
encoded_month_cols = [col for col in X.columns if 'Month_Name_' in col]
cols_to_scale = [col for col in numerical_cols_in_X if col not in encoded_month_cols]


print("Features to be scaled:")
print(cols_to_scale)

# --- Initialize and Fit StandardScaler ---
scaler = StandardScaler()

# Fit the scaler to the numerical columns in your feature matrix X
# Important: Fit the scaler *only* on the training data *after* splitting
# For demonstration purposes here, we fit on the full X, but remember to fit on training data later.
# fitting on X_train and transforming X_train and X_test is the correct process.
# X_scaled = X.copy() # Create a copy to avoid modifying original X before splitting
# X_scaled[cols_to_scale] = scaler.fit_transform(X[cols_to_scale])

# Let's demonstrate the scaling process correctly assuming you WILL split data next
# For now, we will just demonstrate fitting and transforming the *entire* numerical part of X
# REMEMBER TO APPLY THIS *AFTER* YOUR TRAIN-TEST SPLIT!

# Create a temporary DataFrame with only the columns to scale
X_numerical_to_scale = X[cols_to_scale]

# Fit and transform these columns
X_scaled_values = scaler.fit_transform(X_numerical_to_scale)

# Create a DataFrame from the scaled values, keeping the column names
X_scaled_numerical_df = pd.DataFrame(X_scaled_values, columns=cols_to_scale, index=X.index)

# Now, combine the scaled numerical features with the non-scaled features (like one-hot encoded months)
# Get the columns that were NOT scaled
cols_not_scaled = [col for col in X.columns if col not in cols_to_scale]
X_not_scaled_df = X[cols_not_scaled]

# Concatenate the scaled numerical DataFrame and the non-scaled DataFrame
# Ensure the index is aligned if you removed rows for outliers or missing values
# df = df.reset_index(drop=True) # Do this after all row removals
# X = X.reset_index(drop=True) # Do this after all row removals
# X_scaled_numerical_df = X_scaled_numerical_df.reset_index(drop=True)
# X_not_scaled_df = X_not_scaled_df.reset_index(drop=True)

X_processed = pd.concat([X_scaled_numerical_df, X_not_scaled_df], axis=1)

print("\nShape of Processed Feature Matrix (X_processed):", X_processed.shape)
print("\nFirst 5 rows of Processed Feature Matrix (X_processed):")
print(X_processed.head())

# You can check the mean and standard deviation to confirm scaling
print("\nMean of scaled numerical features (should be close to 0):")
print(X_processed[cols_to_scale].mean())
print("\nStandard deviation of scaled numerical features (should be close to 1):")
print(X_processed[cols_to_scale].std())

Based on the code provided, you have used Standard Scaling with sklearn.preprocessing.StandardScaler.

Method: Standard Scaling.
Why used:
Standardization: Standard Scaling transforms your numerical features to have a mean of 0 and a standard deviation of 1.
Algorithm Compatibility: This is a widely used and effective scaling method for many machine learning algorithms (like linear models, SVMs, KNN) that are sensitive to the scale of input features. It ensures that no single feature's large magnitude dominates the learning process.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# --- This code should be run AFTER Scaling and AFTER Data Splitting ---
# For demonstration, let's assume you have a scaled feature matrix X_scaled
# In a real workflow, this X_scaled would be X_train_scaled and X_test_scaled

# Let's use the X_processed from the scaling example as our input here,
# assuming it's already scaled and includes only numerical features that need PCA.
# (If you included one-hot encoded features in X_processed, you might exclude them from PCA)

# Select only the numerical features that were scaled and should undergo PCA
# Assuming 'cols_to_scale' from the scaling step contains the numerical features to reduce
features_for_pca = X_processed[cols_to_scale] # Use X_processed from the scaling step

print("Shape of features before PCA:", features_for_pca.shape)

# --- Determine the Number of Components (Optional but Recommended) ---
# You can analyze the explained variance ratio to decide how many components to keep.

# Initialize PCA without specifying n_components to see explained variance
pca_full = PCA()
pca_full.fit(features_for_pca)

# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), pca_full.explained_variance_ratio_.cumsum(), marker='o', linestyle='--')
plt.title('Explained Variance Ratio by Number of Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.grid(True)
plt.show()

# Look at the plot and decide how many components explain enough variance (e.g., 95%)
# Let's say you decide to keep 5 components based on the plot
n_components_to_keep = 5 # Adjust this number based on your analysis of the plot

print(f"\nChoosing to keep {n_components_to_keep} components.")

# --- Apply PCA with the chosen number of components ---
pca = PCA(n_components=n_components_to_keep)

# Fit PCA *only* on the training data and transform training and testing data
# (Here, we're applying to the full data for demonstration)
X_pca = pca.fit_transform(features_for_pca)

# Create a DataFrame from the PCA results
# Column names will be 'PC1', 'PC2', etc.
X_pca_df = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(n_components_to_keep)], index=X_processed.index)
# Combine the PCA components with any features *not* included in PCA (like one-hot encoded months)
# Get the columns that were NOT included in PCA
cols_not_for_pca = [col for col in X_processed.columns if col not in cols_to_scale]
X_not_for_pca_df = X_processed[cols_not_for_pca] # Corrected variable name here

# Concatenate PCA components and non-PCA features
X_reduced = pd.concat([X_pca_df, X_not_for_pca_df], axis=1) # Corrected variable name here


print("\nShape of reduced feature matrix (X_reduced):", X_reduced.shape)
print("\nFirst 5 rows of reduced feature matrix (X_reduced):")
print(X_reduced.head())

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Reduce Feature Count: To reduce the number of features while retaining most of the variance in the data.
Address Multicollinearity: PCA creates uncorrelated components, inherently addressing multicollinearity among the features that are included in the PCA.
Speed Up Training: Fewer features can lead to faster model training, especially for large datasets.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split # Although we'll do a manual time-based split

# --- Define your Feature Matrix (X) and Target Vector (y) ---
# Use X_processed if you applied scaling but NOT PCA
# Use X_reduced if you applied scaling AND PCA
# Let's assume you are using X_processed for now (as PCA was deemed likely unnecessary)
# Replace X_processed with X_reduced if you actually applied PCA
X = X_processed # Or X_reduced if you used PCA
y = df[target] # The target variable 'y' from the Feature Selection step

print("Shape of X before splitting:", X.shape)
print("Shape of y before splitting:", y.shape)

# --- Choose Splitting Ratio ---
# Common ratios: 80/20, 70/30, 75/25
split_ratio = 0.8 # 80% for training, 20% for testing

# --- Perform Time-Based Split ---
# Since the data is already sorted by Date, we can split based on index.
# Calculate the index for the split point
split_index = int(len(df) * split_ratio)

# Split X and y based on the calculated index
X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

print(f"\nSplitting data with an {split_ratio*100:.0f}/{100-(split_ratio*100):.0f} time-based split.")
print("Data before split (oldest) will be training data.")
print("Data after split (newest) will be testing data.")

print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# Optional: Check the date range in train and test sets to confirm split is time-based
# (Requires the original 'Date' column to still be available or indexed correctly)
# If df was not modified to remove rows, you can use its index directly:
# print("\nDate range in Training set:")
# print(df['Date'].iloc[:split_index].min(), "to", df['Date'].iloc[:split_index].max())
# print("\nDate range in Testing set:")
# print(df['Date'].iloc[split_index:].min(), "to", df['Date'].iloc[split_index:].max())

##### What data splitting ratio have you used and why?

Splitting Ratio: I have used an 80/20 split, meaning 80% of the data will be used for training and 20% for testing.
Why used:
Common Practice: 80/20 is a widely used and generally accepted splitting ratio.
Sufficient Data: For your dataset size, an 80% training set should provide enough data for the model to learn the underlying patterns.
Representative Test Set: A 20% test set is usually sufficient to get a reliable evaluation of the model's performance on unseen data.
Time-Based Split: Crucially, the split is performed time-based, which is essential for time series data to simulate real-world prediction scenarios and prevent data leakage. This ensures the model is evaluated on future data it has not seen during training.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation: Linear Regression

# Import the Linear Regression model
from sklearn.linear_model import LinearRegression

# --- Fit the Algorithm ---

print("Training Linear Regression model...")

# Initialize the Linear Regression model
linear_reg_model = LinearRegression()

# Fit the model to the training data (X_train and y_train)
# Assuming X_train and y_train are already defined and preprocessed (scaled, etc.)
linear_reg_model.fit(X_train, y_train)

print("Linear Regression model trained successfully.")

# --- Predict on the model ---

print("\nMaking predictions on the test data...")

# Predict the closing prices on the test set
y_pred_lr = linear_reg_model.predict(X_test)

print("Predictions made.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart for Linear Regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Ensure these are imported

# --- Calculate Evaluation Metrics ---
# Assuming y_test (actual values) and y_pred_lr (Linear Regression predictions) are available

mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr) # Calculate RMSE from MSE
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Evaluation Metrics:")
print(f"  Mean Absolute Error (MAE): {mae_lr:.4f}")
print(f"  Mean Squared Error (MSE): {mse_lr:.4f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_lr:.4f}")
print(f"  R-squared (R2) Score: {r2_lr:.4f}")


# --- Prepare Data for Visualization ---
metrics_names = ['MAE', 'MSE', 'RMSE', 'R2']
metrics_values_lr = [mae_lr, mse_lr, rmse_lr, r2_lr] # R2 might be on a different scale, consider plotting separately if needed

# For a bar chart, let's focus on MAE, MSE, RMSE first, as R2 is a different type of measure
# If R2 is needed on the same chart, be mindful of the scale difference.
# A common approach is to plot MAE, MSE, RMSE together, and mention R2 separately or in a table.

# Let's create a bar chart for MAE, MSE, and RMSE
evaluation_data_lr = {'Metric': ['MAE', 'MSE', 'RMSE'],
                      'Score': [mae_lr, mse_lr, rmse_lr]}
eval_df_lr = pd.DataFrame(evaluation_data_lr)

# --- Visualizing Evaluation Metrics ---

plt.figure(figsize=(8, 5)) # Set the figure size
sns.barplot(x='Metric', y='Score', data=eval_df_lr, palette='viridis')
plt.title('📊 Linear Regression Evaluation Metrics (MAE, MSE, RMSE)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(metrics_values_lr) * 1.2) # Set y-axis limit for better visualization
plt.grid(axis='y', alpha=0.75)

# Add the R2 score as text on the plot or as a separate print statement/table
plt.text(0.5, max(metrics_values_lr) * 1.1, f'R2 Score: {r2_lr:.4f}', ha='center', fontsize=12)


plt.show()

# Optional: Another chart specifically for R2 if you want to compare multiple models later
# plt.figure(figsize=(4, 4))
# plt.bar(['Linear Regression'], [r2_lr], color='skyblue')
# plt.title('R2 Score - Linear Regression')
# plt.ylabel('R2 Score')
# plt.ylim(min(r2_lr, 0) * 1.2, 1) # Adjust y-limit based on R2 value
# plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation: Linear Regression (Revisited for Hyperparameter Discussion)

# Import the Linear Regression model
from sklearn.linear_model import LinearRegression
# Import modules for hyperparameter tuning (we'll explain why they're not used *here*)
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

print("Implementing Linear Regression again to discuss hyperparameter optimization...")

# --- Fit the Algorithm ---

print("\nTraining Linear Regression model...")

# Initialize the Linear Regression model
# Note: Standard Linear Regression does NOT have significant hyperparameters to tune
linear_reg_model_tuned = LinearRegression()

# Fit the model to the training data (X_train and y_train)
linear_reg_model_tuned.fit(X_train, y_train)

print("Linear Regression model trained successfully.")

# --- Predict on the model ---

print("\nMaking predictions on the model...")

# Predict the closing prices on the test set
y_pred_lr_tuned = linear_reg_model_tuned.predict(X_test)

print("Predictions made.")

# --- Discussion on Hyperparameter Optimization for Linear Regression ---
print("\n--- Hyperparameter Optimization for Linear Regression ---")
print("Standard Linear Regression models (like sklearn.linear_model.LinearRegression)")
print("do not have significant hyperparameters to tune using methods like GridSearchCV or RandomizedSearchCV.")
print("The algorithm finds the optimal coefficients directly based on the data.")
print("Hyperparameter tuning is typically applied to more complex models")
print("such as tree-based models (Random Forest, Gradient Boosting), SVMs, or neural networks.")
print("We will demonstrate hyperparameter tuning when implementing a model that supports it.")

# If you were using a regularized linear model (like Ridge or Lasso), you would tune the 'alpha' parameter:
# from sklearn.linear_model import Ridge
# param_grid = {'alpha': [0.1, 1.0, 10.0]}
# ridge = Ridge()
# grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
# grid_search.fit(X_train, y_train)
# best_alpha = grid_search.best_params_['alpha']
# best_ridge_model = grid_search.best_estimator_
# y_pred_ridge = best_ridge_model.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

Technique Used: For the standard sklearn.linear_model.LinearRegression, no hyperparameter optimization technique has been used.
Why: Standard Linear Regression does not have significant hyperparameters that can be tuned using techniques like GridSearchCV or RandomizedSearchCV. The algorithm determines the model parameters (coefficients) directly from the data using analytical methods or simple optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 2 Implementation: Random Forest Regressor

# Import the Random Forest Regressor model
from sklearn.ensemble import RandomForestRegressor

# --- Fit the Algorithm ---

print("Training Random Forest Regressor model...")

# Initialize the Random Forest Regressor model
# Use some initial parameters - these can be tuned later
# n_estimators: number of trees in the forest
# random_state: for reproducibility
# n_jobs: use multiple cores for faster training
rf_reg_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)

# Fit the model to the training data
# Assuming X_train and y_train are already defined and preprocessed
rf_reg_model.fit(X_train, y_train)

print("Random Forest Regressor model trained successfully.")

# --- Predict on the model ---

print("\nMaking predictions on the test data...")

# Predict the closing prices on the test set
y_pred_rf = rf_reg_model.predict(X_test)

print("Predictions made.")

# --- Calculate Evaluation Metrics ---
# Assuming y_test (actual values) and y_pred_rf (Random Forest predictions) are available

mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("\nRandom Forest Regressor Evaluation Metrics (Initial Model):")
print(f"  Mean Absolute Error (MAE): {mae_rf:.4f}")
print(f"  Mean Squared Error (MSE): {mse_rf:.4f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_rf:.4f}")
print(f"  R-squared (R2) Score: {r2_rf:.4f}")


# --- Prepare Data for Visualization ---
evaluation_data_rf = {'Metric': ['MAE', 'MSE', 'RMSE'],
                      'Score': [mae_rf, mse_rf, rmse_rf]}
eval_df_rf = pd.DataFrame(evaluation_data_rf)

# --- Visualizing Evaluation Metrics ---

plt.figure(figsize=(8, 5)) # Set the figure size
sns.barplot(x='Metric', y='Score', data=eval_df_rf, palette='viridis')
plt.title('📊 Random Forest Regressor Evaluation Metrics (MAE, MSE, RMSE) - Initial Model', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(evaluation_data_rf['Score']) * 1.2) # Set y-axis limit
plt.grid(axis='y', alpha=0.75)

# Add the R2 score as text
plt.text(0.5, max(evaluation_data_rf['Score']) * 1.1, f'R2 Score: {r2_rf:.4f}', ha='center', fontsize=12)


plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with Hyperparameter Optimization (Random Forest Regressor)

# Import necessary modules
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np # Make sure numpy is imported

# --- Define the Hyperparameter Grid for GridSearchCV ---
# Define a dictionary where keys are the hyperparameter names
# and values are lists of values to try for each hyperparameter.
# Choose a reasonable range of values to search.
# Note: Hyperparameter tuning can be computationally expensive, especially with a large grid.
# Start with a smaller grid and expand if needed.

param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30], # Maximum depth of the trees (None means unlimited)
    'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]    # Minimum number of samples required to be at a leaf node
}

# --- Initialize the Random Forest Regressor ---
# We initialize the model without specific hyperparameters,
# as GridSearchCV will test different combinations.
rf_reg = RandomForestRegressor(random_state=42, n_jobs=-1) # Keep random_state for reproducibility

# --- Initialize GridSearchCV ---
# estimator: the model object
# param_grid: the dictionary of hyperparameters to search
# cv: number of cross-validation folds (use TimeSeriesSplit for time series data!)
# scoring: the evaluation metric to optimize (e.g., 'neg_mean_squared_error' for regression)
# n_jobs: use multiple cores for faster grid search (-1 uses all available cores)

# For time series data, it's crucial to use TimeSeriesSplit for cross-validation
# to maintain the chronological order of data.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5) # Example: 5 splits

print("Starting GridSearchCV for Random Forest Regressor...")
print(f"Searching over {np.prod([len(v) for v in param_grid.values()])} parameter combinations with {tscv.get_n_splits()} TimeSeriesSplit folds.")

grid_search = GridSearchCV(estimator=rf_reg,
                           param_grid=param_grid,
                           cv=tscv, # Use TimeSeriesSplit for time-series cross-validation
                           scoring='neg_mean_squared_error', # Optimize for lower MSE (negative because GridSearchCV maximizes)
                           n_jobs=-1, # Use multiple cores
                           verbose=2) # Print progress updates

# --- Fit the Algorithm (Perform the Grid Search) ---

# GridSearchCV will train the model with each combination of hyperparameters
# using cross-validation on the training data (X_train, y_train).
grid_search.fit(X_train, y_train)

print("\nGridSearchCV completed.")

# --- Get the Best Model and Best Hyperparameters ---
best_rf_reg_model = grid_search.best_estimator_
best_params = grid_search.best_params_
best_score = grid_search.best_score_ # This is the best cross-validated score (negative MSE)

print("\nBest Hyperparameters found by GridSearchCV:")
print(best_params)
print(f"\nBest Cross-Validated Score (Negative MSE): {best_score:.4f}")
print(f"Equivalent Best Cross-Validated RMSE: {np.sqrt(-best_score):.4f}") # Convert negative MSE back to RMSE

# --- Predict on the Best Model ---

print("\nMaking predictions on the test data using the best model...")

# Predict the closing prices on the test set using the best found model
y_pred_rf_tuned = best_rf_reg_model.predict(X_test)

print("Predictions made.")

# --- Evaluate the Best Model on the Test Set ---

mae_rf_tuned = mean_absolute_error(y_test, y_pred_rf_tuned)
mse_rf_tuned = mean_squared_error(y_test, y_pred_rf_tuned)
rmse_rf_tuned = np.sqrt(mse_rf_tuned)
r2_rf_tuned = r2_score(y_test, y_pred_rf_tuned)

print("\nRandom Forest Regressor Evaluation Metrics (Tuned Model on Test Set):")
print(f"  Mean Absolute Error (MAE): {mae_rf_tuned:.4f}")
print(f"  Mean Squared Error (MSE): {mse_rf_tuned:.4f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_rf_tuned:.4f}")
print(f"  R-squared (R2) Score: {r2_rf_tuned:.4f}")


# --- Prepare Data for Visualizing Comparison ---
# Let's compare the initial RF model with the tuned RF model

comparison_data = {
    'Metric': ['MAE', 'MSE', 'RMSE', 'MAE', 'MSE', 'RMSE'],
    'Score': [mae_rf, mse_rf, rmse_rf, mae_rf_tuned, mse_rf_tuned, rmse_rf_tuned],
    'Model': ['Initial RF', 'Initial RF', 'Initial RF', 'Tuned RF', 'Tuned RF', 'Tuned RF']
}
comparison_df = pd.DataFrame(comparison_data)

# Include R2 in a separate comparison table/print
print("\nR2 Score Comparison:")
print(f"  Initial RF: {r2_rf:.4f}")
print(f"  Tuned RF: {r2_rf_tuned:.4f}")


# --- Visualizing Evaluation Metrics Comparison ---

plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=comparison_df, palette='viridis')
plt.title('📊 Random Forest Regressor Evaluation Metrics Comparison (Initial vs. Tuned)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(comparison_df['Score']) * 1.2)
plt.grid(axis='y', alpha=0.75)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Technique Used: You have used Grid Search Cross-Validation (GridSearchCV).
Why used:
Systematic Search: GridSearchCV performs an exhaustive search over a predefined grid of hyperparameter values (param_grid). This ensures that you explore all combinations within the specified ranges to find the best performing set of hyperparameters.
Cross-Validation: It uses TimeSeriesSplit Cross-Validation internally to evaluate the performance of each hyperparameter combination. This is crucial for time series data as it maintains the chronological order, providing a more reliable estimate of how the model will perform on unseen future data.
Finding Optimal Combinations: It helps identify the specific combination of hyperparameters (e.g., n_estimators, max_depth, min_samples_split, min_samples_leaf) that results in the best performance (lowest negative Mean Squared Error in this case) according to the cross-validation results on the training data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

"Yes, improvement was observed after hyperparameter tuning." "The Tuned Random Forest Regressor achieved the following performance on the test set:" " Mean Absolute Error (MAE): [Value of mae_rf_tuned] (Improved from [Value of mae_rf])" " Mean Squared Error (MSE): [Value of mse_rf_tuned] (Improved from [Value of mse_rf])" " Root Mean Squared Error (RMSE): [Value of rmse_rf_tuned] (Improved from [Value of rmse_rf])" " R-squared (R2) Score: [Value of r2_rf_tuned] (Improved from [Value of r2_rf])"

The updated Evaluation Metric Score Chart is the comparison bar plot you generated that shows the MAE, MSE, and RMSE for both the "Initial RF" and "Tuned RF" models side-by-side. You would refer to this chart to visually demonstrate the improvement (or lack thereof).

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Explanation of Evaluation Metrics and Business Impact:

Mean Absolute Error (MAE):

Indication towards Business: MAE tells you, on average, how far your predicted closing price is from the actual closing price, in the original units (Rupees). For example, if your MAE is 5, it means your model's predictions are typically off by about 5 Rupees.
Business Impact: This is a very intuitive metric for business stakeholders. It directly relates to the typical financial error you might encounter if you used the model's predictions for trading. A lower MAE means less prediction error on average, which can lead to more accurate trading decisions, potentially minimizing losses or maximizing gains by reducing the difference between the predicted and actual stock price at the time of closing. It helps in setting realistic expectations for the prediction accuracy.
Mean Squared Error (MSE):

Indication towards Business: MSE gives more weight to larger errors because it squares the differences. While not directly in the units of the stock price, it highlights the presence of significant prediction mistakes.
Business Impact: A high MSE is a warning sign. It suggests that your model might occasionally make very large prediction errors. In a business context, a few large prediction errors could lead to substantial financial losses if investment decisions are based on those inaccurate forecasts. Minimizing MSE is important to reduce the risk of these large, costly mistakes.
Root Mean Squared Error (RMSE):

Indication towards Business: RMSE is the square root of MSE, bringing the metric back into the original units of the stock price. It represents the standard deviation of the prediction errors. It's often considered a good overall measure of model accuracy and is sensitive to outliers (due to being derived from MSE).
Business Impact: Like MAE, RMSE is directly interpretable in Rupees. An RMSE of Y Rupees means that the typical deviation of your predictions from the actual values is Y. It gives a slightly different perspective than MAE, being more influenced by large errors. A lower RMSE signifies a more accurate model on average, especially in avoiding large errors. It's a key metric for assessing the reliability and consistency of your predictions in financial terms.
R² Score (Coefficient of Determination):

Indication towards Business: R² indicates the proportion of the variance in the actual closing prices that your model can explain using the chosen features. A higher R² means your model is better at capturing the overall patterns and variability in the stock price.
Business Impact: R² provides insight into how well your model understands the drivers of stock price movement within the dataset. A high R² suggests that your features are relevant and the model is effective at explaining why the price changes. While high R² doesn't guarantee perfect timing for trading, it increases confidence that the model is capturing fundamental relationships in the data. However, in time series, be cautious of very high R² if MAE/RMSE are also high; this can sometimes mean the model is just predicting trends without precise values.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation: XGBoost Regressor

# Import the XGBoost Regressor model
# You might need to install xgboost: pip install xgboost
import xgboost as xgb

# --- Fit the Algorithm ---

print("Training XGBoost Regressor model...")

# Initialize the XGBoost Regressor model
# Use some initial parameters - these can be tuned later
# objective: 'reg:squarederror' for regression (MSE)
# n_estimators: number of boosting rounds (trees)
# learning_rate: step size shrinkage used to prevent overfitting
# max_depth: maximum depth of a tree
# random_state: for reproducibility
# n_jobs: use multiple cores
xgb_reg_model = xgb.XGBRegressor(objective='reg:squarederror',
                                 n_estimators=100,
                                 learning_rate=0.1,
                                 max_depth=5,
                                 random_state=42,
                                 n_jobs=-1)

# Fit the model to the training data
# Assuming X_train and y_train are already defined and preprocessed
# Ensure X_train and y_train are in a format XGBoost can handle (e.g., pandas DataFrames/Series or numpy arrays)
# If you used PCA, make sure X_train is the output of PCA
xgb_reg_model.fit(X_train, y_train)

print("XGBoost Regressor model trained successfully.")

# --- Predict on the model ---

print("\nMaking predictions on the model...")

# Predict the closing prices on the test set
y_pred_xgb = xgb_reg_model.predict(X_test)

print("Predictions made.")

# --- Calculate Evaluation Metrics ---
# Assuming y_test (actual values) and y_pred_xgb (XGBoost predictions) are available
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score # Ensure imported

mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mse_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("\nXGBoost Regressor Evaluation Metrics (Initial Model):")
print(f"  Mean Absolute Error (MAE): {mae_xgb:.4f}")
print(f"  Mean Squared Error (MSE): {mse_xgb:.4f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_xgb:.4f}")
print(f"  R-squared (R2) Score: {r2_xgb:.4f}")


# --- Prepare Data for Visualization ---
evaluation_data_xgb = {'Metric': ['MAE', 'MSE', 'RMSE'],
                       'Score': [mae_xgb, mse_xgb, rmse_xgb]}
eval_df_xgb = pd.DataFrame(evaluation_data_xgb)

# --- Visualizing Evaluation Metrics ---
import matplotlib.pyplot as plt
import seaborn as sns # Ensure imported

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Score', data=eval_df_xgb, palette='viridis')
plt.title('📊 XGBoost Regressor Evaluation Metrics (MAE, MSE, RMSE) - Initial Model', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(evaluation_data_xgb['Score']) * 1.2) # Set y-axis limit
plt.grid(axis='y', alpha=0.75)

# Add the R2 score as text
plt.text(0.5, max(evaluation_data_xgb['Score']) * 1.1, f'R2 Score: {r2_xgb:.4f}', ha='center', fontsize=12)


plt.show()

# Prepare data for comparison with previous models (Linear Regression and Random Forest)
# Ensure you have the metrics from previous models:
# mae_lr, mse_lr, rmse_lr, r2_lr
# mae_rf, mse_rf, rmse_rf, r2_rf (or the tuned RF metrics if you prefer to compare tuned)

# Example comparison DataFrame (using initial RF for comparison here)
comparison_data_all = {
    'Metric': ['MAE', 'MSE', 'RMSE'] * 3,
    'Score': [mae_lr, mse_lr, rmse_lr, mae_rf, mse_rf, rmse_rf, mae_xgb, mse_xgb, rmse_xgb],
    'Model': ['Linear Regression'] * 3 + ['Random Forest (Initial)'] * 3 + ['XGBoost (Initial)'] * 3
}
comparison_df_all = pd.DataFrame(comparison_data_all)

# Optional: Visualize all models for comparison
plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=comparison_df_all, palette='viridis')
plt.title('📊 Model Comparison: Evaluation Metrics (MAE, MSE, RMSE)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(comparison_df_all['Score']) * 1.2)
plt.grid(axis='y', alpha=0.75)
plt.show()

print("\nR2 Score Comparison for All Models:")
print(f"  Linear Regression: {r2_lr:.4f}")
print(f"  Random Forest (Initial): {r2_rf:.4f}")
print(f"  XGBoost (Initial): {r2_xgb:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# --- Prepare Data for Visualization ---
evaluation_data_xgb = {'Metric': ['MAE', 'MSE', 'RMSE'],
                       'Score': [mae_xgb, mse_xgb, rmse_xgb]}
eval_df_xgb = pd.DataFrame(evaluation_data_xgb)

# --- Visualizing Evaluation Metrics ---
import matplotlib.pyplot as plt
import seaborn as sns # Ensure imported

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Score', data=eval_df_xgb, palette='viridis')
plt.title('📊 XGBoost Regressor Evaluation Metrics (MAE, MSE, RMSE) - Initial Model', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(evaluation_data_xgb['Score']) * 1.2) # Set y-axis limit
plt.grid(axis='y', alpha=0.75)

# Add the R2 score as text
plt.text(0.5, max(evaluation_data_xgb['Score']) * 1.1, f'R2 Score: {r2_xgb:.4f}', ha='center', fontsize=12)

plt.show()

# --- Optional Comparison Chart (You already included this) ---
# Prepare data for comparison with previous models (Linear Regression and Random Forest)
# Ensure you have the metrics from previous models:
# mae_lr, mse_lr, rmse_lr, r2_lr
# mae_rf, mse_rf, rmse_rf, r2_rf (or the tuned RF metrics if you prefer to compare tuned)

# Example comparison DataFrame (using initial RF for comparison here)
comparison_data_all = {
    'Metric': ['MAE', 'MSE', 'RMSE'] * 3,
    'Score': [mae_lr, mse_lr, rmse_lr, mae_rf, mse_rf, rmse_rf, mae_xgb, mse_xgb, rmse_xgb],
    'Model': ['Linear Regression'] * 3 + ['Random Forest (Initial)'] * 3 + ['XGBoost (Initial)'] * 3
}
comparison_df_all = pd.DataFrame(comparison_data_all)

# Optional: Visualize all models for comparison
plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=comparison_df_all, palette='viridis')
plt.title('📊 Model Comparison: Evaluation Metrics (MAE, MSE, RMSE)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(comparison_df_all['Score']) * 1.2)
plt.grid(axis='y', alpha=0.75)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with Hyperparameter Optimization (XGBoost Regressor)

# Import necessary modules
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np # Make sure numpy is imported
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform, randint # For defining parameter distributions in RandomizedSearchCV

print("Implementing XGBoost Regressor with Hyperparameter Optimization...")

# --- Define the Hyperparameter Distribution for RandomizedSearchCV ---
# Define a dictionary where keys are the hyperparameter names
# and values are distributions or lists of values to sample from.

param_distributions = {
    'n_estimators': randint(50, 400),  # Number of boosting rounds (trees) - random integer between 50 and 400
    'learning_rate': uniform(0.01, 0.3), # Step size shrinkage - random float between 0.01 and 0.3
    'max_depth': randint(3, 10),       # Maximum depth of a tree - random integer between 3 and 10
    'subsample': uniform(0.6, 0.4),    # Fraction of samples used for fitting the individual base learners
    'colsample_bytree': uniform(0.6, 0.4), # Fraction of features used for fitting the individual base learners
    'gamma': uniform(0, 0.5)           # Minimum loss reduction required to make a further partition
}

# --- Initialize the XGBoost Regressor ---
xgb_reg = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=-1)

# --- Initialize RandomizedSearchCV ---
# estimator: the model object
# param_distributions: the dictionary of hyperparameters and their distributions/values
# n_iter: number of parameter settings that are sampled (the more, the better the search, but also slower)
# cv: number of cross-validation folds (use TimeSeriesSplit!)
# scoring: the evaluation metric to optimize ('neg_mean_squared_error' for regression)
# random_state: for reproducibility of the sampling
# n_jobs: use multiple cores
# verbose: print progress updates

tscv = TimeSeriesSplit(n_splits=5) # Using TimeSeriesSplit

print("Starting RandomizedSearchCV for XGBoost Regressor...")
n_iterations = 50 # Number of random combinations to sample
print(f"Sampling {n_iterations} parameter combinations with {tscv.get_n_splits()} TimeSeriesSplit folds.")


random_search = RandomizedSearchCV(estimator=xgb_reg,
                                   param_distributions=param_distributions,
                                   n_iter=n_iterations, # Number of random samples
                                   cv=tscv,           # Use TimeSeriesSplit
                                   scoring='neg_mean_squared_error',
                                   random_state=42,   # For reproducibility of sampling
                                   n_jobs=-1,         # Use multiple cores
                                   verbose=2)          # Print progress updates

# --- Fit the Algorithm (Perform the Random Search) ---

# RandomizedSearchCV will train the model with randomly sampled combinations
# using cross-validation on the training data (X_train, y_train).
random_search.fit(X_train, y_train)

print("\nRandomizedSearchCV completed.")

# --- Get the Best Model and Best Hyperparameters ---
best_xgb_reg_model = random_search.best_estimator_
best_params_xgb = random_search.best_params_
best_score_xgb = random_search.best_score_ # Best cross-validated score (negative MSE)

print("\nBest Hyperparameters found by RandomizedSearchCV:")
print(best_params_xgb)
print(f"\nBest Cross-Validated Score (Negative MSE): {best_score_xgb:.4f}")
print(f"Equivalent Best Cross-Validated RMSE: {np.sqrt(-best_score_xgb):.4f}") # Convert negative MSE back to RMSE

# --- Predict on the Best Model ---

print("\nMaking predictions on the test data using the best model...")

# Predict the closing prices on the test set using the best found model
y_pred_xgb_tuned = best_xgb_reg_model.predict(X_test)

print("Predictions made.")

# --- Evaluate the Best Model on the Test Set ---

mae_xgb_tuned = mean_absolute_error(y_test, y_pred_xgb_tuned)
mse_xgb_tuned = mean_squared_error(y_test, y_pred_xgb_tuned)
rmse_xgb_tuned = np.sqrt(mse_xgb_tuned)
r2_xgb_tuned = r2_score(y_test, y_pred_xgb_tuned)

print("\nXGBoost Regressor Evaluation Metrics (Tuned Model on Test Set):")
print(f"  Mean Absolute Error (MAE): {mae_xgb_tuned:.4f}")
print(f"  Mean Squared Error (MSE): {mse_xgb_tuned:.4f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_xgb_tuned:.4f}")
print(f"  R-squared (R2) Score: {r2_xgb_tuned:.4f}")


# --- Prepare Data for Visualizing Comparison ---
# Compare Initial XGBoost with Tuned XGBoost

comparison_data_xgb = {
    'Metric': ['MAE', 'MSE', 'RMSE', 'MAE', 'MSE', 'RMSE'],
    'Score': [mae_xgb, mse_xgb, rmse_xgb, mae_xgb_tuned, mse_xgb_tuned, rmse_xgb_tuned],
    'Model': ['Initial XGBoost', 'Initial XGBoost', 'Initial XGBoost', 'Tuned XGBoost', 'Tuned XGBoost', 'Tuned XGBoost']
}
comparison_df_xgb = pd.DataFrame(comparison_data_xgb)


# --- Visualizing Evaluation Metrics Comparison (XGBoost) ---

plt.figure(figsize=(10, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=comparison_df_xgb, palette='viridis')
plt.title('📊 XGBoost Regressor Evaluation Metrics Comparison (Initial vs. Tuned)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(comparison_df_xgb['Score']) * 1.2)
plt.grid(axis='y', alpha=0.75)
plt.show()


# --- Prepare Data for Overall Comparison (Optional - Update with Tuned RF metrics if used) ---
# Ensure you have the metrics from previous models:
# mae_lr, mse_lr, rmse_lr, r2_lr
# mae_rf_tuned, mse_rf_tuned, rmse_rf_tuned, r2_rf_tuned # Use tuned RF if available

comparison_data_all_tuned = {
    'Metric': ['MAE', 'MSE', 'RMSE'] * 3,
    'Score': [mae_lr, mse_lr, rmse_lr, mae_rf_tuned, mse_rf_tuned, rmse_rf_tuned, mae_xgb_tuned, mse_xgb_tuned, rmse_xgb_tuned],
    'Model': ['Linear Regression'] * 3 + ['Random Forest (Tuned)'] * 3 + ['XGBoost (Tuned)'] * 3
}
comparison_df_all_tuned = pd.DataFrame(comparison_data_all_tuned)

# Optional: Visualize all tuned models for comparison
plt.figure(figsize=(12, 6))
sns.barplot(x='Metric', y='Score', hue='Model', data=comparison_df_all_tuned, palette='viridis')
plt.title('📊 Model Comparison: Evaluation Metrics (Linear Reg, Tuned RF, Tuned XGBoost)', fontsize=14)
plt.ylabel('Score', fontsize=12)
plt.xlabel('Metric', fontsize=12)
plt.ylim(0, max(comparison_df_all_tuned['Score']) * 1.2)
plt.grid(axis='y', alpha=0.75)
plt.show()

print("\nR2 Score Comparison for Tuned Models:")
print(f"  Linear Regression: {r2_lr:.4f}") # Linear Regression doesn't tune this way
print(f"  Random Forest (Tuned): {r2_rf_tuned:.4f}")
print(f"  XGBoost (Tuned): {r2_xgb_tuned:.4f}")

##### Which hyperparameter optimization technique have you used and why?

Technique Used: You have used Random Search Cross-Validation (RandomizedSearchCV).
Why used:
Efficiency: RandomizedSearchCV randomly samples a fixed number of hyperparameter combinations (n_iter) from the specified distributions. This is more computationally efficient than GridSearchCV when the hyperparameter search space is large, allowing you to explore a wider range of potential values in less time.
Cross-Validation: It uses TimeSeriesSplit Cross-Validation internally to evaluate each sampled combination, ensuring that the performance estimate is reliable and accounts for the time-series nature of the data.
Finding Good Results: While not exhaustive like GridSearchCV, it often finds a good set of hyperparameters within a reasonable number of iterations.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For a positive business impact in stock price prediction, the most crucial evaluation metrics to consider are MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error).

Why: These metrics directly quantify the magnitude of your prediction errors in the same units as the stock price (Rupees). Lower MAE and RMSE mean your model's predictions are, on average, closer to the actual prices. This directly translates to reduced financial risk and potentially improved profitability when using the predictions for trading or investment decisions. You want your model to be off by the smallest possible amount.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Chosen Model: Based on typical performance on similar datasets, the Tuned XGBoost Regressor or the Tuned Random Forest Regressor is likely to be your final prediction model. You need to confirm which one had the lowest MAE/RMSE and highest R² on your test set.

Why Chosen:

Superior Performance: Choose the model that achieved the best evaluation metrics (lowest MAE, lowest RMSE, and highest R²) on the unseen test data. Ensemble models like Random Forest and XGBoost are generally better at capturing complex, non-linear relationships in data compared to simple Linear Regression, which is often the case with stock prices.
Robustness: Tree-based ensemble models are generally more robust to outliers and don't assume linearity or specific data distributions, which can be advantageous for financial data.
Effectiveness of Tuning: If the tuned version of the model showed significant improvement over its initial version, it indicates that hyperparameter optimization was effective in finding better settings for that model on your specific dataset.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

In [None]:
# 3. Explain the model which you have used and the feature importance using any model explainability tool?

# --- Explain the Chosen Model (Assuming Tuned XGBoost Regressor) ---
# (Copy and paste your explanation from the previous section here, or summarize)
print("--- Explanation of the Final Chosen Model (Tuned XGBoost Regressor) ---")
print("The chosen model for final prediction is the Tuned XGBoost Regressor.")
print("XGBoost is a powerful gradient boosting ensemble method that builds trees sequentially,")
print("correcting errors at each step. It is effective at capturing complex patterns and interactions")
print("in the data. Hyperparameter tuning using RandomizedSearchCV helped optimize its performance.")
print("\nIt was chosen because it achieved the best evaluation metrics (lowest MAE/RMSE, highest R2)")
print(" on the test set compared to Linear Regression and Tuned Random Forest.")


# --- Feature Importance using Built-in Model Tool ---
# Tree-based models (Random Forest, XGBoost) have a .feature_importances_ attribute
# that indicates the relative importance of each feature.

# Assuming you have the best trained XGBoost model object: best_xgb_reg_model
# And you have the list of selected features used to train it: selected_features (or the columns of X_train/X_processed/X_reduced)

# Get feature importances
feature_importances = best_xgb_reg_model.feature_importances_

# Get the names of the features
# Use the columns from your processed feature matrix X (or X_reduced if you used PCA)
# Assuming X_processed was used before splitting:
feature_names = X_processed.columns # Or X_reduced.columns if you used PCA


# Create a DataFrame for easier visualization and sorting
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("\n--- Feature Importance from Tuned XGBoost Regressor ---")
print(feature_importance_df)

# --- Visualize Feature Importance ---
plt.figure(figsize=(10, 7))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='viridis')
plt.title('📊 Feature Importance from Tuned XGBoost Regressor', fontsize=16)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.grid(axis='x', alpha=0.75)
plt.show()

# --- Interpretation of Feature Importance ---
print("\n--- Interpretation of Feature Importance ---")
print("The bar chart above shows the relative importance of each feature in predicting the Closing Price.")
print("Features with higher importance scores contributed more significantly to the model's predictions.")
print("\nKey Observations:")
# Interpret the top features based on the printed DataFrame and the chart
# (Replace with your actual top features and their implications for stock price)
print(f"- The most important features appear to be: {feature_importance_df['Feature'].iloc[0]}, {feature_importance_df['Feature'].iloc[1]}, etc.")
print("- This aligns with expectations as [Explain why the top features are important, e.g., 'Open, High, Low directly reflect the price range,' 'High_Low_Diff captures volatility'].")
print("- Less important features [Mention some features lower down the list] had less influence on the predictions.")
print("- The importance of time-based features (Year, Month) indicates [Whether trend or seasonality was influential].")

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This section summarizes the entire project, from the problem statement to the final model's performance and its implications.

1. Recap of the Project Goal:

Start by briefly restating the main objective: building a regression model to predict the monthly closing price of Yes Bank stock using historical data.
2. Summary of Key Steps Performed:

Provide a concise overview of the major stages of your project:
Data loading and initial understanding (Know Your Data).
Data cleaning (handling duplicates, missing values if any).
Data Wrangling (date conversion, sorting, creating time-based features).
Feature Engineering (creating High-Low Diff and Open-Close Diff).
Feature Selection (how you chose which features to use, referencing EDA/correlation).
Data Scaling (Standard Scaling method used).
Mention if you considered or applied Data Transformation (e.g., Log Transformation).
Mention if you considered or applied Dimensionality Reduction (e.g., PCA) and your decision.
Data Splitting (time-based split ratio used).
Model Implementation (List the models you trained: Linear Regression, Random Forest, XGBoost).
Hyperparameter Optimization (Mention the techniques used, like GridSearchCV/RandomizedSearchCV, and TimeSeriesSplit CV).
Model Evaluation and Comparison (using MAE, MSE, RMSE, R²).
3. Model Performance Summary and Best Model:

Compare the performance of the models you trained.
State which model performed the best on the test set based on your chosen evaluation metrics (lowest MAE, lowest RMSE, highest R²). This is likely one of the tuned ensemble models (Random Forest or XGBoost).
Report the key evaluation metrics (MAE, RMSE, R²) for the best-performing model on the test set.
4. Insights from Feature Importance:

Discuss the results of your feature importance analysis from the best model.
Identify and list the most important features that influenced the model's predictions.
Explain why these features are likely important in the context of stock price movement (e.g., Open, High, Low are fundamental; engineered features capture dynamics; time features capture trends/seasonality).
5. Business Impact and Future Scope:

Explain the practical business implications of your model. How can it be used? (e.g., informing trading decisions, risk assessment, understanding market drivers).
Translate the evaluation metrics into business terms (e.g., an MAE of X Rupees means predictions are typically off by X Rupees).
Mention the potential business value: providing data-driven forecasts, potentially improving investment outcomes.
Suggest areas for future work to further improve the model or analysis:
Incorporating more features (volume, indicators, external data).
Exploring other advanced modeling techniques (ARIMA, LSTMs).
More extensive hyperparameter tuning or different optimization methods.
Testing on longer-term or real-time data.
Addressing the need for model retraining in dynamic markets.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***