# **Project Name**    - **Stock Price Prediction using Machine Learning**



##### **Project Type**    - RegressionUnsupervised
##### **Contribution**    - Individual


# **Project Summary -**

In today’s fast-paced financial world, anticipating stock market trends is not just an advantage—it’s a necessity. This capstone project dives into the practical challenge of predicting stock prices, using Yes Bank’s historical stock data as a case study. The idea was to explore how machine learning models can help investors or financial analysts get a better understanding of market behavior and make more informed decisions

The core goal was to build a predictive model that could estimate the closing stock price of Yes Bank using historical data such as Open, High, Low, and Date. Yes Bank, once one of India’s most promising private sector banks, has seen massive stock volatility over the years due to management shifts, financial instability, and market speculation. This made it an ideal candidate for this project volatile enough to challenge the model and valuable enough to matter.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Stock markets are influenced by countless factors—ranging from market sentiment and economic indicators to company-specific news. For financial institutions, investors, and analysts, the ability to accurately predict future stock prices can significantly reduce risk and improve decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Basic libraries
import pandas as pd
import numpy as np

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning tools
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Plotting setup
%matplotlib inline
sns.set_style("whitegrid")

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings("ignore")



### Dataset Loading

In [None]:
# Load Dataset
file_path = '/content/current data_YesBank_StockPrices - data_YesBank_StockPrices.csv'
df = pd.read_csv(file_path)

# Take a quick look at the data
df.head()


### Dataset First View

In [None]:
# Dataset First Look

# Basic info about the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

# Preview the top 5 rows
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Get the number of rows and columns
rows, columns = df.shape
print(f"Total Rows: {rows}")
print(f"Total Columns: {columns}")


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing values in each column:\n")
print(missing_values)


In [None]:
# Visualizing the missing values
# Bar plot of missing values (safe version)
missing = df.isnull().sum()
missing = missing[missing > 0]

if not missing.empty:
    plt.figure(figsize=(8, 5))
    missing.plot(kind='bar', color='salmon')
    plt.title("Missing Values Count per Column")
    plt.xlabel("Columns")
    plt.ylabel("Missing Values")
    plt.xticks(rotation=45)
    plt.show()
else:
    print(" No missing values found in the dataset.")



In [None]:
print("Original rows:", len(df))


# Cleaning the Dataset

In [None]:
import pandas as pd

# STEP 1: Check original structure
print(" Original Data Summary:")
print(df.head())
print(df.columns)
print(df.info())

# STEP 2: Check for missing values
print("\n Missing Values in Key Columns:")
print(df[['Date', 'Open', 'High', 'Low', 'Close']].isnull().sum())

# STEP 3: Check data types
print("\n Data Types Before Cleaning:")
print(df[['Date', 'Open', 'High', 'Low', 'Close']].dtypes)

# STEP 4: Fix the Date format and clean data
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors='coerce')

for col in ['Open', 'High', 'Low', 'Close']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df_clean = df.dropna(subset=['Date', 'Open', 'High', 'Low', 'Close'])

# Result
print("\n Data Cleaning Result:")
print("Original rows:", len(df))
print("Rows after cleaning:", len(df_clean))

df = df_clean.copy()



In [None]:
print("Cleaned rows:", len(df))


### What did you know about your dataset?

The dataset contains monthly stock data of Yes Bank, including Date, Open, High, Low, and Close prices. It is clean, with no missing or duplicate values, and all features are numeric except the Date, which will need conversion for time-based analysis. The goal is to predict the closing price of the stock. Overall, the dataset is well-structured and suitable for building a reliable regression model to understand and forecast stock trends

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Column Names and Data Types:\n")
print(df.dtypes)


In [None]:
# Dataset Describe
df.describe()


### Variables Description

The dataset includes five main variables that represent key stock price indicators for Yes Bank. The Date column tells us the month and year of each record, which helps in understanding the timeline of stock movement. The Open price is the value at which the stock began trading at the start of the month, while the High and Low columns capture the highest and lowest prices the stock reached during that period. Finally, the Close price is the most important one for us,it shows the stock’s price at the end of the month and is the value we’re trying to predict. Each of these variables gives us a different perspective on the stock’s monthly performance and together they help us analyze trends, volatility, and overall market behavior over time.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique values in each column:\n")
print(df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Extract useful time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Sort data by date to preserve time-series order
df = df.sort_values(by='Date')

# Reset index after sorting
df.reset_index(drop=True, inplace=True)

# Final structure check
print(" Dataset is now analysis-ready.\n")
df.info()


### What all manipulations have you done and insights you found?

To prepare the dataset for analysis, I first converted the Date column into a proper datetime format. This allowed me to extract new time-based features like Year and Month, which are useful for spotting yearly or seasonal patterns in stock behavior. I then sorted the data chronologically to maintain the integrity of the time-series and reset the index for a clean structure.From the initial inspection, I found that the dataset is well-maintained,there are no missing values or duplicates. Each column holds unique and relevant information about the stock’s behavior during a specific month. These manipulations not only cleaned the data but also added useful features that will help in deeper trend analysis and model building.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')  # convert or set to NaT if bad format
df = df.dropna(subset=['Date', 'Close'])  # drop rows where date or close is NaN

In [None]:
# Chart - 1 visualization code

 #Line plot of Closing Price over time
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='Close', data=df, color='blue')
plt.title('Yes Bank Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

I chose a line plot because time-series data is best visualized this way to observe trends, cycles, and abrupt changes over time. It helps reveal how the stock’s closing price evolved month by month

##### 2. What is/are the insight(s) found from the chart?



The line plot clearly shows periods of high volatility, sudden dips, and long-term declining or recovering trends. There might be noticeable drops around major events (like the fraud case in 2018), which can later be correlated with business context.



##### 3. Will the gained insights help creating a positive business impact?


Yes, the insights from this chart can have a strong positive business impact. By visualizing the trend of the closing price over time, stakeholders can identify periods of stability, volatility, or decline. It helps investors, analysts, and financial strategists recognize how the stock has responded to various internal and external factors, such as the fraud incident in 2018. This understanding can support better investment decisions, risk assessment, and strategic planning.

### Are there any insights that lead to negative growth? Justify with specific reason.

Yes, there are clear signs in the chart that point toward periods of negative growth. One of the most noticeable drops in the closing price appears to align with the time around the 2018 Yes Bank fraud case. This kind of sharp decline reflects how serious external events especially those related to trust and governance can directly impact investor confidence and cause the stock value to fall. It’s a reminder that financial health isn’t the only factor influencing stock performance, public perception and credibility also play a huge role. These insights underline the importance of transparency and strong leadership in maintaining steady stock growth.

#### Chart - 2

In [None]:
# Chart 2: Monthly Average Closing Price (Across All Years)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')# Ensure 'Date' column is in datetime format
df = df.dropna(subset=['Date', 'Close'])# Drop rows where either 'Date' or 'Close' is missing
df['Month'] = pd.to_datetime(df['Date']).dt.month
monthly_avg = df.groupby('Month')['Close'].mean()

plt.figure(figsize=(10, 5))
sns.lineplot(x=monthly_avg.index, y=monthly_avg.values, marker='o', color='teal')
plt.title('Average Monthly Closing Price (Across All Years)')
plt.xlabel('Month')
plt.ylabel('Average Closing Price')
plt.xticks(range(1, 13))
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart reveals seasonal patterns in stock price behavior by averaging closing prices across months over the years. It’s especially useful for identifying monthly investor behavior, market cycles, or specific periods of volatility or stability.

##### 2. What is/are the insight(s) found from the chart?

We may notice certain months (e.g., March or October) tend to have lower or higher closing prices, suggesting cyclical investment trends or reactions to fiscal year patterns, budget announcements, etc.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps in timing decisions like selling, buying, or announcing dividends when prices are favorable. Understanding monthly behaviors helps portfolio managers align investment decisions with seasonal movements.


If certain months consistently underperform , that indicates a market sentiment drop, possibly due to poor quarterly earnings or investor withdrawal. Recognizing this can help the company mitigate seasonal losses via better communication or strategic actions.



#### Chart - 3

In [None]:
# Chart - 3 High vs. Low Prices over time
# Ensure 'Date' column is in datetime format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Drop rows where 'Date', 'High', or 'Low' is missing
df = df.dropna(subset=['Date', 'High', 'Low'])


plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='High', data=df, label='High Price', color='green')
sns.lineplot(x='Date', y='Low', data=df, label='Low Price', color='red')
plt.title('High vs Low Prices of Yes Bank Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This line plot compares the highest and lowest prices each month to understand the volatility range of the stock over time. It gives a clear picture of how much fluctuation there was between the peak and bottom values within each month a strong indicator of market uncertainty or stability.

##### 2. What is/are the insight(s) found from the chart?

The plot reveals periods where the gap between high and low prices widened, showing increased volatility, especially around crisis periods. Other times, the prices stayed relatively close, indicating stable market sentiment. This fluctuation often hints at how uncertain or confident investors felt during specific time frames.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely. Understanding volatility is crucial for risk assessment. Investors can use this insight to decide the level of risk they are comfortable with, and companies can align their investor communication strategies accordingly during high-volatility periods to reduce panic or misinformation.



Yes. When the difference between high and low prices is extreme — especially when paired with an overall downward trend — it usually signals fear or confusion in the market. Such conditions can push long-term investors away and reflect weakening trust in the company’s financial or leadership stability.

#### Chart - 4

In [None]:
# Chart 4: Box Plot of Closing Price by Year
# Drop rows with missing dates
# Convert 'Date' column to datetime (if not already done)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

#  Drop rows where either 'Date' or 'Close' is missing
df = df.dropna(subset=['Date', 'Close'])
df.dropna(subset=['Date'], inplace=True)

# Convert 'Date' column to datetime objects and extract the year
df['Year'] = pd.to_datetime(df['Date']).dt.year


plt.figure(figsize=(12, 6))
sns.boxplot(x='Year', y='Close', data=df, palette='coolwarm')
plt.title('Distribution of Closing Prices by Year')
plt.xlabel('Year')
plt.ylabel('Closing Price')
plt.grid(axis='y')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot to analyze how the closing price is distributed within each year including the median, range, and outliers. It’s one of the best visual tools for spotting yearly volatility, consistency, or unusual values in the stock price behavior.

##### 2. What is/are the insight(s) found from the chart?

The chart shows which years had stable closing prices and which had a wide spread or extreme values. For example, in years with financial instability or market panic, the box plot becomes tall with outliers, indicating unpredictable price movement. In more stable years, the boxes are tight and centered.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this helps stakeholders understand how consistent the stock price was in different years. Consistency often indicates investor confidence, while volatility may signal underlying issues. This information is valuable for managing investor expectations and assessing financial performance.


Yes, years with large spreads and many outliers suggest high uncertainty and unstable stock performance, which often deters long-term investors. For example, if one year shows a much wider box or extreme outliers downward, it could reflect market reactions to crises like fraud exposure or poor financial results.

#### Chart - 5

In [None]:
# Chart 5: Scatter Plot of Opening Price vs. Closing Price
 #Drop rows with missing 'Open' or 'Close' values (if any)
df = df.dropna(subset=['Open', 'Close'])

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Open', y='Close', data=df, color='teal', alpha=0.6)
plt.title('Opening Price vs. Closing Price')
plt.xlabel('Opening Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is ideal when examining the relationship between two continuous variables. In this case, we want to understand how the stock’s opening price correlates with its closing price on the same day. It’s useful to evaluate if the stock usually rises, falls, or stays consistent during market hours.



##### 2. What is/are the insight(s) found from the chart?

From the plot, we observe a strong linear relationship — as the opening price increases, the closing price tends to do the same. Most points lie along the diagonal, showing minimal deviation in daily trading. However, we also see a few scattered points far from the diagonal, suggesting days where there were major price fluctuations or market reactions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight reassures investors and analysts about the stability and predictability of the stock’s daily behavior. A strong correlation implies that the stock generally closes near its opening price, reflecting a less volatile trading pattern, which is a favorable sign for risk averse investors.

Yes, the scattered points, especially those far from the main trend hint at sudden intra-day market reactions, possibly driven by news, investor sentiment, or manipulation. These unpredictable shifts can indicate occasional instability, which could undermine investor confidence if such days become frequent.





#### Chart - 6

In [None]:

# Chart 6: High and Low Prices Over Time
#Convert 'Date' to datetime (if not already)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date'])  # in case any date became NaT
plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['High'], label='High Price', color='green', linewidth=1.5)
plt.plot(df['Date'], df['Low'], label='Low Price', color='red', linewidth=1.5)
plt.title('High and Low Stock Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This line chart was chosen to visualize volatility over time by plotting both the high and low prices each month. It’s essential for understanding the range of price movement on a monthly basis and observing whether the gap between high and low prices has widened or narrowed during certain periods.

##### 2. What is/are the insight(s) found from the chart?

We can clearly see periods where the difference between the high and low prices was significantly large, indicating high volatility  particularly around 2018–2020, when the bank faced regulatory and financial scrutiny. In contrast, recent years show a narrowing gap, suggesting price stabilization and possibly regained investor trust.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding periods of volatility helps businesses and investors prepare better strategies. Stable price ranges often invite long-term investors, while volatile periods may be better suited for short-term trading strategies. Such insights assist in tailoring investment approaches based on risk appetite and market behavior.

Yes. The wider price gaps in earlier years may reflect market uncertainty, poor investor confidence, or negative news cycles ( the Rana Kapoor case). These fluctuations could have scared off potential investors, led to a dip in the bank’s valuation, and contributed to temporary negative business growth.

#### Chart - 7

In [None]:

# Chart 7: Histogram of Closing Prices
#Drop missing values from 'Close' if any
df = df.dropna(subset=['Close'])
plt.figure(figsize=(10, 6))
sns.histplot(df['Close'], bins=30, color='skyblue', kde=True)
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal for understanding the distribution of a single numerical variable. Here, we are using it to see how frequently different ranges of closing prices occur. This helps identify central tendencies, common price levels, and any skewness in the stock's closing price history

##### 2. What is/are the insight(s) found from the chart?

The distribution is positively skewed, meaning most of the closing prices were on the lower side, with fewer high values. The peak is seen around 10rs to 30rs, which suggests that after Yes Bank’s crisis period, its stock mostly traded in a lower range. The presence of a long tail on the right reflects the earlier times when the stock was priced much higher.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This analysis can guide investment timing and expectation-setting. For example, long-term investors can use this to understand when the stock was undervalued or overvalued. For the company, it shows how much recovery is still needed to return to its former valuation range  helping in strategic planning and stakeholder communication.

Yes. The large concentration of low-priced bars shows that the stock has remained at low values for an extended period, which could indicate reduced investor confidence, past negative press, and lower market trust. Unless addressed, this perception could continue to affect long-term growth and stock performance.



#### Chart - 8

In [None]:
# Chart 8: Closing Price Over Time
# Drop rows where 'Date' or 'Close' is missing
df = df.dropna(subset=['Date', 'Close'])

# Ensure 'Date' is in datetime format
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Re-drop if any Date conversion fails
df = df.dropna(subset=['Date'])
plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['Close'], color='dodgerblue', linewidth=2)
plt.title('Yes Bank Closing Price Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line plot is best when trying to analyze trends over time. This chart was chosen to visualize how Yes Bank’s closing price has evolved throughout the years. It allows us to detect upward or downward trends, spikes, and crashes, which are essential for understanding historical performance and predicting future patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a sharp decline in stock price around 2018 to 2020, coinciding with the Rana Kapoor fraud case and financial instability within the bank. Post-2020, we observe a stabilization of the price at a much lower range, indicating the market is still cautious, but no longer in free fall.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. By identifying key turning points in stock price movements, business leaders and analysts can better understand investor reactions to real world events. It helps in forecasting, preparing crisis response strategies, and planning recovery measures. For investors, it's valuable for timing entry or exit based on past behavior.


Yes. The long-term decline and flattening of the closing price suggest loss of market trust and difficulty in regaining momentum. It reflects how major controversies and poor governance can have lasting negative effects on a company’s valuation, despite later efforts at revival.

#### Chart - 9

In [None]:
# Chart 9: Correlation Heatmap
# Drop rows with missing values in relevant columns
df_corr = df[['Open', 'High', 'Low', 'Close']].dropna()

# Compute correlation matrix
correlation_matrix = df_corr.corr()
plt.figure(figsize=(10, 6))
correlation_matrix = df[['Open', 'High', 'Low', 'Close']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Between Stock Price Variables')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is excellent for understanding relationships between numerical variables. It shows how strongly variables like Open, High, Low, and Close prices are linearly related, helping identify which features might be redundant or most predictive of the closing price.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows very strong positive correlations among all price variables, especially between Close and High (0.99) and Close and Low (0.98). This confirms that the closing price moves closely in line with the day’s high and low, making them good predictors for ML models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. Knowing which features are highly correlated allows us to select the most relevant variables for our model, reducing complexity and improving performance. From a business perspective, this also means better forecasting accuracy, which helps with planning and investment decisions.


No direct insight here suggests negative growth. However, the high multicollinearity means some features might be repetitive, which if not handled correctly in modeling, can confuse interpretation and reduce generalization of the model. It’s a technical risk rather than a business one — and easily fixable.



#### Chart - 10

In [None]:
# Chart 10: Scatter Plot of Opening vs Closing Prices
# Drop rows with missing values in 'Open' or 'Close'
df_scatter = df.dropna(subset=['Open', 'Close'])

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Open', y='Close', data=df, color='mediumseagreen')
plt.title('Opening Price vs Closing Price')
plt.xlabel('Opening Price')
plt.ylabel('Closing Price')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is used to observe the relationship between two continuous variables. Here, we’re checking if there's a clear linear relationship between opening and closing prices  which can help us understand how closely they are connected day-to-day or month to month.

##### 2. What is/are the insight(s) found from the chart?

The plot shows a strong linear relationship  most points lie close to a straight line. This suggests that the opening price is a strong indicator of the closing price on a given day. However, there are a few outliers where the price significantly moved up or down by market close.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing that opening price can predict closing price with high reliability makes it valuable for short-term trading strategies and algorithmic trading models. It can also help set more accurate intraday expectations and improve forecasting confidence.


The outliers, where opening and closing prices differ heavily, could point to market volatility or sudden news impact. These represent unpredictable risks, which may cause concern for investors relying on stability indicating potential negative sentiment during certain periods.

#### Chart - 11

In [None]:
# Chart 11: Average Monthly Closing Price
df['Month'] = pd.to_datetime(df['Date']).dt.month_name()

monthly_avg = df.groupby('Month')['Close'].mean().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
])

plt.figure(figsize=(12, 6))
sns.barplot(x=monthly_avg.index, y=monthly_avg.values, palette='viridis')
plt.title('Average Monthly Closing Prices')
plt.xlabel('Month')
plt.ylabel('Average Closing Price')
plt.xticks(rotation=45)
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot is perfect for comparing aggregated values across categories. Here, we visualize how the closing price behaves across different months — giving insights into whether any seasonal patterns exist in stock behavior.



##### 2. What is/are the insight(s) found from the chart?

We notice that February, March, and May generally show slightly lower average prices, while November and December often have higher averages. This could relate to investor activity near fiscal year-end or post festival financial optimism.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If seasonality is consistent, businesses and investors can strategically plan buy/sell activities based on expected trends. Companies can also time announcements, offers, or stock buybacks around stronger months for maximum impact.


Yes, if particular months (like February or March) consistently show lower values, it might reflect market uncertainty during financial reporting season or low investor confidence. Recognizing this helps the business proactively manage communication and strategy during such periods.



#### Chart - 12

In [None]:
# Chart 12: Rolling Mean of Closing Price (6-month window)
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values('Date', inplace=True)

df['Rolling_Mean_6M'] = df['Close'].rolling(window=6).mean()

plt.figure(figsize=(14, 6))
plt.plot(df['Date'], df['Close'], label='Original Closing Price', alpha=0.4)
plt.plot(df['Date'], df['Rolling_Mean_6M'], color='crimson', label='6-Month Rolling Mean', linewidth=2)
plt.title('6-Month Rolling Mean of Yes Bank Closing Price')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A rolling mean (or moving average) plot is ideal for smoothing out short-term fluctuations and identifying long-term trends. It provides a clearer view of how the stock is performing over time without being distracted by random ups and downs.

##### 2. What is/are the insight(s) found from the chart?

The 6-month rolling average line reveals that the closing price trended downward steadily during crisis periods (notably around 2018 to 2020), followed by a long flat and stable trend, suggesting the bank entered a recovery phase but has not regained momentum.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Businesses and investors can use rolling trends to time long-term decisions like when to issue stock, raise capital, or enter/exit the market. It helps identify whether recovery is real or temporary, which is vital for strategic planning.


Yes. The prolonged flatness in the rolling mean post-crash highlights difficulty in regaining investor confidence. This stagnation points to potential weaknesses in strategy, reputation, or market conditions  insights that demand a reassessment of the company’s recovery efforts.

#### Chart - 13

In [None]:
# Chart 13: Bar Plot of Price Range by Year
df['Price_Range'] = df['High'] - df['Low']

yearly_range = df.groupby('Year')['Price_Range'].mean()

plt.figure(figsize=(12, 6))
sns.barplot(x=yearly_range.index, y=yearly_range.values, palette='magma')
plt.title('Average Yearly Price Range (High - Low)')
plt.xlabel('Year')
plt.ylabel('Average Price Range')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart helps visualize market volatility by showing the average price range (difference between high and low) each year. It's especially useful to understand how volatile the stock was during specific periods, like before/during/after a financial crisis.

##### 2. What is/are the insight(s) found from the chart?

The data shows high volatility during certain crisis years (like 2018 to 2020), with a gradually narrowing range in later years. This tells us that the stock was extremely unstable at certain times but has since become relatively less volatile, indicating potential market stabilization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. Understanding periods of high volatility can help businesses and investors avoid risky time frames, better manage expectations, and adjust portfolio risk levels. It also helps the bank evaluate how public perception and market events are affecting stock movement.


Yes. The sharp spikes in the price range indicate market panic, instability, or poor sentiment, especially around known crisis years. Such instability discourages long-term investors, erodes trust, and slows down recovery, making it a clear negative insight from a business perspective.

#### Chart - 14 - Correlation Heatmap

In [None]:

# Chart: Correlation Heatmap
plt.figure(figsize=(10, 6))
correlation = df[['Open', 'High', 'Low', 'Close']].corr()

sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap between Price Variables')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose the correlation heatmap to understand how different price-related features  like Open, High, Low, and Close  are statistically related to one another. This kind of chart helps visually capture the strength and direction of linear relationships between variables, which is crucial before building any predictive model.

##### 2. What is/are the insight(s) found from the chart?

The chart shows a very high positive correlation between most of the features, especially between High and Close, Open and Close, and Low and Close. This indicates that the closing price is strongly influenced by the other price metrics meaning any model trying to predict the closing price can confidently include these variables as inputs.

#### Chart - 15 - Pair Plot

In [None]:
# Chart: Pair Plot of Price Variables
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df[['Open', 'High', 'Low', 'Close']])
plt.suptitle('Pair Plot of Price Variables', y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked the pair plot because it’s a great way to explore the pairwise relationships between multiple numerical variables in one view. It helps in visually identifying linear trends, clusters, and potential outliers across all combinations of variables like Open, High, Low, and Close. This is especially useful before applying regression models to see how variables interact with each other.



##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals that the features Open, High, Low, and Close have strong linear relationships, as shown by the tight clustering along diagonals in their scatter plots. It also confirms that High and Close or Low and Close are highly aligned, indicating that daily price fluctuations move together consistently. The distribution plots on the diagonal also show that Close prices are slightly skewed, which might need transformation for certain models.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on the charts and data analysis, I’ve defined three hypotheses to validate insights using statistical testing:

1. Do average closing prices change across years?
There seemed to be yearly variation in closing prices. I’ll test whether these differences are statistically significant using One-Way ANOVA.

2. Is there a significant difference between High and Low prices?
Though the High and Low prices move closely, I’ll use a Paired t-test to check if their averages are meaningfully different.

3. Does trading volume affect closing price?
To verify the link between volume and closing price, I’ll use Pearson correlation to test if a significant relationship exists.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothetical Statement 1
1. State your research hypothesis:
Null Hypothesis: There is no significant difference in the average closing price of Yes Bank stock across different years.

Alternate Hypothesis: There is a significant difference in the average closing price across different years.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
import scipy.stats as stats

# Load the dataset
df = pd.read_csv('/content/current data_YesBank_StockPrices - data_YesBank_StockPrices.csv')

# Just in case there are any hidden spaces in column names
df.columns = df.columns.str.strip()

# Extract year from the FormattedDate column
df['Year'] = pd.to_datetime(df['FormattedDate']).dt.year

# Run one-way ANOVA on Close prices grouped by year
anova_data = [group['Close'].values for name, group in df.groupby('Year') if len(group) > 1]

f_stat, p_value = stats.f_oneway(*anova_data)

print("F-statistic:", f_stat)
print("P-value:", p_value)




##### Which statistical test have you done to obtain P-Value?

I used the ANOVA (Analysis of Variance) test to get the p-value.

##### Why did you choose the specific statistical test?

I chose ANOVA because I needed to compare the average closing stock prices across different years. ANOVA is perfect when you're checking if three or more groups (in this case, years) have different means.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Null Hypothesis:
The distribution of closing stock prices is the same across all years.

Alternate Hypothesis:
At least one year has a different distribution of closing stock prices.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import kruskal

# Group closing prices by year
grouped_data = [group["Close"].values for _, group in df.groupby("Year")]

# Perform Kruskal-Wallis Test
stat, p_value = kruskal(*grouped_data)

print("Kruskal-Wallis H-statistic:", stat)
print("P-value:", p_value)



##### Which statistical test have you done to obtain P-Value?

I performed the Kruskal-Wallis H-Test, a non-parametric test used to compare multiple independent groups.

##### Why did you choose the specific statistical test?

I chose the Kruskal-Wallis test because it doesn't assume the data follows a normal distribution.
It's a good fit when comparing stock closing prices across multiple years, especially if the data may have outliers or non-normal patterns.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis:
There is no monotonic relationship between the Open and Close prices.

Alternate Hypothesis:
There is a monotonic relationship between the Open and Close prices.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import spearmanr

# Apply Spearman's rank correlation test
stat, p_value = spearmanr(df['Open'], df['Close'])

print("Spearman correlation coefficient:", stat)
print("P-value:", p_value)



##### Which statistical test have you done to obtain P-Value?

I used the Spearman Rank Correlation Test.



##### Why did you choose the specific statistical test?

Because Spearman’s test is ideal for checking whether two variables (like Open and Close) have a monotonic (consistently increasing or decreasing) relationship, without assuming a linear trend or normal distribution.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Example 1: Fill missing numerical values with mean
df['Close'] = df['Close'].fillna(df['Close'].mean())

# Example 2: Forward fill for time-series continuity
df = df.fillna(method='ffill')

# If any still left, use backward fill
df = df.fillna(method='bfill')


#### What all missing value imputation techniques have you used and why did you use those techniques?

I used total three imputation techniques, they are

Mean Imputation – For columns like 'Close', I used the mean of the column to fill missing values. because it prevents skewing the data and is simple for continuous numerical features.

Forward Fill (ffill) – Since this is a time-series stock dataset, I used forward fill to propagate the last known value forward which maintains the continuity of stock trends.

Backward Fill (bfill) – To ensure no missing data remains after forward fill, I used backward fill as a fallback method. This is what is useful when missing values occur at the start of the dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing outliers using boxplot
sns.boxplot(df['Close'])
plt.title("Boxplot - Close Prices")
plt.show()

# Treating outliers using IQR method
Q1 = df['Close'].quantile(0.25)
Q3 = df['Close'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Capping outliers to the bounds
df['Close'] = np.where(df['Close'] > upper_bound, upper_bound,
               np.where(df['Close'] < lower_bound, lower_bound, df['Close']))


##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the Interquartile Range (IQR) method for identifying and treating outliers.

First, I visualized the outliers using a boxplot to understand their presence and distribution.

Then, I calculated Q1 and Q3 (25th and 75th percentiles) to compute the IQR.

Any value beyond 1.5 × IQR from Q1 or Q3 was considered an outlier.

Instead of dropping them, I applied capping replacing extreme values with upper or lower threshold limits to retain the data while reducing skewness.

### 3. Categorical Encoding

In [None]:
# List categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
print("Categorical columns found:", categorical_cols)
if len(categorical_cols) > 0:
    df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
else:
    df_encoded = df.copy()  # No encoding needed
    print("No categorical columns found. Proceeding without encoding.")



#### What all categorical encoding techniques have you used & why did you use those techniques?

The only object-type columns in the dataset were 'Date' and 'FormattedDate', which are temporal in nature. Instead of categorical encoding, I extracted meaningful features such as Year, Month, and Day to make the date information more usable for analysis and modeling.

This approach preserves the time-based patterns in the data and avoids inappropriate one-hot encoding of continuous date values.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd

# Convert 'Date' column to datetime with the correct format
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y', errors='coerce')

# Extract time-based features
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek

# Create price-based derived features
df['PriceRange'] = df['High'] - df['Low']
df['DailyChange'] = df['Close'] - df['Open']
df['Volatility'] = (df['High'] - df['Low']) / df['Open']

# Display the first few rows with the new features
display(df.head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.feature_selection import SelectKBest, f_regression

# Define features and target
X = df[['Open', 'High', 'Low', 'Month', 'Day', 'DayOfWeek', 'PriceRange', 'DailyChange', 'Volatility']]
y = df['Close']

# Apply SelectKBest with f_regression
selector = SelectKBest(score_func=f_regression, k=5)
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features.tolist())

##### What all feature selection methods have you used  and why?

I used SelectKBest with the F-test (f_regression) as the feature selection method.
This method evaluates the relationship between each input feature and the target variable (Close price), selecting the top features with the highest statistical significance.

I chose this method because it is:

Simple yet effective for numerical data.

Helps in reducing dimensionality.

Prevents overfitting by removing less relevant features.

Improves model performance by keeping only statistically important inputs.

##### Which all features you found important and why?

After applying SelectKBest, the following features were found to be most important:

Open – Strongly correlates with the closing price; it's a direct indicator of the day’s market trend.

High – Shows the highest price reached in a day; often useful in predicting end-of-day movements.

Low – Like High, it reflects price fluctuation and helps in understanding volatility.

DailyChange – Engineered feature (High - Low) that captures intraday price variation.

Volatility – Another engineered feature (DailyChange / Open), useful for understanding market risk.


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation was necessary in this project to improve the performance and accuracy of the model.

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Open', 'High', 'Low', 'Close']])


### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Open', 'High', 'Low', 'Close']])


 Which method have you used to scale you data and why?

I used the StandardScaler (Z-score Normalization) method to scale the data.

My dataset contains numerical features like Open, High, Low, and Close with different ranges and units.

StandardScaler transforms the data to have zero mean and unit variance.

This scaling helps algorithms that are sensitive to feature magnitude (like Linear Regression, SVM, and K-Means) to perform better and converge faster.

It ensures that no feature dominates the others simply due to scale.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is useful in this project to:

Remove redundant or highly correlated features that do not contribute significantly to the prediction.

Improve model performance by reducing noise and overfitting.

Speed up computation time by lowering the number of features the model processes.

Make the data visualization and interpretation easier when projecting high-dimensional data to 2 or 3 dimensions.

Even though the dataset has only a few numerical features (Open, High, Low, Close), dimensionality reduction like PCA (Principal Component Analysis) was used to check if the essential information can be retained in fewer dimensions, which can benefit downstream models.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Select numerical features for PCA
features = ['Open', 'High', 'Low', 'Close']
X = df[features]

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 components for visualization
X_pca = pca.fit_transform(X_scaled)

# Step 3: Create a new DataFrame with principal components
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])

# Optional: Add Date column for tracking
pca_df['Date'] = df['Date'].values

# Explained variance ratio
explained_variance = pca.explained_variance_ratio_
print("Explained variance by components:", explained_variance)

# View the transformed dataset
pca_df.head()


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I used Principal Component Analysis (PCA) as the dimensionality reduction technique.Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Assume X contains features and y contains target variable
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


##### What data splitting ratio have you used and why?

I used an 80:20 split — 80% for training and 20% for testing. It's a common and reliable choice that balances learning and evaluation well.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, we don’t need to handle imbalance because the dataset is continuous in nature, it deals with stock prices rather than categorical classes. Imbalance handling is only necessary when we have classification problems with uneven class distribution.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1{Linear Regression} Implementation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Selecting features and target
X = df[['Open', 'High', 'Low']]
y = df['Close']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the Algorithm
model_1 = LinearRegression()
model_1.fit(X_train, y_train)

# Predict on the model
y_pred_1 = model_1.predict(X_test)

# Evaluate the model
mse_1 = mean_squared_error(y_test, y_pred_1)
r2_1 = r2_score(y_test, y_pred_1)

print("Model 1 - Linear Regression Results:")
print(f"Mean Squared Error: {mse_1}")
print(f"R² Score: {r2_1}")



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Store metrics
metrics = {'Mean Squared Error': mse_1, 'R2 Score': r2_1}

# Create bar chart
plt.figure(figsize=(8, 5))
plt.bar(metrics.keys(), metrics.values(), color=['skyblue', 'orange'])
plt.title('Evaluation Metrics for Model 1 - Linear Regression')
plt.ylabel('Score')
plt.ylim(0, max(metrics.values()) * 1.2)

# Add values on top
for i, v in enumerate(metrics.values()):
    plt.text(i, v + 0.01, f"{v:.4f}", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# Define the model (Ridge Regression adds regularization)
ridge = Ridge()

# Define hyperparameters to tune
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 100]  # Ridge regularization strength
}

# GridSearchCV setup
grid_search = GridSearchCV(estimator=ridge, param_grid=param_grid, cv=5, scoring='r2')

# Fit the model on training data
grid_search.fit(X_train, y_train)

# Best model
best_ridge = grid_search.best_estimator_

# Predict on test data
y_pred_ridge = best_ridge.predict(X_test)

# Evaluation
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

print("Best Hyperparameters:", grid_search.best_params_)
print("Mean Squared Error:", mse_ridge)
print("R² Score:", r2_ridge)
print("Model Accuracy (based on R²):", r2_ridge* 100, "%")


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV for hyperparameter optimization. It exhaustively searches through a specified range of hyperparameter values,it’s simple and reliable when the parameter space is not too large,it performs cross-validation automatically, helping reduce overfitting

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, there was a slight improvement after hyperparameter tuning.
The R² score increased from 0.983544389505753 to 0.9835443919282351 (a tiny decimal change), and the Mean Squared Error slightly decreased from 137.2334 to 137.2333.
While the change is minimal, it still indicates that the model became a bit more precise with the optimized parameters.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Model Used: Random Forest Regressor.
Random Forest is a powerful ensemble method that builds multiple decision trees and combines their outputs to improve accuracy and avoid overfitting.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluation
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Regressor Results:")
print(f"Mean Squared Error: {mse_rf}")
print(f"R² Score: {r2_rf}")
print("Model Accuracy (based on R²):",r2_rf * 100, "%")

# Plotting
plt.figure(figsize=(6,4))
plt.bar(['MSE', 'R² Score'], [mse_rf, r2_rf], color=['skyblue', 'lightgreen'])
plt.title('Evaluation Metrics - Random Forest Regressor')
plt.ylabel('Score')
plt.show()



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Define the model
rf = RandomForestRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

# GridSearch with Cross-Validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best Model
best_rf = grid_search.best_estimator_

# Predict on Test Data
y_pred_rf_tuned = best_rf.predict(X_test)

# Evaluation
mse_rf_tuned = mean_squared_error(y_test, y_pred_rf_tuned)
r2_rf_tuned = r2_score(y_test, y_pred_rf_tuned)

print("Best Hyperparameters:", grid_search.best_params_)
print("After Tuning - Mean Squared Error:", mse_rf_tuned)
print("After Tuning - R² Score:", r2_rf_tuned)
print("Model Accuracy (based on R²):",r2_rf_tuned * 100, "%")


##### Which hyperparameter optimization technique have you used and why?

I applied GridSearchCV to find the best combination of parameters for the Random Forest model.
It helped fine-tune the model to reduce error and improve accuracy through cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, there is an improvement after tuning!
The Mean Squared Error (MSE) dropped from 171.88 to 159.00, and the R² Score improved from 0.9794 to 0.9809.
This means the tuned model predicts better and explains more variance in the data.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Mean Squared Error (MSE)
 Indicates the average squared difference between actual and predicted values.
 Lower MSE means our predictions are closer to real stock prices.
 Business Impact: Reduces forecasting errors, crucial in making informed decisions like when to buy/sell stocks.

R2 Score
 Measures how well the model explains the variability in the data.
 Closer to 1 means better performance.
 Business Impact: A high R2 (like 0.9809) shows the model reliably captures market behavior — boosting confidence in automation, planning, and risk assessment.

### ML Model - 3

In [None]:
# ML Model - 3: XGBoost Regressor

from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Create the model
xgb_model = XGBRegressor(objective='reg:squarederror', random_state=42)

# Fit the model
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost Regressor Results:")
print("Mean Squared Error:", mse_xgb)
print("R² Score:", r2_xgb)
print("Model Accuracy (based on R²):",r2_xgb * 100, "%")



#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt

# Scores from XGBoost model
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

# Bar chart to visualize performance
metrics = ['Mean Squared Error', 'R² Score']
scores = [mse_xgb, r2_xgb]

plt.figure(figsize=(6,4))
bars = plt.bar(metrics, scores, color=['skyblue', 'lightgreen'])
plt.title("XGBoost Model Evaluation Metrics")
plt.ylabel("Score")

# Add text labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height, f'{height:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'subsample': [0.8, 1]
}

# Create the base model
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)

# Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid,
                           cv=5, scoring='r2', n_jobs=-1, verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Predict using best model
best_xgb = grid_search.best_estimator_
y_pred_best_xgb = best_xgb.predict(X_test)

# Evaluation
from sklearn.metrics import mean_squared_error, r2_score

mse_best_xgb = mean_squared_error(y_test, y_pred_best_xgb)
r2_best_xgb = r2_score(y_test, y_pred_best_xgb)

print("Best Hyperparameters:", grid_search.best_params_)
print("After Tuning - Mean Squared Error:", mse_best_xgb)
print("After Tuning - R² Score:", r2_best_xgb)
print("Model Accuracy (based on R²):",r2_xgb * 100, "%")


##### Which hyperparameter optimization technique have you used and why?

I used GridSearchCV because it systematically checks all combinations of given hyperparameters.It’s easy to implement, reliable, and works well when the dataset isn’t extremely large, helping to find the best set of parameters to boost model accuracy and generalization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No, there was no improvement after hyperparameter tuning.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I used Mean Squared Error (MSE) and R2 Score as key metrics.

MSE shows how much the predictions deviate from the actual stock prices, lower is better, meaning more accurate predictions.

R2 Score tells how well the model explains the variance in the data, closer to 1 means better prediction power.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose the Linear Regression Model (after hyperparameter tuning) as the final model.

It gave the lowest MSE and highest R2 Score among all models.

It also performed consistently, with very little change after tuning indicating robustness and simplicity.

For business, a simpler, stable, and interpretable model like Linear Regression is often preferable over complex ones with marginal improvement.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used Linear Regression, a straightforward and interpretable algorithm.

To explain feature importance, I used model coefficients

Features with higher absolute coefficient values have more impact on the prediction.For example, if Open and High prices had high positive coefficients, it means they strongly drive the predicted Close price.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# (Optional) Split data if not already split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Save the model to a pickle file
with open('best_model.pkl', 'wb') as file:
    pickle.dump(linear_model, file)

print(" Model successfully saved as 'best_model.pkl'")

# To load it later:
# with open('best_model.pkl', 'rb') as file:
#     loaded_model = pickle.load(file)



### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File
import pickle

# Load the saved model from the pickle file
with open('best_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Sanity Check: Predict on unseen test data
# Make sure X_test is already defined in your environment
y_pred = loaded_model.predict(X_test)

# Show a few predictions
print(" Sanity check - Sample predictions:")
print(y_pred[:5])  # show first 5 predictions


In [None]:
# Compare predictions with actual target values
print("Predicted:", y_pred[:5])
print("Actual:   ", y_test[:5].values)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we successfully built and evaluated multiple machine learning models to predict stock prices using historical market data. We began with thorough data preprocessing, including handling missing values, outliers, and feature scaling. Feature engineering and selection helped enhance model performance while preventing overfitting.

Three machine learning models were implemented ,Linear Regression, Random Forest Regressor, and XGBoost Regressor. After performing hyperparameter tuning and cross-validation, the Linear Regression model showed the best performance with an R² score of 0.9835, making it the final chosen model.

We saved this model using pickle and validated it by predicting unseen test data to ensure correctness. The model produced predictions closely aligned with actual values, indicating strong reliability and low error.
Overall, this solution demonstrates a robust, scalable approach for stock price forecasting, with potential for positive business impact in financial decision-making, investment planning, and market analysis.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***