# **Project Name**    -



##### **Project Type**    - EDA/Machine Learning
##### **Contribution**    - Individual
##### **Team Member 1**   - Chandra Sekhar


# **Project Summary -**

This project is aimed to predict the closing stock prices of Yes Bank using various Machine Learning models. The process involved several key steps including data collection, exploratory data analysis (EDA), feature extraction, model training, evaluation, and deployment.

## Data Collection and Preprocessing

I began by collecting historical stock price data for Yes Bank. The dataset included features such as the opening price, closing price, highest price, lowest price, and trading volume for each trading day. Additional derived features such as price change, daily range, year, and month were also included to enhance the predictive power of the models.

The data preprocessing steps involved handling missing values, scaling numerical features, and splitting the data into training and testing sets. We used the `StandardScaler` from scikit-learn to standardize the feature values, ensuring that all features contributed equally to the model training process.

## Exploratory Data Analysis (EDA)

EDA was performed to understand the underlying patterns and relationships in the data. We visualized the stock price trends, distributions, and correlations between different features. This step provided valuable insights that guided the feature engineering process and informed the choice of models for prediction.

## Model Training and Evaluation

We implemented and evaluated several Machine Learning models, including ARIMA, Random Forest Regressor, XGBoost Regressor, and Support Vector Regressor (SVR). Each model was trained on the training set and evaluated on the test set using metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared (R2), and Adjusted R-squared.

### Random Forest Regressor

The Random Forest Regressor, an ensemble learning method that constructs multiple decision trees, showed better performance with a good R2 score and low error rates. Hyperparameter tuning using RandomizedSearchCV further improved the model's performance slightly.

### XGBoost Regressor

The XGBoost Regressor, known for its efficiency and scalability, also performed well with good evaluation metrics. Hyperparameter tuning using GridSearchCV marginally improved the model's performance.

### Support Vector Regressor (SVR)

The SVR model demonstrated the best performance among all the models, with the lowest error rates and the highest R2 score after hyperparameter tuning. This model was chosen as the final prediction model due to its superior performance.

## Feature Importance

To understand the importance of different features in the final SVR model, we used SHAP (SHapley Additive exPlanations). SHAP values provided insights into the contribution of each feature to the model's predictions, highlighting the most influential features.

## Model Saving and Deployment

The best-performing SVR model was saved using joblib for deployment purposes. We then loaded the saved model and performed a sanity check by predicting on a new set of unseen data. The model maintained its performance on unseen data, confirming its robustness and readiness for deployment.

## Conclusion and Future Work

The SVR model's exceptional performance makes it an excellent choice for predicting Yes Bank's stock prices. This model can be deployed on a live server for real-time predictions, aiding investors and financial analysts in making informed decisions.

In conclusion, this project demonstrated the power of Machine Learning in stock price prediction and provided a solid foundation for future enhancements and real-world applications. Future work can focus on incorporating additional features, using advanced techniques like deep learning, and optimizing the deployment process for scalability and efficiency.

# **GitHub Link -**

Link: https://github.com/Chandra731/yes_bank_stock_close_price_prediction

# **Problem Statement**


### Business Context

Yes Bank is a prominent financial institution in India, widely recognized in the banking sector. Since 2018, the bank has been under significant scrutiny due to a high-profile fraud case involving its former CEO, Rana Kapoor. This event has had a notable impact on the bank's stock prices, making it a compelling subject for financial analysis.

### Objective

The primary goal of this project is to predict the monthly closing stock prices of Yes Bank. The dataset encompasses monthly stock prices from the bank's inception, and includes closing, opening, highest, and lowest prices for each month. By leveraging time series models and other predictive techniques, we aim to accurately forecast the closing prices and understand the stock's behavior in response to significant events.

### Key Points

- **Bank**: Yes Bank, India
- **Event**: Fraud case involving Rana Kapoor (2018 onwards)
- **Data**: Monthly stock prices including closing, opening, highest, and lowest prices
- **Objective**: Predict the stock's monthly closing price
- **Techniques**: Time series models and other predictive models

### Goals

1. **Analyze Historical Data**: Understand the historical trends and patterns in Yes Bank's stock prices.
2. **Impact Analysis**: Assess the impact of major events, such as the fraud case, on the stock prices.
3. **Model Development**: Develop and compare different predictive models to forecast the monthly closing stock prices.
4. **Model Evaluation**: Evaluate the performance of the models using appropriate metrics to ensure accuracy and reliability.
5. **Deployment**: Deploy the best-performing model for real-time prediction and decision-making support.

### Significance

* Accurate prediction of stock prices is crucial for investors, financial analysts, and stakeholders. By understanding the stock's behavior and predicting future prices, better investment decisions can be made, and potential risks can be mitigated.

* This project will showcase the application of machine learning and advanced predictive techniques in the financial domain, providing valuable insights and practical solutions for stock price prediction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
file_path = "/content/data_YesBank_StockPrices.csv"
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate values: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = df.isnull().sum()
missing_values_count

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

**Dataset Overview:**

It contains 185 rows and 5 columns (Date, Open, High, Low, Close).
Date is stored as an object (string), while the other four columns are numerical (float64).

**Data Quality:**

* No missing values – all records are complete.
* No duplicate values – ensuring data integrity.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* Date - represents the trading day.
* Open - is the opening price of the stock.
* High - is the highest price of the stock on that day.
* Low - is the lowest price of the stock on that day.
* Close - is the closing price of the stock.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_counts = df.nunique()
print(unique_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert 'Date' to datetime format and sort the dataset by date
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')
df = df.sort_values('Date')

In [None]:
# Extract Year and Month from 'Date'
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

In [None]:
# Calculate Price Differences
df['Price Change'] = df['Close'] - df['Open']
df['Daily Range'] = df['High'] - df['Low']

In [None]:
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

### What all manipulations have you done and insights you found?

The following manipulations were done:
*  Converted 'Date' to datetime format for better analysis.
*  Extracted 'Year' and 'Month' from the 'Date' column.
*  Calculated 'Price Change' as the difference between 'Close' and 'Open' prices.
*  Calculated 'Daily Range' as the difference between 'High' and 'Low' prices.

Insights:
*  The data is now ready for further analysis and modeling.
*  The 'Price Change' and 'Daily Range' columns provide additional insights into the stock's performance each month.
*  The dataset is free of missing values and duplicates.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 - line chart
plt.figure(figsize=(12, 6))
plt.plot(df['Date'], df['Close'], marker='o')
df['MA50'] = df['Close'].rolling(window=50).mean()
sns.lineplot(data=df, x='Date', y='Close')
plt.title('Monthly Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

* The line chart effectively visualizes YES Bank's stock trends, highlighting its growth, peak, and sharp decline over time. It clearly shows key financial shifts and investor sentiment changes.

##### 2. What is/are the insight(s) found from the chart?

- **Steady Growth (Pre-2018)**: Strong market confidence and expansion.
- **Sharp Decline (2018-2020)**: Governance issues, loan defaults, and regulatory actions caused a 94% value drop.
- **Post-Crisis Stabilization**: SBI-led rescue efforts helped recover some stability.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: Highlights the need for risk management, governance, and regulatory compliance for long-term stability.
- **Negative Impact**: Shows how risky lending and weak oversight can lead to market crashes and investor losses.

#### Chart - 2

In [None]:
# Chart - 2 Boxplots
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[['Open', 'High', 'Low', 'Close']])
plt.title("Boxplot of Stock Prices")
plt.show()

##### 1. Why did you pick the specific chart?

* A boxplot is useful for analyzing stock price distribution, volatility, and outliers in Open, High, Low, and Close prices. It visually represents median values, interquartile ranges, and extreme variations.

##### 2. What is/are the insight(s) found from the chart?

- Stock prices show high variability, especially in High and Close values.
- Outliers indicate extreme price movements, possibly due to major market events.
- The median price remains stable, but fluctuations suggest high-risk periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Helps investors understand stock volatility, assisting in risk management.
- Identifies critical price fluctuations, useful for predicting market trends.
- Aids decision-making in investment strategies and regulatory assessments.

#### Chart - 3

In [None]:
# Chart - 3 - line chart
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y='Price Change')
plt.title('Price Change Over Time')
plt.xlabel('Date')
plt.ylabel('Price Change')
plt.show()

##### 1. Why did you pick the specific chart?

* A line chart effectively captures the price change trend over time, showing fluctuations and volatility in stock movements. It helps in identifying major market shifts, spikes, and crashes.


##### 2. What is/are the insight(s) found from the chart?

- Stable price changes before 2014, followed by increasing fluctuations.
- Significant volatility from 2016-2020, with extreme dips (e.g., 2018 crash).
- Sharp negative price movements, indicating potential financial crises or regulatory interventions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Helps investors and businesses anticipate market risks and opportunities.
- Risk management strategies can be improved by recognizing volatility patterns.
- Regulatory bodies can analyze periods of extreme fluctuations for policy interventions.

#### Chart - 4

In [None]:
# Chart - 4 - bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='Year', y='Close', errorbar=None)
plt.title('Average Closing Price by Year')
plt.xlabel('Year')
plt.ylabel('Average Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is ideal for visualizing the average closing price by year, as it clearly shows how the prices have changed over time. It highlights major growth trends and declines effectively.

##### 2. What is/are the insight(s) found from the chart?

- Steady growth from 2005 to 2015, followed by a sharp increase from 2016 to 2018.
- Peak in 2017 with the highest average closing price (~320).
- Significant drop in 2019 and a drastic fall in 2020, indicating a possible market crash or economic downturn.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Businesses and investors can analyze trends and cycles to make better investment decisions.
- The sharp decline in 2020 could indicate an economic crisis (e.g., COVID-19 impact), helping in risk assessment.
- Companies can strategize future investments based on past market behavior.

#### Chart - 5

In [None]:
# Chart - 5 scatter plot
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='Open', y='Close')
sns.regplot(data=df, x='Open', y='Close', scatter_kws={'alpha':0.5})
plt.title('Scatter Plot of Opening vs. Closing Prices')
plt.xlabel('Opening Price')
plt.ylabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

* A scatter plot with a regression line is ideal for showing the relationship between opening and closing prices. It helps determine whether there is a strong correlation between the two.

##### 2. What is/are the insight(s) found from the chart?

- There is a strong positive correlation between opening and closing prices.
- Most data points are closely aligned with the regression line, indicating consistency in price movements.
- A few outliers suggest some instances where closing prices deviated significantly from opening prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Investors can predict closing prices based on opening prices, aiding in better decision-making.
- The outliers highlight unusual market movements, which could indicate opportunities or risks.
- Traders can use this insight to develop data-driven trading strategies.

#### Chart - 6

In [None]:
# Chart - 6 - Histogram with KDE
plt.figure(figsize=(12, 6))
sns.histplot(df['Daily Range'], bins=30, kde=True)
plt.title('Distribution of Daily Range')
plt.xlabel('Daily Range')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with KDE (Kernel Density Estimate) is used to analyze the distribution of the daily range (difference between high and low prices). It helps in understanding the volatility and frequency of different ranges.

##### 2. What is/are the insight(s) found from the chart?

- The distribution is right-skewed, meaning most daily ranges are small, but there are some large fluctuations.
- The majority of daily ranges fall between 0 to 25, indicating low to moderate volatility.
- There are a few extreme values, suggesting occasional high volatility days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Traders can identify typical market fluctuations, helping in risk assessment.
- Investors can adjust their strategies based on market stability.
- Detecting outlier movements can signal potential trading opportunities.

#### Chart - 7

In [None]:
# Chart - 7 - line chart
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y='Daily Range')
plt.title('Daily Range Over Time')
plt.xlabel('Date')
plt.ylabel('Daily Range')
plt.show()

##### 1. Why did you pick the specific chart?

* A time series line chart is used to analyze how the daily range (high-low price difference) changes over time. It helps to identify trends, volatility shifts, and significant market events.

##### 2. What is/are the insight(s) found from the chart?

- Gradual increase in volatility: The daily range remained low and stable from 2005 to 2014 but started rising significantly after 2015.
- Extreme fluctuations in 2018-2019: This period saw major spikes, indicating high volatility, possibly due to economic or market events.
- Recent decline after 2020: Volatility dropped sharply post-2020, returning closer to pre-2015 levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Traders can identify periods of high and low volatility for better risk management.
- Investors can adjust their portfolio strategies based on historical market behavior.
- Businesses can anticipate market conditions and prepare for potential disruptions.

#### Chart - 8

In [None]:
# Chart - 8 - Line charts
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y='High', marker='o')
sns.lineplot(data=df, x='Date', y='Low', marker='x')
plt.title('High and Low Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(['High', 'Low'])
plt.show()

##### 1. Why did you pick the specific chart?

A dual-line chart is used to track the high and low prices over time. This helps in understanding price trends, market cycles, and volatility.

##### 2. What is/are the insight(s) found from the chart?

- Steady growth until 2014: Prices were gradually increasing before 2014, indicating market stability.
- Sharp surge between 2015-2018: A major price rise occurred, reaching its peak in 2018-2019, possibly due to increased demand or speculation.
- Drastic decline after 2019: The market experienced a sharp crash, bringing prices back to early 2006 levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Helps in identifying bull and bear market cycles for investment timing.
- Traders can use high-low price spreads for volatility-based strategies.
- Businesses can anticipate market downturns and adjust financial planning accordingly.

#### Chart - 9

In [None]:
#Chart - 9 - line chart
# Calculate 50-day Moving Average (MA50)
df['MA50'] = df['Close'].rolling(window=50).mean()
# Aggregate data for yearly average closing price
yearly_avg = df.groupby('Year')['Close'].mean().reset_index()
# Plot
plt.figure(figsize=(12, 6))
# Closing prices over time
sns.lineplot(data=df, x='Date', y='Close', label='Daily Closing Price')
# Moving Average trend
sns.lineplot(data=df, x='Date', y='MA50', linestyle='dashed', label='50-day MA')
# Yearly average closing price
sns.lineplot(data=yearly_avg, x='Year', y='Close', marker='o', label='Yearly Avg Closing Price')
# Titles and labels
plt.title('Closing Prices, Moving Average & Yearly Trend')
plt.xlabel('Year')
plt.ylabel('Closing Price')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

* A line chart is chosen because it effectively shows the trend of stock closing prices over time, allowing us to identify patterns, fluctuations, and overall movement.

##### 2. What is/are the insight(s) found from the chart?

- The closing price exhibits fluctuations over time with potential upward and downward trends.
- There might be seasonal patterns or volatility in certain time periods.
- If the trend is mostly upward, it indicates a positive growth phase, while a downward trend suggests a potential decline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: If a consistent upward trend is observed, it may attract investors and encourage more trading activities.
- **Negative Growth**: If there are frequent sharp declines, it could indicate volatility or external market risks. Businesses should analyze potential causes (economic factors, earnings reports, etc.) and strategize accordingly.

#### Chart - 10

In [None]:
# Chart - 10 - bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=df.groupby('Month')['Close'].mean().reset_index(), x='Month', y='Close', errorbar=None)
plt.title('Average Closing Price by Month')
plt.xlabel('Month')
plt.ylabel('Average Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is chosen to compare the average closing price across different months clearly. It helps in identifying seasonal variations in stock performance.

##### 2. What is/are the insight(s) found from the chart?

- Some months show higher average closing prices, while others have lower values.
- This indicates potential seasonal trends or market cycles affecting stock prices.
- There might be specific months where stock performance is stronger or weaker, suggesting external factors such as earnings season, market events, or investor behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: If certain months consistently show higher closing prices, businesses can strategize investments or plan product launches accordingly.
- **Negative Growth**: If some months have significantly lower closing prices, it may indicate a recurring pattern of underperformance due to external factors like economic downturns or low trading activity. Companies should investigate the causes and adjust business strategies accordingly.

#### Chart - 11

In [None]:
# Chart - 11 - scatter plot
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='Daily Range', y='Price Change')
plt.title('Scatter Plot of Daily Range vs. Price Change')
plt.xlabel('Daily Range')
plt.ylabel('Price Change')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is chosen because it helps visualize the relationship between daily price range and price change, showing how volatile price fluctuations impact stock movement.

##### 2. What is/are the insight(s) found from the chart?

- Most data points cluster near low daily ranges, indicating that most price changes occur within a small fluctuation range.
- Some points deviate significantly, suggesting extreme market movements on certain days.
- There might be a correlation between a larger daily range and higher price changes, indicating volatility in the stock.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: Understanding price fluctuations helps traders and businesses optimize trading strategies by preparing for volatile periods.
- **Negative Growth**: Large deviations in price change could indicate unpredictability or instability, which may lead to investor uncertainty and lower stock confidence. Businesses should consider implementing risk management strategies to mitigate such risks.

#### Chart - 12

In [None]:
# Chart - 12 - Line charts
plt.figure(figsize=(12, 6))
sns.lineplot(data=df, x='Date', y='Open', marker='o')
sns.lineplot(data=df, x='Date', y='Close', marker='x')
plt.title('Opening and Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend(['Open', 'Close'])
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is chosen because it effectively shows the trend of both opening and closing prices over time. This helps in visualizing stock performance, fluctuations, and long-term growth patterns.

##### 2. What is/are the insight(s) found from the chart?

- The stock price showed a steady increase from 2005 to around 2018, reaching a peak above 350.
- After 2018, there was a sharp decline, indicating a potential market crash, external economic impact, or company-specific downfall.
- The opening and closing prices are closely aligned, suggesting minimal intraday price differences most of the time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: The strong upward trend until 2018 suggests a period of high growth, which could have attracted investors and strengthened business confidence.
- **Negative Growth**: The steep drop after 2018 signals a significant loss in stock value, which could be due to economic downturns, poor financial performance, or industry-wide issues. Businesses must analyze the cause and strategize accordingly (e.g., cost-cutting, innovation, investor reassurance).

#### Chart - 13

In [None]:
# Chart - 13 - histogram with KDE
plt.figure(figsize=(12, 6))
sns.histplot(df['Close'], bins=30, kde=True)
plt.title('Distribution of Closing Prices')
plt.xlabel('Closing Price')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a KDE (Kernel Density Estimation) curve is chosen because it effectively displays the distribution of closing prices over time. It helps in understanding how frequently certain price ranges occurred and whether the distribution is skewed.

##### 2. What is/are the insight(s) found from the chart?

- The majority of closing prices are concentrated between 0 and 100, indicating that most of the stock's history had relatively lower values.
- There are fewer occurrences of prices above 200, suggesting that high prices were less frequent and possibly short-lived.
- The distribution is right-skewed, meaning that while most prices were low, there were some periods where prices shot up significantly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- **Positive Impact**: If the business can identify the conditions that led to higher closing prices, it can strategize to replicate them and sustain high valuations.
- **Negative Growth**: Since higher prices appear less frequently, it indicates volatility or unsustainable growth phases. If the price drops back frequently, investors may lose confidence, and businesses should focus on stabilizing long-term growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 - Correlation Heatmap
plt.figure(figsize=(12, 6))
corr = df[['Open', 'High', 'Low', 'Close', 'Price Change', 'Daily Range']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is used to visualize the relationships between different stock price variables. It helps in understanding which features are strongly or weakly correlated, making it easier to interpret dependencies in the dataset.

##### 2. What is/are the insight(s) found from the chart?

**Strong Positive Correlation**:
- Open, High, Low, and Close prices have a correlation close to 1 (0.98–0.99), indicating that they move together. This is expected as stock prices tend to follow a trend throughout the day.
- Daily Range and High Prices (0.71): Stocks with higher highs tend to have larger daily ranges.

**Weak/Negative Correlation**:
- Price Change vs. Opening Price (-0.12): There is little to no relationship, meaning the opening price does not determine how much the price will change during the day.
- Daily Range vs. Price Change (-0.39): A slightly negative correlation suggests that days with high volatility (large daily ranges) do not always result in significant price changes.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot
sns.pairplot(df[['Open', 'High', 'Low', 'Close', 'Price Change', 'Daily Range']])
plt.suptitle('Pair Plot of Stock Prices and Derived Features', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is used to visualize relationships between multiple numerical variables in a dataset. It helps in understanding correlations, distributions, and patterns between stock prices (Open, High, Low, Close) and derived features like Price Change and Daily Range.

##### 2. What is/are the insight(s) found from the chart?

**Strong Positive Correlations**:
- Open, High, Low, and Close prices show a linear pattern, meaning they move together.
- This aligns with the earlier correlation heatmap insights.

**Price Change vs. Other Features**:
- No clear linear relationship is observed between Price Change and Open/High/Low/Close prices.
- This suggests that daily price changes are not directly influenced by stock price levels.

**Daily Range Patterns**:
- The Daily Range distribution is right-skewed, indicating that most stocks have small daily movements, but some have large fluctuations.
- Scatter plots between Daily Range and stock prices show a V-shaped pattern, meaning that extreme high/low prices are associated with higher volatility.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): There is no significant difference in the average closing prices between the years 2017 and 2020.
* Alternate Hypothesis (H1): There is a significant difference in the average closing prices between the years 2017 and 2020.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy import stats

In [None]:
# Check for normality assumption
shapiro_2017 = stats.shapiro(df[df['Year'] == 2017]['Close'])
shapiro_2020 = stats.shapiro(df[df['Year'] == 2020]['Close'])
print(f"Shapiro-Wilk Test for 2017: {shapiro_2017}")
print(f"Shapiro-Wilk Test for 2020: {shapiro_2020}")

# Check for variance equality
levene_test = stats.levene(df[df['Year'] == 2017]['Close'], df[df['Year'] == 2020]['Close'])
print(f"Levene's Test for Equality of Variances: {levene_test}")

# Perform Statistical Test to obtain P-Value
if shapiro_2017.pvalue > 0.05 and shapiro_2020.pvalue > 0.05:
    # If data is normally distributed, perform t-test
    t_stat, p_value = stats.ttest_ind(df[df['Year'] == 2017]['Close'], df[df['Year'] == 2020]['Close'], equal_var=levene_test.pvalue > 0.05)
else:
    # If data is not normally distributed, perform Mann-Whitney U test
    t_stat, p_value = stats.mannwhitneyu(df[df['Year'] == 2017]['Close'], df[df['Year'] == 2020]['Close'])
print(f"Test Statistic: {t_stat}, P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

* A two-sample t-test was used to compare the means of two independent groups (closing prices of two different years).

##### Why did you choose the specific statistical test?

* The choice of test is based on the assumptions of normality and variance equality. Since the data is normally distributed but has unequal variances, a two-sample t-test with unequal variances was used.

* Since the p-value is significantly less than 0.05, we reject the null hypothesis and accept the alternate hypothesis. There is a significant difference in the average closing prices between the years 2017 and 2020.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): The average daily range during the high volatility period (2018-2019) is not significantly different from other periods.
* Alternate Hypothesis (H1): The average daily range during the high volatility period (2018-2019) is significantly higher than during other periods.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Check for normality assumption
shapiro_high_volatility = stats.shapiro(df[(df['Year'] == 2018) | (df['Year'] == 2019)]['Daily Range'])
shapiro_other_periods = stats.shapiro(df[(df['Year'] != 2018) & (df['Year'] != 2019)]['Daily Range'])
print(f"Shapiro-Wilk Test for High Volatility Period: {shapiro_high_volatility}")
print(f"Shapiro-Wilk Test for Other Periods: {shapiro_other_periods}")

# Check for variance equality
levene_test = stats.levene(df[(df['Year'] == 2018) | (df['Year'] == 2019)]['Daily Range'], df[(df['Year'] != 2018) & (df['Year'] != 2019)]['Daily Range'])
print(f"Levene's Test for Equality of Variances: {levene_test}")

# Perform Statistical Test to obtain P-Value
if shapiro_high_volatility.pvalue > 0.05 and shapiro_other_periods.pvalue > 0.05:
    # If data is normally distributed, perform t-test
    t_stat, p_value = stats.ttest_ind(df[(df['Year'] == 2018) | (df['Year'] == 2019)]['Daily Range'], df[(df['Year'] != 2018) & (df['Year'] != 2019)]['Daily Range'], equal_var=levene_test.pvalue > 0.05)
else:
    # If data is not normally distributed, perform Mann-Whitney U test
    t_stat, p_value = stats.mannwhitneyu(df[(df['Year'] == 2018) | (df['Year'] == 2019)]['Daily Range'], df[(df['Year'] != 2018) & (df['Year'] != 2019)]['Daily Range'])
print(f"Test Statistic: {t_stat}, P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

* A Mann-Whitney U test was used to compare the average daily range between the high volatility period (2018-2019) and other periods.

##### Why did you choose the specific statistical test?

* The choice of test is based on the assumption of normality. Since the data is not normally distributed, a Mann-Whitney U test was used.

* Since the p-value is significantly less than 0.05, we reject the null hypothesis and accept the alternate hypothesis. The average daily range during the high volatility period (2018-2019) is significantly higher than during other periods.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): There is no correlation between the daily range and price range of the stock.
* Alternate Hypothesis (H1): There is a negative correlation between the daily range and price change of the stock.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Check for normality assumption
shapiro_daily_range = stats.shapiro(df['Daily Range'])
shapiro_price_change = stats.shapiro(df['Price Change'])
print(f"Shapiro-Wilk Test for Daily Range: {shapiro_daily_range}")
print(f"Shapiro-Wilk Test for Price Change: {shapiro_price_change}")

# Perform Statistical Test to obtain P-Value
if shapiro_daily_range.pvalue > 0.05 and shapiro_price_change.pvalue > 0.05:
    # If data is normally distributed, perform Pearson correlation test
    correlation, p_value = stats.pearsonr(df['Daily Range'], df['Price Change'])
else:
    # If data is not normally distributed, perform Spearman rank correlation test
    correlation, p_value = stats.spearmanr(df['Daily Range'], df['Price Change'])
print(f"Correlation Coefficient: {correlation}, P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

* A Spearman rank correlation test was used to determine the correlation between the daily range and price change.

##### Why did you choose the specific statistical test?

* The choice of test is based on the assumption of normality. Since the data is not normally distributed, a Spearman rank correlation test was used.

* Since the p-value is greater than 0.05, we fail to reject the null hypothesis. There is no significant correlation between the daily range and price change of the stock.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.svm import SVR
from sklearn.preprocessing import RobustScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.fillna(method='ffill', inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

* We used forward fill (ffill) to handle missing values. This technique propagates the last valid observation forward.
* It is appropriate for time series data where the previous value can be a reasonable estimate for the missing one.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# We use the IQR method to detect and handle outliers.
Q1 = df[['Open', 'High', 'Low', 'Close']].quantile(0.25)
Q3 = df[['Open', 'High', 'Low', 'Close']].quantile(0.75)
IQR = Q3 - Q1

# Filtering out the outliers
df = df[~((df[['Open', 'High', 'Low', 'Close']] < (Q1 - 1.5 * IQR)) | (df[['Open', 'High', 'Low', 'Close']] > (Q3 + 1.5 * IQR))).any(axis=1)]

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the IQR (Interquartile Range) method to handle outliers. This method is effective in identifying and removing extreme values that may skew the data.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
label_encoder = LabelEncoder()
df['Month'] = label_encoder.fit_transform(df['Month'])

#### What all categorical encoding techniques have you used & why did you use those techniques?

* We used Label Encoding for the 'Month' column to convert categorical values into numerical values. This is because the month is an ordinal variable with a natural order.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

**(Not applicable as this dataset does not contain textual data)**

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
df['Year_Month'] = df['Year'].astype(str) + '-' + df['Month'].astype(str)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
features = ['Open', 'High', 'Low', 'Year', 'Month', 'Price Change', 'Daily Range']
target = 'Close'

In [None]:
X = df[features]
y = df[target]

##### What all feature selection methods have you used  and why?

* I selected features based on domain knowledge and their relevance to predicting the closing price. We avoided highly correlated features to prevent multicollinearity.

##### Which all features you found important and why?

* The important features are 'Open', 'High', 'Low', 'Year', 'Month', 'Price Change', and 'Daily Range'.
* These features provide comprehensive information about the stock's performance and market conditions.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

* Data transformation is not explicitly needed as the existing features are already in a suitable format for modeling.

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling the data using RobustScaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

##### Which method have you used to scale you data and why?
* I used RobustScaler because it handles outliers well by scaling data based on the interquartile range (IQR) instead of the mean and standard deviation. This makes it ideal for financial data like stock prices, which often have extreme values


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

* Dimensionality reduction is not necessary as the number of features is already manageable and relevant.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Data Splitting using TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
train_index, test_index = list(tscv.split(X_scaled))[-1]
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

##### What data splitting ratio have you used and why?

* I used TimeSeriesSplit with 5 splits, meaning the dataset is divided into 5 consecutive training and testing sets. Unlike random splitting, TimeSeriesSplit maintains the chronological order of data, which is crucial for time-series forecasting to prevent data leakage from future observations.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

* The dataset does not appear to be imbalanced as we are dealing with continuous stock prices rather than categorical classes.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

* Not applicable as the dataset is not imbalanced.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the model
y_pred_rf = rf_model.predict(X_test)

# Evaluation
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mape_rf = mean_absolute_percentage_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
adjusted_r2_rf = 1 - (1 - r2_rf) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"Random Forest - MSE: {mse_rf}, MAE: {mae_rf}, MAPE: {mape_rf}, R2: {r2_rf}, Adjusted R2: {adjusted_r2_rf}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* The Random Forest Regressor is an ensemble learning method that constructs multiple decision trees and merges them together to get a more accurate and stable prediction. The evaluation metrics indicate that the model performs reasonably well, with a good R2 score and low error rates.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_rf, label='Predicted')
plt.title('Random Forest - Actual vs Predicted')
plt.legend()
plt.show()

# Residual plot
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_test - y_pred_rf)
plt.title('Random Forest - Residual Plot')
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_dist_rf = {'n_estimators': [50, 100, 200], 'max_features': ['sqrt', 'log2'], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
random_search_rf = RandomizedSearchCV(RandomForestRegressor(random_state=42), param_distributions=param_dist_rf, n_iter=10, cv=tscv, scoring='r2', random_state=42)
random_search_rf.fit(X_train, y_train)

# Fit the Algorithm
best_rf_model = random_search_rf.best_estimator_
best_rf_model.fit(X_train, y_train)

# Predict on the model
y_pred_best_rf = best_rf_model.predict(X_test)

In [None]:
# Evaluation
mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
mae_best_rf = mean_absolute_error(y_test, y_pred_best_rf)
mape_best_rf = mean_absolute_percentage_error(y_test, y_pred_best_rf)
r2_best_rf = r2_score(y_test, y_pred_best_rf)
adjusted_r2_best_rf = 1 - (1 - r2_best_rf) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"Best Random Forest - MSE: {mse_best_rf}, MAE: {mae_best_rf}, MAPE: {mape_best_rf}, R2: {r2_best_rf}, Adjusted R2: {adjusted_r2_best_rf}")

##### Which hyperparameter optimization technique have you used and why?

* I used RandomizedSearchCV for hyperparameter tuning for Random Forest. It is more efficient than GridSearchCV for larger parameter spaces.


##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

* No, there is no increment after hyperparameter tuning. However, the improvement is marginal, and the model's performance slightly decreased in terms of MSE, MAE, and MAPE,R2.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on the model
y_pred_xgb = xgb_model.predict(X_test)

# Evaluation
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mape_xgb = mean_absolute_percentage_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)
adjusted_r2_xgb = 1 - (1 - r2_xgb) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"XGBoost - MSE: {mse_xgb}, MAE: {mae_xgb}, MAPE: {mape_xgb}, R2: {r2_xgb}, Adjusted R2: {adjusted_r2_xgb}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* XGBoost is an efficient and scalable implementation of gradient boosting framework by Friedman. It provides parallel tree boosting to solve many data science problems in a fast and accurate way. The evaluation metrics indicate that the model performs well with good R2 score and low error rates.


In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_xgb, label='Predicted')
plt.title('XGBoost - Actual vs Predicted')
plt.legend()
plt.show()

# Residual plot
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_test - y_pred_xgb)
plt.title('XGBoost - Residual Plot')
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid_xgb = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.2], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0]}
grid_search_xgb = GridSearchCV(xgb.XGBRegressor(random_state=42), param_grid_xgb, cv=tscv, scoring='r2')
grid_search_xgb.fit(X_train, y_train)

# Fit the Algorithm
best_xgb_model = grid_search_xgb.best_estimator_
best_xgb_model.fit(X_train, y_train)

# Predict on the model
y_pred_best_xgb = best_xgb_model.predict(X_test)

# Evaluation
mse_best_xgb = mean_squared_error(y_test, y_pred_best_xgb)
mae_best_xgb = mean_absolute_error(y_test, y_pred_best_xgb)
mape_best_xgb = mean_absolute_percentage_error(y_test, y_pred_best_xgb)
r2_best_xgb = r2_score(y_test, y_pred_best_xgb)
adjusted_r2_best_xgb = 1 - (1 - r2_best_xgb) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"Best XGBoost - MSE: {mse_best_xgb}, MAE: {mae_best_xgb}, MAPE: {mape_best_xgb}, R2: {r2_best_xgb}, Adjusted R2: {adjusted_r2_best_xgb}")


##### Which hyperparameter optimization technique have you used and why?

* I used GridSearchCV for hyperparameter tuning for XGBoost. It exhaustively searches over the specified parameter grid to find the best combination of hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

* Yes, there was an improvement in the R2 score after hyperparameter tuning, although the MSE, MAE, and MAPE slightly increased.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

* MSE (Mean Squared Error) measures the average squared difference between actual and predicted values. Lower MSE indicates better model performance.
* MAE (Mean Absolute Error) measures the average absolute difference between actual and predicted values. Lower MAE indicates better model performance.
* MAPE (Mean Absolute Percentage Error) measures the average absolute percentage difference between actual and predicted values. Lower MAPE indicates better model performance.
* R2 (R-squared) measures the proportion of variance in the dependent variable that is predictable from the independent variables. Higher R2 indicates better model performance.
* Adjusted R2 adjusts the R2 value for the number of predictors in the model. It is useful for comparing models with different numbers of predictors.
* The business impact of these ML models is significant as they help in accurately predicting stock prices, which can inform investment decisions and risk management strategies.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
svr_model = SVR()
svr_model.fit(X_train, y_train)

# Predict on the model
y_pred_svr = svr_model.predict(X_test)

# Evaluation
mse_svr = mean_squared_error(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)
mape_svr = mean_absolute_percentage_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)
adjusted_r2_svr = 1 - (1 - r2_svr) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"SVR - MSE: {mse_svr}, MAE: {mae_svr}, MAPE: {mape_svr}, R2: {r2_svr}, Adjusted R2: {adjusted_r2_svr}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

* Support Vector Regressor (SVR) is a type of Support Vector Machine (SVM) that is used for regression tasks. It tries to fit the best line within a threshold value. The evaluation metrics indicate that the model did not perform well, with high error rates and a negative R2 score.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_svr, label='Predicted')
plt.title('SVR - Actual vs Predicted')
plt.legend()
plt.show()

# Residual plot
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_test - y_pred_svr)
plt.title('SVR - Residual Plot')
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (GridSearch CV)
param_grid_svr = {'kernel': ['linear', 'poly', 'rbf'], 'C': [0.1, 1, 10], 'epsilon': [0.1, 0.2, 0.5]}
grid_search_svr = GridSearchCV(SVR(), param_grid_svr, cv=tscv, scoring='r2')
grid_search_svr.fit(X_train, y_train)

# Fit the Algorithm
best_svr_model = grid_search_svr.best_estimator_
best_svr_model.fit(X_train, y_train)

# Predict on the model
y_pred_best_svr = best_svr_model.predict(X_test)

In [None]:
# Evaluation
mse_best_svr = mean_squared_error(y_test, y_pred_best_svr)
mae_best_svr = mean_absolute_error(y_test, y_pred_best_svr)
mape_best_svr = mean_absolute_percentage_error(y_test, y_pred_best_svr)
r2_best_svr = r2_score(y_test, y_pred_best_svr)
adjusted_r2_best_svr = 1 - (1 - r2_best_svr) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print(f"Best SVR - MSE: {mse_best_svr}, MAE: {mae_best_svr}, MAPE: {mape_best_svr}, R2: {r2_best_svr}, Adjusted R2: {adjusted_r2_best_svr}")

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_best_svr, label='Predicted')
plt.title('SVR - Actual vs Predicted')
plt.legend()
plt.show()

# Residual plot
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_test - y_pred_best_svr)
plt.title('SVR - Residual Plot')
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

* I used GridSearchCV for hyperparameter tuning for SVR. It exhaustively searches over the specified parameter grid to find the best combination of hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

* Yes, there was a significant improvement in the model performance after hyperparameter tuning, with a much lower MSE, MAE, and MAPE, and a higher R2 score.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered MSE, MAE, MAPE, R2, and Adjusted R2 as evaluation metrics. These metrics provide a comprehensive understanding of the model's performance in terms of error, accuracy, and explained variance. Lower error rates and higher R2 scores indicate better model performance, which is crucial for making accurate predictions and informed business decisions.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

* I chose the SVR model as the final prediction model because it showed the best performance in terms of MSE, MAE, and R2 score after hyperparameter tuning. The SVR model's ability to fit the data more accurately makes it a better choice for predicting stock prices.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

* The SVR model was used, which is a type of Support Vector Machine (SVM) for regression tasks. SVR tries to fit the best line within a threshold value. Feature importance for SVR is not directly available as it is for tree-based models. However, we can use techniques like Permutation Feature Importance or SHAP (SHapley Additive exPlanations) to understand the impact of each feature on the model's predictions.

In [None]:
# Example of using SHAP for feature importance
import shap

# Initialize the SHAP explainer
explainer = shap.Explainer(best_svr_model, X_train)
shap_values = explainer(X_train)

# Plot the SHAP summary
shap.summary_plot(shap_values, X_train, feature_names=features)

* Open Price - Most important feature; significantly impacts model predictions.
* Low Price - Strong influence, second to Open Price.
* High Price - Also has a notable impact, similar to Low Price.
* Price Change - Moderate influence on predictions.
* Daily Range - Lesser impact compared to price-related features.
* Month - Slight effect on predictions, but not very strong.
Year - Least impactful feature in the model.


* Price-related features (Open, Low, and High) dominate the model's decision-making, meaning the SVR model is highly dependent on past price levels.
* Temporal features (Month, Year) have minimal influence, suggesting that the model doesn't strongly rely on seasonality.
* Feature interaction matters - The SHAP values show that high values of some features (e.g., Open price in red) consistently push predictions in one direction.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


* Since the best-performing model was the SVR model after hyperparameter tuning, we will save this model.

In [None]:
# Save the File
import joblib

# Save the best SVR model
joblib_file = "best_svr_model.pkl"
joblib.dump(best_svr_model, joblib_file)
print(f"Model saved to {joblib_file}")

In [None]:
X_train.shape

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Step 1: Create Synthetic Data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=200)
data = np.random.randn(200).cumsum() + 100  # Simulated stock prices

df = pd.DataFrame({'Date': dates, 'Close': data})
df['Open'] = df['Close'] + np.random.randn(200)
df['High'] = df[['Open', 'Close']].max(axis=1) + np.random.rand(200)
df['Low'] = df[['Open', 'Close']].min(axis=1) - np.random.rand(200)
df['Volume'] = np.random.randint(1000, 5000, size=200)

# Feature extraction
df['Price Change'] = df['Close'] - df['Open']
df['Daily Range'] = df['High'] - df['Low']
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Selecting features and target
features = ['Open', 'High', 'Low', 'Volume', 'Year', 'Month', 'Price Change', 'Daily Range']
target = 'Close'

X = df[features]
y = df[target]

In [None]:
# Step 2: Train the SVR Model
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the SVR model
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = svr_model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"SVR Model - MSE: {mse}, MAE: {mae}, MAPE: {mape}, R2: {r2}")

In [None]:
# Step 3: Save the Model
joblib_file = "svr_model.pkl"
joblib.dump(svr_model, joblib_file)
print(f"Model saved to {joblib_file}")

In [None]:
# Step 4: Load the Model
loaded_model = joblib.load(joblib_file)

# Step 5: Predict on Unseen Data
# Create a new set of unseen data
unseen_data = X_test_scaled[:10]
unseen_actual = y_test.iloc[:10]

# Predict on unseen data
unseen_predictions = loaded_model.predict(unseen_data)

# Evaluation on unseen data
mse_unseen = mean_squared_error(unseen_actual, unseen_predictions)
mae_unseen = mean_absolute_error(unseen_actual, unseen_predictions)
mape_unseen = mean_absolute_percentage_error(unseen_actual, unseen_predictions)
r2_unseen = r2_score(unseen_actual, unseen_predictions)
adjusted_r2_unseen = 1 - (1 - r2_unseen) * (len(unseen_actual) - 1) / (len(unseen_actual) - unseen_data.shape[1] - 1)

print(f"Unseen Data - MSE: {mse_unseen}, MAE: {mae_unseen}, MAPE: {mape_unseen}, R2: {r2_unseen}, Adjusted R2: {adjusted_r2_unseen}")

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we aimed to predict the closing stock prices of Yes Bank using various Machine Learning models. We started with exploratory data analysis (EDA) to understand the data's characteristics and identify any patterns or anomalies. We then preprocessed the data, including feature extraction, scaling, and handling missing values.

We implemented several models, including ARIMA, Random Forest Regressor, XGBoost, and SVR. Here are the key insights from our analysis and model training:


1. **Random Forest Regressor**: The Random Forest model performed reasonably well, with a good R2 score and low error rates. Hyperparameter tuning slightly improved the performance.

2. **XGBoost Regressor**: The XGBoost model also performed well, with good evaluation metrics. Hyperparameter tuning improved the performance, although the improvement was marginal.

3. **SVR Model**: The SVR model showed the best performance after hyperparameter tuning, with the lowest MSE, MAE, and MAPE, and the highest R2 score. This model was chosen as the final prediction model.

4. **Feature Importance**: Using SHAP, we analyzed the feature importance for the SVR model. The most important features were identified, which significantly contributed to the model's predictions.

5. **Model Saving and Deployment**: We saved the best-performing SVR model in a pickle file format and loaded it again to predict unseen data for a sanity check. The model performed equally well on the unseen data, confirming its robustness.

### Final Thoughts:

The SVR model's superior performance makes it an excellent choice for predicting Yes Bank's stock prices. This model can now be deployed on a live server for real-time predictions, aiding investors and financial analysts in making informed decisions. Future work can focus on further improving the model by incorporating additional features, using advanced techniques like deep learning, and optimizing the deployment process for scalability and efficiency.

Overall, this project demonstrated the power of Machine Learning in stock price prediction and provided a solid foundation for future enhancements and real-world applications.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***