# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - EDA/Regression/Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** SUMIT KAUSHIK
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

<p>The dataset being analyzed involves various stock market metrics, including columns such as <strong>Open</strong>, <strong>High</strong>, <strong>Low</strong>, <strong>Close</strong>, along with temporal variables like <strong>month</strong> and <strong>day</strong>. The goal is to analyze stock performance over time, identify trends, and implement predictive models that can help in understanding stock behavior. This dataset, while straightforward, provides insights into stock price movements, including fluctuations and volatility.</p>
<h4>Key Features:</h4>
<ol>
<li><strong>Date</strong>: Represents the specific date for each stock market entry. This feature is critical for understanding time-based patterns.</li>
<li><strong>Open</strong>: The price at which a stock opened on a given day. This serves as a baseline for each trading day and is influenced by various pre-market activities.</li>
<li><strong>High</strong>: The highest price reached by the stock during a particular trading session.</li>
<li><strong>Low</strong>: The lowest price that the stock touched during the trading session.</li>
<li><strong>Close</strong>: The price at which the stock closed for the day, considered the most important price point for analysis.</li>
<li><strong>Month</strong>: Extracted from the <strong>Date</strong> column, this helps in aggregating data to identify seasonal patterns or monthly trends.</li>
<li><strong>Day</strong>: Also derived from <strong>Date</strong>, this allows for daily comparisons and potential day-of-the-week effects on stock prices.</li>
</ol>

# **GitHub Link -**

https://github.com/Sumit021990/ML-Yes-Bank-Stock-Closing-Price-Prediction-Capstone-Sumit_Kaushik-.git

# **Problem Statement**


<p>The objective of this analysis is to explore and predict stock price movements using a dataset consisting of various stock market metrics such as <strong>Open</strong>, <strong>High</strong>, <strong>Low</strong>, and <strong>Close</strong> prices. Stock prices fluctuate over time due to a variety of factors, and accurately predicting these changes can provide significant value to investors, traders, and financial analysts.</p>
<p>Specifically, the task involves analyzing historical stock price data to:</p>
<ol>
<li><strong>Understand trends and patterns</strong> over different time periods (e.g., by month, day, or hour).</li>
<li><strong>Identify and treat outliers</strong> in stock prices to avoid skewing predictions and analyses.</li>
<li><strong>Build predictive models</strong> to forecast future stock prices using features such as historical price points, temporal factors (e.g., month, day), and other relevant attributes.</li>
</ol>

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import metrics
from datetime import datetime

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve

import lightgbm as lgb
import xgboost as xgb
from xgboost import XGBClassifier

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data Sets/Yes Bank Stock Closing Price Prediction-20240907T092607Z-001/Yes Bank Stock Closing Price Prediction/Copy of data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df=data.copy()
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.isnull().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.duplicated().sum()

In [None]:
# Visualizing the missing values
df.isnull().sum().plot(kind='bar')


### What did you know about your dataset?

<p>The dataset provided for stock price analysis contains the following key features:</p>
<ol>
<li><strong>Date</strong>: The timestamp for each record, indicating when the stock price data was recorded.</li>
<li><strong>Open</strong>: The stock price at the start of the trading day.</li>
<li><strong>High</strong>: The highest price of the stock during the trading day.</li>
<li><strong>Low</strong>: The lowest price of the stock during the trading day.</li>
<li><strong>Close</strong>: The final price of the stock at the end of the trading day.</li>
<li><strong>Month</strong>: Derived from the date, indicating the month in which the stock prices were recorded.</li>
<li><strong>Day</strong>: Also derived from the date, indicating the specific day of the month.</li>
</ol>

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

<h3>Variables Description for the Stock Price Dataset:</h3>
<ol>
<li>
<p><strong>Date</strong>:</p>
<ul>
<li><strong>Type</strong>: DateTime</li>
<li><strong>Description</strong>: Represents the specific trading day. Useful for time series analysis and extracting time-based features (month, day, year, etc.).</li>
</ul>
</li>
<li>
<p><strong>Open</strong>:</p>
<ul>
<li><strong>Type</strong>: Float</li>
<li><strong>Description</strong>: The stock's price at the beginning of the trading day. This variable indicates how the stock opens in the market.</li>
</ul>
</li>
<li>
<p><strong>High</strong>:</p>
<ul>
<li><strong>Type</strong>: Float</li>
<li><strong>Description</strong>: The highest price the stock reaches during the trading day. Reflects the peak value during the day.</li>
</ul>
</li>
<li>
<p><strong>Low</strong>:</p>
<ul>
<li><strong>Type</strong>: Float</li>
<li><strong>Description</strong>: The lowest price the stock falls to during the trading day. This value helps assess how low the stock was during volatile trading periods.</li>
</ul>
</li>
<li>
<p><strong>Close</strong>:</p>
<ul>
<li><strong>Type</strong>: Float</li>
<li><strong>Description</strong>: The stock's price at the end of the trading day. It is considered a significant variable in stock market analysis as it reflects the final price investors are willing to pay for the stock at market close.</li>
</ul>
</li>
<li>
<p><strong>Month</strong>:</p>
<ul>
<li><strong>Type</strong>: Integer</li>
<li><strong>Description</strong>: Extracted from the Date variable, this represents the month of the trading activity. Useful for capturing seasonal trends or patterns over time.</li>
</ul>
</li>
<li>
<p><strong>Day</strong>:</p>
<ul>
<li><strong>Type</strong>: Integer</li>
<li><strong>Description</strong>: Extracted from the Date variable, this represents the day of the trading month. Useful for understanding daily fluctuations in stock prices.</li>
</ul>
</li>
</ol>

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.columns


In [None]:
df.dtypes

In [None]:
df.head()

In [None]:
df['Date'] = pd.to_datetime(df['Date'] + '-2023', format='%b-%d-%Y')

# Extract month and day from the 'Date' column
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day

# Display the DataFrame
print(df)

### What all manipulations have you done and insights you found?

<h3>Data Manipulations</h3>
<ol>
<li>
<p><strong>Date Conversion</strong>:</p>
<ul>
<li>Converted the <code>Date</code> column from a string format (<code>'%b-%d'</code>) to a datetime format by appending the year <code>'-2023'</code>. This allowed extraction of month and day components for analysis.</li>
</ul>
</li>
<li>
<p><strong>Feature Extraction</strong>:</p>
<ul>
<li>Extracted <code>month</code> and <code>day</code> from the <code>Date</code> column. These features can help in understanding seasonal patterns and daily variations.</li>
</ul>
</li>
<li>
<p><strong>Handling Missing Values</strong>:</p>
<ul>
<li>Filled missing values in <code>Open</code>, <code>High</code>, <code>Low</code>, and <code>Close</code> columns with the mean of their respective columns to avoid data imputation errors and maintain data consistency.</li>
</ul>
</li>
<li>
<p><strong>Data Verification</strong>:</p>
<ul>
<li>Dropped rows where <code>Date</code> could not be converted, ensuring data integrity for subsequent analysis.</li>
</ul>
</li>
</ol>
<h3>Insights</h3>
<ol>
<li>
<p><strong>Monthly Trends</strong>:</p>
<ul>
<li>By extracting the month from the <code>Date</code>, we can analyze trends and patterns specific to each month. For instance, if certain months show higher or lower average stock prices, this might indicate seasonal trends or events affecting stock prices.</li>
</ul>
</li>
<li>
<p><strong>Daily Trends</strong>:</p>
<ul>
<li>Extracting the day allows for analysis of daily variations. This can help identify patterns that recur on specific days of the week or anomalies in daily trading behavior.</li>
</ul>
</li>
<li>
<p><strong>Missing Value Handling</strong>:</p>
<ul>
<li>Filling missing values with the mean of the respective columns ensures that the dataset remains complete for analysis and prevents biases that could arise from dropping missing data points.</li>
</ul>
</li>
<li>
<p><strong>Correlation Analysis</strong>:</p>
<ul>

</ul>
</li>
</ol>
<h3>Example Insights from Analysis</h3>
<ul>
<li>
<p><strong>Seasonal Patterns</strong>: If monthly data shows specific trends, such as higher stock prices in particular months, this could suggest seasonal effects or annual events impacting stock performance.</p>
</li>
<li>
<p><strong>Volatility Patterns</strong>: By examining daily price fluctuations, it might be possible to identify specific days with higher volatility or recurring patterns, which could be useful for trading strategies.</p>
</li>
<li>
<p><strong>Data Quality</strong>: The manipulation steps ensured that the dataset was cleaned and ready for further analysis, such as machine learning modeling, by handling missing values and converting data types appropriately.</p>
</li>
</ul>
<p>These manipulations and insights lay the groundwork for more detailed analysis, such as time series forecasting, anomaly detection, or deeper statistical analyses to understand and predict stock price behavior.</p>

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df.columns

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.barplot(x='month', y='Close', data=df, estimator='mean', palette='Blues_d')

# Adding labels and title
plt.title('Average Close Price by Month')
plt.xlabel('Month')
plt.ylabel('Average Close Price')
plt.show()

##### 1. Why did you pick the specific chart?

<h4>The bar chart was selected to visualize the average closing price of the stock by month. This type of chart is ideal for comparing the average values across discrete categories (in this case, the months of the year). By aggregating the closing prices to monthly averages, the chart provides a clear and intuitive view of how the stock&rsquo;s performance fluctuates over the course of the year. This helps in identifying any seasonal trends, patterns, or anomalies in the stock's behavior.</h4>

##### 2. What is/are the insight(s) found from the chart?

<p>From the bar chart, you can derive several insights:</p>
<ul>
<li>
<p><strong>Monthly Trends</strong>: The chart will show which months have higher or lower average closing prices. For instance, if certain months consistently show higher closing prices, it might indicate positive performance during those periods.</p>
</li>
<li>
<p><strong>Seasonal Patterns</strong>: By analyzing the average closing prices for each month, you can identify any seasonal patterns in the stock&rsquo;s performance. For example, if the stock tends to perform better in specific months, this could be due to seasonal factors affecting the market or industry trends.</p>
</li>
<li>
<p><strong>Anomalies</strong>: The chart may reveal any anomalies or outliers where the closing price deviates significantly from the norm. Such anomalies might indicate exceptional market conditions or specific events impacting stock performance.</p>
</li>
</ul>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<p>&nbsp;</p>
<ul>
<li>
<p><strong>Positive Business Impact</strong>:</p>
<ul>
<li><strong>Strategic Planning</strong>: If certain months consistently show strong performance, businesses can use this information to strategize investments or marketing efforts. For example, knowing that the stock performs well in specific months can guide timing for stock purchases or financial planning.</li>
<li><strong>Seasonal Adjustments</strong>: Recognizing seasonal patterns can help businesses prepare for expected fluctuations in stock performance, allowing for better cash flow management and investment planning.</li>
</ul>
</li>
<li>
<p><strong>Negative Growth Insights</strong>:</p>
<ul>
<li><strong>Poor Performance Months</strong>: If the chart highlights certain months with consistently low average closing prices, it may signal periods of poor performance. Businesses need to be cautious during these periods and may need to investigate underlying causes.</li>
<li><strong>Market Trends</strong>: Significant drops or erratic behavior in certain months could indicate underlying market issues or external factors affecting stock performance. Understanding these factors can help mitigate potential negative impacts.</li>
</ul>
</li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>

#### Chart - 2

In [None]:
# Chart - 2 visualization code
df_melted = df.melt(id_vars='month', value_vars=['Close', 'Low', 'High', 'Open'], var_name='Stock Metric', value_name='Value')

# Create the FacetGrid plot
g = sns.catplot(x='month', y='Value', hue='Stock Metric', data=df_melted, kind='bar', height=6, aspect=2)
g.set_axis_labels('Month', 'Average Value')
g.fig.suptitle('Average Stock Metrics by Month')
plt.show()



##### 1. Why did you pick the specific chart?

<p>I chose the FacetGrid bar plot to visualize the average values of key stock metrics (<code>Close</code>, <code>Low</code>, <code>High</code>, <code>Open</code>) across different months. This type of chart was selected because:</p>
<ul>
<li><strong>Comparative Analysis</strong>: It allows us to compare multiple stock metrics simultaneously and see their performance throughout the year.</li>
<li><strong>Trend Identification</strong>: It helps identify monthly trends and variations in stock metrics.</li>
<li><strong>Ease of Interpretation</strong>: Bar plots are straightforward and effective for showing differences in average values over discrete categories like months.</li>
</ul>


##### 2. What is/are the insight(s) found from the chart?

<p>From the chart, we can derive several insights:</p>
<ul>
<li><strong>Seasonal Patterns</strong>: The plot reveals whether certain months have higher or lower average values for stock metrics. For example, if the "Close" price is consistently higher in specific months, it may indicate seasonal trends or market patterns.</li>
<li><strong>Metric Comparison</strong>: We can observe how different metrics behave relative to each other over the months. If the "High" prices consistently peak in certain months compared to "Low" prices, it suggests higher volatility or market activity during those periods.</li>
<li><strong>Volatility</strong>: Variations in metrics like "Open" and "Close" over the months can indicate periods of high or low market volatility. For instance, large differences between the "Open" and "Close" prices in certain months could highlight increased market fluctuations or events impacting stock performance.</li>
</ul>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<p><strong>Positive Business Impact:</strong></p>
<ul>
<li><strong>Informed Investment Decisions</strong>: By understanding monthly trends in stock metrics, investors can make more strategic decisions. For instance, knowing which months typically show higher stock prices can help in timing investments more effectively.</li>
<li><strong>Financial Planning</strong>: Businesses can use this data to align their financial strategies with expected market conditions. For example, if certain months show consistent growth, companies can plan their budget or stock purchases accordingly.</li>
<li><strong>Risk Management</strong>: Identifying months with high volatility can help in planning risk management strategies. Businesses can prepare for potential downturns or capitalize on opportunities during more stable periods.</li>
</ul>
<p><strong>Potential Negative Growth Insights:</strong></p>
<ul>
<li><strong>Consistent Low Performance</strong>: If the chart shows that certain months consistently have lower stock metrics, it might indicate periods of poor performance or market downturns. This could lead to reduced profitability or losses if not anticipated and managed.</li>
<li><strong>Volatility Risks</strong>: High variability between stock metrics in certain months could suggest periods of increased risk. Businesses and investors might face challenges if they are not prepared for the high fluctuations in stock performance.</li>
</ul>

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Group by 'month' and find the minimum close price
chart_3 = df.groupby('month')['Close'].min().reset_index()

# Plotting the result
plt.plot(chart_3['month'], chart_3['Close'], marker='o', linestyle='-', color='b')
plt.xlabel('Month')
plt.ylabel('Minimum Close Price')
plt.title('Minimum Close Price by Month')
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

<p>I chose the line chart displaying the minimum closing price by month to highlight the periods when the stock experienced its lowest points. This chart provides a clear, continuous view of the minimum values over time, allowing us to identify trends and fluctuations in the stock's performance. The line chart is particularly effective for showing how the minimum price changes month-to-month and for detecting any recurring patterns or anomalies in stock performance.</p>

##### 2. What is/are the insight(s) found from the chart?

<p>&nbsp;</p>
<ul>
<li>
<p><strong>Seasonal Trends:</strong> The chart reveals whether there are specific months when the stock regularly reaches its lowest closing prices. This can indicate potential seasonal trends or cyclic patterns in the stock's performance.</p>
</li>
<li>
<p><strong>Price Drops:</strong> By examining the minimum closing prices, we can identify months with significant price drops. These drops might correlate with market events, economic conditions, or company-specific issues.</p>
</li>
<li>
<p><strong>Market Volatility:</strong> The variability in minimum prices across months provides insights into market volatility. Periods with large fluctuations may indicate higher risk or instability in the stock.</p>
</li>
</ul>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<p><strong>Positive Business Impact:</strong></p>
<ol>
<li>
<p><strong>Strategic Investment Decisions:</strong> Knowing which months historically have lower minimum prices can help investors time their purchases more effectively. They might choose to invest during these periods if they anticipate a recovery, potentially leading to higher returns.</p>
</li>
<li>
<p><strong>Risk Management:</strong> Understanding the months with historically low minimum prices allows businesses and investors to anticipate and prepare for potential downturns. This foresight can lead to better risk management and more informed financial strategies.</p>
</li>
</ol>
<p><strong>Negative Growth Insights:</strong></p>
<ol>
<li>
<p><strong>Investment Caution:</strong> Consistently low minimum prices in certain months could signal unfavorable investment conditions. Investors may need to exercise caution or avoid investing during these periods, as it might indicate periods of market weakness or instability.</p>
</li>
<li>
<p><strong>Market Sentiment Concerns:</strong> Persistent low minimum prices could reflect underlying issues with the stock or negative market sentiment. This may necessitate a deeper investigation into the reasons behind the low prices and could influence decisions to divest or hold off on new investments.</p>
</li>
</ol>

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Group by 'month' and find the maximum close price
chart_4 = df.groupby('month')['Close'].max().reset_index()

# Plotting the result
plt.plot(chart_4['month'], chart_4['Close'], marker='o', linestyle='-', color='b')
plt.xlabel('Month')
plt.ylabel('Maximum Close Price')
plt.title('Maximum Close Price by Month')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

<p>I chose the line chart displaying the maximum closing price by month to illustrate the highest points reached by the stock over time. This visualization helps in understanding the peak performance of the stock in different months. By showing the maximum values, the chart allows us to assess periods of highest stock performance and identify any patterns or trends associated with peak prices.</p>

##### 2. What is/are the insight(s) found from the chart?

<p>&nbsp;</p>
<ul>
<li>
<p><strong>Peak Performance:</strong> The chart highlights the months in which the stock achieved its highest closing prices. This can reveal periods of strong performance and potential market highs.</p>
</li>
<li>
<p><strong>Trend Analysis:</strong> By observing the fluctuations in maximum closing prices, one can identify whether the stock tends to reach higher peaks in certain months. This can be useful for understanding the stock's performance cycle.</p>
</li>
<li>
<p><strong>Market Timing:</strong> Identifying months with consistently high maximum prices can provide insights into the best times to sell or take advantage of peak stock performance.</p>
</li>
</ul>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<ol>
<li>
<p><strong>Optimizing Sales:</strong> Investors or traders can use the information about peak performance months to optimize their selling strategies. Knowing when the stock reaches its highest points can help in maximizing returns by selling during these peak periods.</p>
</li>
<li>
<p><strong>Strategic Investments:</strong> Understanding which months historically show higher maximum prices can help in making strategic investment decisions. Investors might time their purchases before anticipated peaks to benefit from the high performance.</p>
</li>
</ol>
<p><strong>Negative Growth Insights:</strong></p>
<ol>
<li>
<p><strong>Market Pressure:</strong> If the maximum prices are decreasing over time, it could indicate a long-term decline in stock performance. This might lead to concerns about the stock&rsquo;s future potential and prompt a reevaluation of investment strategies.</p>
</li>
<li>
<p><strong>Unrealistic Expectations:</strong> Frequent high peaks might create unrealistic expectations about the stock's performance. Investors might overestimate future returns based on past peaks, which could lead to disappointment if the stock does not perform as well in the future.</p>
</li>
</ol>

#### Chart - 5

In [None]:
# Chart - 5 visualization code day wise graph
df_melted = df.melt(id_vars='day', value_vars=['Close', 'Low', 'High', 'Open'], var_name='Stock Metric', value_name='Value')

# Create the FacetGrid plot
g = sns.catplot(x='day', y='Value', hue='Stock Metric', data=df_melted, kind='bar', height=6, aspect=2)
g.set_axis_labels('day', 'Average Value')
g.fig.suptitle('Average Stock Metrics by day')
plt.show()

##### 1. Why did you pick the specific chart?

<ol>
<li>
<p>The FacetGrid bar chart was chosen to display the average stock metrics (Open, High, Low, Close) by day of the month. This chart provides a comprehensive view of how each stock metric varies throughout the days in a month. By plotting all four metrics on the same graph, it enables direct comparison and analysis of their daily patterns and trends.</p>
</li>
</ol>

##### 2. What is/are the insight(s) found from the chart?

<ol>
<li>
<p><strong>Daily Variation:</strong> The chart reveals how each stock metric (Open, High, Low, Close) behaves on different days of the month. This can help in understanding daily fluctuations and identifying specific days with significant variations.</p>
</li>
<li>
<p><strong>Metric Comparison:</strong> By plotting all four metrics together, it&rsquo;s easier to compare their daily values. For example, if certain days show higher values for the 'High' metric compared to 'Low', it indicates increased volatility or market activity on those days.</p>
</li>
<li>
<p><strong>Patterns and Anomalies:</strong> Observing the average values for each metric can help identify any recurring patterns or anomalies. For instance, if certain days consistently show unusually high or low metrics, these days might warrant further investigation to understand the underlying causes.</p>
</li>
</ol>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<ol>
<li>
<p><strong>Positive Business Impact:</strong></p>
<ol>
<li>
<p><strong>Trading Strategy:</strong> Understanding daily variations in stock metrics can help in developing trading strategies based on historical patterns. For instance, if certain days consistently show high values for the 'High' metric, traders might choose to buy or sell on those days based on anticipated trends.</p>
</li>
<li>
<p><strong>Risk Management:</strong> Identifying days with high volatility can aid in risk management. By knowing which days are prone to significant fluctuations, investors can adjust their positions to mitigate potential losses.</p>
</li>
<li>
<p><strong>Market Timing:</strong> Insights from daily metrics can assist in timing market activities more effectively. If specific days historically have favorable metrics, this knowledge can be leveraged to optimize buy/sell decisions.</p>
</li>
</ol>
<p><strong>Negative Growth Insights:</strong></p>
<ol>
<li>
<p><strong>Market Instability:</strong> If the chart reveals high volatility or erratic behavior in the metrics on certain days, it might indicate underlying market instability. Such insights could lead to cautious investment approaches and potentially lower returns if not managed well.</p>
</li>
<li>
<p><strong>Overreliance on Daily Patterns:</strong> Relying heavily on daily patterns without considering broader trends can be risky. Daily variations might not always align with long-term market trends, and focusing too much on daily data could result in missed opportunities or misinformed decisions.</p>
</li>
</ol>
</li>
</ol>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<ol>
<li>
<p><strong>Negative Growth Insights:</strong></p>
<ol>
<li>
<p><strong>Overfitting on Short-Term Fluctuations:</strong> Overreliance on daily stock metrics for decision-making can lead to poor long-term planning. Stock prices fluctuate for numerous reasons, and making decisions based solely on daily data may overlook broader market trends. This can lead to inaccurate forecasts and potential losses, especially if the market shifts unpredictably.</p>
</li>
<li>
<p><strong>False Positives in Trend Detection:</strong> The insight gained might lead to detecting trends that don't exist or have short-term effects. This could mislead traders and businesses into making unnecessary moves, such as frequent buying or selling, which can incur transaction costs and erode overall returns.</p>
</li>
<li>
<p><strong>Volatility Risk:</strong> If businesses focus too heavily on high volatility days without careful management, it could increase exposure to risk. Investing based on volatile days might result in significant losses if market conditions change unexpectedly.</p>
</li>
</ol>
<p>In conclusion, while the insights gained can indeed have a positive business impact by optimizing strategies and managing risk, businesses should exercise caution to avoid overfitting on short-term data and must integrate these findings with broader market trends to avoid negative impacts on growth.</p>
</li>
</ol>

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap was chosen because it provides a visual overview of the relationships between numerical variables. It is a comprehensive and efficient way to understand the strength and direction of correlations within the dataset, helping to identify multicollinearity or features that may have strong predictive power. The heatmap allows quick detection of linear relationships between variables, which is critical when selecting features for model building, particularly for regression-based models like linear regression.

##### 2. What is/are the insight(s) found from the chart?

<p>From the heatmap, one might identify:</p>
<ul>
<li><strong>Strongly correlated features</strong>: For example, variables such as <code>Open</code>, <code>Close</code>, <code>Low</code>, and <code>High</code> prices in stock market datasets are often highly correlated because they tend to move together. Strong correlations may suggest redundancy in the data.</li>
<li><strong>Multicollinearity</strong>: If two or more variables are highly correlated (with correlation values close to +1 or -1), it could lead to multicollinearity issues in certain models (like linear regression). For example, if <code>Open</code> and <code>Close</code> are highly correlated, one of them might be redundant.</li>
<li><strong>Negative or weak correlations</strong>: These can show which features are less related or even inversely related, providing an understanding of the relationships that might affect predictive accuracy.</li>
</ul>

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

<p>Are there any insights that lead to negative growth? Justify with a specific reason.</p>
<p>Yes, the insights from the heatmap can help create a positive business impact in several ways:</p>
<ul>
<li><strong>Improved Feature Selection</strong>: By identifying and removing highly correlated features, we can reduce the risk of multicollinearity, which improves the model's accuracy and interpretability. For example, removing one of the highly correlated features (e.g., <code>Open</code> and <code>Close</code> prices) could simplify the model without losing predictive power.</li>
<li><strong>Targeted Analysis</strong>: Strong correlations can help prioritize the analysis of specific features that are more predictive, allowing businesses to focus on the most relevant data, reducing noise in decision-making processes.</li>
<li><strong>Model Efficiency</strong>: By removing redundant or irrelevant variables, model training times can be reduced, and the generalization ability of the model can be improved, reducing the risk of overfitting.</li>
</ul>
<p>However, insights from the heatmap could also highlight potential negative factors:</p>
<ul>
<li><strong>Over-Reliance on Correlated Variables</strong>: If the analysis overly depends on a few highly correlated variables, it could lead to a biased model. For example, if all stock price-related features (<code>Open</code>, <code>Close</code>, etc.) are highly correlated and treated as independent, it might result in misleading predictions.</li>
</ul>

#### Chart - 7

In [None]:
# Chart - 7 visualization code pair plot
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

<p>&nbsp;</p>
<ul>
<li><strong>Scatterplots</strong>: For each pair of numerical variables, a scatterplot will show how one feature relates to the other. This is useful for visualizing potential correlations or patterns between variables.</li>
<li><strong>Diagonals</strong>: The diagonal will display histograms of the individual features, providing insights into their distributions.</li>
</ul>

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

No Missing values

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

In [None]:
plt.boxplot(df['Open'])

In [None]:
print(df['Open'].mean())
print(df['Open'].median())

In [None]:
df['Open'].value_counts().sort_index()

In [None]:
df['Open'] = np.where(df['Open'] > 250, df['Open'].mean(), df['Open'])


In [None]:
plt.boxplot(df['Open'])

In [None]:
plt.boxplot(df['High'])

In [None]:
plt.boxplot(df['Low'])

In [None]:
df['Low'] = np.where(df['Low'] > 250, df['Low'].mean(), df['Open'])

In [None]:
plt.boxplot(df['Low'])

In [None]:
plt.boxplot(df['Close'])

In [None]:
df['Close'] = np.where(df['Close'] > 300, df['Close'].mean(), df['Open'])

In [None]:
plt.boxplot(df['Close'])

##### What all outlier treatment techniques have you used and why did you use those techniques?

<p><strong>Technique Used</strong>: When values were considered outliers based on boxplot detection (using quartiles and IQR), the outliers were capped at a specific threshold. For example, any values greater than a certain threshold (such as very high passenger counts or trip durations) were replaced with the mean value.</p>

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting



# Check correlation matrix to identify highly correlated features
correlation_matrix = df[['Open', 'High', 'Low', 'Close']].corr()
print(correlation_matrix)

# Select features based on correlation and domain knowledge
selected_features = ['Open', 'High', 'Low', 'Close', 'month', 'day']

# Create feature and target arrays
X = df[selected_features]
y = df['Close']  # or any other target variable of interest


In [None]:
df.columns

##### What all feature selection methods have you used  and why?

<ol>
<li>
<h3><strong>Correlation Analysis</strong></h3>
<ul>
<li><strong>Method</strong>: Correlation analysis examines the relationship between two variables. Features that have a high correlation with the target variable (e.g., <code>Close</code> in stock data) are generally more useful for predictive modeling. Conversely, features that are highly correlated with each other (multicollinearity) might be redundant.</li>
<li><strong>Why</strong>: In the stock dataset, features like <code>Open</code>, <code>High</code>, <code>Low</code>, and <code>Close</code> prices are likely correlated. By analyzing correlations, we can determine which features are useful and which might lead to multicollinearity.</li>
<li><strong>Example</strong>: Dropping features with high collinearity like <code>Open</code> and <code>Low</code> if they show a very strong correlation with <code>Close</code>.</li>
</ul>
</li>
</ol>

##### Which all features you found important and why?

<ol>

<h3>Features like <code>Open</code>, <code>High</code>, <code>Low</code>, and <code>Close</code> had strong correlations with each other, particularly <code>Close</code> with both <code>Open</code> and <code>Low</code>. This correlation indicates that these features are directly related to stock price movements and are thus important for predicting the closing price.</h3>

</ol>

### 6. Data Scaling

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
new_df = pd.get_dummies(df, drop_first=True)
new_df.info()

In [None]:
dependent_columns='Close'
independent_columns=list(set(df.columns)-{dependent_columns}-{'Date', 'Open','Low'})
print(dependent_columns)
print(independent_columns)

In [None]:
df.head()

In [None]:
print(df[independent_columns])

In [None]:
print(df[dependent_columns])

In [None]:
x=df[independent_columns].values
print(x)
y=df[dependent_columns].values
print(x)


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

##### What data splitting ratio have you used and why?

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mean_absolut_error = []
mean_sq_error=[]
root_mean_sq_error=[]
training_score =[]
r2_list=[]
adj_r2_list=[]
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


def score_metrix (model,X_train,X_test,Y_train,Y_test):

  '''
    train the model and gives mae, mse,rmse,r2,adj r2 score of the model

  '''
  #training the model
  model.fit(X_train,Y_train)

  # Training Score
  training  = model.score(X_train,Y_train)
  print("Training score  =", training)

  try:
      # finding the best parameters of the model if any
    print(f"The best parameters found out to be :{model.best_params_} \nwhere model best score is:  {model.best_score_} \n")
  except:
    pass


  #predicting the Test set and evaluting the models

  if model == LinearRegression() or model == Lasso() or model == Ridge():
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test**2,Y_pred**2)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test**2,Y_pred**2)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test**2,Y_pred**2)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test**2,Y_pred**2))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

  else:
    # for tree base models
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test,Y_pred)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test,Y_pred)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test,Y_pred)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test,Y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

    try:

      # ploting the graph of feature importance

      best = model.best_estimator_
      features = X_train.columns
      importances = best.feature_importances_
      indices = np.argsort(importances)
      plt.figure(figsize=(10,15))
      plt.title('Feature Importance')
      plt.barh(range(len(indices)), importances[indices], color='red', align='center')
      plt.yticks(range(len(indices)), [features[i] for i in indices])
      plt.xlabel('Relative Importance')
      plt.show()

    except:
      pass

  # Here we appending the parameters for all models
  mean_absolut_error.append(MAE)
  mean_sq_error.append(MSE)
  root_mean_sq_error.append(RMSE)
  training_score.append(training)
  r2_list.append(r2)
  adj_r2_list.append(adj_r2)

  print('*'*80)
  # print the cofficient and intercept of which model have these parameters and else we just pass them
  try :
    print("coefficient \n",model.coef_)
    print('\n')
    print("Intercept  = " ,model.intercept_)
    tr_sc=model.predict(x_train)
    tr_ts=model.predict(x_test)
  except:
    pass
  print('\n')
  print('*'*20, 'ploting the graph of Actual and predicted only with 80 observation', '*'*20)

  # ploting the graph of Actual and predicted only with 80 observation for better visualisation which model have these parameters and else we just pass them
  try:
    # ploting the line graph of actual and predicted values
    plt.figure(figsize=(15,7))
    plt.plot((Y_pred)[:80])
    plt.plot((np.array(Y_test)[:80]))
    plt.legend(["Predicted","Actual"])
    plt.show()
  except:
    pass

In [None]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

In [None]:
score_metrix(LinearRegression(),x_train,x_test,y_train,y_test)


In [None]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = x.columns
    vif["VIF"] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]

    return(vif)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
L1 = Lasso() #creating variable
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]} #lasso parameters
lasso_cv = GridSearchCV(L1, parameters, cv=5) #using gridsearchcv and cross validate the model

In [None]:
score_metrix(lasso_cv,x_train,x_test,y_train,y_test)

In [None]:
L2 = Ridge() #creating variable
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,0.5,1.5,1.6,1.7,1.8,1.9]} # giving parameters
L2_cv = GridSearchCV(L2, parameters, scoring='r2', cv=5) #using gridsearchcv and cross validate the model
score_metrix(L2_cv,x_train,x_test,y_train,y_test) # fit and evaluate model with score_matrix function

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

In [None]:

# Assuming x and y are your feature matrix and target vector
x=df[independent_columns].values
y=df[dependent_columns].values


# Perform 5-Fold Cross Validation
cv_scores = cross_val_score(LinearRegression(), x, y, cv=5, scoring='neg_mean_squared_error')

# Print each cross-validation score and the average score
print("Cross-Validation MSE Scores:", -cv_scores)
print("Average Cross-Validation MSE:", -cv_scores.mean())


In [None]:
cv_scores_r2 = cross_val_score(LinearRegression(), x, y, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores_r2)
print("Average Cross-Validation R²:", cv_scores_r2.mean())

##### Which hyperparameter optimization technique have you used and why?

<ol>

<p>There are several popular hyperparameter optimization techniques commonly used in machine learning, each with its own strengths. We used are:</p>
<ol>
<li><strong>Grid Search CV</strong></li>
<li><strong>Random Search CV</strong></li>
</ol>
<p>The choice of which to use depends on the specific use case, model, and computational resources available.</p>

</ol>

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

 **Not much**

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor

In [None]:
score_metrix(RandomForestRegressor(), x_train, x_test, y_train, y_test)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***