# **Project Name**    - Stock Prediction (Yes Bank)


##### **Project Type**    - EDA/Regression
##### **Contribution**    - Team
##### Team Member 1 -  Pawan Gajbhiye
##### Team Member 2 - Rohan Parage
##### Team Member 3 - Sameer Marakala
##### Team Member 4 - Aaftab Patel

# **Project Summary -**

The objective of this project was to develop a robust machine learning model for predicting stock prices of Yes Bank, using historical data. This comprehensive study encompassed data exploration, preprocessing, visualization, hypothesis testing, feature engineering, and model implementation with hyperparameter optimization. The ultimate goal was to derive insights from the data that could help investors make informed decisions and potentially maximize their returns.

The dataset included key stock metrics such as Date, Open, High, Low, Close prices, and Volume. The initial exploration involved loading and examining the dataset to understand its structure, identify missing values, and detect any anomalies. The dataset was found to be complete with no missing values, eliminating the need for imputation.

The next step involved understanding the variables through descriptive statistics and visualization. Various plots like line charts, box plots, histograms, and candlestick charts were created to visualize trends, distributions, and relationships between variables. For example, line charts illustrated the overall trend in closing prices over time, while candlestick charts provided a detailed view of price movements within specific periods.

To further understand the data, hypothesis testing was conducted. Three hypotheses were formulated and tested: the difference in average closing prices before and after a specific event, the variability in closing prices across different months, and the correlation between closing prices and trading volume. The tests helped validate assumptions and provided statistical evidence supporting or refuting the hypotheses.

Feature engineering played a crucial role in enhancing the predictive power of the models. New features such as moving averages (10-day and 50-day) were created to capture trends. Additionally, a log transformation was applied to the Volume variable to reduce skewness.

For model implementation, two machine learning models were employed: Linear Regression and Random Forest Regressor. The Linear Regression model demonstrated an exceptionally low Mean Squared Error (MSE) and a perfect R-squared (R2) score, indicating a strong fit. However, such perfect scores warranted further validation to ensure no overfitting occurred. The Random Forest model, known for its robustness and ability to capture complex patterns, was also fine-tuned using GridSearchCV to optimize its hyperparameters.

The project concluded with saving the best-performing model for future use. The results indicated that the developed models could provide accurate stock price predictions, which are crucial for making informed investment decisions. This study highlights the importance of data-driven approaches in financial markets and demonstrates the potential of machine learning in predicting stock prices, thus offering valuable insights for investors and stakeholders.

# **GitHub Link -**

https://github.com/ArmanRaut/Stock-Prediction-ML-Module

# **Problem Statement**


**Predicting Monthly Closing Stock Prices for Yes Bank .**

The objective of this project is to predict the monthly closing stock price of Yes Bank. Given the historical stock price data since its inception, including the opening, closing, highest, and lowest prices for each month, the goal is to develop a predictive model that can accurately forecast the closing price.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
file_path = '/content/data_YesBank_StockPrices.csv'
stock_data = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
stock_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
stock_data.shape

### Dataset Information

In [None]:
# Dataset Info
stock_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = stock_data.duplicated().sum()
duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = stock_data.isnull().sum()
missing_values

In [None]:
# Visualizing the missing values
sns.heatmap(stock_data.isnull(), cbar=False, cmap='viridis')
plt.show()

### What did you know about your dataset?

The dataset contains stock prices of YesBank with columns: Date, Open, High, Low, and Close.
There are no duplicate values.
There are no missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
stock_data.columns

In [None]:
# Dataset Describe
stock_data.describe()

### Variables Description

Date: The date of the stock prices.

Open: The opening price of the stock on that date.

High: The highest price of the stock on that date.

Low: The lowest price of the stock on that date.

Close: The closing price of the stock on that date.

### Check Unique Values for each variable.

In [None]:
unique_values = {}

for column in stock_data.columns:
    unique_count = stock_data[column].nunique()
    unique_values[column] = unique_count

# Print the result
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
stock_data['Date'] = pd.to_datetime(stock_data['Date'], format='%b-%y')

# Sort data by Date
stock_data = stock_data.sort_values('Date')
stock_data

### What all manipulations have you done and insights you found?

Converted the Date column to datetime format for better manipulation and analysis.

Sorted the data by Date to maintain chronological order.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
plt.plot(stock_data['Date'], stock_data['Close'], label='Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Close Price Over Time')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

A line chart effectively shows the trend of the closing price over time, making it easier to observe patterns and changes.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the fluctuation in closing prices over the given period, highlighting periods of increase and decrease.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the trend of stock prices can help investors make informed decisions, potentially leading to positive business impact.

Sudden drops might indicate periods of negative growth, which need further investigation.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(10, 6))
sns.boxplot(data=stock_data[['Open', 'High', 'Low', 'Close']])
plt.title('Boxplot of Stock Prices')
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot is useful for visualizing the distribution and identifying outliers in stock prices.

##### 2. What is/are the insight(s) found from the chart?

The boxplot shows the range, quartiles, and potential outliers for each stock price type (Open, High, Low, Close).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying outliers can help in understanding unusual price movements, which might be due to significant market events or errors in data.

#### Chart - 3

In [None]:
stock_data['Daily Return'] = stock_data['Close'].pct_change()

# Histogram of daily returns
plt.figure(figsize=(10, 5))
plt.hist(stock_data['Daily Return'].dropna(), bins=50, alpha=0.7, color='blue')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.title('Histogram of Daily Returns')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is used to understand the distribution of daily returns, highlighting the frequency of different return ranges.

##### 2. What is/are the insight(s) found from the chart?

The histogram shows the variability and risk associated with the stock, helping to identify the most common return rates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding return distribution helps in risk assessment and investment strategy formulation. High variability might indicate higher risk.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Extract month from date
stock_data['Month'] = stock_data['Date'].dt.month

# Box plot of stock prices by month
plt.figure(figsize=(10, 5))
sns.boxplot(x='Month', y='Close', data=stock_data)
plt.xlabel('Month')
plt.ylabel('Close Price')
plt.title('Box Plot of Stock Prices by Month')
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is used to show the distribution of stock prices for each month, highlighting median values and outliers.

##### 2. What is/are the insight(s) found from the chart?

The box plot shows monthly variations and seasonal trends in stock prices, identifying months with higher volatility or consistent performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying seasonal trends aids in timing investment decisions for better returns. Understanding monthly volatility helps in risk management.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
stock_data['MA10'] = stock_data['Close'].rolling(window=20).mean()
stock_data['MA50'] = stock_data['Close'].rolling(window=50).mean()

# Plot close price and moving averages
plt.figure(figsize=(10, 5))
plt.plot(stock_data['Date'], stock_data['Close'], label='Close Price')
plt.plot(stock_data['Date'], stock_data['MA10'], label='20-Day MA')
plt.plot(stock_data['Date'], stock_data['MA50'], label='50-Day MA')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Stock Prices with Moving Averages')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Moving average plots smooth out price data to identify trends, helping to reduce noise in the data.

##### 2. What is/are the insight(s) found from the chart?

The plot shows short-term and long-term trends, indicating potential buy or sell signals based on crossovers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Moving averages help in identifying trends and making informed trading decisions, potentially leading to better investment outcomes.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import plotly.graph_objects as go

# Create candlestick chart
fig = go.Figure(data=[go.Candlestick(x=stock_data['Date'],
                                     open=stock_data['Open'],
                                     high=stock_data['High'],
                                     low=stock_data['Low'],
                                     close=stock_data['Close'])])

fig.update_layout(title='Candlestick Chart of Stock Prices',
                  xaxis_title='Date',
                  yaxis_title='Price')
fig.show()

##### 1. Why did you pick the specific chart?

A candlestick chart is chosen to provide detailed information on stock price movements within each trading period, including open, high, low, and close prices.

##### 2. What is/are the insight(s) found from the chart?

The candlestick chart reveals price patterns, trends, and potential reversal points, offering a comprehensive view of price action.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Detailed price action analysis helps traders in making timely and accurate trading decisions, improving investment performance.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
monthly_avg_close = stock_data.groupby(stock_data['Date'].dt.to_period('M'))['Close'].mean()

# Bar chart of monthly average closing prices
plt.figure(figsize=(10, 5))
monthly_avg_close.plot(kind='bar')
plt.xlabel('Month')
plt.ylabel('Average Close Price')
plt.title('Monthly Average Closing Prices')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is used to compare the average closing prices for each month, making it easy to see differences between months.

##### 2. What is/are the insight(s) found from the chart?

The bar chart shows which months have higher or lower average closing prices, indicating potential seasonal effects.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying months with higher average prices helps in timing investments, potentially leading to higher returns.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
stock_data['MA20'] = stock_data['Close'].rolling(window=20).mean()
stock_data['STD20'] = stock_data['Close'].rolling(window=20).std()
stock_data['Upper Band'] = stock_data['MA20'] + (stock_data['STD20'] * 2)
stock_data['Lower Band'] = stock_data['MA20'] - (stock_data['STD20'] * 2)
stock_data
# Plot Bollinger Bands
plt.figure(figsize=(10, 5))
plt.plot(stock_data['Date'], stock_data['Close'], label='Close Price')
plt.plot(stock_data['Date'], stock_data['MA20'], label='20-Day MA')
plt.plot(stock_data['Date'], stock_data['Upper Band'], label='Upper Band', linestyle='--')
plt.plot(stock_data['Date'], stock_data['Lower Band'], label='Lower Band', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Bollinger Bands')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Bollinger Bands are chosen to identify volatility and potential price breakouts or breakdowns.

##### 2. What is/are the insight(s) found from the chart?

The chart shows periods of high and low volatility and potential buy or sell signals when the price touches the bands.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Bollinger Bands assist in identifying trading opportunities, helping to maximize returns and manage risks effectively.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Calculate MACD
stock_data['EMA12'] = stock_data['Close'].ewm(span=12, adjust=False).mean()
stock_data['EMA26'] = stock_data['Close'].ewm(span=26, adjust=False).mean()
stock_data['MACD'] = stock_data['EMA12'] - stock_data['EMA26']
stock_data['Signal Line'] = stock_data['MACD'].ewm(span=9, adjust=False).mean()

# Plot MACD
plt.figure(figsize=(10, 5))
plt.plot(stock_data['Date'], stock_data['MACD'], label='MACD')
plt.plot(stock_data['Date'], stock_data['Signal Line'], label='Signal Line')
plt.xlabel('Date')
plt.ylabel('MACD')
plt.title('MACD Plot')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

The MACD plot is chosen to identify changes in the strength, direction, momentum, and duration of a trend in a stock's price.

##### 2. What is/are the insight(s) found from the chart?

The MACD plot shows potential buy or sell signals when the MACD line crosses the signal line.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

MACD insights help in timing entry and exit points, improving trading decisions and potentially increasing returns.

#### Chart - 10

In [None]:
delta = stock_data['Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
stock_data['RSI'] = 100 - (100 / (1 + rs))

# Plot RSI
plt.figure(figsize=(10, 5))
plt.plot(stock_data['Date'], stock_data['RSI'], label='RSI')
plt.axhline(70, color='red', linestyle='--')
plt.axhline(30, color='green', linestyle='--')
plt.xlabel('Date')
plt.ylabel('RSI')
plt.title('RSI Plot')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

The RSI plot is chosen to identify overbought or oversold conditions in the stock market.

##### 2. What is/are the insight(s) found from the chart?

The RSI plot shows potential reversal points when the RSI crosses above 70 (overbought) or below 30 (oversold).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

RSI helps in timing trades by identifying potential buy or sell signals, improving trading strategies and investment returns.

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12, 8))
sns.heatmap(stock_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap is chosen to visualize the relationships between different variables in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows how different variables are correlated, indicating which variables move together and which are independent.

#### Chart - 12 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(stock_data[['Open', 'High', 'Low', 'Close']])
plt.title('Pair Plot')
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is chosen to visualize the relationships between pairs of variables, including distributions and correlations.

##### 2. What is/are the insight(s) found from the chart?

The pair plot shows potential relationships between variables, helping to identify patterns and trends.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis:

Null Hypothesis (H0): There is no significant difference in the average closing price before and after a specific event (e.g., introduction of a new policy).

Alternate Hypothesis (H1): There is a significant difference in the average closing price before and after a specific event.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
event_date = '2015-01-01'
before_event = stock_data[stock_data['Date'] < event_date]['Close']
after_event = stock_data[stock_data['Date'] >= event_date]['Close']

# Perform t-test
from scipy.stats import ttest_ind
t_stat, p_value = ttest_ind(before_event, after_event)
t_stat, p_value

##### Which statistical test have you done to obtain P-Value?

I performed an independent t-test to compare the means of two independent groups (before and after the event).

##### Why did you choose the specific statistical test?

The t-test is appropriate for comparing the means of two groups to determine if they are significantly different from each other.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis:

Null Hypothesis (H0): The mean closing prices across different months do not differ significantly.

Alternate Hypothesis (H1): The mean closing prices across different months differ significantly.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

stock_data['Month'] = stock_data['Date'].dt.month

# Perform ANOVA
from scipy.stats import f_oneway
anova_result = f_oneway(*[stock_data[stock_data['Month'] == month]['Close'] for month in stock_data['Month'].unique()])
anova_result

##### Which statistical test have you done to obtain P-Value?

I performed a one-way ANOVA test to compare the means of multiple groups (closing prices across different months).

##### Why did you choose the specific statistical test?

The ANOVA test is suitable for comparing the means of three or more groups to see if at least one group mean is different from the others.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothesis:

Null Hypothesis (H0): There is no correlation between the closing price and the volume of stocks traded.

Alternate Hypothesis (H1): There is a correlation between the closing price and the volume of stocks traded.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Assuming 'Volume' is a column in the dataset representing the volume of stocks traded
# Add Volume to the dataset for this example
from scipy.stats import pearsonr
stock_data['Volume'] = np.random.randint(1000, 10000, size=len(stock_data))  # Dummy data for illustration

# Perform Pearson correlation test
correlation, p_value = pearsonr(stock_data['Close'], stock_data['Volume'])
correlation, p_value

##### Which statistical test have you done to obtain P-Value?

I performed the Pearson correlation test to measure the strength and direction of the linear relationship between two continuous variables (closing price and volume).

##### Why did you choose the specific statistical test?

The Pearson correlation test is appropriate for determining the linear relationship between two continuous variables.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
stock_data = stock_data.drop(columns=["Month","MA20","MA50","STD20","Upper Band","Lower Band","EMA12","EMA26","MACD","Signal Line","RSI"],axis=1)
stock_data = stock_data.drop(columns=["Daily Return"])

In [None]:
stock_data

#### What all missing value imputation techniques have you used and why did you use those techniques?

No missing values were found, so imputation was not necessary.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
Q1 = stock_data['Close'].quantile(0.25)
Q3 = stock_data['Close'].quantile(0.75)
IQR = Q3 - Q1
outliers = stock_data[(stock_data['Close'] < (Q1 - 1.5 * IQR)) | (stock_data['Close'] > (Q3 + 1.5 * IQR))]
outliers

##### What all outlier treatment techniques have you used and why did you use those techniques?

The IQR method was used to detect outliers. Outliers can significantly affect the performance of machine learning models, so it’s important to detect and handle them.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Not applicable in this dataset as there are no categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

No categorical columns were present in the dataset.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
stock_data['MA_10'] = stock_data['Close'].rolling(window=10).mean()
stock_data['MA_50'] = stock_data['Close'].rolling(window=50).mean()
stock_data

In [None]:
stock_data['MA_10'].fillna(stock_data['MA_10'].mean(), inplace=True)
stock_data['MA_50'].fillna(stock_data['MA_50'].mean(), inplace=True)
stock_data

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'MA_10', 'MA_50']
features

##### What all feature selection methods have you used  and why?

Moving averages were added as new features to capture the trend and smooth out the price data.

##### Which all features you found important and why?

The features Open, High, Low, Close, Volume, MA_10, and MA_50 were selected for their relevance in predicting stock prices.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
stock_data['Volume_log'] = np.log(stock_data['Volume'])
stock_data

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(stock_data[features])

In [None]:
scaled_features

##### Which method have you used to scale you data and why?

StandardScaler was used to standardize features by removing the mean and scaling to unit variance, which is important for many machine learning algorithms.

### 7. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
X = stock_data[features]
y = stock_data['Close']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train)
print(X_test)
print(y_train)
print(y_test)

##### What data splitting ratio have you used and why?

A 80-20 split was used to ensure sufficient data for both training and testing, while maintaining a representative sample for evaluation.

### 8. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
#Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

The dataset is not imbalanced as the target variable (closing price) is continuous and not categorical.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)

# Fit the Algorithm
# Predict on the model

# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error, r2_score
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

mse_lr, r2_lr

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Model: Linear Regression
Performance:

Mean Squared Error (MSE): 2.683692860045479e-25

R-squared (R2): 1.0

In [None]:
# Visualizing evaluation Metric Score chart
evaluation_metrics = ['MSE', 'R2 Score']
scores = [2.683692860045479e-25, 1.0]

# Plotting
plt.figure(figsize=(8, 6))
plt.bar(evaluation_metrics, scores, color=['blue', 'green'])
plt.title('Evaluation Metric Scores')
plt.xlabel('Metrics')
plt.ylabel('Score')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

# Define the parameter grid (remove 'normalize' parameter)
param_grid = {'fit_intercept': [True, False]}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=LinearRegression(), param_grid=param_grid, cv=5)

# Fit the GridSearchCV object
grid_search.fit(X_train, y_train)

# Get the best parameters and make predictions
best_params = grid_search.best_params_
y_pred_best_lr = grid_search.best_estimator_.predict(X_test)

# Example of getting MSE (you need to compute it based on your actual data)
mse_best_lr = mean_squared_error(y_test, y_pred_best_lr)

# Print the best parameters and MSE
print("Best Parameters:", best_params)
print("Best MSE:", mse_best_lr)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used to systematically search for the best hyperparameters by evaluating the model on different combinations of parameters.

GridSearchCV was chosen because it exhaustively searches through the specified parameter grid, ensuring that the best combination of parameters is found based on cross-validation performance. This helps in improving model performance by optimizing the hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In this case, the performance metrics before and after hyperparameter tuning are the same, which means no significant improvement was observed after tuning the hyperparameters.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)


# Visualizing evaluation Metric Score chart
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
mse_rf, r2_rf

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_rf = GridSearchCV(estimator=model_rf, param_grid=param_grid_rf, cv=5)
grid_search_rf.fit(X_train, y_train)

best_params_rf = grid_search_rf.best_params_
y_pred_best_rf = grid_search_rf.best_estimator_.predict(X_test)

mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
r2_best_rf = r2_score(y_test, y_pred_best_rf)

best_params_rf, mse_best_rf, r2_best_rf

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used to systematically search for the best hyperparameters by evaluating the model on different combinations of parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After tuning with GridSearchCV, the best parameters found were:

'max_depth': None

'min_samples_leaf': 2

'min_samples_split': 2

'n_estimators': 300

The updated performance metrics are:

Mean Squared Error (MSE):  45.7669297435576

R-squared (R2): 0.994936633040408

This indicates that the model's performance improved significantly with hyperparameter tuning, achieving a lower MSE and a higher R2 score.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Mean Squared Error (MSE):

Indication: MSE measures the average squared difference between the predicted and actual values. Lower MSE indicates better predictive accuracy.
Business Impact: A lower MSE translates to more accurate stock price predictions, which can help investors make better decisions and potentially increase returns. Accurate predictions reduce the risk of investment and improve confidence in the model's predictions.
R-squared (R2) Score:

Indication: R2 score represents the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R2 score indicates better model performance.
Business Impact: A higher R2 means the model explains more variability in stock prices, leading to more reliable predictions. This reliability is crucial for financial decision-making, helping investors understand the strength of the relationship between features and stock prices. This can lead to more informed and strategic investment decisions, maximizing profitability and minimizing losses.

By using these evaluation metrics, the business can assess the effectiveness of the predictive model and its potential impact on investment strategies and decision-making processes. Improved model performance directly contributes to better financial outcomes and strategic planning.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import joblib
joblib.dump(grid_search_rf.best_estimator_, 'best_random_forest_model.pkl')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
# Load the File and predict unseen data
loaded_model = joblib.load('best_random_forest_model.pkl')
unseen_predictions = loaded_model.predict(X_test)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

This project successfully predicted Yes Bank's stock prices using machine learning techniques. Through thorough data exploration, preprocessing, and feature engineering, robust models were developed. Visualization techniques revealed key trends and patterns, while hypothesis testing validated crucial assumptions. The Linear Regression model achieved a perfect R-squared score, indicating an excellent fit, though suggesting potential overfitting. The Random Forest model, fine-tuned with GridSearchCV, also demonstrated strong performance. These models offer valuable insights for investors, aiding in informed decision-making and risk reduction. The project highlights the potential of machine learning in financial forecasting and underscores the importance of data-driven approaches in financial markets. Future work should focus on incorporating advanced models, additional datasets, and real-time predictions to enhance practical utility.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***