# **Project Name**    - Yes Bank Stock Closing Price Prediction



##### **Project Type**    - Linear Regression ML
##### **Contribution**    - Individual


# **Project Summary -**

This project aims to build a predictive model for Yes Bank's stock closing prices using historical data, machine learning, and statistical techniques. Stock price prediction is inherently complex due to the influence of numerous factors, such as market trends, economic indicators, and news events. By analyzing patterns in Yes Bank’s past stock prices, the project seeks to deliver a model that can make informed predictions about future prices. This prediction model could aid investors, traders, and financial analysts in making data-driven decisions and understanding stock behavior.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The objective of this project is to accurately predict the closing price of Yes Bank's stock for future trading days based on historical price data. Stock prices are highly volatile, affected by a combination of financial, social, and economic factors. Traditional models often fall short of accounting for these fluctuations, especially in the short term. Thus, the primary challenge is to develop a robust model capable of providing reliable stock price predictions for Yes Bank by leveraging time series forecasting, machine learning algorithms, or a hybrid approach.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
from datetime import date
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import Lasso, Ridge
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error,mean_absolute_percentage_error
import math
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

### Dataset Loading

In [None]:
# Connecting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Datasets/Copy of data_YesBank_StockPrices.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Date:** month and year given

**Open:** opening price of the stock

**High:** highest price of the stock

**Low:** lowest price of the stock

**Close:** closing price of the stock

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.columns.value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert object format to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%b-%y')


In [None]:
# Extract the month and year from date
df['Month'] = df['Date'].dt.strftime('%b')
df['Year'] = df['Date'].dt.year


In [None]:
# Convert month abbreviation to month number using datetime
df['Month_Num'] = pd.to_datetime(df['Month'], format='%b').dt.month

In [None]:
# Drop Columns
df.drop(['Month'],axis=1,inplace=True)
df.drop(['Date'],axis=1,inplace=True)

In [None]:
# Add New Cloumn ROI(%)
df['ROI(%)'] = ((df['Close'] - df['Open']) / df['Open']) * 100

### What all manipulations have you done and insights you found?

Answer Here.

In [None]:
Avg_open_price = pd.DataFrame(df.groupby('Month_Num')['Open'].mean())
Avg_open_price

In [None]:
Avg_open_price_year = pd.DataFrame(df.groupby('Year')['Open'].mean())
Avg_open_price_year

In [None]:
Avg_high_price = pd.DataFrame(df.groupby('Month_Num')['High'].mean())
Avg_high_price

In [None]:
Avg_low_price = pd.DataFrame(df.groupby('Month_Num')['Low'].mean())
Avg_low_price

In [None]:
Avg_close_price = pd.DataFrame(df.groupby('Month_Num')['Close'].mean())
Avg_close_price

In [None]:
ROI_yearly = pd.DataFrame(df.groupby('Year')['ROI(%)'].mean())
ROI_yearly

In [None]:
# Lowest opening price by an year
Lowest_open = pd.DataFrame(df.groupby('Year')['Open'].min())
Lowest_open

In [None]:
# Highest opening price by an year
Highest_open = pd.DataFrame(df.groupby('Year')['Open'].max())
Highest_open

In [None]:
# Lowest closing price by an year
Lowest_close = pd.DataFrame(df.groupby('Year')['Close'].min())
Lowest_close

In [None]:
# Highest closing price by an year
Highest_close = pd.DataFrame(df.groupby('Year')['Close'].max())
Highest_close

In [None]:
# In which year having the highest Opening and Lowest Closing price
highest_open_lowest_close = pd.DataFrame(df.groupby('Year').agg({'Open': 'max', 'Close': 'min'}))
highest_open_lowest_close

In [None]:
# Higher High and Lower Low Stock Price by an Year
High_low_price = pd.DataFrame(df.groupby('Year').agg({'High': 'max', 'Low': 'min'}))
High_low_price

In [None]:
# Average Open and Close Stock Price by an Year
average_price = pd.DataFrame(df.groupby('Year').agg({'Open': 'mean', 'Close': 'mean'}))
average_price

In [None]:
# Highest Open and Close Stock Price by an Year
highest_price = pd.DataFrame(df.groupby('Year').agg({'Open': 'max', 'Close': 'max'}))
highest_price

In [None]:
# Lowest Open and Close Stock Price by an Year
lowest_price = pd.DataFrame(df.groupby('Year').agg({'Open': 'min', 'Close': 'min'}))
lowest_price

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

As this is the Project of Stock price, so I have used mostly Line Chart for the Better Visualization and Understanding of the Graphs. And its easy to read charts or graphs when we have Large number of insights.

#### Chart - 1 Plotting the line chart

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(highest_open_lowest_close.index, highest_open_lowest_close['Open'], marker='o', label='Highest Open', color='green')
plt.plot(highest_open_lowest_close.index, highest_open_lowest_close['Close'], marker='o', label='Lowest Close', color='red')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('Stock Price')
plt.title('Highest Opening and Lowest Closing price by an Year')
plt.legend()

# Displaying the plot
plt.grid(True)
plt.show()

##### 1. What is/are the insight(s) found from the chart?

Highest Open Price - 375 in 2018

Lowest Close Price - 275 in 2017

#### Chart - 2 Plotting the line chart

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(High_low_price.index, High_low_price['High'], marker='o', label='Higher High', color='green')
plt.plot(High_low_price.index, High_low_price['Low'], marker='o', label='Lower Low', color='red')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('Stock Price')
plt.title('Higher High and Lower Low Stock Price by an Year')
plt.legend()

# Displaying the plot
plt.grid(True)
plt.show()

##### 1. What is/are the insight(s) found from the chart?

Higher High Price - 400 in 2018

Lower Low Price - 225 in 2017

#### Chart - 3 Plotting the line chart

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(average_price.index, average_price['Open'], marker='o', label='Average Open', color='blue')
plt.plot(average_price.index, average_price['Close'], marker='o', label='Average Close', color='red')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('Average Stock Price')
plt.title('Average Open and Close Stock Price by an Year')
plt.legend()

# Displaying the plot
plt.grid(True)
plt.show()

##### 1. What is/are the insight(s) found from the chart?

Average Opening and Average Closing stock prices is very close in each year.

Here,

Highest Average Opening - 315 in 2017

Highest Average Closing - 325 in 2017

#### Chart - 4 Plotting the line chart

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(highest_price.index, highest_price['Open'], marker='o', label='Highest Open', color='blue')
plt.plot(highest_price.index, highest_price['Close'], marker='o', label='Highest Close', color='red')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('Stock Price')
plt.title('Highest Open and Close Stock Price by an Year')
plt.legend()

# Displaying the plot
plt.grid(True)
plt.show()

#### Chart - 5 Plotting the line chart

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(lowest_price.index, lowest_price['Open'], marker='o', label='Lowest Open', color='blue')
plt.plot(lowest_price.index, lowest_price['Close'], marker='o', label='Lowest Close', color='red')

# Adding labels and title
plt.xlabel('Year')
plt.ylabel('Stock Price')
plt.title('Lowest Open and Close Stock Price by Year')
plt.legend()

# Displaying the plot
plt.grid(True)
plt.show()

#### Chart - 6

In [None]:
plt.figure(figsize=(8, 6))
Avg_open_price.plot( marker='o', color='b')
plt.xlabel('Month')
plt.ylabel('Average Opening Price')
plt.title('Average Opening Price Over Months')
plt.legend()
plt.grid(True)
plt.show()

#### Chart - 7

In [None]:
plt.figure(figsize=(14, 8))
Avg_close_price.plot( marker='o', color='b')
plt.xlabel('Month')
plt.ylabel('Average Closing Price')
plt.title('Average Closing Price Over Months')
plt.legend()
plt.grid(True)
plt.show()

#### Chart - 8

In [None]:
plt.figure(figsize=(14, 8))
Avg_high_price.plot( marker='o', color='g')
plt.xlabel('Month')
plt.ylabel('Average High Price')
plt.title('Average High Price Over Months')
plt.legend()
plt.grid(True)
plt.show()

#### Chart - 9

In [None]:
plt.figure(figsize=(14, 8))
Avg_low_price.plot( marker='o', color='r')
plt.xlabel('Month')
plt.ylabel('Average Low Price')
plt.title('Average Low Price Over Months')
plt.legend()
plt.grid(True)
plt.show()

#### Chart - 10 - Correlation (Heat Map)

In [None]:
dataset_corr = df[['Open', 'High', 'Low', 'Close','Month_Num', 'Year', 'ROI(%)']]

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(dataset_corr.corr(), cmap=plt.cm.CMRmap_r, annot=True)

##### 1. What is/are the insight(s) found from the chart?

My purpose is to pick this Chart beacause a correlation heatmap shows the correlation coefficients between variables in a matrix format, using colors to highlight the strength and weakness of relationships.

**Benefits:** It quickly visualizes which variables are strongly correlated (positive or negative), making it easy to spot potential predictors and identify multicollinearity issues.

**Uses:** Used frequently in EDA, especially for feature selection.

#### Chart - 11- Closing Price (Distribution Chart)

In [None]:
plt.figure(figsize = (7,7))
sns.distplot(df['Close'],color='r')

##### 1. Why did you pick the specific chart?

Histograms display the frequency distribution of a single variable, helping us understand its distribution shape (e.g., normal, skewed, uniform).

**Benefits:** They help identify the central tendency, spread, and shape of the data distribution, as well as outliers and data skewness.

**Uses:** Ideal for univariate analysis to check the underlying distribution of a variable, informing data transformations and adjustments.

##### 2. What is/are the insight(s) found from the chart?

Shape = Right skewed

Spread =  Mostly between 10 to 360 Closing price.

Maximum density  =  between 10 to 100 Closing price

#### Chart - 12- Numerical Feature Count (Bar Chart)

In [None]:
num_features = df.describe().columns
num_features

In [None]:
for col in num_features:
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  feature = df[col]
  feature.hist(bins = 50, ax=ax)
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed',linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed',linewidth=2)
  ax.set_title(col)
plt.show()

#### Chart - 13- Correlation (scatter plot)

In [None]:
for col in num_features[1:-1]:
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  feature =(df[col])
  label = df['Close']
  correlation = feature.corr(label)
  plt.scatter(x=feature, y=label)
  plt.xlabel(col)
  plt.ylabel('Closing Price')
  ax.set_title('Close vs ' + col + '- correlation: ' + str(correlation))
  z = np.polyfit(df[col], df['Close'], 1)
  y_hat = np.poly1d(z)(df[col])

  plt.plot(df[col], y_hat, "r--", lw=1)

#### Chart - 14- Pair Plot

In [None]:
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

I picked Pair plot beacause pair plots are used to visualize relationships between multiple variables in a dataset, displaying scatter plots for each pair of features and histograms for individual features.

**Benefits:** They’re excellent for identifying patterns, trends, and potential correlations between variables. They can also reveal clusters, distributions, and outliers.

**Uses:** Useful in exploratory data analysis (EDA) to see how variables interact with each other and to detect any linear or non-linear relationships.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data
df.info()

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split your data to train and test.
x = df.drop('Close',axis=1)
y = df['Close']

In [None]:
# Train test split our data
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
# Appending all models parameters to the corrosponding list
mean_absolut_error = []
mean_sq_error=[]
root_mean_sq_error=[]
training_score =[]
r2_list=[]
adj_r2_list=[]
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


def score_metrix (model,X_train,X_test,Y_train,Y_test):

  '''
    train the model and gives mae, mse,rmse,r2,adj r2 score of the model

  '''
  #training the model
  model.fit(X_train,Y_train)

  # Training Score
  training  = model.score(X_train,Y_train)
  print("Training score  =", training)

  try:
      # finding the best parameters of the model if any
    print(f"The best parameters found out to be :{model.best_params_} \nwhere model best score is:  {model.best_score_} \n")
  except:
    pass


  #predicting the Test set and evaluting the models

  if model == LinearRegression() or model == Lasso() or model == Ridge():
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test**2,Y_pred**2)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test**2,Y_pred**2)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test**2,Y_pred**2)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test**2,Y_pred**2))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

  else:
    # for tree base models
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test,Y_pred)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test,Y_pred)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test,Y_pred)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test,Y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')


  # Here we appending the parameters for all models
  mean_absolut_error.append(MAE)
  mean_sq_error.append(MSE)
  root_mean_sq_error.append(RMSE)
  training_score.append(training)
  r2_list.append(r2)
  adj_r2_list.append(adj_r2)

  print('*'*80)
  # print the cofficient and intercept of which model have these parameters and else we just pass them
  try :
    print("coefficient \n",model.coef_)
    print('\n')
    print("Intercept  = " ,model.intercept_)
  except:
    pass
  print('\n')
  print('*'*20, 'ploting the graph of Actual and predicted only with 80 observation', '*'*20)

  # ploting the graph of Actual and predicted only with 80 observation for better visualisation which model have these parameters and else we just pass them
  try:
    # ploting the line graph of actual and predicted values
    plt.figure(figsize=(15,7))
    plt.plot((Y_pred)[:80])
    plt.plot((np.array(Y_test)[:80]))
    plt.legend(["Predicted","Actual"])
    plt.show()
  except:
    pass

### ML Model - 1 - Linear Regression

In [None]:
score_metrix(LinearRegression(),x_train,x_test,y_train,y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The linear regression model tries to find the best-fit line that minimizes the sum of squared differences between actual values and predicted values.

**Interpretation:**

**MSE & RMSE:** The lower these values, the better the model's predictions. Here, an average MSE of 68.86 and RMSE of 8.29 indicate good accuracy; improvements might be possible.

**MAE:** An average MAE of 5.38 shows that predictions deviate from actual values by about 5.38 units on average.

**R-squared:** An average R-squared of 0.9923 indicates that 99.23% of the variance in the dependent variable is explained by the model, which is generally a strong fit.

**Adjusted R2:** An adjusted R2 which is 0.9908 close to the original R2 0.9923 indicates that prediction is likely relevant.

##### Which hyperparameter optimization technique have you used and why?

**I used Cross Validation Technique because-**

Cross-validation is a great way to evaluate a linear regression model's performance by splitting the dataset into multiple folds.

This process is repeated k times, with each fold serving as the validation set once, and the results are averaged to give a more reliable measure of model performance.

It helps to ensure that model is not overfitting or underfitting.

###**For Model Improvement Techniques**

####  1 - Cross Validation & Hyperparameter Tuning

In [None]:
model = LinearRegression()
from sklearn.model_selection import cross_val_score, KFold

In [None]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
mse_scores = cross_val_score(model, x, y, cv=kfold, scoring='neg_mean_squared_error')

# Convert negative MSE scores to positive and calculate average MSE
mse_scores = -mse_scores
mean_mse = np.mean(mse_scores)
std_mse = np.std(mse_scores)
print("Cross-Validation MSE Scores for each fold:", mse_scores)
print("Average MSE:", mean_mse)
print("Standard Deviation of MSE:", std_mse)

**In this case:**

The mean MSE of 46.17 suggests that, on average, the model's predictions are good by that amount in squared terms. If this value is high (predicting stock prices, it may indicate that the model is not sufficiently accurate).

The std MSE of 17.05 indicates variability in model performance across the folds, suggesting that the model may not be robust across all data subsets, and could potentially be improved.

Actions Based on Evaluation

**If Mean MSE is high:** Consider adding more features, using a more complex model, or performing feature engineering.

**If Std MSE is high:** Check for outliers or data anomalies, try regularization to reduce overfitting, or increase the dataset size for stability.
Overall, a good-performing model should have both a low mean MSE and a low std MSE for consistency and accuracy.

###2 - Lasso with hyperparameter tuning

In [None]:
L1 = Lasso() #creating variable
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]} #lasso parameters
lasso_cv = GridSearchCV(L1, parameters, cv=5) #using gridsearchcv and cross validate the model

In [None]:
score_metrix(lasso_cv,x_train,x_test,y_train,y_test)

###3 - **Ridge with hyperparameter tuning**

In [None]:
L2 = Ridge() #creating variable
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,0.5,1.5,1.6,1.7,1.8,1.9]} # giving parameters
L2_cv = GridSearchCV(L2, parameters, scoring='r2', cv=5) #using gridsearchcv and cross validate the model


In [None]:
score_metrix(L2_cv,x_train,x_test,y_train,y_test) # fit and evaluate model with score_matrix function

### Model - 2 - **XGBoost Regression**

In [None]:
params = {'learning_rate':[0.5,1,1.5,2],'n_estimators':[80,100,150],'max_depth':[15,20,30]}
xgb_grid_search= GridSearchCV(XGBRegressor(),param_grid=params,)
score_metrix(xgb_grid_search,x_train,x_test,y_train,y_test)

###Model - 3 - **KNN Regressor**

In [None]:
knn = KNeighborsRegressor()
score_metrix(knn,x_train,x_test,y_train,y_test)

### Model - 4 - **Random Forest Regressor**

In [None]:
# parameters for Random forest
param_grid = {"n_estimators":[50,100,150],
              'max_depth' : [10,15,20,25,'none'],
              'min_samples_split': [10,50,100],
              'max_features' :[24,35,40,49]}

In [None]:
# Using Grid SearchCV
Ranom_forest_Grid_search = GridSearchCV(RandomForestRegressor(),param_grid=param_grid,n_jobs=-1,verbose=2)
score_metrix(Ranom_forest_Grid_search,x_train,x_test,y_train,y_test)

### Model - 5 - **Gradient Boosting Regressor**

In [None]:
param_grid = {'learning_rate':[0.15,0.1,0.05,0.02,0.20],
              'n_estimators':[100,150,200,250],
              'max_depth':[2,4,6,10]}

gradient_boost_grid_search = GridSearchCV(GradientBoostingRegressor(), param_grid=param_grid, n_jobs=-1, verbose=2)
score_metrix(gradient_boost_grid_search,x_train,x_test,y_train,y_test)

###Model - 6 - **Adaboost Boost Regressor**

In [None]:
# # parameters for Ada Boost Regressor
param_grid = {'n_estimators': [50,100,150,200],
          'learning_rate':[0.5,1,1.5,2]}

Ada_boost_grid_search = GridSearchCV(AdaBoostRegressor(),param_grid=param_grid,n_jobs=-1)
score_metrix(Ada_boost_grid_search,x_train,x_test,y_train,y_test)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For practical evaluation in stock price prediction, MAE and RMSE are commonly used to get daily prediction performances.

MAE gives a clear view of an average error. It’s easier to interpret, how far our predictions are, on average, from actual prices.

RMSE helps monitor and penalize larger errors and directly reflects the impact of prediction inaccuracies on trading outcomes.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After all above created Models and its Implementation, Here I choose that the best fit Model/Algorithm is Linear Regression with cross validation technique with Lasso and Ridge hyperparameter tuning because this model having good accuracy with minimum number of errors as comparison to the other Models/Algorithms Like XGBoost, KNN, Random Forest, Gradient Boosting and AdaBoost

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In linear regression, hyperparameter optimization is typically less than in more complex models like decision trees, as linear regression has fewer hyperparameters. However, there are techniques and choices that can improve model performance, especially when using regularized versions of linear regression.

In this case, Cross Validation with Hyperparameters (L1 and L2) is sufficient because it is a very simple task or in smaller dataset.

# **Conclusion**

"In summary, this linear regression model provides a straightforward approach to predicting stock prices with good accuracy, which indicates that the model can predict stock prices within a close range to the actual values. Here, this is the best fit model because in our case we have a smaller dataset. However, given the volatility and non-linear characteristics of stock markets, this model will have limitations in highly dynamic environments. For more precise applications, we will recommend exploring more advanced, non-linear models to capture additional market complexities."

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***