<a href="https://colab.research.google.com/github/ShivamV01/ML-Project/blob/main/YesBank_StockPrices_MLproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Shivam Vishwakarma
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Accurately forecasting the closing price of Yes Bank's stock is a critical challenge for investors, market participants, and stakeholders due to the bank's recent tumultuous history. As a prominent private sector bank in India, Yes Bank has faced significant financial distress, marked by a substantial accumulation of bad loans and allegations of fraudulent activities. This has resulted in regulatory intervention by the Reserve Bank of India, creating an environment of uncertainty and complexity surrounding the bank's stock price trajectory.

This project endeavors to address this challenge by harnessing a comprehensive dataset encompassing monthly stock price information from the bank's inception. The dataset includes essential metrics such as the closing, starting, highest, and lowest prices for each month, providing a rich source of historical data for analysis. The overarching aim is to develop robust predictive models capable of capturing the nuanced dynamics and trends in Yes Bank's stock prices, even in the face of recent adverse events and the associated volatility.

To achieve this, the project will employ a multifaceted approach, leveraging a variety of modeling techniques that include time series models and regression methods. A thorough evaluation will be conducted to assess the efficacy of these models in accurately predicting Yes Bank's closing stock price. Furthermore, the models will be rigorously tested to gauge their ability to incorporate the impact of pivotal events, such as fraud cases involving the bank's founders or regulatory actions taken by the Reserve Bank of India.

The successful prediction of Yes Bank's closing stock price holds the potential to offer invaluable insights to stakeholders, empowering them to make well-informed investment decisions. By navigating the complex and unpredictable nature of Yes Bank's stock performance, this project aims to deepen the understanding of its financial standing and contribute to effective decision-making in the financial markets.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The main objective of this project is to develop a robust and accurate predictive model that can effectively forecast the closing price of Yes Bank's stock. The challenge lies in understanding and capturing the complex dynamics and trends in the stock prices, considering various factors such as the historical trend of an increasing price followed by a sudden decline after 2018.

One of the key challenges in developing the predictive model is addressing the issue of multicollinearity present in the dataset. Multicollinearity occurs when there is a high correlation between independent variables, which can lead to difficulties in interpreting the model and can affect the accuracy of the predictions. Therefore, the model should incorporate techniques to handle multicollinearity and ensure that the independent variables are appropriately considered in the prediction process.

Furthermore, the model should account for significant events that have had an impact on Yes Bank's stock performance. This includes events such as fraud cases involving the bank's founders and regulatory interventions by the Reserve Bank of India. These events can significantly influence the stock prices, and it is crucial for the predictive model to capture and reflect their effects accurately.

In terms of performance, the model should aim for a high level of accuracy in forecasting the closing price of Yes Bank's stock. The 99% accuracy achieved by the K-Nearest Neighbors (KNN) Regression model serves as a benchmark, indicating the target accuracy that the developed model should strive to achieve. By achieving high accuracy, the predictive model can provide valuable insights to stakeholders, investors, and market participants, enabling them to make informed decisions and effectively manage their investments in Yes Bank's stock.

Overall, this project seeks to develop a predictive model that addresses the complexities and challenges associated with forecasting Yes Bank's stock prices. The ultimate goal is to provide stakeholders with a reliable tool that can enhance their understanding of the stock's future performance and support them in making well-informed investment decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Import Plotly graph objects for interactive visualizations
import plotly.graph_objects as go

# pending to import sklearn and module

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
link = "/content/drive/MyDrive/Almabetter/Capstone video and projects by me/Module 6/data_YesBank_StockPrices.csv"

stock_df = pd.read_csv(link)

### Dataset First View

In [None]:
# Dataset First Look
stock_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = stock_df.shape
print(f"Numbers of Rows: {rows}")
print(f"Numbers of Columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
stock_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicated_count=stock_df.duplicated().sum()
print(duplicated_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Null_values = stock_df.isnull().sum()
print(Null_values)

In [None]:
# Visualizing the missing values
columns_to_fill = ['Open', 'High','Low', 'Close']  #Although there is no null data
df_filled = stock_df[columns_to_fill].fillna(stock_df[columns_to_fill].mean())
print(df_filled)

### What did you know about your dataset?

*   The data includes High, Low, Open, Close price of a specific stock on a daily basis. It also includes the date and their High, Low, Open, Close price.
*   The data has 185 Rows and 5 Columns. There also no null values in the dataframe. Also there is no duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
stock_df.columns.tolist()

In [None]:
# Dataset Describe
stock_df.describe()

### Variables Description

*   There are total 185 elements in each columns.
*   The min and max value for following columns are:
    1.   Open   = 10 to 369.95
    2.   High   = 11.24 to 404
    3.   Low    = 5.55 to 345.5
    4.   Close  = 9.98 to 367.9


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in stock_df.columns:
    unique_values = stock_df[column].unique()
    print(f"Unique values in column '{column}': {unique_values}")
stock_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Check Unique Values for each variable.
for variable in stock_df.columns:
  print(f"The unique values for the '{variable}' variable are:\n\n {stock_df[variable].unique()}\n\n")

In [None]:
# Saving a copy of the original dataframe
og_df = stock_df.copy()

**Data Type Correction**

Since the dataset does not contain any duplicate or null values, we do not need to perform any operations to treat them. We can proceed to outlier detection and dealing with them.

However, the datatype of the Date column is currently object. We need to change it to datetime format. This is because the Date column represents a date and time, and the object datatype is not sufficient to represent this type of data.

To change the datatype of the Date column, we can use the pd.to_datetime() function. For example, the following code would change the datatype of the Date column to datetime:

In [None]:
# Checking the exact datatype of the entries under the 'Date' column
type(stock_df['Date'][0])

In [None]:
# Changing date colunn datatype to datetime format.
from datetime import datetime

# parsing date which is string of format %b-%y to datetime (%b for Month as locale’s abbreviated name and %y for Year without century as a zero-padded decimal number.
stock_df['Date'] = stock_df['Date'].apply(lambda x: datetime.strptime(x, '%b-%y'))

In [None]:
# Check the datatype of the columns after changing datatype of date
# using 'info()' method
print(stock_df.info())

In [None]:
#Check the datatype of the columns after changing datatype of date
stock_df.dtypes

In [None]:
# Setting the 'Date' column as the index
stock_df = stock_df.set_index('Date')

In [None]:
# Set the background color of the DataFrame to a gradient
# using 'style.background_gradient()' method
stock_df.head().style.background_gradient(cmap='hot')

In [None]:
# Set the background color of the DataFrame to a gradient
# using `style.background_gradient()` method
stock_df.tail().style.background_gradient(cmap='hot')

In [None]:
dependent_variable = ['Close']
independent_variables = list(stock_df.columns[:-1])

### What all manipulations have you done and insights you found?

Upon examining the provided dataframe, it becomes apparent that all the columns exclusively consist of numerical data. There is an absence of any categorical data in the dataset, which means that the information available for analysis primarily comprises quantitative values. This characteristic enables direct application of numerical calculations, statistical analyses, and modeling techniques to the dataset. The lack of categorical data simplifies data processing and ensures a streamlined approach when performing quantitative analyses.

Furthermore, during the examination of the dataset, it is evident that outliers are present. These outliers are data points that significantly deviate from the majority of the data. Before proceeding with modeling or conducting further analysis, it is crucial to address these outliers. Dealing with outliers involves assessing their impact on the data and making decisions regarding appropriate actions, such as removing or transforming them. By addressing the outliers, we can enhance the robustness and reliability of our models and analyses.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Barchart month of Year VS Open price

In [None]:
# Chart - 1 visualization code
# Set the figure size before plotting
plt.figure(figsize=(40, 10))  # Increase width and height of the chart

# Create the barplot
sns.barplot(x='Date', y='Open', data=stock_df)

# Add titles and labels
plt.title('Month VS Open price', fontsize=16)  # Increased font size for clarity
plt.xlabel('Date', fontsize=12,fontweight='bold')
plt.ylabel('Open price', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontweight='bold')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

In order to check the opening price for every year I have chosen this chart.
It will represent the data with respect to the first year of available data.


##### 2. What is/are the insight(s) found from the chart?

*   In the Year of 2018
*   List item



#### Chart - 2 Months VS Close price

In [None]:
# Chart - 2 visualization code
# Set the figure size before plotting
plt.figure(figsize=(40, 10))  # Increase width and height of the chart

# Create the barplot
sns.barplot(x='Date', y='Close', data=stock_df)

# Add titles and labels
plt.title('Month VS Close price', fontsize=16)  # Increased font size for clarity
plt.xlabel('Date', fontsize=12,fontweight='bold')
plt.ylabel('Close price', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontweight='bold')

# Show the plot
plt.show()

#### Chart - 4 candle stick graph with price movement

In [None]:
# Chart - 4 visualization code
# Create a Figure object with Candlestick chart
fig = go.Figure(go.Candlestick(
    x = stock_df.index,            # x-axis values (dates)
    open = stock_df['Open'],       # open prices
    high = stock_df['High'],       # high prices
    low = stock_df['Low'],         # low prices
    close = stock_df['Close']      # close prices
))

# Update the layout of the figure with a title
fig.update_layout(
    title={'text': 'Describing the Price Movements', 'x': 0.5, 'y': 0.95, 'font': {'color': 'white'}},
    xaxis=dict(title='Year', title_font={'color': 'white'}, tickfont={'color': 'white'}),
    yaxis=dict(title='Price', title_font={'color': 'white'}, tickfont={'color': 'white'}),
    width=1450,
    height=1000,
    plot_bgcolor='rgb(36, 40, 47)',  # Set the background color to a professional dark gray
    paper_bgcolor='rgb(51, 56, 66)'  # Set the paper color
)


# Show the figure
fig.show()


##### 1. Why did you pick the specific chart?

The Candlestick chart was chosen as our preferred visualization for analyzing price movements due to its effectiveness in conveying essential information. It provides a visual representation of open, high, low, and close prices, making it a popular choice for us in financial analysis, particularly in the context of stocks and other assets. The Candlestick chart excels in capturing market sentiment and price trends, as each candlestick represents a specific time interval. By observing the color and shape of the candlesticks, we can quickly discern whether prices increased or decreased during that interval. The high and low points of the candlesticks indicate the highest and lowest prices reached within the given period, while the body represents the opening and closing prices. These features enable us to identify patterns, trends, and potential price reversals, facilitating informed decisions regarding asset buying or selling. The larger graph size further enhances visibility, allowing for a more detailed analysis of the price movements depicted by the Candlestick chart. Overall, the Candlestick chart is a valuable tool for us to understand and interpret price dynamics in financial markets

##### 2. What is/are the insight(s) found from the chart?

The analysis of Yes Bank stock prices reveals a distinct pattern. Prior to 2018, the stock exhibited a consistent upward trend, indicating positive growth and reflecting investor optimism. However, a significant decline occurred after this period, primarily attributed to the Yes Bank fraud case involving Rana Kapoor, the former CEO.

Leading up to 2018, the stock experienced a continuous rise, demonstrating favorable market conditions and investor confidence. However, the revelation of the fraud case involving Rana Kapoor had a profound impact on the stock's performance. This event marked a turning point, as the stock prices sharply declined.

The fraud case involving Rana Kapoor significantly affected investor sentiment, eroding trust and confidence in Yes Bank. Consequently, the stock's value experienced a notable decrease, reflecting the negative repercussions of the scandal on the company's reputation and financial stability.

Overall, the analysis highlights the contrasting trends in Yes Bank's stock prices. Pre-2018, there was a consistent upward trajectory, while the post-2018 period witnessed a significant decline due to the repercussions of the fraud case involving Rana Kapoor.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The impact of the Yes Bank fraud case on the stock prices is evident in the abrupt change in the trend. The case brought about increased scrutiny and regulatory interventions, causing a negative sentiment surrounding the bank's future prospects. Consequently, investors reacted by selling off their shares, leading to a rapid decline in the stock prices.

It is important to consider external factors, such as legal proceedings and market sentiment, when interpreting the observed drop in stock prices. The Yes Bank fraud case involving Rana Kapoor significantly affected investor perception and had a direct impact on the stock's value, resulting in the observed downturn in stock prices after 2018.

#### Chart - 5 Line plot showcasing variations in each feature over the years

In [None]:
import plotly.express as px

# Plotting line graph wrt Date and Low prices
fig = px.line(og_df, x="Date", y="Low")

# Add additional traces for Open, Close, and High prices
fig.add_scatter(x=og_df['Date'], y=og_df['Open'], name="Open",
                line_color='lime', marker_color='hotpink', marker_size=10)
fig.add_scatter(x=og_df['Date'], y=og_df['Close'], name="Close",
                line_color='cyan', marker_color='magenta', marker_size=10)
fig.add_scatter(x=og_df['Date'], y=og_df['High'], name="High",
                line_color='gold', marker_color='deepskyblue', marker_size=10)
fig.add_scatter(x=og_df['Date'], y=og_df['Low'], name="Low",
                line_color='orange', marker_color='chartreuse', marker_size=10)

# Update the layout of the plot
fig.update_layout(
    title={'text': "Yes Bank Prices with Respect to Year", 'x': 0.5, 'y': 0.95, 'xanchor': 'center', 'yanchor': 'top', 'font': {'color': 'white'}},
    xaxis_title="Year",
    yaxis_title="Price",
    width=1400,
    height=800,
    plot_bgcolor='rgb(36, 40, 47)',  # Set the dark blue background color of the plot
    paper_bgcolor='rgb(51, 56, 66)',  # Set the dark blue background color of the paper area
    font_color='white',  # Set the font color to white
    legend=dict(x=0.02, y=0.98, bgcolor='rgba(255, 255, 255, 0.7)', bordercolor='gray', borderwidth=1, font={'color': 'white'}),
    margin=dict(l=50, r=50, t=100, b=50),
    xaxis=dict(tickangle=90)  # Rotate x-axis labels by 90 degrees
)
# Show the plot
fig.show()

##### 1. Why did you pick the specific chart?

The specific chart chosen for this analysis is a combination of line and scatter plots. This chart type is suitable for visualizing the individual changes in Open, High, Low, and Close prices of the Yes Bank stock over time. By utilizing line plots, we can observe the overall trends and patterns, while scatter plots allow us to identify specific data points. The chart effectively presents the data by distinguishing each price variable with unique line colors and marker styles. The layout of the plot is optimized to include a centered title, clear axis labels, and an appropriate size. Additionally, the choice of color scheme and background enhances visual appeal and readability. Overall, this chart enables a comprehensive analysis of the Yes Bank stock prices, aiding in the identification of trends, patterns, and potential insights for informed decision-making.

##### 2. What is/are the insight(s) found from the chart?

Indeed, the expected dip in the price variables after 2018 is prominently visible in the chart. The line graph shows a notable decrease in the Open, High, Low, and Close prices of the Yes Bank stock following the specified timeframe. This decline can be attributed to various factors, such as the impact of the Yes Bank fraud case involving Rana Kapoor, which adversely affected investor sentiment and confidence in the bank. The scatter plots further accentuate the dip, as they highlight individual data points that deviate significantly from the preceding upward trend. By visually representing the price variables over time, the chart effectively showcases the substantial decrease in prices after 2018, emphasizing the challenging period faced by Yes Bank and the subsequent decline in its stock value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact by providing valuable information for decision-making and strategic planning. By analyzing the Yes Bank stock price data and observing the significant dip after 2018, businesses and investors can adjust their strategies accordingly. These insights can guide them in making informed decisions about investing in Yes Bank or adjusting their existing holdings. Additionally, the insights can alert businesses to the need for diligent risk management practices and thorough due diligence when evaluating financial institutions.

Regarding insights leading to negative growth, the significant dip in the Yes Bank stock prices after 2018 can be seen as a negative growth trend. The decline in stock prices can be attributed to various factors, including the Yes Bank fraud case involving Rana Kapoor. This event had a detrimental impact on investor sentiment and eroded trust in the bank, resulting in a decrease in its stock value. The negative growth observed in this scenario highlights the importance of maintaining ethical practices, strong corporate governance, and transparency within financial institutions. It also underscores the potential consequences of fraud and misconduct on the overall growth and stability of a business.

#### Chart - 6 Distribution of dependent variable Close Price of stock.

In [None]:
# Chart - 6 visualization code
# Set the figure size and title
plt.figure(figsize=(15, 9))
plt.suptitle('Overall Distribution of Each Variable', color='white')

# Define the color list for each variable (using Yes Bank color scheme)
color_list = ['#003366', '#FF6600', '#99CC00', '#FFCC00']

# Set the dark theme background color
plt.style.use('dark_background')

for i, column in enumerate(stock_df.columns):
    # Create subplots
    ax1 = plt.subplot(2, 2, i + 1)
    ax2 = ax1.twinx()

    # Plot histogram
    sns.histplot(stock_df[column], color=color_list[i], ax=ax1)

    # Plot KDE curve
    sns.kdeplot(stock_df[column], color=color_list[i], ax=ax2)

    # Set gridlines
    ax1.grid(which='major', alpha=0.5)
    ax1.grid(which='minor', alpha=0.5)

    # Add vertical lines for mean and median
    plt.axvline(stock_df[column].mean(), color='white', linestyle='dashed', linewidth=1.5)
    plt.axvline(stock_df[column].median(), color='yellow', linestyle='dashed', linewidth=1.5)

# Set the background color of the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity
plt.gcf().patch.set_facecolor(plot_bgcolor)

# Set the background color of the axes
paper_bgcolor = (51/255, 56/255, 66/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity
for ax in plt.gcf().get_axes():
    ax.set_facecolor(paper_bgcolor)

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

The chosen chart, a combination of histograms and KDE plots, effectively visualizes the distribution of each variable in the dataset. It allows for the examination of central tendency, spread, and shape of the distributions. The subplots enable easy comparison between variables. The color scheme aligns with the Yes Bank branding. The chart aids in data exploration and analysis, providing insights into skewness, multimodality, and outliers. It is a concise and efficient representation of the overall distribution of the variables. The histograms show frequency distribution, while the KDE plots provide a smooth curve. The chart is visually cohesive and facilitates pattern identification. It serves as a valuable tool for understanding the dataset and identifying relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

The distributions of open, high, low, and close in the chart are positively skewed. This indicates that the majority of data points are concentrated on the left side of the distributions, with a tail extending towards larger values on the right side. The histograms and KDE plots clearly show this skewness. Positive skewness suggests that the variables have a tendency for higher values, but with fewer occurrences. The presence of positive skewness may indicate bounded or restricted variables, resulting in an accumulation of values on the lower end and a tail of relatively larger values. Proper consideration of the positive skewness is important for accurate data analysis and modeling, potentially requiring transformations or alternative techniques to account for the skewness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights about the positively skewed distributions of open, high, low, and close prices can have a positive business impact by informing strategic decision-making and identifying potential buying opportunities. However, it is important to note that positive skewness does not directly imply negative growth. Negative growth would require a comprehensive analysis considering various factors beyond skewness, such as trends, market conditions, and external influences. Therefore, it is not justified to conclude specific insights leading to negative growth based solely on the skewness of the distributions. Further analysis is needed to assess any potential negative impacts on business growth.

#### Chart - 7 Boxplots: Studying the Outliers

In [None]:
# Chart - 7 visualization code
fig = plt.figure(figsize=(10, 7))
boxplot = stock_df.boxplot(column=['Open', 'High', 'Low', 'Close'], grid=False, notch=True)

# Change the line color to white
for item in boxplot.findobj(plt.Line2D):
    item.set_color('white')  # Set the color of the lines to white

# Add title to the plot
plt.title("Outliers in Various Features")

# Change the background color of the plot
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity
paper_bgcolor = (51/255, 56/255, 66/255, 1)

fig.patch.set_facecolor(paper_bgcolor)  # Set the background color of the figure

plt.show()

##### 1. Why did you pick the specific chart?

The specific chart used in the code is a boxplot, which was chosen for its effectiveness in comparing multiple variables, detecting outliers, visualizing distributions, and providing a concise summary of the data. The notch feature adds a confidence interval around the median, enhancing comparison. The boxplot's space efficiency allows for displaying multiple variables in a compact manner. Removing the gridlines reduces visual clutter. The code also demonstrates customization flexibility, such as changing line color to white. Overall, the boxplot is a suitable choice for analyzing and comparing the Open, High, Low, and Close prices of the stock.

##### 2. What is/are the insight(s) found from the chart?

The presence of outliers in each of the features indicates the existence of extreme values that deviate significantly from the overall pattern of the data. These outliers can potentially impact the model fitting process and the accuracy of the predictions. Therefore, it is crucial to address these outliers before proceeding with model fitting.

To handle outliers, various approaches can be employed, such as removing them from the dataset, transforming the data using robust statistical techniques, or imputing them with more representative values. The choice of the method depends on the nature of the data and the specific requirements of the analysis.

Handling outliers helps to ensure that the model captures the underlying patterns and relationships accurately, leading to more reliable predictions and interpretations. It also improves the robustness of the model against extreme observations that may introduce bias or noise. Properly addressing outliers contributes to the overall validity and integrity of the analysis, enhancing the reliability of the model fitting process and subsequent predictions.

**Bivarient analysis**

#### Chart - 8 Scatter Plot to see the Best Fit line

In [None]:
# Chart - 8 visualization code
# Function to create scatter plots with correlation lines
def create_scatter_plot(col, df):
    fig = plt.figure(figsize=(20, 10))  # Create a new figure with the specified size
    ax = fig.gca()  # Get the current axes
    feature = df[col]  # Extract the data for the given column
    label = df['Close']  # Extract the 'Close' column data
    correlation = feature.corr(label)  # Calculate the correlation between the two columns
    plt.scatter(x=feature, y=label, marker="*", c="b", s=label/20)  # Create a scatter plot with blue markers
    plt.xlabel(col)  # Set the label for the x-axis
    plt.ylabel('Close')  # Set the label for the y-axis
    ax.set_title(col + ' Vs. Close' + '         Correlation: ' + str(round(correlation, 2)), fontsize=16)  # Set the title of the plot
    z = np.polyfit(df[col], df['Close'], 1)  # Fit a linear regression line to the data
    y_hat = np.poly1d(z)(df[col])  # Generate the y-values for the regression line

    plt.plot(df[col], y_hat, "r", lw=1)  # Plot the regression line in red

    # Add a comment
    plt.annotate('The correlation coefficient is {}.'.format(round(correlation, 2)), (200, 0.2), fontsize=10)

    # Change the shape of the marker
    plt.scatter(df[col], df['Close'], marker="*", c="b", s=label/20)  # Create another scatter plot with blue markers

    # Change the size of the marker
    plt.grid(alpha=0.3)  # Add grid lines with transparency (alpha=0.3)
    plt.xticks(np.arange(min(df[col]), max(df[col]), 100))  # Set the x-axis ticks
    plt.yticks(np.arange(min(df['Close']), max(df['Close']), 10))  # Set the y-axis ticks

    # Set the background colors
    plot_bgcolor = (36/255, 40/255, 47/255)  # RGB values divided by 255
    paper_bgcolor = (51/255, 56/255, 66/255)  # RGB values divided by 255

    fig.patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

    plt.show()  # Display the plot

# Loop through the columns and create scatter plots
for col in ['Open', 'High', 'Low']:
    create_scatter_plot(col, stock_df)  # Call the function for each column in the loop

##### 1. Why did you pick the specific chart?

Using scatter plots with a best fit line allows for visualizing the relationship between numerical features and the 'Close' price. The correlation coefficient quantifies the strength of the relationship. The best fit line provides an estimate of the trend and predictive power. The plot aids interpretation and communication of the relationship to stakeholders. Annotations, such as the correlation coefficient, provide valuable insights. Customization enhances clarity and aesthetics. The plots help identify potential predictors and support analysis and decision-making in stock market analysis.

##### 2. What is/are the insight(s) found from the chart?

Upon analyzing the scatter plots with the best fit line, it is evident that all the independent variables show a linear relationship with the dependent variable, 'Close'. This indicates that there is a consistent and predictable relationship between these variables.

The presence of a linear relationship has important implications in data analysis and modeling. It suggests that changes in the independent variables can be associated with proportional changes in the dependent variable. This knowledge can be leveraged to build regression models, make predictions, and understand the impact of the independent variables on the 'Close' price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The business impact of identifying linear relationships between the independent variables and the dependent variable is significant. It allows for better decision-making and forecasting in the context of stock market analysis. Here are a few potential implications:

Prediction and Forecasting: With a clear understanding of the linear relationships, regression models can be developed to predict future 'Close' prices based on the values of the independent variables. This can assist in forecasting stock performance and informing investment decisions.

Risk Assessment: By analyzing the strength and direction of the linear relationships, it becomes possible to assess the risk associated with changes in the independent variables. This knowledge can aid in risk management and portfolio optimization strategies.

Feature Selection: Identifying the linear relationships helps in determining the most influential independent variables that impact the 'Close' price. This knowledge can guide feature selection and variable prioritization in future analyses or model development.

Strategy Development: The linear relationships can provide insights into the factors driving stock price movements. This information can be utilized to develop trading strategies, identify patterns, and make informed investment decisions.

By recognizing and understanding the linear relationships between the independent variables and the dependent variable, businesses and investors can gain valuable insights into the dynamics of stock prices. This can lead to improved forecasting accuracy, risk management, and decision-making in the context of financial markets.

#### Chart - 9 Pair Plot

In [None]:
# Chart - 9 visualization code
# Create a pair plot to explore relationships in the stock_df DataFrame
sns.pairplot(stock_df)

# Set the background colors for the figure and axes
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity
paper_bgcolor = (51/255, 56/255, 66/255, 1)

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure
plt.gca().patch.set_facecolor(paper_bgcolor)  # Set the background color of the axes

# Display the pair plot
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot graph is used for analyzing the Yes Bank stock price because it helps explore relationships, detect patterns and trends, and identify outliers. It allows for correlation analysis and understanding of data distributions. The visualization of time series data is facilitated, aiding in the identification of long-term trends. Pair plots are visually appealing and effective for communicating findings to stakeholders. They can generate hypotheses and compare variables with other relevant factors. Pair plots support exploratory data analysis, serving as a starting point for further analysis. The choice to use this visualization technique depends on research questions, dataset characteristics, and analyst judgment. Combining visual exploration with statistical analysis is recommended for a comprehensive understanding.

##### 2. What is/are the insight(s) found from the chart?

The variables Open, High, and Low show a strong correlation with the Close variable, indicating a close relationship between the stock's opening, highest, lowest, and closing prices. The Open, High, and Low variables also exhibit a high correlation with each other, suggesting they move in sync and share similar trends. These correlations provide valuable insights for analyzing the Yes Bank stock and can serve as predictors of the closing price. The relationships highlight the interdependencies within the stock market and the potential impact of external factors. Understanding these connections aids in making informed decisions and identifying patterns for forecasting future stock price movements. However, it's important to consider that correlation does not imply causation, and comprehensive analysis requires additional factors and considerations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the strong correlations between the Open, High, Low, and Close variables can have a positive business impact. They aid in making informed decisions and predicting the closing price, improving investment strategies for Yes Bank stock. However, correlation alone does not guarantee success, and comprehensive analysis is necessary, considering other factors and potential limitations. The insights do not directly lead to negative growth; rather, misinterpretation or over-reliance on correlations without considering other factors can result in negative outcomes. Thorough analysis, risk management, and understanding market conditions are crucial for mitigating risks and ensuring positive growth.

To address multicollinearity, we drop the variable with the highest VIF. We prioritize retaining variables that have the strongest correlation with the dependent variable while having the least correlation with other independent variables. This improves model performance and interpretation by reducing interdependence between variables. The decision aligns with the objective of selecting the most influential predictors and enhancing prediction accuracy.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

No missing values were found in the dataset, as confirmed earlier. Therefore, there is no requirement for missing values imputation techniques. The dataset is complete, allowing for direct analysis without the need to handle missing data.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Create a figure with a size of 10x6 inches
fig = plt.figure(figsize=(10, 6))

# Add a super title to the plot
plt.suptitle('Studying the Outliers after Log Transformation', color='white', fontsize=16)

# Define a list of colors for the boxplots
color_list = ['blue', 'green', 'red', 'grey','orange'] # Added a new color to the list

# Calculate the number of rows needed for the subplots
num_rows = int(np.ceil(len(stock_df.columns) / 2)) # Calculate the number of rows based on the number of columns

# Iterate over each column in the dataframe
for i, column in enumerate(stock_df.columns[1:], start=1):
    # Create subplots for each column
    plt.subplot(num_rows, 2, i) # Use num_rows to create the correct number of rows

    # Check if the column contains numeric data
    if pd.api.types.is_numeric_dtype(stock_df[column]):
        # Apply a log transformation to the column and create a boxplot
        sns.boxplot(x=np.log10(stock_df[column]), color=color_list[i])

        # Add a title to each subplot
        plt.title(column, color='white')


# Adjust the layout of the subplots
plt.tight_layout()

plt.show()

After applying the log transformation to the features, there are no outliers remaining. The boxplots show no extreme values beyond the whiskers. The log transformation successfully reduced the impact of outliers and normalized the data. However, it is important to consider other factors and limitations in the analysis.

##### What all outlier treatment techniques have you used and why did you use those techniques?

The log transformation was applied as a treatment for outliers. This approach not only addresses outliers but also helps to alleviate skewness in the features' distribution. By using log transformation, two problems - outlier treatment and skewness correction - are tackled simultaneously, providing a consolidated solution. This technique aids in normalizing the data and improving the suitability of the features for analysis and modeling purposes.

### 3. Categorical Encoding

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create an empty dataframe to store the VIF for each feature
vif_df = pd.DataFrame()

# Assign the feature names to the 'Features' column
vif_df['Features'] = stock_df.iloc[:, :-1].columns.tolist()

# Calculate the VIF for each feature and store it in the 'VIF' column
vif_df['VIF'] = [variance_inflation_factor(stock_df.iloc[:, :-1].values, i) for i in range(len(stock_df.iloc[:, :-1].columns))]

# Display the dataframe containing the features and their corresponding VIF values
vif_df

The VIF values for all the features indicate high multicollinearity. However, considering the small size of the dataset and having only three numerical independent variables, there is limited potential for feature manipulation that could be beneficial. With the absence of categorical variables, the scope for feature engineering or transformation is constrained. Therefore, the focus should be on alternative modeling approaches or additional data collection to address the issue of multicollinearity.

#### 2. Feature Selection

Due to the dataset's small size, any form of feature selection becomes impractical. Given the limited number of observations, attempting to reduce the feature space may lead to unreliable or biased results. Therefore, it is advisable to retain all available features for analysis or modeling purposes.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

To address the skewed distribution of the features, a data transformation is necessary to approximate a normal distribution. In this case, a log transformation will be applied. This transformation aims to reduce skewness and make the data more symmetrical. Furthermore, as observed earlier, the log transformation also aids in handling outliers. By employing this transformation, we can simultaneously improve the normality of the data distribution and mitigate the impact of outliers.

In [None]:
# Iterate over each column in the dataframe
for column in stock_df.columns:
    # Apply a log transformation to the column using np.log10()
    stock_df[column] = np.log10(stock_df[column])

In [None]:
# Create a figure with a size of 15x10 inches
plt.figure(figsize=(15, 10))

# Set the background colors for the figure and axes
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity
paper_bgcolor = (51/255, 56/255, 66/255, 1)

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Add a super title to the plot
plt.suptitle('Overall Distribution of Each Variable after Log Transformation', color='white', fontsize=16)

color_list = ['blue', 'green', 'red', 'purple']

for i, column in enumerate(stock_df.columns):
    plt.subplot(2, 2, i + 1)
    ax1 = plt.gca()
    sns.histplot(stock_df[column], color=color_list[i], ax=ax1)
    ax2 = ax1.twinx()
    sns.kdeplot(stock_df[column], color=color_list[i], ax=ax2)  # Overlapping the KDE plot on the histogram.

    # Set the background color of the axes
    ax1.patch.set_facecolor(paper_bgcolor)
    ax2.patch.set_facecolor(paper_bgcolor)

    # Add gridlines
    plt.grid(which='major', alpha=0.5)
    plt.grid(which='minor', alpha=0.5)

    # Add dashed lines for mean and median
    plt.axvline(stock_df[column].mean(), color='purple', linestyle='dashed', linewidth=1.5)
    plt.axvline(stock_df[column].median(), color='orange', linestyle='dashed', linewidth=1.5)

plt.tight_layout()

plt.show()

After the log transformation, the distributions of the features appear to be closer to a normal distribution compared to their previous state. The mean (indicated by the purple vertical line) and the median (represented by the yellow vertical line) are nearly equal for each feature. This alignment suggests that the log transformation successfully reduced the skewness and brought the data closer to symmetry. The convergence of the mean and median highlights the relative balance in the distribution, indicating a more representative central tendency. Overall, these observations indicate an improved approximation to a normal distribution after the log transformation.

### 6. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Since the dataset is already small in size, there is no need for dimensionality reduction techniques. With a limited number of observations, attempting to reduce the number of features may not provide significant benefits and could potentially lead to loss of valuable information. Therefore, it is advisable to retain all the available features for analysis or modeling purposes without applying dimensionality reduction methods.

### 7. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Assign the independent and dependent variables to X and y, respectively
X = stock_df[independent_variables]
y = stock_df[dependent_variable]

# Split the data into training and testing datasets using a test size of 0.2 (20%)
# Set random_state to 0 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

##### What data splitting ratio have you used and why?

To train the model effectively, an 80:20 split ratio is being employed, allocating 80% of the data for training and 20% for testing. However, considering the small dataset size, it may be beneficial to acquire more data for training purposes. Increasing the training data size helps improve the model's ability to learn and generalize from the patterns present in the data. Gathering additional data can enhance the model's performance, reduce the risk of overfitting, and provide a more comprehensive representation of the underlying relationships within the dataset.

### 8. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
scaler = StandardScaler()

# Scale the training data (X_train) using fit_transform
X_train = scaler.fit_transform(X_train)

# Scale the testing data (X_test) using transform
X_test = scaler.transform(X_test)

In [None]:
# Checking the training dataset
X_train[0: 10]

Which method have you used to scale you data and why?

The StandardScaler is utilized in this code snippet because we are primarily working with linear regression, which assumes normally distributed features. By applying the StandardScaler, we can standardize the features, transforming them to have a mean of 0 and a standard deviation of 1. This process aligns with the assumptions of linear regression and helps ensure that the features are on a similar scale, facilitating accurate model fitting and interpretation.

## ***7. ML Model Implementation***

### ML Model - 1 Linear Regression

Linear Regression is a powerful machine learning algorithm that falls under the category of supervised learning. It is specifically designed for regression tasks, where the goal is to predict a continuous target variable based on independent variables. In regression analysis, the algorithm establishes a relationship between the predictor variables and the target variable to make accurate predictions.

The primary objective of Linear Regression is to identify and quantify the relationship between variables. By examining the patterns and trends in the data, the algorithm enables us to understand how changes in one variable affect the target variable. This understanding is crucial for making informed decisions and forecasting future outcomes.

Linear Regression is widely employed in various domains, including finance, economics, social sciences, and engineering. It finds applications in areas such as sales forecasting, housing price prediction, demand estimation, and trend analysis. By leveraging the insights gained from analyzing the relationship between variables, Linear Regression empowers us to make reliable forecasts and make informed business decisions.

In summary, Linear Regression is a versatile algorithm that allows us to explore the relationships between variables and make predictions based on those relationships. Its ability to model the dependencies between variables makes it a valuable tool for understanding data and making accurate forecasts in numerous fields.

In [None]:
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression model
linear_reg = LinearRegression()

# Fit the Linear Regression model to the training data
linear_reg.fit(X_train, y_train)

In [None]:
# Predict on the model
y_pred_lin = linear_reg.predict(X_test)

In [None]:
# Checking the model parameters
print("Coefficients:", linear_reg.coef_)
print("Intercept:", linear_reg.intercept_)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Linear Regression model
plt.plot(10**y_pred_lin, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Linear Regression", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Linear Regression aims to establish a linear connection between the independent and dependent variables by minimizing the sum of squared differences between the observed and predicted dependent values. It assumes a linear relationship and calculates the best-fitting line by adjusting the model's coefficients. The objective is to minimize the overall distance between the observed data points and the line of best fit. This approach enables the model to capture the underlying linear pattern and make predictions based on the learned relationship between the variables.

In [None]:
# importing libraries
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Calculate the Mean Squared Error (MSE)
mse_lin = round(mean_squared_error(10**y_test, 10**y_pred_lin), 4)

# Calculate the Root Mean Squared Error (RMSE)
rmse_lin = round(np.sqrt(mse_lin), 4)

# Calculate the Mean Absolute Error (MAE)
mae_lin = round(mean_absolute_error(10**y_test, 10**y_pred_lin), 4)

# Calculate the R-squared Score (R2)
r2_lin = round(r2_score(10**y_test, 10**y_pred_lin), 4)

# Calculate the Adjusted R-squared Score (Adjusted R2)
adj_r2_lin = round(1 - (1 - r2_lin) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1)), 4)


In [None]:
# Create a dataframe to store the evaluation metrics
evametdf_lin = pd.DataFrame()

# Set the 'Metrics' column in the dataframe
evametdf_lin['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the 'Linear Regression' column in the dataframe with the corresponding metric values
evametdf_lin['Linear Regression'] = [mse_lin, rmse_lin, mae_lin, r2_lin, adj_r2_lin]

# Display the dataframe
evametdf_lin

The evaluation metrics for the Linear Regression model are as follows:

Mean Squared Error (MSE): The MSE value is 70.4204, indicating the average squared difference between the actual and predicted Close prices. Lower values indicate better model performance, as they represent a smaller overall prediction error.

Root Mean Squared Error (RMSE): The RMSE value is 8.3917, which is the square root of the MSE. It provides a measure of the average difference between the actual and predicted Close prices in the original scale. Again, a lower value signifies better predictive accuracy.

Mean Absolute Error (MAE): The MAE value is 4.8168, representing the average absolute difference between the actual and predicted Close prices. Similar to MSE and RMSE, a smaller MAE indicates better model performance.

R-2 Score: The R-2 score is 0.9937, reflecting the proportion of variance in the dependent variable (Close prices) explained by the independent variables. A score closer to 1 indicates a better fit of the model to the data.

Adjusted R-2 Score: The adjusted R-2 score is 0.9931, which considers the number of independent variables and sample size when assessing the model's goodness of fit. This adjustment helps mitigate potential overfitting issues and provides a more reliable measure of model performance.

These evaluation metrics collectively demonstrate that the Linear Regression model performs well in predicting the Close prices, with low errors, a high R-2 score, and a relatively stable adjusted R-2 score.

### ML Model - 2 Lasso Regression

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Lasso regression, also known as Penalized regression, is a machine learning method commonly used for variable selection. It offers improved prediction accuracy compared to other regression models. By applying Lasso regularization, the model can enhance interpretability while effectively reducing the impact of less relevant variables. This regularization technique plays a crucial role in feature selection and contributes to a more accurate and interpretable model.

In [None]:
# ML Model - 3 Implementation
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.01)

# Fit the Algorithm
lasso.fit(X_train, y_train)

In [None]:
# Predict on the model
y_pred_lasso = lasso.predict(X_test)

In [None]:
# Print the coefficients of the Lasso model
print("Coefficients:", lasso.coef_)

# Print the intercept of the Lasso model
print("Intercept:", lasso.intercept_)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Lasso Regression model
plt.plot(10**y_pred_lasso, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Lasso Regression", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Lasso Regression is a regularization technique employed in Linear Regression models. It incorporates a penalty term into the loss function that is based on the sum of the absolute values of the coefficients. This penalty term encourages sparsity in the model by driving some coefficients to exactly zero. As a result, Lasso Regression not only reduces the magnitudes of the coefficients but can also eliminate some features from the model by setting their corresponding coefficients to zero.

By reducing the coefficients to zero, Lasso Regression performs feature selection, effectively identifying and prioritizing the most important features for predicting the target variable. This characteristic makes Lasso Regression particularly useful when dealing with high-dimensional datasets where feature reduction is desired.

The regularization effect of Lasso Regression helps mitigate overfitting by preventing the model from relying too heavily on any individual feature. It encourages a more parsimonious model representation, improving its generalizability to unseen data. The capability of Lasso Regression to shrink coefficients towards zero and perform feature selection makes it a valuable tool for both improving model interpretability and enhancing prediction accuracy.

In [None]:
# Mean Squared Error
mse_lasso = round( mean_squared_error((10**y_test), 10**(y_pred_lasso)), 4)

# Root Mean Squared Error
rmse_lasso = round(np.sqrt(mse_lasso), 4)

# Mean Absolute Error
mae_lasso = round(mean_absolute_error((10**y_test), 10**(y_pred_lasso)), 4)

# R-2 Score
r2_lasso = round(r2_score((10**y_test), (10**y_pred_lasso)), 4)

# Adjusted R-2 Score
adj_r2_lasso = round(1 - (1 - r2_lasso)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)

In [None]:
# Create a dataframe to store the evaluation metrics for Lasso Regression
evametdf_lasso = pd.DataFrame()

# Set the 'Metrics' column in the dataframe
evametdf_lasso['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the 'Lasso Regression' column in the dataframe with the corresponding metric values
evametdf_lasso['Lasso Regression'] = [mse_lasso, rmse_lasso, mae_lasso, r2_lasso, adj_r2_lasso]

# Display the dataframe
evametdf_lasso

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for Lasso Regression
lasso_param_grid = {'alpha': [0.00001, 0.0001, 0.001, 0.01, 1, 10, 100, 1000]}

# Perform GridSearchCV with Lasso Regression
lasso_gscv = GridSearchCV(lasso, param_grid=lasso_param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the Lasso Regression model with GridSearchCV
lasso_gscv.fit(X_train, y_train)

In [None]:
# Finding the best parameter value
print("The best value of 'alpha' would be:", lasso_gscv.best_params_)

In [None]:
# Print the coefficients of the best estimator from GridSearchCV
print("Coefficients:", lasso_gscv.best_estimator_.coef_)

# Print the intercept of the best estimator from GridSearchCV
print("Intercept:", lasso_gscv.best_estimator_.intercept_)

In [None]:
# Predict on the model
y_pred_lasso_gscv = lasso_gscv.predict(X_test)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Plot the actual Close prices from the test data
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Lasso Regression model with GridSearchCV
plt.plot(10**y_pred_lasso_gscv, color='red')

# Set the label for the y-axis
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot
plt.title("Lasso Regression with GridSearchCV", color='white')

# Add gridlines
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

plt.show()

In [None]:
# Mean Squared Error
mse_lasso_gscv = round( mean_squared_error((10**y_test), 10**(y_pred_lasso_gscv)), 4)

# Root Mean Squared Error
rmse_lasso_gscv = round(np.sqrt(mse_lasso_gscv), 4)

# Mean Absolute Error
mae_lasso_gscv = round(mean_absolute_error((10**y_test), 10**(y_pred_lasso_gscv)), 4)

# R-2 Score
r2_lasso_gscv = round(r2_score((10**y_test), (10**y_pred_lasso_gscv)), 4)

# Adjusted R-2 Score
adj_r2_lasso_gscv = round(1 - (1 - r2_lasso_gscv)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)

In [None]:
# Create a dataframe to store the evaluation metrics for Lasso Regression with GridSearchCV
evametdf_lasso_gscv = pd.DataFrame()

# Add the column "Metrics" to the dataframe
evametdf_lasso_gscv['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Add the column "Lasso Regression with GridSearchCV" to the dataframe with the corresponding evaluation metric values
evametdf_lasso_gscv['Lasso Regression with GridSearchCV'] = [mse_lasso_gscv, rmse_lasso_gscv, mae_lasso_gscv, r2_lasso_gscv, adj_r2_lasso_gscv]

# Display the dataframe
evametdf_lasso_gscv

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV was used with a smaller set of hyperparameters to find the best combination of hyperparameter values for Lasso Regression. The hyperparameter grid specified a range of alpha values. By narrowing down the set of hyperparameters, the search space was reduced, making the grid search more efficient. GridSearchCV then performed cross-validation to evaluate the performance of each combination of hyperparameters based on the negative mean squared error. The best set of hyperparameters was determined based on the highest cross-validated score, resulting in the optimal regularization strength for Lasso Regression. This approach allowed for an effective and efficient search for the optimal hyperparameters and minimized the mean squared error.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Create a dataframe to store the comparison of evaluation metrics for Lasso Regression and Lasso Regression with GridSearchCV
lasso_comp_df = pd.concat([evametdf_lasso, evametdf_lasso_gscv.iloc[:, 1]], axis=1)

# Display the dataframe
lasso_comp_df

Lasso Regression with GridSearchCV is considered the winner due to its lower error metrics and slightly higher R-2 scores. The lower mean squared error, root mean squared error, and mean absolute error indicate improved accuracy and better predictive performance compared to Lasso Regression without GridSearchCV. Additionally, the slightly higher R-2 score suggests that Lasso Regression with GridSearchCV captures a greater amount of variance in the target variable and provides a better fit to the data. Overall, these evaluation metrics demonstrate that Lasso Regression with GridSearchCV outperforms Lasso Regression without GridSearchCV in terms of predictive accuracy and model fit.

### ML Model - 3 Ridge Regression

Ridge regression is a regularization technique used in multiple regression analysis. While it may seem daunting at first, gaining a solid understanding of multiple regression can provide a foundation for comprehending the science behind Ridge regression in R.

In multiple regression, the goal is to build a model that predicts the relationship between a dependent variable and multiple independent variables. This is done by estimating the coefficients of the independent variables that minimize the difference between the predicted and actual values of the dependent variable. The traditional least squares method is commonly used to estimate these coefficients.

Ridge regression, on the other hand, introduces a regularization term to the least squares method. This regularization term, known as the Ridge penalty or L2 regularization, adds a constraint to the coefficient estimation process. The purpose of this constraint is to prevent overfitting and improve the model's generalization ability.

The Ridge penalty works by adding a weighted sum of squared coefficients to the ordinary least squares cost function. This sum penalizes larger coefficient values, encouraging them to be smaller. Consequently, Ridge regression tends to shrink the coefficient estimates towards zero, while still allowing them to have non-zero values. This shrinkage effect helps mitigate the impact of multicollinearity, a situation where the independent variables are highly correlated with each other.

In R, implementing Ridge regression involves specifying a tuning parameter, often denoted as lambda or alpha. This parameter controls the amount of regularization applied to the model. A larger lambda value results in stronger regularization, leading to smaller coefficient estimates. Conversely, a smaller lambda value reduces the regularization effect, allowing the coefficients to approach the values obtained from ordinary least squares regression.

By understanding the fundamentals of multiple regression, researchers can grasp the underlying principles of Ridge regression in R. This regularization technique offers a valuable tool for handling multicollinearity and improving the generalization performance of multiple regression models.

In [None]:
# Import the Ridge regression model from scikit-learn
from sklearn.linear_model import Ridge

# Create an instance of the Ridge regression model
ridge = Ridge()

# Fit the Ridge regression model to the training data
ridge.fit(X_train, y_train)

In [None]:
# Predict on the model
y_pred_ridge = ridge.predict(X_test)

In [None]:
# Print the coefficients of the Ridge regression model
print("Coefficients:", ridge.coef_)

# Print the intercept of the Ridge regression model
print("Intercept:", ridge.intercept_)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Plot the actual Close prices from the test data in blue
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Ridge regression model in red
plt.plot(10**y_pred_ridge, color='red')

# Set the label for the y-axis as "Close Price"
plt.ylabel("Close Price")

# Add a legend to differentiate between the actual and predicted values
plt.legend(["Actual", "Predicted"])

# Set the title of the plot as "Ridge Regression" with white color
plt.title("Ridge Regression", color='white')

# Add grid lines to the plot
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

# Display the plot
plt.show()

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Ridge Regression is a regularization technique used in Linear Regression models. It introduces a penalty term to the loss function, which is the sum of squared values of the coefficients. This penalty term helps control the magnitude of the coefficients, limiting their impact on the model and reducing the chances of overfitting. By adding this penalty term, Ridge Regression encourages a balance between fitting the training data well and maintaining generalization to unseen data. It is an effective approach to handle multicollinearity and stabilize the model's performance.

In [None]:
# Calculate the Mean Squared Error (MSE)
mse_ridge = round(mean_squared_error(10**y_test, 10**y_pred_ridge), 4)

# Calculate the Root Mean Squared Error (RMSE)
rmse_ridge = round(np.sqrt(mse_ridge), 4)

# Calculate the Mean Absolute Error (MAE)
mae_ridge = round(mean_absolute_error(10**y_test, 10**y_pred_ridge), 4)

# Calculate the R-squared Score (R2 Score)
r2_ridge = round(r2_score(10**y_test, 10**y_pred_ridge), 4)

# Calculate the Adjusted R-squared Score (Adjusted R2 Score)
adj_r2_ridge = round(1 - (1 - r2_ridge) * ((X_test.shape[0] - 1) / (X_test.shape[0] - X_test.shape[1] - 1)), 4)

In [None]:
# Create a dataframe to store the evaluation metrics
evametdf_ridge = pd.DataFrame()

# Set the metrics as a column in the dataframe
evametdf_ridge['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Set the corresponding values for Ridge Regression in the dataframe
evametdf_ridge['Ridge Regression'] = [mse_ridge, rmse_ridge, mae_ridge, r2_ridge, adj_r2_ridge]

evametdf_ridge

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the hyperparameter grid for Ridge regression
ridge_param_grid = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}

# Create an instance of the Ridge regression model
ridge = Ridge()

# Create an instance of GridSearchCV with the Ridge regression model,
# the hyperparameter grid, scoring metric, and cross-validation settings
ridge_gscv = GridSearchCV(ridge, param_grid=ridge_param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the GridSearchCV instance to the training data
ridge_gscv.fit(X_train, y_train)

In [None]:
# Predict on the model
y_pred_ridge_gscv = ridge_gscv.predict(X_test)


In [None]:
# Finding the best parameter value
print("The best value of 'alpha' would be:", ridge_gscv.best_params_)

In [None]:
# Checking the model parameters after GridSearchCV
print("Coefficients:", ridge_gscv.best_estimator_.coef_)
print("Intercept:", ridge_gscv.best_estimator_.intercept_)

In [None]:
plt.figure(figsize=(8, 4))

# Set the background colors for the figure
plot_bgcolor = (36/255, 40/255, 47/255, 1)  # RGB values divided by 255, with alpha=1 for full opacity

plt.gcf().patch.set_facecolor(plot_bgcolor)  # Set the background color of the figure

# Plot the actual Close prices from the test set in blue
plt.plot(np.array(10**y_test), color='blue')

# Plot the predicted Close prices from the Ridge regression model with GridSearchCV in red
plt.plot(10**ridge_gscv.predict(X_test), color='red')

# Set the y-axis label
plt.ylabel("Close Price")

# Add a legend for the plotted lines
plt.legend(["Actual", "Predicted"])

# Add grid lines to the plot
plt.grid(which='major', alpha=0.5)
plt.grid(which='minor', alpha=0.5)

# Set the title of the plot with white color
plt.title("Ridge Regression with GridSearchCV", color='white')

# Display the plot
plt.show()

In [None]:
# Mean Squared Error
mse_ridge_gscv = round( mean_squared_error((10**y_test), 10**(y_pred_ridge_gscv)), 4)

# Root Mean Squared Error
rmse_ridge_gscv = round(np.sqrt(mse_ridge_gscv), 4)

# Mean Absolute Error
mae_ridge_gscv = round(mean_absolute_error((10**y_test), 10**(y_pred_ridge_gscv)), 4)

# R-2 Score
r2_ridge_gscv = round(r2_score((10**y_test), (10**y_pred_ridge_gscv)), 4)

# Adjusted R-2 Score
adj_r2_ridge_gscv = round(1 - (1 - r2_ridge_gscv)*((X_test.shape[0] - 1)/(X_test.shape[0] - X_test.shape[1] - 1)), 4)

In [None]:
# Create an empty dataframe
evametdf_ridge_gscv = pd.DataFrame()

# Create a column for the evaluation metrics
evametdf_ridge_gscv['Metrics'] = ['Mean Squared Error', 'Root Mean Squared Error', 'Mean Absolute Error', 'R-2 Score', 'Adjusted R-2 Score']

# Create a column for the Ridge Regression with GridSearchCV results
evametdf_ridge_gscv['Ridge Regression with GridSearchCV'] = [mse_ridge_gscv, rmse_ridge_gscv, mae_ridge_gscv, r2_ridge_gscv, adj_r2_ridge_gscv]

# Display the dataframe
evametdf_ridge_gscv

##### Which hyperparameter optimization technique have you used and why?

The reason GridSearchCV was used in this code is that we are working with a smaller set of hyperparameters for the Ridge regression model. GridSearchCV allows us to exhaustively search through the specified hyperparameter grid and find the best combination of hyperparameters that yields the optimal model performance.

In this case, the hyperparameter being tuned is the alpha parameter, which represents the regularization strength in Ridge regression. The ridge_param_grid contains a predefined list of potential alpha values to explore. By using GridSearchCV, the code iterates through each alpha value in the grid, fits the Ridge regression model with that particular alpha, and evaluates the model's performance using cross-validation.

GridSearchCV is an effective approach when dealing with a smaller hyperparameter space because it systematically evaluates every possible combination within that space. However, as the hyperparameter space grows larger, GridSearchCV may become computationally expensive and time-consuming.

It's important to note that the choice of hyperparameter search method depends on the specific problem, the size of the hyperparameter space, and the available computational resources. GridSearchCV is suitable for smaller hyperparameter spaces, while other techniques like RandomizedSearchCV or Bayesian optimization may be more efficient for larger hyperparameter spaces.

Overall, GridSearchCV provides a systematic way to search through a smaller set of hyperparameters and identify the optimal combination for the Ridge regression model, leading to improved model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
# Concatenating two DataFrames side by side using pd.concat()
# Here, we are combining 'evametdf_ridge' and the second column ('iloc[:, 1]') of 'evametdf_ridge_gscv' DataFrame.
ridge_comp_df = pd.concat([evametdf_ridge, evametdf_ridge_gscv.iloc[:, 1]], axis=1)

# Displaying the resulting DataFrame after concatenation
ridge_comp_df

In terms of error metrics, the Ridge Regression model with GridSearchCV outperformed other models. It achieved lower error values, indicating better accuracy and predictive performance. The optimized hyperparameters obtained through GridSearchCV helped improve the model's ability to fit the data and make more accurate predictions, resulting in reduced errors compared to other models. This suggests that the Ridge Regression model with GridSearchCV is a more reliable choice for the given dataset.

Answer Here.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

A careful examination of the data reveals a pronounced decline in the stock prices of Yes Bank following the exposure of the Rana Kapoor fraud in 2018.

The dataset exhibited exceptional cleanliness, devoid of any missing values or duplicated rows, minimizing the need for extensive data wrangling.

Although outliers were present in the features, effective outlier mitigation was achieved through the implementation of a log transformation across all features.

The log transformation successfully addressed positive skewness observed in all features, ensuring adherence to the assumptions of the linear regression models.

Strong positive correlations were observed between the independent variables (Open, High, Low) and the dependent variable (Close), implying a high predictive potential of the dependent variable based on the independent variables.

The presence of positive correlations among the independent variables suggested the presence of multicollinearity; however, given the limited dataset size, feature removal was deemed unnecessary.

Among the various implemented regression models, the Ridge Regression model, combined with GridSearchCV for hyperparameter optimization, emerged as the preferred choice. It achieved a commendable performance, boasting an RMSE of 8.3824 and an R-2 score of 0.9938.

Notably, the 'High' and 'Low' features demonstrated positive weights, indicating a favorable impact on the predictions. Conversely, the 'Open' feature displayed a negative weight, signifying a detrimental influence on the predictions.

Satisfactorily meeting the assumptions of homoscedasticity, absence of autocorrelation, and a mean of zero, the residuals bolstered the reliability of the regression model.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***