# **Project Name**    - **Seoul Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
***
SAMYAK JAIN


# **Project Summary -**

This project emphasises on predicting the hourly demand for rental bikes in urban areas to ensure a stable and efficient supply, thereby reducing user wait times and enhancing the bike-sharing experience. By employing exploratory data analysis (EDA) to understand patterns and relationships and correlations in the data, and employing various regression algorithms such as linear regression, ridge regression, lasso regression, and elastic net regression, we aim to accurately forecast bike demand. The project involves splitting the data into training and test sets, evaluating models using metrics like RMSE, MAE, and R-squared, and tuning hyperparameters for optimal performance. The insights gained will help identify key factors affecting bike demand, such as weather and time of day, leading to recommendations for better bike distribution strategies.At last, this project aims  in making informed decisions to improve the availability and accessibility of rental bikes.

# **GitHub Link -**

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***Know Your Data***

### Import Libraries

In [None]:
!pip install shap

In [None]:
# Import Libraries
# Data manipulation and numerical operations
import pandas as pd
import numpy as np
import shap

# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning algorithms and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from scipy import stats
from scipy.stats import f_oneway
from scipy.stats import ttest_ind


# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# For statistical models and tests
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read dataset
dataset = pd.read_csv('/content/drive/MyDrive/SeoulBikeData.csv', encoding='unicode_escape')
df = dataset.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Above dataset have 8760 rows and 14 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

### According to above information the goven dataset do not have null values.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values = df.duplicated().sum()
print("Number of duplicate values:", duplicate_values)

### In the given dataset there is no any duplicate value.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing values count:\n", missing_values)

### There is no any missing value in the given dataset.

Summary :

### In the given dataset there are 8760 rows and 14 columns and also there is no any null value, duplicate value and missing value.

## ***Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* **Date** : Date on which bike rented.
* **Rented Bike Count** : Count of Bike Rented.
* **Hour** : Hour of the day (0-23).
* **Temperature** : Temprature of the day in celcius.
* **Humidity** : Humidity of the day in percentage.
* **Wind Speed** : Speed of wind in m/s.
* **Visibility** : Visibility Measure 10m.
* **Dew Point Temprature** : Dew Point Temprature measure in degree celcius.
* **Solar radiation** : Solar radiation measure in MJ/m2.
* **Rainfall** : Rainfall in mm.
* **Snowfall** : Snowfall measure in cm.
* **Seasons** : Spring, Summer, Fall, Winter.
* **Holidays** : Whether a holiday or not.
* **Functional** : Whether a functional day or not.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values using lambda function for each variable.
print(df.apply(lambda x: len(x.unique())))

In [None]:
# Creating function to return all the unique values each categorical colum can have
def cat_unique_vals(cat_cols,df):
  for col in cat_cols:
    print("The values that the categorical column",col,"can take are:",df[col].unique())

In [None]:
# Checking the possible values importanat and meaningful categorical columns can have.
categorical_columns=['Seasons','Holiday']
cat_unique_vals(categorical_columns,df)

## ***Data Wrangling***

In [None]:
# Write your code to make your dataset analysis ready.
df1 = df.copy()

# **Handelling Null Values**

In [None]:
# Cheking for null values
df1.isnull().sum()

### **Here we can see no any null value in the given dataset.**

# **Handelling Duplicate Values**

In [None]:
# Checking Duplicate Values
df1.duplicated().sum()

### **The the given dataset we do not have any duplicate value.**

* **The above dataset have 8760 rows and 14 columns.**
* **The above dataset do not have duplicate and missing values.**

## ***Data Vizualization with charts

#### distribution of the 'Rented Bike Count?

In [None]:
# Calculate distribution of the ranted bike count
rented_bike_count = df1['Rented Bike Count'].value_counts()
print(rented_bike_count)

In [None]:
# visualization code for rented_bike_count
plt.figure(figsize=(10, 6))
sns.histplot(df1['Rented Bike Count'], bins=20, kde=True)
plt.title('Distribution of Rented Bike Count')
plt.xlabel('Rented Bike Count')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer : A histplot, or histogram plot, is commonly used to visualize the distribution of rented bike counts because it provides a clear representation of how the counts are spread across different ranges.

Answer : The right-skewed nature of the graph indicates that the majority of rental instances involve only a few bikes being rented, while there are occasional instances where a significantly higher number of bikes are rented. This suggests that the bulk of the activity, in terms of rental counts, is concentrated towards the lower end of the scale, with progressively fewer instances occurring as the rental counts increase.

##### 3.INSIGHTS

Answer : The right-skewed rental data suggests focusing on promotions for individual or small group rentals, optimizing bike distribution to match typical demand, and implementing dynamic pricing to balance demand and maximize revenue. However, ignoring the few high-demand instances could lead to missed opportunities and potential customer dissatisfaction during peak times.

#### Chart - 2 : How does the distribution of 'Rented Bike Count' vary throughout the days in the dataset?

In [None]:
# Convert 'Date' column to datetime, handle errors
df1['Date'] = pd.to_datetime(df1['Date'], errors='coerce')

# Drop rows with NaT (parsing errors)
data = df1.dropna(subset=['Date'])

# Extract 'Date' and 'Rented Bike Count' columns
bike_counts_by_date = data[['Date', 'Rented Bike Count']]

# Group by day and calculate mean or sum of 'Rented Bike Count'
daily_bike_counts = bike_counts_by_date.groupby(bike_counts_by_date['Date'].dt.date)['Rented Bike Count'].sum()

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.lineplot(data=daily_bike_counts)
plt.title('Distribution of Rented Bike Count Throughout the Days')
plt.xlabel('Date')
plt.ylabel('Rented Bike Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Answer : I used line plot because line plot for visualizing the distribution of bike rentals over time aligns well with the nature of the data and the goal of identifying temporal trends and patterns in bike rental demand.

Answer : The graph shows that there's a rise in bike rentals during 2018 compared to 2017, with noticeable drops in rentals during 2017. Starting from early 2018, there's a general trend of more variability and peaks in bike rentals. This could suggest that bike rentals became more popular or there were increased marketing efforts around that time.

#### Chart - 3 : How does 'Temperature(°C)' affect 'Rented Bike Count'?

In [None]:
# compute the correlation between temprature and rented bike count
correlation = df1['Temperature(°C)'].corr(df1['Rented Bike Count'])
print("Correlation between Temperature and Rented Bike Count:", correlation)

In [None]:
# visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='Temperature(°C)', y='Rented Bike Count')
plt.title("Temperature vs Rented Bike Count")
plt.xlabel("Temperature")
plt.ylabel("Rented Bike Count")
plt.grid(True)
plt.show()

Answer : A scatter plot is used to visualize the relationship between two variables by plotting individual data points on a Cartesian plane. It helps identify patterns, trends, and correlations between the variables, making it suitable for exploring how changes in one variable (e.g., temperature) affect another (e.g., bike rentals) across different observations.

Answer : The scatter plot indicates a positive correlation between temperature and bike rentals, with higher temperatures generally leading to more bike rentals. The rental count increases significantly as temperatures rise up to around 20-25°C, beyond which the rental growth plateaus or slightly decreases. This suggests that bike rental demand is highest in mild to warm weather conditions.

Answer : The insights gained from the analysis can indeed have a positive business impact. Understanding the positive correlation between temperature and bike rentals, with a peak demand in mild to warm weather conditions, allows businesses to capitalize on this relationship.

#### Chart - 4 : How do 'Seasons' affect 'Rented Bike Count'?

In [None]:
# calculate seasons affect rented bike count
seasons_rented_bike_count = df1.groupby('Seasons')['Rented Bike Count'].sum().reset_index()
print(seasons_rented_bike_count)

In [None]:
# visualization code
plt.figure(figsize=(10, 4))
sns.barplot(data=seasons_rented_bike_count, x='Seasons', y='Rented Bike Count', palette = 'deep')
plt.title("Seasons vs Rented Bike Count")
plt.xlabel("Seasons")
plt.ylabel("Rented Bike Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Answer : According to the graph, the rented bike count is highest in the summer season, followed by autumn and then spring. The rented bike count is lowest in the winter season.

Answer : The insights indicating higher bike rentals in summer, autumn, and spring can help businesses optimize resources and marketing efforts during these peak seasons, positively impacting revenue. However, the low rental count in winter could lead to negative growth if not addressed, suggesting a need for strategies to attract customers or diversify offerings during colder months.

#### Chart - 5 : How does 'Wind speed (m/s)' affect 'Rented Bike Count'?

In [None]:
rented_bike_count_by_wind_speed = df1.groupby('Wind speed (m/s)')['Rented Bike Count'].sum().reset_index()
print(rented_bike_count_by_wind_speed)

In [None]:
# visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='Wind speed (m/s)', y='Rented Bike Count')
plt.title("Wind speed vs Rented Bike Count")
plt.xlabel("Wind speed")
plt.ylabel("Rented Bike Count")
plt.grid(True)
plt.show()

Answer : According to above graph we can see that when the wind speed is low then the rented bike count is hight and when wind speed is maximum then the bike count is less. We can see that between 1-3 m/s wind speed the count is maximum and when the wind speed is above 5 m/s then the count is very very less.

Answer : The insights show that bike rentals are highest at low wind speeds (1-3 m/s) and drop significantly at high wind speeds (above 5 m/s). This can help businesses plan better by promoting rentals during favorable wind conditions. However, high wind speeds may lead to fewer rentals, suggesting a need for strategies to maintain demand during such conditions.

#### Chart - 6 : How do 'Holiday' affect 'Rented Bike Count'?

In [None]:
# Compute rented bike count by holiday
rented_bike_count_by_holiday = df1.groupby('Holiday')['Rented Bike Count'].sum().reset_index()
print(rented_bike_count_by_holiday)

In [None]:
# visualization code
plt.figure(figsize=(10, 6))
sns.barplot(data=rented_bike_count_by_holiday, x='Holiday', y='Rented Bike Count', palette='muted')
plt.title("Holiday vs Rented Bike Count")
plt.xlabel("Holiday")
plt.ylabel("Rented Bike Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Answer : According to the above graph when there is holiday then the bike count is very less and when there is no holiday then rented bike count is very high.

#### Chart - 7 : How does visibility ('Visibility (10m)') influence bike rental count?

In [None]:
# compute bike rental count by visibility
rented_bike_count_by_visibility = df1.groupby('Visibility (10m)')['Rented Bike Count'].sum().reset_index()
print(rented_bike_count_by_visibility)

In [None]:
# visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='Visibility (10m)', y='Rented Bike Count')
plt.title("Visibility vs Rented Bike Count")
plt.xlabel("Visibility")
plt.ylabel("Rented Bike Count")
plt.grid(True)
plt.show()

Answer : The scatter plot shows the relationship between visibility and rented bike count. The insights from the graph indicate that as visibility increases, the number of rented bikes generally increases. There is a noticeable clustering of higher bike rentals at higher visibility levels, particularly around 2000 meters. This suggests that better visibility conditions are associated with higher bike rental counts.

Answer : The gained insights from the scatter plot indicating a positive correlation between visibility and rented bike count can indeed help in creating a positive business impact for a bike rental company.

#### Chart - 8 : Is there any noticeable relationship between dew point temperature ('Dew point temperature(°C)') and rented bike count?


In [None]:
relationship_between_dew_point_temperature_and_bike_rental = df1.groupby('Dew point temperature(°C)')['Rented Bike Count'].sum().reset_index()
print(relationship_between_dew_point_temperature_and_bike_rental)

In [None]:
# visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df1, x='Dew point temperature(°C)', y='Rented Bike Count')
plt.title("Dew point temperature vs Rented Bike Count")
plt.xlabel("Dew point temperature")
plt.ylabel("Rented Bike Count")
plt.grid(True)
plt.show()

Answer : I used scatterplot because catterplots are used to visualize the relationship between two continuous variables, helping to identify patterns, trends, and correlations in the data at a glance.

Answer : Yes, There is noticeable relationship between dew point temprature and rented bike count. When dew point temprature is maximum then the count of the rented bike is maximum and when temptarure is minimum then count is also minimum.

Answer : The observed relationship between dew point temperature and bike rental count suggests a positive impact for the business, as it allows for strategic planning and resource allocation based on weather conditions. However, overreliance on optimal weather conditions for high rental counts may lead to negative growth during periods of unfavorable weather, potentially resulting in revenue fluctuations and operational challenges.

#### Chart - 9 : How does the functioning day ('Functioning Day') affect bike rental demand?

In [None]:
# group by functioning day and calculate the mean of bike rental demand.
bike_rental_demand_by_functioning_day = df1.groupby('Functioning Day')['Rented Bike Count'].mean().reset_index()
print(bike_rental_demand_by_functioning_day)

In [None]:
# visualization code
plt.figure(figsize=(10,6))
sns.barplot(data=bike_rental_demand_by_functioning_day, x='Functioning Day', y='Rented Bike Count', palette = 'deep')
plt.title("Functioning Day vs Rented Bike Count")
plt.xlabel("Functioning Day")
plt.ylabel("Rented Bike Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Answer : The above barplot shows that on functioning day rented bike count is high and when there is no functioning day then the rented bike count is zero.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Exclude non-numeric columns from the DataFrame
numeric_df = df1.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix
correlation_matrix = numeric_df.corr()

# Plot the correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

From the above heatmap we can have followinf indights:
* Rented bike count has a strong positive correlation with temperature (0.54), meaning higher temperatures are associated with more bike rentals.

* There is a moderate positive correlation between rented bike count and the time of day (hour) (0.39) as well as dew point temperature (0.39), indicating more rentals during specific hours and in higher dew point conditions.

* Rented bike count is negatively correlated with humidity (-0.20) and visibility (-0.24), suggesting that higher humidity and lower visibility reduce bike rentals.

* There are weak positive correlations with wind speed (0.10) and solar radiation (0.25), indicating these factors have minimal impact on rentals.

* Both rainfall (-0.14) and snowfall (-0.11) have weak negative correlations with rented bike count, suggesting that precipitation slightly reduces bike rentals.

#### Chart - 11 - Pair Plot

In [None]:
df1.columns

In [None]:
# Pair Plot visualization code
sns.pairplot(df1)
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Answer : The pair plot provides a comprehensive view of the relationships between multiple variables. Key insights include:

1. **Temperature vs. Rented Bike Count**: There is a clear positive trend showing higher bike rentals at higher temperatures, confirming the strong correlation observed earlier.

2. **Hour vs. Rented Bike Count**: The plot shows that bike rentals have a distinct pattern across different hours, likely peaking during specific times of the day, such as morning and evening commutes.

3. **Humidity vs. Rented Bike Count**: The scatter plot reveals a weak negative trend, indicating that higher humidity tends to reduce bike rentals.

4. **Dew Point Temperature vs. Temperature**: There is a strong positive linear relationship between dew point temperature and temperature, indicating that as the temperature increases, the dew point temperature also rises.

5. **Rainfall and Snowfall**: Both these variables show a sparse distribution with rented bike count, indicating they have less frequent but notable negative impacts on bike rentals.

6. **Wind Speed, Visibility, and Solar Radiation**: These factors show weak to moderate relationships with rented bike count. Solar radiation has a slightly positive impact, while visibility and wind speed show no strong patterns.


## ***Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer :

* Null Hypothesis (H0): There is no significant difference in bike rental demand across different temperature ranges.

* Alternative Hypothesis (H1): Bike rental demand significantly varies across different temperature ranges, with a peak in demand observed in mild to warm weather conditions (20-25°C).

#### 2. Perform an appropriate statistical test.

In [None]:
# Divide the dataset into two groups based on temperature
group1 = df1[df1['Temperature(°C)'] < 20]['Rented Bike Count']
group2 = df1[df1['Temperature(°C)'] >= 20]['Rented Bike Count']

# Perform t-test
t_statistic, p_value = stats.ttest_ind(group1, group2)

# Print the results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

### The extremely large negative T-Statistic (-47.5046) and the P-Value of 0.0 suggest a highly significant difference in bike rental counts between temperatures below 20°C and those 20°C and above, indicating temperature has a very strong effect on bike rentals.

##### Which statistical test have you done to obtain P-Value?

Answer : I used T-Statistical test to calculate P-Value.

##### Why did you choose the specific statistical test?

Answer : The t-statistical test is used to determine if there is a significant difference between the means of two groups, which helps in understanding whether a particular factor (like temperature) has a substantial impact on the observed data.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer :
* Null Hypothesis (H0): There is no significance difference in bike rental counts across diffrent seasons.

* Alternative Hypothesis (H1) : There is a significant difference in bike rental counts across different seasons.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Group data by seasons
data = [df1[df1['Seasons'] == season]['Rented Bike Count'] for season in df1['Seasons'].unique()]

# Perform ANOVA test
f_statistic, p_value = stats.f_oneway(*data)

# Print the results
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

The above value indicates that a highly significant diffrence in bike rental counts across the diffrent seasons. The extremely low P-Value suggests that the likelihood observing such a large F-Statistic under the null hypothesis (that there is no difference in bike rental counts across seasons) is extremely low. Therefore, we reject the null hypothesis and conclude that the bike rental counts vary significantly between seasons.

##### Which statistical test have you done to obtain P-Value?

Answer : I used F_statistical test.

##### Why did you choose the specific statistical test?

Answer : The F-statistical method (ANOVA) is used to determine if there are significant differences between the means of multiple groups, making it ideal for comparing bike rental counts across different seasons.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer :

* Null Hypothesis (H0): There is no significant difference in bike rental counts between days with and without holidays.

* Alternative Hypothesis (H1): There is a significant difference in bike rental counts between days with and without holidays.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Divide the dataset into two groups based on holiday
holiday_group = df1[df1['Holiday'] == 'Holiday']['Rented Bike Count']
non_holiday_group = df1[df1['Holiday'] == 'No Holiday']['Rented Bike Count']

# Perform t-test
t_statistic, p_value = ttest_ind(holiday_group, non_holiday_group)

# Print the results
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

The T-Statistic of -6.7874 and the very small P-Value (1.21)suggest that there is a highly significant difference in bike rental counts between days with and without holidays.



##### Which statistical test have you done to obtain P-Value?

Answer : I have used T-Statistical test.

##### Why did you choose the specific statistical test?

Answer : The t-test is used to determine if there is a significant difference between the means of two groups. In this scenario, we're comparing bike rental counts between days with and without holidays, which are two distinct groups. Therefore, the t-test is appropriate for examining whether there's a statistically significant difference in bike rental counts between these two conditions.

## ***Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# missing values
df1.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer : There is no any missing values in the given dataset.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
outlier_columns = list(set(df1.describe().columns) - {'Rented Bike Count', 'Hour'})
outlier_columns

In [None]:
# Plot boxplot
plt.figure(figsize=(10, 6))
for index, column in enumerate(outlier_columns):
    plt.subplot(3, 3, index+1)
    sns.boxplot(df1[column], orient='h')
    plt.title(column)
plt.tight_layout()
plt.show()

Answer  : From above boxplots we can see that 'snowfall','wind speed','solar radiation','rainfall' columns have outliers.

In [None]:
#Creating a list of columns that contains outliers
outlier_cols = ['Rainfall(mm)','Wind speed (m/s)','Snowfall (cm)','Solar Radiation (MJ/m2)']

#Finding the inter-quartile range for the columns with outliers
Q1 = df1[outlier_cols].quantile(0.25)
Q3 = df1[outlier_cols].quantile(0.75)
IQR = Q3-Q1

#Calculating the upper and lower fence for outlier removal
u_fence = Q3 + (1.5*IQR)
l_fence = Q1 - (1.5*IQR)

#Detecting and removing the outliers
df1[outlier_cols] = df1[outlier_cols][~((df1[outlier_cols] < l_fence) | (df1[outlier_cols] > u_fence))]

In [None]:
df1.info()

After removing outliers, null values may arise due to the removal of certain data points. To handle these null values, replacing them with the median is a robust approach, as the median is less influenced by outliers compared to the mean.

In [None]:
# Impute the null values created by outlier handeling
def impute_null(outlier_cols):
    for col in outlier_cols:
        df1[col] = df1[col].fillna(df1[col].median())
    return df1

df1 = impute_null(outlier_cols)
df1.isnull().sum()

### 3. Categorical Encoding

### **Categorical encoding is the process of converting categorical variables into numerical representations that can be used as inputs for machine learning algorithms.**

**In above dataset seasons, holiday and functional day these columns have categorical values. Therefore these three columns require encoding.**

In [None]:
# Encoding the seasons columns
df1['Winter'] = np.where(df1['Seasons']=='Winter', 1, 0)
df1['Spring'] = np.where(df1['Seasons']=='Spring', 1, 0)
df1['Summer'] = np.where(df1['Seasons']=='Summer', 1, 0)
df1['Autumn'] = np.where(df1['Seasons']=='Autumn', 1, 0)

#Removing seasons column
df1.drop(columns=['Seasons'],axis=1,inplace=True)

In [None]:
# Encoding holiday column
holiday_mapping = {'Holiday': 1, 'No Holiday': 0}
df1['Holiday'] = df1['Holiday'].map(holiday_mapping)

In [None]:
# Encoding functining day column
functioning_day_mapping = {'Yes': 1, 'No': 0}
df1['Functioning Day'] = df1['Functioning Day'].map(functioning_day_mapping)

In [None]:
df1.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer :

* For encoding seasons column One-Hot Encoding techmique is used. In one-hot encoding, each category is represented by a binary column where a 1 indicates the presence of that category and 0 indicates the absence. Each category in the 'Seasons' column ('Winter', 'Spring', 'Summer', 'Autumn') is encoded into a separate binary column, creating a sparse matrix representation of the categorical variable. This technique is commonly used when the categories are nominal (unordered) and do not have a natural ordinal relationship.

* For encoding holiday column Label Encoding technique is used because abel encoding is employed here by assigning a binary representation to each category in the 'Holiday' column. 'Holiday' is mapped to 1, indicating the presence of a holiday, while 'No Holiday' is mapped to 0, indicating the absence of a holiday.

* For encoding functioning column Label Encoding technique is used because label encoding is employed here by assigning a binary representation to each category in the 'Functioning Day' column. 'Yes' is mapped to 1, indicating a functioning day, while 'No' is mapped to 0, indicating a non-functioning day.

In [None]:
df1.columns

In [None]:
# drop unnecessary columns
df1.drop(columns=['Date','Dew point temperature(°C)'],axis=1,inplace=True)

In [None]:
df1.columns

### **Checking and Removing Multicollinearity**

Multicollinearity refers to the phenomenon where two or more predictor variables in a regression model are highly correlated with each other. It can lead to issues such as inflated standard errors of coefficients and difficulty in interpreting the effects of individual predictors on the target variable.

Accepted multicollinearity is below 10.

In [None]:
# list of independent variable
independent_variable = list(set(df1.columns) - {'Rented Bike Count'})
independent_variable

In [None]:
# calculate variance inflation factor
vif_data = pd.DataFrame()
vif_data["feature"] = independent_variable
vif_data["VIF"] = [variance_inflation_factor(df1[independent_variable].values, i) for i in range(len(independent_variable))]
print(vif_data)

We observed that the predictor variables representing the seasons have very high VIF values, indicating strong multicollinearity. To address this, we will drop one of the season columns. We choose to drop the 'Winter' column because it corresponds to the season with the lowest bike rental count. Additionally, the columns 'Snowfall' and 'Rainfall' have negligible VIF values, suggesting they do not contribute to multicollinearity and can be dropped from the analysis.

In [None]:
#drop columns
df1.drop(columns=['Winter','Snowfall (cm)','Rainfall(mm)'],axis=1,inplace=True)

In [None]:
# list of remaining independent columns
independent_variable1 = list(set(df1.columns) - {'Rented Bike Count'})

In [None]:
# calculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = independent_variable1
vif_data["VIF"] = [variance_inflation_factor(df1[independent_variable1].values, i) for i in range(len(independent_variable1))]
print(vif_data)

We can see that functioning column has VIF > 10 therefor drop Functioning day column

In [None]:
# drop column
df1.drop(columns=['Functioning Day'],axis=1,inplace=True)

In [None]:
# remaining independent columns
independent_variable2 = list(set(df1.columns) - {'Rented Bike Count'})

In [None]:
# calculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = independent_variable2
vif_data['VIF'] = [variance_inflation_factor(df1[independent_variable2].values, i) for i in range(len(independent_variable2))]
print(vif_data)

Now we have all feature whoes multicollinearity is below 10.

### Check correlation between independent and dependent variables

In [None]:
# Check correlation between independent_variables2 and dependent variable using regression plot
for col in independent_variable2:
    plt.figure(figsize=(8, 4))
    sns.regplot(x=col, y='Rented Bike Count', data=df1, scatter_kws={"color" : 'pink'}, line_kws={"color" : 'black'})
    correlation = df1[col].corr(df1['Rented Bike Count'])
    plt.title(f"Correlation between {col} and Rented Bike Count: {correlation:.3f}")
    plt.ylabel('Rented Bike Count')
    plt.show()

We can see that all the independent variabale have linear correlation with the dependent variable.

# **Preprocessin of the Data**

In [None]:
# Creating dataset independent variables and dependent variables
X = df1.drop(columns=['Rented Bike Count'])
y = df1['Rented Bike Count']

In [None]:
# independent variable dataset
X.head()

In [None]:
# dependent variable dataset
y.head()

### Feature Transformation

In [None]:
# cheking distribution of targeted variable
plt.figure(figsize=(10,6))
sns.histplot(y)
plt.show()

We can see that in targetrd variable we observed distribution because of positively skewed distribution and we will normalize using squre root transformation.

In [None]:
# use log trasformation
y = np.sqrt(y)

In [None]:
# plot histplot
plt.figure(figsize=(10,6))
sns.histplot(y)
plt.show()

Now the target variable is normalized.

###**Apply test and train split**

In [None]:
# split datasset into test and train datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# check the shape of train dataset if the independent variable
X_train.shape

In [None]:
# # check the shape of test dataset if the independent variable
X_test.shape

### **Feature Scalling:**

Feature scaling involves changing the range or distribution of numerical features so that they have similar scales.

### **Common methods include:**

* **Standardization (Z-score normalization)**: Centers the data around zero with a standard deviation of one.
* **Normalization (Min-Max scaling)**: Scales the data to a fixed range, typically [0, 1].
* **Robust Scaling**: Uses the median and interquartile range, which makes it robust to outliers.

Here we are going to use standerdisation method.

In [None]:
# Apply standerdization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [None]:
# dataset after stabderdisation
X_train

## ***7. ML Model Implementation***

In this project we are going to implement following regression models:

1. Linear Regression
2. Ridge Regression
3. Lasso Regression
4. Elastic Net Regression

### ML Model - 1 : Linear Regression:

Linear regression is the simplest form of regression analysis where the relationship between the dependent variable y and one or more independent variables X is modeled by fitting a linear equation to the observed data.

In [None]:
# Initialize linear regression model
linear_model = LinearRegression()

# Fit the model on the training data
linear_model.fit(X_train, y_train)

In [None]:
# check the score of model
linear_model.score(X_train, y_train)

In [None]:
# Checking coefficient value
linear_model.coef_

In [None]:
# predicting the value of the dependent variable for train and test dataset
y_train_pred_lr = linear_model.predict(X_train)
y_test_pred_lr = linear_model.predict(X_test)
print(y_train_pred_lr)
print(y_test_pred_lr)

In [None]:
#Creating a function to plot the comparison between actual values and predictions
def plot_comparison(y_pred,model):
   plt.figure(figsize=(8,4))
   plt.title("The comparison of actual values and predictions obtained by "+model)
   plt.plot(np.array((y_test)))
   plt.plot((y_pred),color='red')
   plt.legend(["Actual","Predicted"])
   plt.show()

In [None]:
#Plotting the comparison between actual and predicted values obtained by Linear Regression
plot_comparison(y_test_pred_lr,'Linear Regression')

In [None]:
# Calculate evaluation matrix of the model
print('MAE:', mean_absolute_error(y_test, y_test_pred_lr))
print('MSE:', mean_squared_error(y_test, y_test_pred_lr))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_test_pred_lr)))
print('R2 Score:', r2_score(y_test, y_test_pred_lr))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Define the metrics
metrics = {
    'MAE': 6.631991473282277,
    'MSE': 81.09904863897937,
    'RMSE': 9.005501020985971,
    'R2 Score': 0.4730974203328704,
}

In [None]:
# Visualizing evaluation Metric Score chart
# Convert the dictionary to two lists
metric_names = list(metrics.keys())
metric_values = list(metrics.values())

# Create the bar chart
plt.figure(figsize=(8, 4))
sns.barplot(x=metric_names, y=metric_values, palette='viridis')

# Add titles and labels
plt.title('Regression Model Performance Metrics')
plt.ylabel('Score')
plt.xlabel('Metrics')

# Display the values on top of the bars
for i, v in enumerate(metric_values):
    plt.text(i, v + 0.05, f"{v:.10f}", ha='center', va='bottom')

plt.show()

From the above chart we can see that:
* The Mean Absolute Error (MAE) of 6.63 indicates that, on average, your model's predictions are off by approximately 6.63 units from the actual values.
* The Mean Squared Error (MSE) of 81.10 indicates that, on average, the squared differences between the predicted and actual values are 81.10, suggesting that the model's predictions have a considerable variance and the errors can be relatively large.
* The RMSE (Root Mean Square Error) value of 9.005501020985971 indicates the average magnitude of the errors between the predicted values of your model and the actual observed values.
* An R2 score of 0.473 suggests that approximately 47.3% of the variance in the dependent variable is explained by the independent variables in your model, indicating moderate predictive capability.

In summary, the model's performance seems moderate based on the provided metrics, further evaluation is necessary to determine its adequacy for the dataset and whether improvements are needed.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the model
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'n_jobs': [None, -1, 1, 2],
    'copy_X':[True, False]
}

# Perform GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

In [None]:
# Get the best estimator
best_model = grid_search.best_estimator_

# Predict on test set
y_pred = best_model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

# Output the best parameters and new metrics
print("Best parameters from GridSearchCV:", grid_search.best_params_)
print("MAE after tuning:", mae)
print("MSE after tuning:", mse)
print("RMSE after tuning:", rmse)
print("R2 Score after tuning:", r2)

##### Which hyperparameter optimization technique have you used and why?

Answer : I used GridSearchCV for hyperparameter optimization because it exhaustively searches through all possible combinations of specified parameters, ensuring the best set is found for the model. This thorough approach is beneficial for small to moderately sized parameter grids.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer : No, There is no any improvement in evaluation metric score chart.

### ML Model - 2 : Ridge Regression

Ridge regression is a type of linear regression that includes a regularization term to prevent overfitting. This regularization term, which is the sum of the squared coefficients multiplied by a penalty factor (alpha), shrinks the coefficients towards zero, adding bias but reducing variance and potentially improving model performance on new data.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Create a Ridge regression model
ridge= Ridge()

# Define the parameter grid for alpha
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}

# Perform GridSearchCV to find the best alpha
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

In [None]:
#Getting the best parameters for Ridge regression fetched through GridSearchCV
print(f"The best value for alpha in ridge regression through GridSearchCV is found to br {grid_search.best_params_}")
print(f"\nUsing {grid_search.best_params_} as the value for alpha gives a negative mean squared error of: {grid_search.best_score_}")

In [None]:
#Fitting the Ridge regression model on the dataset with appropriate alpha value
ridge_model=Ridge(alpha=10).fit(X_train,y_train)

In [None]:
#Predicting values of the independent variable on the test set
y_test_pred_ridge = ridge_model.predict(X_test)

In [None]:
#Plotting the comparison between actual and predicted values obtained by Ridge Regression
plot_comparison(y_test_pred_ridge,'Ridge Regression')

In [None]:
# Calculate metrics for the Ridge regression model
ridge_mae = mean_absolute_error(y_test, y_test_pred_ridge)
ridge_mse = mean_squared_error(y_test, y_test_pred_ridge)
ridge_rmse = ridge_mse ** 0.5
ridge_r2 = r2_score(y_test, y_test_pred_ridge)

print("\nBest alpha from GridSearchCV:", grid_search.best_params_['alpha'])
print("\nRidge Regression Metrics After Hyperparameter Tuning:")
print(f"MAE: {ridge_mae}")
print(f"MSE: {ridge_mse}")
print(f"RMSE: {ridge_rmse}")
print(f"R2 Score: {ridge_r2}")

In [None]:
# Define the metrics
metrics_1 = {
    'MAE': 6.631908141849714,
    'MSE': 81.10342522112052,
    'RMSE': 9.00574401263552,
    'R2 Score': 0.473068985567494,
}

In [None]:
# Visualizing evaluation Metric Score chart
# Convert the dictionary to two lists
metric_names = list(metrics_1.keys())
metric_values = list(metrics_1.values())

# Create the bar chart
plt.figure(figsize=(8, 4))
sns.barplot(x=metric_names, y=metric_values, palette='viridis')

# Add titles and labels
plt.title('Regression Model Performance Metrics')
plt.ylabel('Score')
plt.xlabel('Metrics')

# Display the values on top of the bars
for i, v in enumerate(metric_values):
    plt.text(i, v + 0.05, f"{v:.10f}", ha='center', va='bottom')

plt.show()

### ML Model - 3 : Lasso Regression

Lasso regression, or L1 regularization, is a linear regression technique that adds a penalty term to the loss function, constraining the absolute size of the coefficients. It encourages sparse models by shrinking coefficients to zero, effectively performing feature selection and providing interpretable models with fewer predictors.

In [None]:
# Initialization of lasso regression
lasso = Lasso()

# Define the parameter grid for alpha
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}

# Perform GridSearchCV to find the best alpha
lasso_grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
lasso_grid_search.fit(X_train, y_train)

In [None]:
#Getting the best parameters for Ridge regression fetched through GridSearchCV
print(f"The best value for alpha in ridge regression through GridSearchCV is found to br {lasso_grid_search.best_params_}")
print(f"\nUsing {lasso_grid_search.best_params_} as the value for alpha gives a negative mean squared error of: {lasso_grid_search.best_score_}")

In [None]:
#Fitting the lasso regression model on the dataset with appropriate alpha value
lasso_model=Lasso(alpha=0.01).fit(X_train,y_train)

In [None]:
#Predicting values of the independent variable on the test set
y_test_pred_lasso = lasso_model.predict(X_test)

In [None]:
#Plotting the comparison between actual and predicted values obtained by Ridge Regression
plot_comparison(y_test_pred_lasso,'Lasso Regression')

In [None]:
# Calculate metrics for the lasso regression model
lasso_mae = mean_absolute_error(y_test, y_test_pred_lasso)
lasso_mse = mean_squared_error(y_test, y_test_pred_lasso)
lasso_rmse = ridge_mse ** 0.5
lasso_r2 = r2_score(y_test, y_test_pred_lasso)

print("\nBest alpha from GridSearchCV:", lasso_grid_search.best_params_['alpha'])
print("\nLasso Regression Metrics After Hyperparameter Tuning:")
print(f"MAE: {lasso_mae}")
print(f"MSE: {lasso_mse}")
print(f"RMSE: {lasso_rmse}")
print(f"R2 Score: {lasso_r2}")

In [None]:
# Visualizing evaluation Metric Score chart
# Define the metrics
metrics_2 = {
    'MAE': 6.630418122574132,
    'MSE': 81.09393160547758,
    'RMSE': 9.005216910517902,
    'R2 Score': 0.4731306658295511,
}

In [None]:
# Visualizing evaluation Metric Score chart
# Convert the dictionary to two lists
metric_names = list(metrics_1.keys())
metric_values = list(metrics_1.values())

# Create the bar chart
plt.figure(figsize=(8, 4))
sns.barplot(x=metric_names, y=metric_values, palette='viridis')

# Add titles and labels
plt.title('Regression Model Performance Metrics')
plt.ylabel('Score')
plt.xlabel('Metrics')

# Display the values on top of the bars
for i, v in enumerate(metric_values):
    plt.text(i, v + 0.05, f"{v:.10f}", ha='center', va='bottom')

plt.show()

### ML Model - 3 : Elastic Net Regression

Elastic Net Regression is a hybrid regularization technique that combines L1 (Lasso) and L2 (Ridge) penalties in the loss function. It balances between feature selection (Lasso) and coefficient shrinkage (Ridge), providing a solution for multicollinearity and improving model performance by handling both high-dimensional data and correlated predictors.

In [None]:
# Initialize elastic net regression
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)

# Fit the model on the training data
elastic_net.fit(X_train, y_train)

In [None]:
# check the score
elastic_net.score(X_train, y_train)

In [None]:
# predict on the test set
y_test_pred_elastic_net = elastic_net.predict(X_test)
y_test_pred_elastic_net

In [None]:
#Plotting the comparison between actual and predicted values obtained by Ridge Regression
plot_comparison(y_test_pred_elastic_net,'Elastic Net Regression')

In [None]:
# Calculate evaluation matrics
elastic_mae = mean_absolute_error(y_test, y_test_pred_elastic_net)
elastic_mse = mean_squared_error(y_test, y_test_pred_elastic_net)
elastic_rmse = ridge_mse ** 0.5
elastic_r2 = r2_score(y_test, y_test_pred_elastic_net)

print(f"MAE: {elastic_mae}")
print(f"MSE: {elastic_mse}")
print(f"RMSE: {elastic_rmse}")
print(f"R2 Score: {elastic_r2}")

In [None]:
# Visualization
matrices = {
    'MAE': elastic_mae,
    'MSE': elastic_mse,
    'RMSE': elastic_rmse,
    'R2 Score': elastic_r2
}

# plot the barplot
plt.figure(figsize=(8, 4))
sns.barplot(x=list(matrices.keys()), y=list(matrices.values()), palette='viridis')
# Display the values on top of the bars
for i, v in enumerate(metric_values):
    plt.text(i, v + 0.05, f"{v:.10f}", ha='center', va='bottom')


# Add titles and labels
plt.title('Regression Model Performance Metrics')
plt.ylabel('Score')
plt.xlabel('Metrics')
plt.show()

In [None]:
# Define dictionaries for each model's metrics
linear_regression_metrics = {
    'Model': 'Linear Regression',
    'MAE': 6.631991473282277,
    'MSE': 81.09904863897937,
    'RMSE': 9.005501020985971,
    'R2 Score': 0.4730974203328704
}

ridge_regression_metrics = {
    'Model': 'Ridge Regression',
    'MAE': 6.631908141849714,
    'MSE': 81.10342522112052,
    'RMSE': 9.00574401263552,
    'R2 Score': 0.473068985567494
}

lasso_regression_metrics = {
    'Model': 'Lasso Regression',
    'MAE': 6.630418122574132,
    'MSE': 81.09393160547758,
    'RMSE': 9.005216910517902,
    'R2 Score': 0.4731306658295511
}

elastic_net_regression_metrics = {
    'Model': 'Elastic Net Regression',
    'MAE': 6.6717984977445175,
    'MSE': 81.62135825952943,
    'RMSE': 9.005216910517902,
    'R2 Score': 0.46970396145670157
}

# Create a DataFrame
metrics_df = pd.DataFrame([linear_regression_metrics, ridge_regression_metrics, lasso_regression_metrics, elastic_net_regression_metrics])

# Print the DataFrame
print(metrics_df)

Based on these metrics:

* Linear Regression, Ridge Regression, and Lasso Regression have very similar performance across all metrics.
* Elastic Net Regression performs slightly worse compared to the other models, with a higher MAE, MSE, and lower R2 Score.

Considering the similarity in performance between Linear Regression, Ridge Regression, and Lasso Regression, either of these models can be considered the best choice for this dataset. However, Elastic Net Regression may not be the preferred choice in this case due to slightly poorer performance.

## **Model Explainability**

In [None]:
def shap_summary(model):
    # Get X_train and X_columns from the global environment
    global X_train, X_columns

    # Create a Shap explainer
    explainer_shap = shap.Explainer(model=model, masker=X_train)

    # Calculate Shap values
    shap_values = explainer_shap.shap_values(X_train)

    # Set color palette
    colors = ["#008fd5", "#fc4f30", "#e5ae38", "#6d904f", "#8b8b8b"]

    # Plot the summary plot
    shap.summary_plot(shap_values, X_train, feature_names=X.columns, plot_type="bar", color=colors)

In [None]:
#Plotting shap summary plot for linear regression
shap_summary(linear_model)

In [None]:
#Plotting shap summary plot for ridge regression
shap_summary(ridge_model)

In [None]:
#Plotting shap summary plot for lasso regression
shap_summary(lasso_model)

In [None]:
#Plotting shap summary plot for elastic net regression
shap_summary(elastic_net)

# Conclusion

1. Summary of the EDA :

  * There's a rise in bike rentals during 2018 compared to 2017, with noticeable drops in rentals during 2017, suggesting increased popularity or intensified marketing efforts from early 2018 onwards.

  * Bike rental demand is high in warm weather condition and low in mild condition.
  * The rented bike count is highest in the summer season, followed by autumn and then spring. The rented bike count is lowest in the winter season.

  * when the wind speed is low then the rented bike count is hight and when wind speed is maximum then the bike count is less.

  * When there is a holiday then the bike count is very less and when there is no holiday then the rent bike count is very high.

  * Better visibility conditions are associated with higher bike rental counts.

  * When dew point temperature is maximum then the count of the rented bike is maximum and when temperature is minimum then count is also minimum.

  * On a functioning day the rent bike count is high and when there is no functioning day then the rented bike count is zero.

  * Rented bike count has a strong positive correlation with temperature (0.54), meaning higher temperatures are associated with more bike rentals.

2.Summary of the Machine Learning Model applications :

  * Based on these metrics:

    * Linear Regression, Ridge Regression, and Lasso Regression have very similar performance across all metrics.

    * Elastic Net Regression performs slightly worse compared to the other models, with a higher MAE, MSE, and lower R2 Score.

    * Considering the similarity in performance between Linear Regression, Ridge Regression, and Lasso Regression, either of these models can be considered the best choice for this dataset. However, Elastic Net Regression may not be the preferred choice in this case due to slightly poorer performance.

    * All 4 models have been explained with the help of SHAP library.

    * Temperature and Hour are the two most important factors according to all the models.

3. Cornerstone points of the process:

  * Handling outliers to avoid their impact on predictions.
  * Converting categorical variables into numerical representations.
  * Managing multicollinearity among predictor variables.
  * Choosing an easy-to-understand technique to explain model predictions.


