# **Project Name**    -  **Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual (Sayesh Ankaram)


# **Project Summary -**

The emergence of bike and scooter ride-sharing companies in urban areas, has created a challenge in accurately predicting the demand for their services. Overestimating or underestimating the demand can lead to resource wastage or revenue loss, respectively. To address this challenge, this project aims towards combining historical bike usage patterns with weather data to forecast bike rental demand.

This project utilizes a dataset with eight input variables: 'Date', 'Seasons', 'Holiday', 'Functional day', 'Temperature', 'Humidity', 'Dew Point Temperature', and 'Windspeed'. Python libraries such as Pandas, Seaborn, NumPy, and scikit-learn (sklearn) are used to develop the prediction algorithm. By evaluating different models, the project seeks to identify algorithms that provide accurate predictions and can be deployed effectively in real-world scenarios.

Accurate bike rental demand forecasting offers significant benefits. Ride-sharing companies can reduce waste and improve resource allocation, resulting in cost savings and increased profitability. By optimizing bike maintenance, parking space allocation, and operational planning based on anticipated demand, these companies can operate more efficiently.

Moreover, accurate demand predictions enhance customer satisfaction and provide a better overall experience for users. By ensuring an adequate supply of bikes and scooters based on anticipated demand, customers are less likely to face unavailability issues. This fosters customer loyalty, positive word-of-mouth, and sustained business growth.

Additionally, bike and scooter ride-sharing services are considered environmentally friendly alternatives to traditional transportation methods. By incorporating weather data into demand forecasting, it becomes possible to align the supply of bikes and scooters with weather conditions suitable for cycling. This encourages more people to choose biking as a means of transportation, resulting in reduced traffic congestion and lower carbon emissions. Accurate demand forecasting contributes to the broader goal of promoting sustainable and eco-friendly urban mobility.

In conclusion, this project's aim is to combine historical bike usage patterns with weather data for accurate demand forecasting holds significant potential for the bike and scooter ride-sharing industry. By utilizing advanced algorithms and machine learning techniques, the project seeks to optimize resource allocation, reduce waste, and increase profitability for ride-sharing companies. Simultaneously, it strives to enhance customer satisfaction, promote environmentally friendly transportation alternatives, and mitigate traffic congestion and carbon emissions. Data-driven insights can have a positive impact on both the business and environmental aspects of the bike and scooter ride-sharing industry, leading to a more sustainable and efficient urban mobility landscape.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
from google.colab import drive                                                  # for mounting the dataset

import numpy as np                                                              # for processing single-dimensional data centric operations
import pandas as pd                                                             # for processing multi-dimensional data centric operations
import matplotlib.pyplot as plt                                                 # for plotting 2D graphs
import seaborn as sns                                                           # for plotting 3D & more sophisticated statstical graphs
import missingno as msno                                                        # for plotting null values

import random                                                                   # for randomly picking visualizations background style

import warnings                                                                 # for ignoring any warning interruptions that can disrupt the flow of code

from datetime import datetime                                                   # for converting the Dtype of a categorical date feature from object to datetime64[ns]
import datetime as dt

from statsmodels.stats.outliers_influence import variance_inflation_factor      # for using VIF in order to detect multicollinearity between features

import statsmodels.api as sm                                                    # Perform Statistical Test to obtain P-Value

from scipy.stats import zscore                                                  # for Scaling the data

from sklearn.preprocessing import MinMaxScaler                                  # for transforming the features on a common scale
from sklearn.preprocessing import OneHotEncoder                                 # for transforming categorical data into numerical binary format
from sklearn.preprocessing import OrdinalEncoder                                # for transforming categorical data into ordinal values with natural ordering
from sklearn.preprocessing import LabelEncoder                                  # for transforming categorical data into unique integer values

from sklearn.model_selection import train_test_split                            # for splitting the dataset into training & testing set
from sklearn.model_selection import GridSearchCV                                # for utilizing Grid Search Cross Validation
from sklearn.model_selection import cross_validate                              # for utilizing cross validation inorder to check the perforamnce of the model
from sklearn.model_selection import RandomizedSearchCV                          # for utilizing Ranodm search cross validation inorder to select best hyperparameter for the model

from sklearn.linear_model import LinearRegression                               # for fitting a Linear Regression model onto the data distribution
from sklearn.linear_model import Lasso                                          # for fitting a Lasso Regression model onto the data distribution
from sklearn.linear_model import Ridge                                          # for fitting a Ridge Regreesion model onto the data distribution

from sklearn.metrics import *                                                   # for including Evaluation metrics in order to quantify the performance of predictive models

### Dataset Loading

In [None]:
# Load Dataset
drive.mount('/content/drive')
df_path = '/content/drive/My Drive/Colab Notebooks/Module 5 - Machine Learning/Capstone Project - Regression/'
data = pd.read_csv(df_path + 'SeoulBikeData.csv', index_col=False, encoding='unicode escape')

### Handling Warnings and Assigning Background Randomizer

In [None]:
# Ignoring interpreter generated Warnings
warnings.filterwarnings(action='ignore')

In [None]:
# Initializing Background Randomizer for Data Viusualizations
style_types = plt.style.available
for i in random.sample(style_types, len(style_types)):
  plt.style.use(i)
%matplotlib inline

### Dataset First View

In [None]:
# Dataset First Look
data.head().T

### Dataset Rows & Columns count

In [None]:
# Dataset First Look
data.tail().T

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isna().sum()

In [None]:
# Visualizing the missing values
msno.bar(data, figsize=(10, 7), fontsize=10, color='black')

### What did you know about your dataset?

The dataset we've been presented with explains monthly stock prizes of the bank since it's inception and it includes the closing, starting, highest & lowest stock prizes of every month.

Our main aim is to predict bike count at each hour for a stable supply of rental bikes.

The above dataset comprises of 8760 rows & 14 columns.

There are 0 duplicates present in the dataset.

There are 0 missing values present in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(data.columns)

In [None]:
#Renaming the columns
data = data.rename(columns={'Temperature(°C)':'Temperature', 'Humidity(%)':'Humidity', 'Wind speed (m/s)':'Wind speed', 'Visibility (10m)':'Visibility',
                            'Dew point temperature(°C)':'Dew point temperature', 'Solar Radiation (MJ/m2)':'Solar Radiation', 'Rainfall(mm)':'Rainfall',
                            'Snowfall (cm)':'Snowfall'})

In [None]:
# Dataset Describe
data.describe().T

### Variables Description

Our Dataset comprises of 8760 rows & 14 columns.



Following is the description regarding each column:
  
* **Date** - day/month/year
* **Rented Bike count** - Count of bikes rented per hour
* **Hour** - Hour of the day
* **Temperature(°C)**-Temperature in Celsius
* **Humidity(%)** - Humidity in the air in %
* **Wind speed (m/s)** - Speed of the wind in  m/s
* **Visibility (10m)** - Visibility in m (10m)
* **Dew point temperature(°C)** - Temperature at the beggining of the day(Celsius)
* **Solar Radiation (MJ/m2)** -Sun contribution (MJ/m2)
* **Rainfall(mm)** - Amount of raining in mm
* **Snowfall(cm)** - Amount of snowing in cm
* **Seasons** - Winter, Spring, Summer, Autumn
* **Holiday** - Holiday/No holiday
* **Functional Day** -  If the day is a Functional Day or not



From above Overview, we can conclude that:

*   **Categorical Discrete Variables (dtype: object)** - Date, Seasons, Holiday, Functioning Day.
*   **Numerical Discrete Variables (dtype: int64)** - Hour.

*   **Numerical Continuous Variables (dtype: float64)** - Temperature(°C), Humidity(%), Wind speed (m/s), Visibility (10m), Dew point temperature(°C), Solar Radiation (MJ/m2), Rainfall(mm), Snowfall(cm).



*   **Dependent/Target/Y variable** - "Rented Bike count"






### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in list(data.columns):
  print(f"""No. of unique values in "{i}" is {data[i].nunique()}.""")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating an instance of the dataset in order to preserve the original dataset.
bike_df = data.copy()

In [None]:
# converting the Dtype of a categorical 'Date' feature from object to datetime64[ns]
print(f"Initial Dtype of Date (pre-conversion): '{bike_df['Date'].dtype}'")
bike_df['Date'] = bike_df['Date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
print(f"Final Dtype of Date (post-conversion): '{bike_df['Date'].dtype}'")

In [None]:
# creating new columns by extracting Day, Month & Year out of the 'Date' column
bike_df['Day'] = bike_df['Date'].dt.day_name()
bike_df['Month'] = bike_df['Date'].dt.month
bike_df['Year'] = bike_df['Date'].dt.year

In [None]:
bike_df.head()

In [None]:
# creating a new column called 'Weekend' that takes Functional days as 0 & Non-functional days as 1
bike_df['Weekend'] = bike_df['Day'].apply(lambda x: 1 if x=='Saturday' or x=='Sunday' else 0)
bike_df['Weekend'].value_counts()

In [None]:
# dropping 'Date', 'Day' & 'Year' columns
bike_df.drop(columns=['Date', 'Day','Year'], axis=1, inplace=True)

In [None]:
bike_df.head()

### What all manipulations have you done and insights you found?

**Following are some of the steps that were carried out and some insights that have been found while performing Data wrangling:**



1.   In order to avoid tampering with the original dataset, we created an instance of it.
2.   We found out that there's a 'Date' variable which was of Dtype: object, hence we converted it to Dtype: datetime64.
3.  After the conversion, we created new columns by extracting 'Day', 'Month' & 'year' from the 'Date' column.
4.  After the extraction, we used the 'Day' column to create a new column called 'Weekend', which takes 0 for Functional Days & 1 for Non-Functional Days.
5.  Later, we drop the irrelevant columns 'Date', 'Day' & 'Year'.














## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **Univariate Analysis**

#### Chart - 1

In [None]:
# Considering only the Dependent feature
bike_df['Rented Bike Count'].value_counts()

In [None]:
# Checking the Distribution of Target variable by making 2 plots
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,3))

# Checking the Skewness of Target variable about it's mean - using Kernel Density Estimation Plot
sns.kdeplot(bike_df,x='Rented Bike Count',fill=True,color='darkblue',ax=ax1)
ax1.axvline(bike_df['Rented Bike Count'].mean(), color='red', linestyle=':', linewidth=2)

# Checking the Outliers present in the Target variable - using Box Plot
sns.boxplot(bike_df,x='Rented Bike Count',ax=ax2,palette='Pastel1')

plt.show()

##### 1. Why did you pick the specific chart?



   



*   Kernel Density Estimation Plot describes the skewness of Dependent variable.
*   Box Plot describes the presence of outliers in the Dependent variable.



##### 2. What is/are the insight(s) found from the chart?



*   Dependent variable is positively skewed about it's mean.
*   Dependent variable has a lot of Outliers.







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, The insights gained from analyzing data i.e, a positively skewed dependent variable (Bike rented count) and a high number of outliers can potentially create a positive business impact.

However, the presence of these outliers describes instances where there are extremely high bike rental counts, which may indicate exceptional demand spikes or anomalies.

While this may not directly lead to negative growth, it can pose challenges in capacity planning, resource allocation, and service delivery, requiring businesses to carefully manage and optimize operations to meet customer demand and prevent potential negative impacts on customer satisfaction and business growth.

#### Chart - 2

In [None]:
# Considering Independent Numerical features & excluding the Dependent feature
Numerical_feat = bike_df.describe().columns.drop('Rented Bike Count')
list(Numerical_feat)

In [None]:
# Checking the Distribution of Independent Numerical features by making 2 plots
for feat in Numerical_feat:
  fig, (ax1,ax2) = plt.subplots(1,2, figsize=(7,7))
  sns.kdeplot(bike_df, x=feat, fill=True, color='darkblue', ax=ax1)
  ax1.axvline(bike_df[feat].mean(), color='red', linestyle=':', linewidth=2)
  sns.boxplot(bike_df, x=feat, ax=ax2, palette='Pastel1')
  plt.show()
  print('\n\n')

##### 1. Why did you pick the specific chart?

*   Kernel Density Estimation Plot describes the skewness of Independent Numerical features.
*   Box Plot describes the presence of outliers in the Independent Numerical features.

##### 2. What is/are the insight(s) found from the chart?

*   Some Independent Numerical features exhibit skeweness.
*   Some Independent Numerical features have Outliers in them.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from carefully analyzing and appropriately handling the skewness and outliers of Independent Numerical features can help create a positive business impact.

By accurately understanding and addressing these data characteristics, businesses can make informed decisions, develop effective strategies, and optimize their operations.

This can lead to improved resource allocation, targeted marketing, enhanced customer satisfaction, and overall positive growth and performance in the business.

#### Chart - 3

In [None]:
# Considering the Independent Categorical features
Categorical_feat = bike_df[['Seasons', 'Holiday', 'Functioning Day']]
list(Categorical_feat)

In [None]:
# Checking the Distribution of Independent Categorical features by making 1 plot
for feat in Categorical_feat:
  fig, (ax1) = plt.subplots(1, figsize=(7,7))
  sns.countplot(bike_df, x=feat, palette="magma")
  plt.show()
  print('\n')

##### 1. Why did you pick the specific chart?




*   Countplot helps in describing the count of Rental Bikes for 3 Categorical features: Seasons, Holiday, Functional Day.






##### 2. What is/are the insight(s) found from the chart?

*   The count of Rental Bikes remains the same across different seasons.
*   THe count of Rental Bikes is highly imbalanced in both: Holiday & Functional Day.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Analyzing the count of rental bike patterns across seasons, holidays, and functional days provides a valuable insight for understanding the fluctuation in demand and optimizing resource allocation accordingly.

### **Bivariate Analysis**

#### Chart - 4

In [None]:
# Considering the Independent Numerical features
list(Numerical_feat)

In [None]:
# Checking the relationship between Independent Numerical features & the Dependent variable
for feat in Numerical_feat:
  plt.figure(figsize=(7,7))
  sns.scatterplot(bike_df,x=feat,y='Rented Bike Count',color='green')
  correlation=bike_df[feat].corr(bike_df['Rented Bike Count'])
  plt.title('Rented Bike Count vs ' + feat + ': Correlation = '+str(correlation) )
  z = np.polyfit(bike_df[feat], bike_df['Rented Bike Count'], 1)
  y_hat = np.poly1d(z)(bike_df[feat])
  plt.plot(bike_df[feat], y_hat,'b--', lw=1)
  plt.show()
  print('\n\n\n')

##### 1. Why did you pick the specific chart?



*   Scatterplot helps in understanding the relationship between Independent Numerical features and the Dependent variable by plotting a regression line helps in visualizing and understanding the coorelation between them.




##### 2. What is/are the insight(s) found from the chart?



*   Independent Numerical features like: Hour, Temperature, wind speed, visibility, dew point temperature,solar radiation & month exhibit positive correlation with our dependent variable: Rented Bike Count.
*   Independent Numerical features like: Humidity, Rainfall & Snowfall exhibit negative correlation with our dependent variable: Rented Bike Count.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained indicating positive correlations between Hour, Temperature, wind speed, visibility, dew point temperature, solar radiation, and month with the Rented Bike Count can inform business decisions such as optimizing operational hours, adjusting pricing, and targeting marketing efforts to maximize bike rentals and drive positive business impact.

Additionally, understanding the negative correlations with other numerical features can help identify areas for improvement and implement strategies to mitigate potential negative impacts on bike rentals.

#### Chart - 5

In [None]:
# Considering the Independent Categorical features
list(Categorical_feat)

In [None]:
# Checking the relationship between Independent Categorical features & the Dependent variable
for feat in Categorical_feat:
  fig, (ax1) = plt.subplots(1, figsize=(7,7))
  sns.barplot(bike_df, x=feat, y='Rented Bike Count', palette="magma", ax=ax1)
  plt.show()
  print('\n')

##### 1. Why did you pick the specific chart?



*   Bar plot for plotting the variation in Rented Bike count due to Seasons,Holiday and Functioning day.




##### 2. What is/are the insight(s) found from the chart?

* The Bike Rent Count is maximum during summer but minimum during winter.
* During holidays, Bike Rent counts drop down.
* Contribution of non-funtioning day to the Bike Rent count is insignificant.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained about the count being maximum during summer, dropping during winter, and decreasing during holidays can help businesses plan their resources, adjust marketing strategies, and optimize operations to meet customer demand, resulting in a positive business impact.

Additionally, the understanding that non-functioning days have an insignificant contribution can guide businesses in allocating resources more efficiently.

#### Chart - 6

In [None]:
# Cheking the relationship between Independent feature: Rainfall & the Dependent Variable
plt.figure(figsize=(7,7))
bike_df.groupby('Rainfall').mean()['Rented Bike Count'].plot(c='b')
plt.xlabel('Rainfall in mm')
plt.ylabel('Average rented bike count')
plt.xticks(range(0,37,2))
plt.show()

##### 1. Why did you pick the specific chart?




*   Continuous plot for analyzing the frequency of Rented Bike counts over mm's of Rainfall.



##### 2. What is/are the insight(s) found from the chart?


*   The above plot indicates that despite heavy rainfall, the demand for rented bikes does not decrease. For instance, even with a rainfall of 22-24 mm, there is a significant peak in the number of rented bikes.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, the heavy rainfall preventing the decrease in the demand for rented bikes can have a positive business impact.

Businesses can leverage this information to optimize their operations during rainy periods and ensure a continuous supply of bikes, meeting customer demands and potentially increasing the revenue.

#### Chart - 7

In [None]:
# Cheking the relationship between Independent feature: Wind speed & the Dependent Variable
plt.figure(figsize=(7,7))
bike_df.groupby('Wind speed').mean()['Rented Bike Count'].plot(c='g')
plt.xlabel('Wind speed in m/s')
plt.ylabel('Average rented bike count')
plt.show()

##### 1. Why did you pick the specific chart?

*   Continuous plot for analyzing the frequency of Rented Bike counts over m/ses of Wind speed.

##### 2. What is/are the insight(s) found from the chart?



*   The above plot indicates that the demand for rented bikes is evenly distributed regardless of the wind speed. However, there is a spike in bike rentals when the wind speed is at 7 m/s, indicating that people enjoy riding bikes when there is a slight breeze.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, the even distribution of bike rentals regardless of wind speed, with a spike at 7 m/s, can have a positive business impact.

Businesses can promote biking as an enjoyable activity during breezy conditions, potentially increasing bike rentals and attracting more customers.

#### Chart - 8

In [None]:
# Checking relationship between Independent feature: Hour & the Dependent variable
fig, ax = plt.subplots(figsize=(7, 7))
sns.boxplot(bike_df, x='Hour', y='Rented Bike Count', ax=ax, palette='viridis')
ax.set(title='Count of Rented bikes according to Hour')
plt.show()

##### 1. Why did you pick the specific chart?


*   Box plot for visualizing the count of rented bikes on an hourly basis.



##### 2. What is/are the insight(s) found from the chart?


*   The above plot showcases the usage of rented bikes across different hours throughout the year. It is notable that people tend to use rented bikes during their working hours, specifically from 7 AM to 9 AM and from 5 PM to 7 PM.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, people tend to use rented bikes during their working hours can have a positive business impact.
 Businesses can optimize their operations and marketing efforts during these peak hours to meet customer demand, attract more riders, and potentially increase revenue.

#### Chart - 9

In [None]:
# Checking relationship between Independent feature: Month & the Dependent variable
plt.figure(figsize=(7,7))
sns.barplot(bike_df, x='Month', y='Rented Bike Count', palette='flare')
plt.title('Average count of Bikes Rented per Month')
plt.show()

##### 1. Why did you pick the specific chart?


*   Bar plot for visualizing the rented bike count over a period of 12 months.



##### 2. What is/are the insight(s) found from the chart?


*   During Summer season the demand for rented bikes are on hike.
*   During Winter season the demand for rented bikes is low.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, the demand for rented bikes is high during summer and low during winter can help businesses align their resources and marketing strategies accordingly, maximizing their revenue and creating a positive business impact.

### **Trivariate Analysis**

#### Chart - 10

In [None]:
# Checking the relationship between Independent feature: Hour & the Dependent variable for different seasons
plt.figure(figsize=(7,7))
sns.lineplot(bike_df, x='Hour', y= "Rented Bike Count", hue='Seasons', palette='deep', alpha=1)
plt.xticks(range(0,24))
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" for different Seasons')
plt.show()

##### 1. Why did you pick the specific chart?



*   Line plot for analyzing the count of rented bikes for different hours over 4 different seasons.



##### 2. What is/are the insight(s) found from the chart?


*   The analysis reveals that the use of rented bikes is significantly high during the summer season with peak demand during 7am-9am and 5pm-7pm.

*   However, during the winter season, the use of rented bikes is quite low due to snowfall.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, the high demand for rented bikes during the summer season and specific peak hours, as well as low demand during the winter season due to snowfall, can help businesses optimize operations, target marketing efforts, and adjust resources accordingly, leading to a positive business impact.

#### Chart - 11

In [None]:
# Checking the relationship between Independent feature: Hour & the Dependent variable for Holidays & No Holidays
plt.figure(figsize=(7,7))
sns.pointplot(bike_df, x='Hour', y= "Rented Bike Count", hue='Holiday', palette='dark')
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" seperately for "Holiday" and "No Holiday" ')
plt.show()

##### 1. Why did you pick the specific chart?


*   Point plot for analyzing the count of rented bikes for different hours over Holidays & Non-Holidays.



##### 2. What is/are the insight(s) found from the chart?


*   During Holidays People prefer to use rented bikes after 12 pm.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, people prefer to use rented bikes after 12 pm during holidays can help businesses adjust their operational hours and allocate resources effectively, catering to the increased demand and potentially creating a positive business impact.

#### Chart - 12

In [None]:
# Checking the relationship between Independent feature: Hour & the Dependent variable for Weekend & No Weekend
plt.figure(figsize=(7,7))
sns.pointplot(bike_df, x='Hour', y= "Rented Bike Count", hue='Weekend',palette='rocket')
plt.title('Analysing trend line of "Rented Bike Count" w.r.t "Hour" sperately for "weekdays" and "weekend" ')
plt.show()

##### 1. Why did you pick the specific chart?




*   Point plot for analyzing the count of rented bikes for different hours over Weekends and Non-Weekend days.




##### 2. What is/are the insight(s) found from the chart?




*  The demand for rented bikes is higher on weekdays and more specifically between 7am-9am and 5pm-7pm.
*  On weekends,the demand for rented bikes is generally lower, especially during the morning hours but rise thereafter.  
  






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained i.e, a higher demand for rented bikes on weekdays, specifically during peak commuting hours, and lower demand on weekends, especially in the morning, can help businesses optimize their operations, staffing, and marketing strategies to cater to these patterns, potentially leading to a positive business impact.

### **Multivariate Analysis**

#### Chart - 13

In [None]:
# Checking the correlation between multiple Independent variables
plt.figure(figsize=(7,7))
plt.title('Correlation Chart')
sns.heatmap(bike_df[bike_df.describe().columns].corr(),annot=True,annot_kws={'size': 9},linewidths=3,square=True,fmt='.2f',cmap='PiYG')
plt.show()

##### 2. What is/are the insight(s) found from the chart?

*  We observe that columns 'Temperature' and 'Dew point temperature' are highly positively correlated, with a correlation coefficient of 0.91.
*  Visibility' and 'Humidity' have high negative correlation as compared to others, with a correlation coefficient of -0.54.

In [None]:
# Using VIF to remove Multicollinearity
def calc_vif(X):
  vif =pd.DataFrame()
  vif['Features']= X.columns
  vif['VIF']=[variance_inflation_factor(X.values,i) for i in range(X.shape[1])]
  return vif

In [None]:
calc_vif(bike_df[[i for i in bike_df.describe().columns if i not in ["Rented Bike Count"]]])

In [None]:
bike_df.drop(columns = ['Dew point temperature'],axis = 1, inplace = True)

In [None]:
calc_vif(bike_df[[i for i in bike_df.describe().columns if i not in ["Rented Bike Count"]]])

In [None]:
bike_df.drop(columns = ['Humidity'],axis = 1, inplace = True)

In [None]:
calc_vif(bike_df[[i for i in bike_df.describe().columns if i not in ["Rented Bike Count"]]])

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: There is no significant relationship between the independent variables and the 'Rented Bike Count' (dependent variable).

**Alternate Hypothesis (Ha)**: There is a significant relationship between the independent variables and the 'Rented Bike Count' (dependent variable).

#### 2. Perform an appropriate statistical test.

####**Ordinary Least Square Model**

In [None]:
# Add a constant column to the DataFrame for the intercept term
bike_df = sm.add_constant(bike_df)

independent_vars=bike_df[bike_df.describe().columns].drop('Rented Bike Count',axis=1)
dependent_var=bike_df['Rented Bike Count']

# Perform the regression analysis
model = sm.OLS(dependent_var,independent_vars)
results = model.fit()

# Obtain the p-values
p_values = results.pvalues

print(round(p_values,5))

In [None]:
results.summary()

**Conclusion**
* For the 'Solar_Radiation' variable, the p-value is 0.53238, which is greater than 0.05. Therefore, there is not enough evidence to conclude a significant relationship between 'Solar_Radiation' and the 'Rented Bike Count'.
*Similarly, for the 'Snowfall' variable, the p-value is 0.10884, which is also greater than 0.05. Hence, there is not enough evidence to establish a significant relationship between 'Snowfall' and the 'Rented Bike Count'.



In summary, based on the given p-values, we can reject the null hypothesis for the independent variables 'Hour', 'Temperature', 'Wind_speed', 'Visibility', 'Rainfall', 'Month', and 'Weekend'. This implies that there is a significant relationship between these independent variables and the 'Rented Bike Count'. However, there is insufficient evidence to reject the null hypothesis for the 'Solar_Radiation' and 'Snowfall' variables, indicating that these variables may not have a significant relationship with the 'Rented Bike Count'.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis (H0)**: The dependent variable is normally distributed in the population.

**Alternative hypothesis (Ha)**: The dependent variable is not normally distributed in the population.

#### 2. Perform an appropriate statistical test.

####**Shapiro-Wilk test**

In [None]:
# Perform Statistical Test to obtain P-Value

from scipy import stats

# Perform the Shapiro-Wilk test
statistic, p_value = stats.shapiro(bike_df['Rented Bike Count'])

print("Shapiro-Wilk Test")
print("Test statistic:", statistic)
print("p-value:", p_value)

**Conclusion**


*   Based on the Shapiro-Wilk test results, with a test statistic of 0.8822 and a p-value of 0.0, the p-value is less than the chosen significance level (e.g., 0.05). Therefore, we would reject the null hypothesis (H0) that the dependent variable is normally distributed.




In [None]:
bike_df.drop('const',axis=1,inplace=True)

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Treatment of Outliers in our dependent Variable(applying square root transformation)
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(16,6))
sns.kdeplot(np.sqrt(bike_df['Rented Bike Count']),color='y',fill=True,ax=ax1)
ax1.axvline(np.sqrt(bike_df['Rented Bike Count']).mean(), color='green', linestyle='dashed', linewidth=2)
ax1.axvline(np.sqrt(bike_df['Rented Bike Count']).median(), color='blue', linestyle='dashed', linewidth=2)
sns.boxplot(x= np.sqrt(bike_df['Rented Bike Count']),color='y')
plt.show()

### 2. Categorical Encoding

In [None]:
#ONE HOT ENCODING
cat_features=['Hour', 'Seasons', 'Holiday', 'Functioning Day', 'Month',
       'Weekend']
def one_hot_encoding(data, column):
    data = pd.concat([data, pd.get_dummies(data[column], prefix=column, drop_first=True)], axis=1)
    data = data.drop([column], axis=1)
    return data

In [None]:
for col in cat_features:
    bike_df = one_hot_encoding(bike_df, col)
bike_df.head()

In [None]:
bike_df.columns

### 3. Data Scaling

In [None]:
# Scaling your data
features = list(set(bike_df.columns) - {'Rented Bike Count'})
from scipy.stats import zscore
bike_df[features]=bike_df[features].apply(zscore)

## ***7. ML Model Implementation***

In [None]:
X=bike_df.drop('Rented Bike Count',axis=1)
y=np.sqrt(bike_df['Rented Bike Count'])

In [None]:
# TRAIN TEST SPLIT
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.25,random_state=19)
print(X_train.shape)
print(X_test.shape)

### ML Model - 1

### **Linear Regression**

In [None]:
lr=LinearRegression()
lr.fit(X_train,y_train)

In [None]:
#check the score
lr.score(X_train,y_train)

In [None]:
#check the coefficient
lr.coef_

In [None]:
# Prediction
y_pred_train= lr.predict(X_train)
y_pred_test= lr.predict(X_test)

In [None]:
# Metrics evaluation for Train set
# 1. mean_squared_error
mse_lr= mean_squared_error(y_train,y_pred_train)
print('MSE :' , mse_lr)
#2. Root_mean_squared_error
rmse_lr=np.sqrt(mse_lr)
print('RMSE :' , rmse_lr)
#3. mean_absolute_error
mae_lr=mean_absolute_error(y_train,y_pred_train)
print('MAE :' ,mae_lr)
#4. coefficient of determination(r2_score)
r2_lr=r2_score(y_train,y_pred_train)
print('R2 :' ,r2_lr)
#5. adjusted  coefficient of determination
adjusted_r2_lr=(1-(1-r2_score(y_train,y_pred_train))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lr)

In [None]:
#Storing
lr_dict={'Model':'Linear Regression','MAE':round(mae_lr,2),'MSE':round(mse_lr,2),'RMSE':round(rmse_lr,2),'R2_score':round(r2_lr,2),'Adjusted R2_score':round(adjusted_r2_lr,2)}
training_df=pd.DataFrame(lr_dict,index=[1])
training_df

In [None]:
# Metrics evaluation for Test set
# 1. mean_squared_error
mse_lr= mean_squared_error(y_test,y_pred_test)
print('MSE :' , mse_lr)
#2. Root_mean_squared_error
rmse_lr=np.sqrt(mse_lr)
print('RMSE :' , rmse_lr)
#3. mean_absolute_error
mae_lr=mean_absolute_error(y_test,y_pred_test)
print('MAE :' ,mae_lr)
#4. coefficient of determination(r2_score)
r2_lr=r2_score(y_test,y_pred_test)
print('R2 :' ,r2_lr)
#5. adjusted  coefficient of determination
adjusted_r2_lr=(1-(1-r2_score(y_test,y_pred_test))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lr)

In [None]:
#Storing
lr_dict2={'Model':'Linear Regression','MAE':round(mae_lr,2),'MSE':round(mse_lr,2),'RMSE':round(rmse_lr,2),'R2_score':round(r2_lr,2),'Adjusted R2_score':round(adjusted_r2_lr,2)}
test_df=pd.DataFrame(lr_dict2,index=[1])
test_df

### **Concluding Remark:**

* The linear regression model shows moderate performance on both the training and test sets.
* The model achieves an R-squared (R2) value of approximately 0.76, indicating that around 76% of the variance in the target variable is explained by the independent variables.
* The mean squared error (MSE) values are 36.49 (training set) and 37.49 (test set), suggesting moderate errors in the predictions.
* The root mean squared error (RMSE) values are around 6.04 and 6.12, indicating the average magnitude of the errors.
* The mean absolute error (MAE) values are approximately 4.55 and 4.56, representing the average absolute deviation of the predictions.
* The adjusted R-squared values account for the number of predictors in the model, showing a similar pattern.
* Overall, further analysis and model refinement may be beneficial to improve the performance.

In [None]:
# Checking Heteroscedasticity
residuals = y_test - y_pred_test
sns.scatterplot(x=y_pred_test, y=residuals,color='salmon')

# Add a horizontal line at y=0
plt.axhline(y=0, color='red', linestyle='--')


plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')

plt.show()

* **Since,the  points in the scatter plot are more or less evenly distributed on both sides of the line y=0, it suggests that the residuals have relatively consistent variability across the range of predicted values. This indicates homoscedasticity rather than heteroscedasticity**.

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(range(len(y_pred_test)),y_pred_test,s=20,c='green',label='Predicted')
plt.scatter(range(len(y_test)), y_test, s=20, c='red', label='Actual')
plt.legend()
plt.xlabel('number of test data')
plt.show()

### ML Model - 2

### **Lasso Regression(L1 Regularization)**

In [None]:
lasso = Lasso(alpha=0.1,max_iter=3500)
lasso.fit(X_train,y_train)

In [None]:
#check the score
lasso.score(X_train,y_train)

In [None]:
# Prediction
y_pred_train_lasso= lasso.predict(X_train)
y_pred_test_lasso= lasso.predict(X_test)

In [None]:
# Metrics evaluation for Train set
# 1. mean_squared_error
mse_lasso= mean_squared_error(y_train,y_pred_train_lasso)
print('MSE :' , mse_lasso)
#2. Root_mean_squared_error
rmse_lasso=np.sqrt(mse_lasso)
print('RMSE :' , rmse_lasso)
#3. mean_absolute_error
mae_lasso=mean_absolute_error(y_train,y_pred_train_lasso)
print('MAE :' ,mae_lasso)
#4. coefficient of determination(r2_score)
r2_lasso=r2_score(y_train,y_pred_train_lasso)
print('R2 :' ,r2_lasso)
#5. adjusted  coefficient of determination
adjusted_r2_lasso=(1-(1-r2_score(y_train,y_pred_train_lasso))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lasso)

In [None]:
#Storing
lasso_dict={'Model':'Lasso Regression','MAE':round(mae_lasso,2),'MSE':round(mse_lasso,2),'RMSE':round(rmse_lasso,2),'R2_score':round(r2_lasso,2),'Adjusted R2_score':round(adjusted_r2_lasso,2)}
training_df=training_df.append(lasso_dict,ignore_index=True)
training_df

In [None]:
# Metrics evaluation for Test set
# 1. mean_squared_error
mse_lasso= mean_squared_error(y_test,y_pred_test_lasso)
print('MSE :' , mse_lasso)
#2. Root_mean_squared_error
rmse_lasso=np.sqrt(mse_lasso)
print('RMSE :' , rmse_lasso)
#3. mean_absolute_error
mae_lasso=mean_absolute_error(y_test,y_pred_test_lasso)
print('MAE :' ,mae_lasso)
#4. coefficient of determination(r2_score)
r2_lasso=r2_score(y_test,y_pred_test_lasso)
print('R2 :' ,r2_lasso)
#5. adjusted  coefficient of determination
adjusted_r2_lasso=(1-(1-r2_score(y_test,y_pred_test_lasso))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_lasso)

In [None]:
#Storing
lasso_dict2={'Model':'Lasso Regression','MAE':round(mae_lasso,2),'MSE':round(mse_lasso,2),'RMSE':round(rmse_lasso,2),'R2_score':round(r2_lasso,2),'Adjusted R2_score':round(adjusted_r2_lasso,2)}
test_df=test_df.append(lasso_dict2,ignore_index=True)
test_df

### ML Model - 3

###**Ridge Regression(L2 Regularization)**

In [None]:
ridge=Ridge(alpha=0.1)
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train,y_train)

In [None]:
# Prediction
y_pred_train_ridge= ridge.predict(X_train)
y_pred_test_ridge= ridge.predict(X_test)

In [None]:
# Metrics Evaluation for Train set
# 1. mean_squared_error
mse_ridge= mean_squared_error(y_train,y_pred_train_ridge)
print('MSE :' , mse_ridge)
#2. Root_mean_squared_error
rmse_ridge=np.sqrt(mse_ridge)
print('RMSE :' , rmse_ridge)
#3. mean_absolute_error
mae_ridge=mean_absolute_error(y_train,y_pred_train_ridge)
print('MAE :' ,mae_ridge)
#4. coefficient of determination(r2_score)
r2_ridge=r2_score(y_train,y_pred_train_ridge)
print('R2 :' ,r2_ridge)
#5. adjusted  coefficient of determination
adjusted_r2_ridge=(1-(1-r2_score(y_train,y_pred_train_ridge))*((X_train.shape[0]-1)/(X_train.shape[0]-X_train.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_ridge)

In [None]:
#Storing
ridge_dict={'Model':'Ridge Regression','MAE':round(mae_ridge,2),'MSE':round(mse_ridge,2),'RMSE':round(rmse_ridge,2),'R2_score':round(r2_ridge,2),'Adjusted R2_score':round(adjusted_r2_ridge,2)}
training_df=training_df.append(ridge_dict,ignore_index=True)

In [None]:
# Metrics Evaluation for Test set
# 1. mean_squared_error
mse_ridge= mean_squared_error(y_test,y_pred_test_ridge)
print('MSE :' , mse_ridge)
#2. Root_mean_squared_error
rmse_ridge=np.sqrt(mse_ridge)
print('RMSE :' , rmse_ridge)
#3. mean_absolute_error
mae_ridge=mean_absolute_error(y_test,y_pred_test_ridge)
print('MAE :' ,mae_ridge)
#4. coefficient of determination(r2_score)
r2_ridge=r2_score(y_test,y_pred_test_ridge)
print('R2 :' ,r2_ridge)
#5. adjusted  coefficient of determination
adjusted_r2_ridge=(1-(1-r2_score(y_test,y_pred_test_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))
print('Adjusted R2 :' ,adjusted_r2_ridge)

In [None]:
#Storing
ridge_dict2={'Model':'Ridge Regression','MAE':round(mae_ridge,2),'MSE':round(mse_ridge,2),'RMSE':round(rmse_ridge,2),'R2_score':round(r2_ridge,2),'Adjusted R2_score':round(adjusted_r2_ridge,2)}
test_df=test_df.append(ridge_dict2,ignore_index=True)

# **Conclusion**

In our analysis, we began by conducting an exploratory data analysis (EDA) on all features in the dataset. We started by examining the dependent variable, 'Rented Bike Count,' and made necessary transformations to ensure its suitability for modeling. Moving on, we focused on the categorical variables and eliminated those with a dominant single class. For the numerical variables, we calculated correlations, studied their distributions, and analyzed their relationships with the dependent variable. Additionally, we removed numerical features mostly consisting of 0 values and performed one-hot encoding for the categorical variables.

Next, we implemented three machine learning algorithms: Linear Regression, Lasso Regression & Ridge Regression. Our evaluation yielded the following findings:

In [None]:
# displaying the results of evaluation metric values for all models
result=pd.concat([training_df,test_df],keys=['Training set','Test set'])
result

* Linear Regression, Lasso Regression & Ridge Regression show similar performance on both the training and test sets. They have comparable MAE, MSE, RMSE, R2 score, and adjusted R2 score values, indicating consistent predictive accuracy across the two datasets.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***