<a href="https://colab.research.google.com/github/Nakulcj7/bike/blob/main/Bike_sharing_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Seoul Bike Sharing Demand Prediction



##### **Project Type**    -Regression
##### **Contribution**    - Individual


# **Project Summary -**

Bike sharing systems have gained widespread popularity in urban environments, offering a sustainable and efficient mode of transportation. This project focuses on developing a predictive model for bike sharing demand, leveraging historical data, weather conditions, and other relevant factors. The primary goal is to create a robust and accurate prediction system to optimize bike allocation and enhance user experience.There were approximately 8760 records and 14 attributes in the dataset.This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

# **GitHub Link -**

https://github.com/Nakulcj7/bike/blob/main/Bike_sharing_prediction.ipynb


# **Problem Statement**



It is necessary to make the rental bike avaiable and accessible for the public at the right time as the waiting period shortens.Eventually,providing the city with a stable supply of rental bikes becomes a major concern.The main think to focus here is to predict the bike count required at each hour for a stable supply of rental bikes.


The major objective here is to count the rental bikes required on an daily hour basis and also to identify the features which influences the hourly demant for rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#loading the dataset
bd_df=pd.read_csv("/content/drive/MyDrive/Almabetter/SeoulBikeData.csv", encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
bd_df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bd_df.shape

In [None]:
print(f'number of rows : {bd_df.shape[0]}  \nnumber of columns : {bd_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
bd_df.info()

In [None]:
#viewing the statistical summary of the data
bd_df.describe(include='all').T

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bd_df[bd_df.duplicated()]

This shows that there are no duplicate entries in the dataset.


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bd_df.isnull().sum()

So it is evident that there is no null values in the dataset.So we can that the dataset is balanced.


### What did you know about your dataset?

The dataset provided contains 14 columns and 8760 rows and does not have any missing or duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bd_df.columns

In [None]:
# Dataset Describe
bd_df.describe(include='all').T


### Variables Description





 This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

Attribute Information:


*   Date :The date of each observation in the format 'year-month-day'

*   Rented Bike count - Count of bikes rented at each hour

*   Hour - Hour of the day

*   Temperature - Temperature recorded in the city in Celsius (°C).

*   Humidity - Relative humidity in %

*   
Windspeed - Speed of the wind in m/s


*   Visibility - measure of distance at which object or light can be clearly discerned in units of 10m
*   Dew point temperature - Temperature recorded in the beginning of the day in Celsius(°C).


*   Solar radiation - Intensity of sunlight in MJ/m^2


*   Rainfall - Amount of rainfall received in mm


*   Snowfall - Amount of snowfall received in cm


*   Seasons - Season of the year (Winter, Spring, Summer, Autumn)


*   Holiday - Whether the day is a Holiday or not (Holiday/No holiday)


*   Functional Day -Whether the rental service is available (Yes-Functional hours) or not (No-Non functional hours)



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bd_df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#Converting the datatype of Date column to datatime
bd_df['Date'] = pd.to_datetime(bd_df['Date'], format='%d/%m/%Y')

#Extracting Month,Weekday and Year from the date column
bd_df['Month']=bd_df['Date'].dt.month
bd_df['Days_of_week']=bd_df['Date'].dt.day_name()
bd_df['Year']=bd_df['Date'].dt.year
bd_df['Day']=bd_df['Date'].dt.day


In [None]:
#The number of unique values in Date column
bd_df['Date'].nunique()

The dataset contains records of rented bikes per hour for a period of 365 days

In [None]:
#The number of unique values in Year column
bd_df['Year'].value_counts()


Most of the records are from the year 2018

In [None]:
#Finding the date of first and last entry in the dataset
print(f'The dataset contains observations from ',min(bd_df['Date']).date(),'to',max(bd_df['Date']).date())


In [None]:
#Creating a column which specifies  if the day is a Weekend('Y')or not ('N')
bd_df['Weekend']=bd_df['Days_of_week'].apply(lambda x : ('Y') if x in ['Saturday','Sunday'] else ('N'))


In [None]:
#Displaying the unique values in the categorical columns
categorical_columns=['Seasons','Holiday', 'Functioning Day','Days_of_week','Weekend']

for col in categorical_columns:
  print(f'The unique values in the column {col} are {bd_df[col].unique()}')

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate Analysis**

Let's see how some of the important numerical independent features are distibuted in our data.

In [None]:
fig = plt.figure(figsize=(12,15))
c=1
lis=['Hour','Temperature(°C)',	'Humidity(%)',	'Wind speed (m/s)',	'Visibility (10m)',	'Dew point temperature(°C)',	'Solar Radiation (MJ/m2)',	'Rainfall(mm)','Snowfall (cm)']
for i in lis:
  plt.subplot(3,3, c)
  sns.histplot(bd_df[i],kde=True)
  plt.title('Distibution of {}'.format(i))
  c+=1
plt.tight_layout()



*   Distribution of Temperature,Humidity,Dew point temperature are almost normal.

*   Wind speed,Solar Radiation,Rainfall,Snowfall-positively skewed
*   Visibility is negatively skewed





Lets see how is the dependent variable Rented Bike Count distributed?

In [None]:
sns.displot(bd_df['Rented Bike Count'],kde=True,color='black')
plt.title('Distibution of Rented Bike Count')

In [None]:
#Checking for outliers

fig = plt.figure(figsize=(8,25))
c=1
for i in lis :
    plt.subplot(13,1, c)
    plt.xlabel('Distibution of {}'.format(i))
    sns.boxplot(x=i,data=bd_df)
    c = c + 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)


The outlier values are not extreme,nor unusual.So,we retain these values in our dataset.

In [None]:
#The number of records belonging to each category
for col in categorical_columns:
 print('Column :',col)
 print(bd_df[col].value_counts(),'\n')

#Basic conclusion from inivariate analysis


*   Number of records are mostly similar throughout the seasons(need to dig more for better understanding).

*   More number of records on non-holiday(working days) & Functioning days of the rental service.

*   Bike rentals are fewers on Weekends

*   Not much info from hour at the moment.
*   The temperature is mostly >0, for now lets consider Seoul on the warmer side.


*   Humidity is also moderate but still on warmer side.


*   Wind speed is not that extreme.


*   Most of the rainfall is <4 mm.

*   Snowfall is mostly 0-1 cm and not that extreme in most cases.





# Bivariate Analysis

## Correlation Heatmap

In [None]:
#Correlation heatmap of numerical features in the dataset
plt.figure(figsize = (10,10))
sns.heatmap(bd_df.corr(),annot=True,linewidth = 0.5, vmin=-1, vmax=1, cmap = 'YlGnBu')




*   Dew point temperature is strongly correlated with temperature.
*   Temperature,Hour shares a stronger correlation with Rented Bike count.



# **Scatter plot showing the high correlation of Temperature and Dew point Temperature**

In [None]:
plt.figure(figsize=[10,6])
sns.scatterplot(data=bd_df, y='Temperature(°C)', x='Dew point temperature(°C)')

**Were rental services offered on non-functional days?**

In [None]:
len(bd_df[bd_df['Functioning Day']=='No'])


It is highly unlikely that services will be provided on non-functional days.But since there were few observations (295) recorded on those days,let's check if there were any exceptional cases.

In [None]:
#Calculating the count of rental bikes,number of holidays &non-holidays and number of records for Functioning and Non-Functioning days

bd_df.groupby(['Functioning Day','Holiday']).agg(bikerentalcounts=('Rented Bike Count','sum'),no_of_holidays_nonholidays=('Date',lambda x: x.nunique()),no_of_records=(('Date','count')))

In [None]:
plt.figure(figsize=(8,6))
sns.barplot(x='Functioning Day',y='Rented Bike Count',data=bd_df)



*   The rental service were functional on most days during the period from Dec 2017 to Nov 2018(only 13 non-functional days)
*   Although,we've observed few records on Non-Functioning Day,rental services were not offered on those days(no exceptions)



# **Which are the days on which the rental facility was unavailable?**

In [None]:
non_functioning_days =bd_df.loc[bd_df['Functioning Day']=='No']

#Holiday on which the rental service was unavailable
non_functioning_days.loc[non_functioning_days['Holiday']=='Holiday']['Date'].unique()

The holiday on which the rental service was not functioning is Hangeul day.It is a national Korean commemorative day marking the invention and the proclamation of Hangul , the alphabet of the Korean language

In [None]:
non_functioning_days.loc[non_functioning_days['Holiday']=='No Holiday']['Date'].value_counts().to_frame(name = 'Hours_of_non_operation').reset_index().rename(columns={'index':'Date'})

The services were not for available for 1 day in the month of April,1 day in May,4 days in September,3 days each in October and November.

# **What is the likelihood of people renting bikes on holidays and non-holidays?**

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Holiday',y='Rented Bike Count',data=bd_df,palette='Set1')



*   The demand for rented bikes is higher on non holidays



# **What is the count of rented bikes during different seasons over the entire period of observation?**

In [None]:
#Finding the total number of bikes rented in each season
season_df=bd_df.groupby('Seasons')['Rented Bike Count'].sum().reset_index()['Rented Bike Count'].to_frame(name = 'season_count').reset_index()

In [None]:
#Finding the total number of bikes rented in each month
month_df=(bd_df.groupby(['Seasons','Month'])['Rented Bike Count'].sum()).to_frame(name = 'month_count').reset_index()

In [None]:
import calendar

# Define month names
month_names = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]

# Create a figure and axis
fig, ax = plt.subplots()
size = 1
group_names = ['Autumn', 'Spring', 'Summer', 'Winter']
group_size = season_df['season_count']
subgroup_size = month_df['month_count']

# Setting figure colors using color maps
a, b, c, d = [plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples]
outer_colors = [a(.8), b(.8), c(.8), d(.8)]
inner_colors = [*a(np.linspace(.7, .4, 3)), *b(np.linspace(.7, .4, 3)), *c(np.linspace(.7, .4, 3)), *d(np.linspace(.7, .4, 3))]

# Creating the outer pie chart
patches, texts, pcts = ax.pie(
    group_size,
    radius=3.2,
    colors=outer_colors,
    wedgeprops=dict(width=size, edgecolor='w'),
    labels=group_names,
    autopct='%1.1f%%',
    textprops={'fontsize': 16},
    labeldistance=1.1,
    pctdistance=0.85
)
plt.setp(pcts, color='white', fontweight='bold')
plt.setp(texts, fontweight=600)

# Creating the inner pie chart with month names
subgroup_names = [calendar.month_abbr[month_num] for month_num in month_df['Month']]
patches1, texts1, pcts1 = ax.pie(
    subgroup_size,
    radius=3.2 - size,
    colors=inner_colors,
    labels=subgroup_names,
    wedgeprops=dict(width=1.2, edgecolor='w'),
    autopct='%1.1f%%',
    textprops={'fontsize': 14},
    labeldistance=0.8,
    pctdistance=0.65
)
plt.setp(pcts1, color='w', fontweight='bold', fontsize=12)
plt.setp(texts1, fontweight=600)

ax.set(aspect="equal")

# Show the pie chart
plt.show()




*   The demand for rental bikes is lowest during Winters(Dec-Feb),highest during Summers(June-August)



## **What is the demand for rental bikes during different days of the week?**

In [None]:
plt.figure(figsize=(8,6))
sns.boxenplot(x='Days_of_week',y='Rented Bike Count',data=bd_df,palette='Set1')



*   Least demand on Sunday,Slightly higher demand on Friday

*   More demand on weekdays than weekends.





# What is the demand for rental bikes during weekdays and weekends?

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='Weekend',y='Rented Bike Count',data=bd_df,palette='Set1')

In [None]:
bd_df.groupby(['Weekend'])['Rented Bike Count'].mean()



*   The average demand for rental bikes is lower on Weekends(Sat-Sun) as compared to Weekdays(Mon-Fri).



# What is the demand for rental bikes during different hours of the day?

In [None]:
plt.figure(figsize=(10,8))
sns.lineplot(x='Hour',y='Rented Bike Count',data=bd_df,palette='Set1',hue='Seasons',lw=1.5)



*   The demand for rental bikes peaks at 8 (8:00 am) and 18 (6:00 PM ).

*   This peak in demand coincides with opening and closing hours of various institutions and offices.

*   The demand for rental bikes increases steadily after 10:00 AM and continues till 6:00 PM
*   The demand for bikes is least during the early hours (1:00 AM to 6:00 AM)

*   Regardless,of the seasons,this has been the general trend noticed.







## What is the variation of Rented bikes count over the entire period of observation based on various factors?

In [None]:
fig = plt.figure(figsize=(15,12))
c=1
cont = ['Date','Temperature(°C)',	'Wind speed (m/s)',	'Visibility (10m)',	'Solar Radiation (MJ/m2)',	'Rainfall(mm)',	'Snowfall (cm)']
for i in cont:
  plt.subplot(4,2,c)
  sns.lineplot(x=i,y='Rented Bike Count',data=bd_df,palette='inferno')
  plt.title('Demand of Rental bikes at different {}'.format(i))
  c = c + 1
plt.tight_layout()



*   Temperature vs Bike count plot : The demand is higher during warmer temperatures (25°C-30°C)



*  Windspeed vs Bike count plot : The demand for rental bikes is relatively uniform over all windspeeds upto 5 m/s .Beyond that speed,we observe a higher demand for bikes.

*   Visibility vs Bike count plot : The count of bikes rented is few on times when the visibility is extremely low,less than 1000m.
*   Solar radiation vs Bike count plot:There is an overall increase in the demand with increase in Solar radiation.


*   Rainfall vs Bike count plot : The peak between 20 mm and 25 mm seems out of place,on refering to the dataset we find that such observations are recorded during Summer Season.However,people still continue to opt for rental bikes,since they have to go to work (No Holiday).


*   Snowfall vs Bike count plot : The demand for bikes is comparatively lower when the snowfall received is 4 cm and above



# Inspecting the observations where there is a peak in demand for bikes regardless of the weather conditions

In [None]:
#1.Rainfall
bd_df[(bd_df['Rainfall(mm)'] >=20) & (bd_df['Rainfall(mm)'] <=25)]

These are working days



In [None]:
bd_df[(bd_df['Snowfall (cm)'] >=5) & (bd_df['Snowfall (cm)'] <=8)]



*   These are also working days




# What are the factors which influence the demand for rental bikes during a day?

In [None]:

fig = plt.figure(figsize=(11,9))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)','Solar Radiation (MJ/m2)', 'Humidity(%)']
for i in columns :
    plt.subplot(3,2,c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="green")
    sns.lineplot(data=bd_df, x='Hour', y=i, color='r')
    c = c + 1
plt.tight_layout()



*   Temperature, visibility, windspeed, and humidity appear to be positively associated to the hourly demand for rental bikes.
*   The rented bike counts are highest during the hours from 7:00 AM to 20.00 (8:00 PM), when the temperature is highest, there is the most visibility, windspeed, and humidity is lowest.



# What are the factors which influence the demand for rental bikes during different months?

In [None]:
fig = plt.figure(figsize=(12,12))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)','Solar Radiation (MJ/m2)', 'Humidity(%)','Rainfall(mm)','Snowfall (cm)']
for i in columns :
    plt.subplot(4,2, c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="green")
    sns.lineplot(data=bd_df, x='Month', y=i, color='r')
    c = c + 1
plt.tight_layout()



*   The monthly count of rented bikes is positively associated with Temperature.

*   Snowfall movement coincides with season, with heavy snowfall from December to February throughout the winter season. There's a decline in count of rented bikes during these months.
*   Rainfall tends to be more frequent in Seoul from June to August, during the summer season.However,this has not lead to decline in demand for rental bikes during those months.






# What are the factors which influence the demand for rental bikes during various seasons of the year?

In [None]:
fig = plt.figure(figsize=(12,12))
c=1
columns=['Rented Bike Count','Temperature(°C)','Visibility (10m)', 'Wind speed (m/s)', 'Humidity(%)','Rainfall(mm)', 'Snowfall (cm)']
for i in columns :
    plt.subplot(4,2,c)
    plt.ylabel(i)
    plt.title(label=i,fontsize=15,color="black")
    sns.barplot(data=bd_df, x='Seasons', y=i, palette='Set1')
    sns.lineplot(data=bd_df, x='Seasons', y=i, color='black')
    c = c + 1
plt.tight_layout()



*   It is evident that the seasonal demand for rental bikes is positively associated with temperature, solar radiation ,rainfall ,humidity and is negatively related with Snowfall received.
*   Therefore,the demand is highest during Summer season and least during winters



# Basic Conclusions from Bivariate Analysis



*   Temperature and Hour have a strong correlation with the count of rented bikes.

* Dew point temperature is highly positively correlated to the Temperature.  
* The peak demands for rental bikes occur on the opening (8-9 AM) and closing times (6-7pm) of offices and institutions

*   During the period from Dec 2017 to Nov 2018,bike rental facilities were available on most days.The service was unavailable only for 13 days.

*   The demand for rental bikes is higher on Regular days(Non-Holidays)
*   There is more demand for rental bikes on Weekdays than on Weekends.





*   There is a significant drop in the number of rented bikes during Winters(Dec-Feb) because it's freezing cold!


*       The demand for bikes increases during warmer temperatures,which is why there's maximum count of rented bikes during the Summer season.





# Feature engineering

In [None]:
#Checking for multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor

def check_vif(dataframe):
  '''
  This function calculates the variance inflation factor of the independent features in the datasdet
  '''

  # the independent variables set
  X =dataframe
  # VIF dataframe
  vif_data = pd.DataFrame()
  vif_data["feature"] = X.columns

  # calculating VIF for each feature
  vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                            for i in range(len(X.columns))]
  print(vif_data)

In [None]:
#Displaying the columns in the dataframe
bd_df.columns

In [None]:
#Checking the VIF value of certain columns in bd_df
check_vif(bd_df[['Hour', 'Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)',
        'Month','Year','Day']])



Multicolinearity causes reduction in the statistical power of your regression model

Let's check the values of VIF if we exclude Dew point temperature and Year.


In [None]:
check_vif(bd_df[['Hour', 'Temperature(°C)', 'Humidity(%)','Wind speed (m/s)', 'Visibility (10m)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)','Month','Day']])

The VIF of the features,now lie within the acceptable range.

In [None]:
#Dropping 'Dew point temperature(°C)','Year' to reduce the VIF
bd_df.drop(columns=['Dew point temperature(°C)','Year'],inplace=True)

In [None]:
#Creating a copy of the main dataframe 'bd_df'
df=bd_df.copy()

In [None]:
#Creating dummies for the Categorical columns
df = pd.get_dummies(bd_df, columns = ['Seasons','Holiday','Weekend','Functioning Day'],drop_first=True)
df.head(2)

In [None]:
df.columns

In [None]:
#Dropping the columns Date and Days_of_week
df.drop(['Days_of_week','Date'],axis=1,inplace=True)

In [None]:
#Displaying the columns present in the dataframe 'df'
df.columns

# Implementation of Regression Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [None]:
#Defining independent and dependent variables

y = df['Rented Bike Count']
X = df.drop('Rented Bike Count',axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
print(f'The shape of X : {X.shape}\n The shape of X_train : {X_train.shape}\n The shape of X_test : {X_test.shape}')

In [None]:
#Creating functions to calculate the Evaluation metrics for the regression models

def evaluate_model(name,X_test,y_true,y_pred):

  '''
  This function calculate  metrics for evaluating
  the perfomance of Regression models
  '''
  list_=[]
  #calculating mean absolute error
  MAE =  mean_absolute_error(y_true,y_pred)
  print(f'MAE : {MAE}')

  #finding mean_squared_error
  MSE  = mean_squared_error(y_true,y_pred)
  print("MSE :" , MSE)

  #finding root mean squared error
  RMSE = np.sqrt(MSE)
  print("RMSE :" ,RMSE)

  #finding the r2 score
  r2 = r2_score(y_true,y_pred)
  print("R2 :" ,r2)

  #finding the adjusted r2 score
  adj_r2=1-(1-r2_score(y_true,y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  print("Adjusted R2 : ",adj_r2)
  list_.extend([name,MAE,MSE,RMSE,r2,adj_r2])
  return(list_)

In [None]:
#Creating a  list which would store lists of different models and their performance metrics
list_of_models=[]

## Linear Regression

**Multiple linear regression**

In [None]:
#Scaling the features
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

In [None]:
#Importing the Linear Regression model
from sklearn.linear_model import LinearRegression

In [None]:
#Fitting the data to Linear Regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#predicting the values of y from X_test
y_pred= regressor.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Multiple Linear Regression',X_test,y_test,y_pred))



# *   Lasso Regression



In [None]:
#importing the classes required for Cross Validation
from sklearn.model_selection import RandomizedSearchCV as rscv
from sklearn.model_selection import GridSearchCV as gsv

In [None]:
#importing the linear_model class from sklearn library
from sklearn import linear_model

#Creating a Lasso Linear model object
ls_model = linear_model.Lasso()

In [None]:
#Creating the parameter grid
grid = dict()
grid['alpha'] = np.arange(0, 1, 0.005,)
grid['max_iter'] = [25,50,100,500,1000]

In [None]:
#performing GridSearch CV
ls_model=gsv(estimator=ls_model, param_grid=grid,cv=5 ,verbose=1, scoring='r2')


In [None]:
#Training the model
ls_model.fit(X_train,y_train)

In [None]:
#displaying the best estimators and score
print(ls_model.best_estimator_,'The best score is ',ls_model.best_score_)

In [None]:
#Fitting the data to optimal Lasso model
best_lasso = ls_model.best_estimator_
best_lasso.fit(X_train, y_train)

In [None]:
#Prediciting the values for y from X_test using the best parameters
y_pred=best_lasso.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Lasso Regression(Tuned)',X_test,y_test,y_pred))

# Ridge Regression



*   Ridge regression with default parameters




In [None]:
#importing Ridge from linear_model class of sklearn library
from sklearn.linear_model import Ridge

In [None]:
#Creating an instance of Ridge regression
ridge=Ridge()

In [None]:
#Training the model
ridge.fit(X_train,y_train)

In [None]:
#predicting the values of y from the test data
y_pred=ridge.predict(X_test)

ridge.score(X_train,y_train)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Ridge Regression (default)',X_test,y_test,y_pred))



*   Ridge Regression with Hper Parameter Tuning



In [None]:
#Creating an object of linear model with Ridge regularization
Ridge_model = linear_model.Ridge()

In [None]:
#Creating the parameter grid
grid = dict()
grid['alpha'] = np.arange(0, 1, 0.005)
grid['max_iter'] = [25,50,100,500,1000]

In [None]:
#Perfoming cross-validation to find the best model
Ridge_model=gsv(estimator=Ridge_model, param_grid=grid,cv=5 ,verbose=1, scoring='r2')
Ridge_model.fit(X_train,y_train)

In [None]:
print(Ridge_model.best_estimator_,Ridge_model.best_score_)

In [None]:
#Fitting the train data to the Ridge model with best parameters
best_ridge = Ridge_model.best_estimator_
best_ridge.fit(X_train,y_train)

In [None]:
y_pred = best_ridge.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Ridge Regression(Tuned)',X_test,y_test,y_pred))

## Elasticnet Regression

In [None]:
#Importing Elastic Net
from sklearn.linear_model import ElasticNet

In [None]:
#Creating an instance of ElasticNet model
elasticnet = ElasticNet()

In [None]:
#Fitting the model to train data and finding its score
elasticnet.fit(X_train,y_train)
elasticnet.score(X_train, y_train)

In [None]:
#Predicting the values of y from test data
y_pred = elasticnet.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Elastic Net Regression(default)',X_test,y_test,y_pred))



*   Elasticnet Regression with Hyperparameter Tuning



In [None]:
#Creating an instance of ElasticNet model
elastic = ElasticNet()

#Creating the parameter grid
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100],'l1_ratio':[0.3,0.4,0.5,0.6,0.7,0.8]}

In [None]:
#Performing GridSearch Cross Validation to find the best parameters

elastic_regressor = gsv(elastic, parameters,scoring='neg_mean_squared_error',cv=5)
elastic_regressor.fit(X_train, y_train)

In [None]:
print("The best fit alpha value is found out to be :" ,elastic_regressor.best_params_)
print("\nUsing ",elastic_regressor.best_params_, " the negative mean squared error is: ", elastic_regressor.best_score_)

In [None]:
best_elasticnet=elastic_regressor.best_estimator_
best_elasticnet.fit(X_train,y_train)

In [None]:
#predicting the values of y from best model
y_pred_elastic = best_elasticnet.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Elastic Net Regression(Tuned)',X_test,y_test,y_pred_elastic))

# Polynomial Regression

In [None]:
#importing the packages required for preprocessing,creating pipeline,cross-validation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

In [None]:
def PolynomialRegression(degree=1, **kwargs):#initializing the degree as 1,however the degree will change while performing GridSearch cross-validation
  '''
  This function transforms the independent features(X_train) to a polynomial of the degree given in the parameters
  and performs Linear regression using the  y_train(not-transformed) and the transformed X_train.
  '''
  return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

In [None]:
#Creating a parameter grid
parameters = {'polynomialfeatures__degree': [2,3,4,5]}

#Creating a PolynomialRegression object
poly_regressor = PolynomialRegression()

#Perfoming GridSearch Cross-validation to find the optimal parameters
poly_grid = GridSearchCV(poly_regressor, param_grid=parameters,cv=3, scoring='neg_mean_squared_error', verbose=3)
poly_grid.fit(X_train, y_train)

In [None]:
print("\n The best parameters across ALL searched params:\n",poly_grid.best_params_)

In [None]:
#Fitting the independent features to a polynomial of degree 2
poly_features = PolynomialFeatures(degree =2)
X_train_poly = poly_features.fit_transform(X_train)

#Performing Linear regression using  y_train and thetransformed X_train
poly_regressor = LinearRegression( )
poly_regressor.fit(X_train_poly, y_train)

#Predicting the y values from X_test

X_test_transform=poly_features.transform(X_test)
y_pred=poly_regressor.predict(X_test_transform)

In [None]:
#Determining the evaluation metrics of the model
print(f'The polynomial with degree = 2 is optimal fit')
list_of_models.append(evaluate_model('Polynomial Regression(Tuned)',X_test,y_test,y_pred))

# Decision Tree Regressor

In [None]:
#Splitting the data to train and test again(to obtain non-scaled test and train data)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# making function to plot important features.
def plotting_imp_features(model, training_data):
  imp_features = model.feature_importances_
  feature_names = training_data.columns
  _imp_features = pd.Series(imp_features, index=feature_names)
  return _imp_features.sort_values(ascending=True).plot(kind='barh',figsize=[10,8], title='Feature Importance')

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

#Creating a decision tree regressor object
dtr = DecisionTreeRegressor()

# hyper-parameter tuning using gridSearchCV
parameters = {'max_depth': [int(i) for i in np.linspace(start=3, stop=20, num=17)],
              'min_samples_split':[2,3,4],
              'min_samples_leaf':[1,2,3,4],
              }

gridsearch_dtr = GridSearchCV(dtr, parameters, scoring='r2', cv=5)
gridsearch_dtr.fit(X_train, y_train)

In [None]:
# best parameters.
print('Best parameters for our model are: max_depth={}, min_samples_split={}, min_samples_leaf={}'.format(gridsearch_dtr.best_params_['max_depth'],
                                    gridsearch_dtr.best_params_['min_samples_split'], gridsearch_dtr.best_params_['min_samples_leaf']))

In [None]:
# train and test performance
train_score = gridsearch_dtr.best_score_
test_score = gridsearch_dtr.best_estimator_.score(X_test,y_test)

print('The best r2 score for train data is {}'.format(train_score))
print('The best r2 score for test data is {}'.format(test_score))

In [None]:
#Fitting the data to the Decision tree regressor with tuned parameters
best_dtr = gridsearch_dtr.best_estimator_

In [None]:
best_dtr.fit(X_train,y_train)
#Making predictions on the best decision tree model
y_pred=best_dtr.predict(X_test)

In [None]:
#Evaluating the model
list_of_models.append(evaluate_model('Decision Tree Regression (Tuned)',X_test,y_test,y_pred))

In [None]:
#Visualizing features importance of Decision Tree Regressor model
plotting_imp_features(gridsearch_dtr.best_estimator_, X_train)

## XGBoost Regression

In [None]:
#importing the required packages and classes
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
from xgboost import XGBRegressor
import xgboost as xgb

In [None]:
def rmse_cv(model):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5)).mean()
    return rmse

def evaluation(y, predictions):
    mae = mean_absolute_error(y, predictions)
    mse = mean_squared_error(y, predictions)
    rmse = np.sqrt(mean_squared_error(y, predictions))
    r_squared = r2_score(y, predictions)
    return mae, mse, rmse, r_squared


Xgboost with default parameters

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify categorical columns
categorical_cols = ['Seasons', 'Holiday', 'Functioning Day', 'Weekend']

# Create a ColumnTransformer to apply one-hot encoding to categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_cols)
    ], remainder='passthrough'  # Keep non-categorical columns as they are
)

# Create an XGBoost Regressor instance with specified hyperparameters
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.01)

# Create a pipeline that includes preprocessing and the XGBoost Regressor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('xgb', xgb)
])

# Fit the model to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
predictions = pipeline.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
rmse = mean_squared_error(y_test, predictions, squared=False)  # Calculate RMSE
r_squared = r2_score(y_test, predictions)

print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R2 Score:", r_squared)
print("-" * 30)

# Calculate RMSE using cross-validation (You need to define the function rmse_cv)
rmse_cross_val = rmse_cv(pipeline)  # Define the rmse_cv function to perform cross-validation
print("RMSE Cross-Validation:", rmse_cross_val)

Xgboost with hyperparameter Tuning


In [None]:
import scipy.stats as stats
# Create an XGBoost Regressor instance
xgb = XGBRegressor()

# Create a pipeline that includes preprocessing and the XGBoost Regressor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('xgb', xgb)
])

# Define the parameter grid for hyperparameter tuning
param_dist = {
    'xgb__n_estimators': stats.randint(100, 1000),  # Random integer between 100 and 1000
    'xgb__learning_rate': stats.uniform(0.01, 0.2),  # Random float between 0.01 and 0.2
    'xgb__max_depth': stats.randint(3, 6),  # Random integer between 3 and 5
    'xgb__min_child_weight': stats.randint(1, 4)  # Random integer between 1 and 3
}
# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=50, cv=5,
                                   scoring='neg_mean_squared_error', verbose=1, random_state=42, n_jobs=-1)

# Fit the random search to the training data
random_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", random_search.best_params_)

# Get the best model from the random search
best_model = random_search.best_estimator_

# Make predictions on the test data
y_pred = best_model.predict(X_test)

# Calculate RMSE and R-squared on the test set
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Test RMSE:", rmse)
print("Test R-squared:", r2)



In [None]:
# Create an XGBoost Regressor instance with specific hyperparameters
best_xgb = XGBRegressor(n_estimators=1000, learning_rate=0.03, subsample=0.7, objective='reg:squarederror',
                        max_depth=7, silent=1, min_child_weight=4, colsample_bytree=0.7)

# Create a pipeline that includes preprocessing and the XGBoost Regressor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('xgb', best_xgb)
])

# Fit the model to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Calculate and print the training score (R-squared)
training_score = pipeline.score(X_train, y_train)
print(f'Training score: {training_score}')

# Evaluate the model on the test set
test_r2_score = r2_score(y_test, y_pred)
print("Test R-squared Score:", test_r2_score)

In [None]:

def plotting_imp_features(model, X, importance_type='weight', max_num_features=None, title='Feature Importance'):
    # Get feature importance scores using XGBoost's built-in method
    importance_scores = model.get_booster().get_score(importance_type=importance_type)

    # Extract feature names and importance scores
    feature_names, scores = zip(*importance_scores.items())

    # Create a DataFrame to store feature names and their importance scores
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': scores})

    # Sort features by importance score in descending order
    feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

    # Select the top features if max_num_features is specified
    if max_num_features is not None:
        feature_importance_df = feature_importance_df.head(max_num_features)

    # Plot feature importance
    plt.figure(figsize=(10, 6))
    plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.title(title)
    plt.gca().invert_yaxis()  # Invert the y-axis to show the most important feature at the top
    plt.show()

# Call the plotting_imp_features function with your trained XGBoost model and X_train
plotting_imp_features(best_xgb, X_train, max_num_features=10)

## CATBoost Regressor

In [None]:
#installing CatBoost package
!pip install catboost

In [None]:
#Importing the Catboost Regressor class
from catboost import CatBoostRegressor

In [None]:
#Displaying the list of columns in the main data frame
bd_df.columns

In [None]:
X=bd_df.drop(columns=['Date','Rented Bike Count','Days_of_week'])
y=bd_df['Rented Bike Count']

In [None]:
#List of categorical columns
categorical_columns = X.select_dtypes(include=["object"]).columns.tolist()
print("Names of categorical columns : ", categorical_columns)

#Get location of categorical columns
categorical_features_indices = [X.columns.get_loc(col) for col in categorical_columns]
print("Location of categorical columns : ",categorical_features_indices)

In [None]:
#Splitting the data to train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

CATBoost Regressor with default parameters


In [None]:
# creating Catboost model
CB_regressor= CatBoostRegressor( loss_function='RMSE')

 # train the model
CB_regressor.fit(X_train,y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test),verbose=False,plot=False)

In [None]:
#Make predictions and evalution of Catboost model

print(f'Training score:{CB_regressor.score(X_train,y_train)}')

y_pred=CB_regressor.predict(X_test)

#Evaluating the model
list_of_models.append(evaluate_model('Catboost Regression(default)',X_test,y_test,y_pred))

In [None]:
#Visualizing features importance of Catboost model
plotting_imp_features(CB_regressor, X_train)

CATBoost with Hyperparameter Tuning

In [None]:
#Finding the optimal parameters by Grid Search Cross Validation

parameters = {'depth' : [8,7,6],'learning_rate' : [0.025, 0.05, 0.1],'iterations':[100,200,500]}
CB_regressor = CatBoostRegressor(iterations=50, loss_function='RMSE',cat_features=categorical_features_indices)
grid = GridSearchCV(estimator=CB_regressor, param_grid = parameters, cv = 5, n_jobs=-1)
grid.fit(X_train, y_train)
print("\n The best parameters across ALL searched params:\n",grid.best_params_)


In [None]:
# creating Catboost model
best_CB_regressor= CatBoostRegressor(iterations=500,depth=8,learning_rate=0.1, loss_function='RMSE')
 # train the model
best_CB_regressor.fit(X_train,y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test),verbose=False,plot=False)

In [None]:
#Make predictions and evalution of Catboost model

print(f'Training score:{best_CB_regressor.score(X_train,y_train)}')

y_pred=best_CB_regressor.predict(X_test)

#Evaluating the model
list_of_models.append(evaluate_model('Catboost Regression(tuned)',X_test,y_test,y_pred))

In [None]:
#Visualizing features importance of the best Catboost model
plotting_imp_features(best_CB_regressor, X_train)

In [None]:
#Comparing models
Comparison_df=pd.DataFrame(list_of_models,columns=['Regression Model','Mean Absolute Error','Mean Squared Error','Root Mean Squared Error','r2 score','adjusted r2 score'])
Comparison_df

# Conclusion



*  

    Evaluating the performance metrics of the models has brought us to a conclusion that Decison tree based Ensemble models like XGBoost and CatBoost models are the most suitable for Predicting the number of bikes required on an hourly basis.



    
*   The important features for prediction are : Hour &Temperature.

*   Due to the lack of significant linear correlation between the independent variables and the count of Rented bikes,Linear regression and Polynomial regression are not good fit in this scenario.



