<a href="https://colab.research.google.com/github/Nakulcj7/bike/blob/main/Bike_sharing_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Seoul Bike Sharing Demand Prediction



##### **Project Type**    -Regression
##### **Contribution**    - Individual


# **Project Summary -**

Bike sharing systems have gained widespread popularity in urban environments, offering a sustainable and efficient mode of transportation. This project focuses on developing a predictive model for bike sharing demand, leveraging historical data, weather conditions, and other relevant factors. The primary goal is to create a robust and accurate prediction system to optimize bike allocation and enhance user experience.There were approximately 8760 records and 14 attributes in the dataset.This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

# **GitHub Link -**

https://github.com/Nakulcj7/bike/blob/main/Bike_sharing_prediction.ipynb


# **Problem Statement**



It is necessary to make the rental bike avaiable and accessible for the public at the right time as the waiting period shortens.Eventually,providing the city with a stable supply of rental bikes becomes a major concern.The main think to focus here is to predict the bike count required at each hour for a stable supply of rental bikes.


The major objective here is to count the rental bikes required on an daily hour basis and also to identify the features which influences the hourly demant for rental bikes.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#loading the dataset
df=pd.read_csv("/content/drive/MyDrive/Almabetter/SeoulBikeData.csv", encoding='ISO-8859-1')

### Dataset First View

In [None]:
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
print(f'number of rows : {df.shape[0]}  \nnumber of columns : {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
# Viewing the statistical summary of the data
df.describe(include='all').T

#### Duplicate Values

In [None]:
len(df[df.duplicated()])

This shows that there are no duplicate rows present in the dataset.


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

So it is evident that there is no null values in the dataset.So we can that the dataset is balanced.


### What did you know about your dataset?

The dataset provided contains 14 columns and 8760 rows and does not have any missing or duplicate values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all').T


In [None]:
numerical_features = [ftr for ftr in df.columns if df[ftr].dtype != 'O'] # attributes that are not of Object typer, i.e numerical data
categorical_features = [ftr for ftr in df.columns if df[ftr].dtype == 'O']
target = ['Rented Bike Count']
numerical_features.remove(target[0])

### Variables Description





 This dataset contains information on Seoul city's weather conditions (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall)and the number of bikes rented on every hour and the date information.

Attribute Information:


*   Date :The date of each observation in the format 'year-month-day'

*   Rented Bike count - Count of bikes rented at each hour

*   Hour - Hour of the day

*   Temperature - Temperature recorded in the city in Celsius (°C).

*   Humidity - Relative humidity in %

*   
Windspeed - Speed of the wind in m/s


*   Visibility - measure of distance at which object or light can be clearly discerned in units of 10m
*   Dew point temperature - Temperature recorded in the beginning of the day in Celsius(°C).


*   Solar radiation - Intensity of sunlight in MJ/m^2


*   Rainfall - Amount of rainfall received in mm


*   Snowfall - Amount of snowfall received in cm


*   Seasons - Season of the year (Winter, Spring, Summer, Autumn)


*   Holiday - Whether the day is a Holiday or not (Holiday/No holiday)


*   Functional Day -Whether the rental service is available (Yes-Functional hours) or not (No-Non functional hours)



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()


## 3. ***Exploratory Data Analysis***

Categorizing features as numerical and categorical

## Univariate Analysis

Distribution of Numerical Features

In [None]:
import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(25, 15))
plt.suptitle("Distribution Plot", fontsize=18, y=0.95)

for n, ticker in enumerate(numerical_features + target):
  # add a new subplot iteratively
  ax = plt.subplot(4,3, n + 1)
  plt.subplots_adjust(hspace=0.5, wspace=0.3)
  # filter df and plot ticker on the new subplot axis
  sns.distplot(df[ticker])
  plt.axvline(df[ticker].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(df[ticker].median(), color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(ticker.upper())
  ax.set_xlabel("")



*   Other than 'Hour', all numerical features exhibit some resemblance to a normal distribution.
.
        


*   The distribution for numerical features like Rented Bike Count, Solar Radiation and Visibility appear highly skewed, indicating the presence of large outliers.



In [None]:
import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(25, 15))
plt.suptitle("Regression Plot", fontsize=18, y=0.95)

for n, ticker in enumerate(numerical_features):
  # add a new subplot iteratively

  # filter df and plot ticker on the new subplot axis
  ax = plt.subplot(4,3, n+1)
  plt.subplots_adjust(hspace=0.5, wspace=0.3)
  sns.regplot(x=df[ticker],y=df[target],line_kws={"color": "red"})
  ax.set_title(ticker.upper())
  ax.set_xlabel("")


## Feature Extraction

In order to get deeper insights, we'll be extracting features from the date column. The relevant features extracted will be:

1.   Weekend: Boolean variable tells if the day falls on a weekend

1.   day of the week
2.   Month

Features extracted from hour column will be:

4.   Day Phase: Morning, Afternoon,Evening and Night








Extracting features from Date column

In [None]:
import datetime as dt                                                                                                                                                                                                                                                  #mahinisawesomemate
df['Date']=pd.to_datetime(df['Date'],)

In [None]:
df.describe(datetime_is_numeric=True)

Data for one year (2017 December 1 to 2018 November 12) is present

In [None]:
df.groupby('Date').agg({'Hour':'count'}).Hour.unique()

There are 24 logs of bike rental data provided in the dataset

In [None]:
a = df['Date'].dt.day_name()
df['Weekend']= a.apply(lambda x : 'Yes' if x=='Saturday' or x=='Sunday' else 'No' ) #tells if the day is a weekday or not
df['DayNo'] = df['Date'].dt.dayofweek #Day Number of week
df['Day Name'] = df['Date'].dt.day_name()
df['Month'] = df['Date'].apply(lambda x : x.month) #returns month from date

categorical_features.remove('Date') #Dropping date as necessary features are extracted
categorical_features = categorical_features + ['Weekend','Month', 'Day Name']

In [None]:
categorical_features

In [None]:
def phase_day(row):

  if 6 <= int(row['Hour']) <=11: #Time between 6am and 11am
    return 'Morning'

  if 12 <= int(row['Hour'])< 18: #Time between 12 noon and 6pm
    return 'AfterNoon'

  if 18 <= int(row['Hour'])<=21: #Time between 6pm and 9pm
    return 'Evening'

  if 22 <= int(row['Hour']) or int(row['Hour']) < 6  : #Time between 10pm and 6am, next day
    return 'Night'

df['Day Phase'] = df.apply(lambda row: phase_day(row), axis =1)
categorical_features.append('Day Phase')

Since time of the day is cyclic in nature (after 23:59pm it's 00:01am which is 2 minute difference but will be treated as 23 hours and 58 minute difference instead) we are doing sin and cosine transformations

In [None]:
import math

for a in ['Hour', 'Month','DayNo']:
  b =  2 * math.pi * df[a] / df[a].max()
  df['sin ' + a ] = np.sin(b)
  df["cos " + a ] = np.cos(b)
df.drop('DayNo', inplace=True, axis=1)

In [None]:
df.info()

In [None]:
for col in categorical_features:
  print(col, df[col].unique(), '\n')

## Count Plot of Categorical Attributes

In [None]:
import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(25, 15))
plt.suptitle("Count Plot", fontsize=18, y=0.95)

for n, ticker in enumerate(categorical_features):
  # add a new subplot iteratively

  # filter df and plot ticker on the new subplot axis
  ax = plt.subplot(4,3, n+1)
  plt.subplots_adjust(hspace=0.5, wspace=0.3)
  sns.countplot(x=df[ticker])
  ax.set_title(ticker.upper())
  ax.set_xlabel("")

**Questions**


*   What is the trend of Bike Sharing on an average day?

*   How do Holidays affect Bike Sharing Demand?
*   How do Holidays, Seasons, Weekends and the Month of the year have an effect on this trend?


*   How does the Bike Sharing demand fluctuate in different times of the day?(will be done using boxplot)



1. What is the trend of Bike Sharing on an average day?

In [None]:
plt.figure(figsize = (16,8))
sns.lineplot(x = 'Hour', y= 'Rented Bike Count', data = df)
plt.title("Average Bike Sharing Demand")                                                                                                                                                                                                                                                #mahinisawesomemate
a = plt.xticks(ticks = np.arange(0,24,1))

In [None]:
plt.figure(figsize = (16,8))
sns.lineplot(x = 'Hour', y= 'Rented Bike Count',hue ='Day Name' , data = df)
plt.title("Average Bike Sharing Demand on Different days of the Week")
a = plt.xticks(ticks = np.arange(0,24,1))



*   Number of Bikes Rented increses from 5 am and reaches its first peak at 8am.

*   The demand starts raising again at 10 am and reaches the second peak at 6 pm and this is the busiest time of the day.
*   The demand keeps decreasing from 6 pm to 4 am next day. 4-5 am is observed to be the quietest hours of the day.


*   The mornings are busies on Mondays while evenings are busiest on the last working day, Friday. Late morning and afternoon demand in Bike Sharing is busiest on Saturdays and Sunday



2. How do Holidays affect Bike Sharing Demand?

In [None]:
plt.figure(figsize = (16,8))
sns.lineplot(x = 'Hour', y= 'Rented Bike Count',hue ='Holiday' , data = df)
plt.title("Bike Sharing Demand: Holidays vs Working Days")
a = plt.xticks(ticks = np.arange(0,24,1))



*   Bike Sharing Demands are substantially lowered on Holidays in comparison to Working days
*   The peaks aren't identical and the demand is low and increases very gradually. This suggests that the demand is contributed by the Working class people to a notable extent.





3. How do Seasons have an effect on this trend?

In [None]:
plt.figure(figsize = (16,8))
sns.lineplot(x = 'Hour', y= 'Rented Bike Count',hue ='Seasons' , data = df)
plt.title("Bike Sharing Demand in Different Seasons")
a = plt.xticks(ticks = np.arange(0,24,1))



*   Winter season recieves the least bike sharing demand of all the seasons while Summer is observed to be the season of maximum bike sharing demand
*   The peaks and lows are similar in all the seasons suggesting that the rental routines of the people don't change but the amount of demand certainly does



In [None]:
plt.figure(figsize = (16,8))
sns.lineplot(x = 'Hour', y= 'Rented Bike Count',hue ='Weekend' , data = df)
plt.title("Bike Sharing Demand: Weekends vs Weekdays")
a = plt.xticks(ticks = np.arange(0,24,1))

The afternoons and late-nights are busier in the weekends as compared to the weekdays while the mornings and late evenings appear quieter

4. How does the Bike Sharing demand fluctuate in different times of the day?

In [None]:
import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(15, 10))
plt.suptitle('Bikes Rented Fluctuation on Different Times of the Day', fontsize=18, y=0.95)                                                                                                                                                                                                                                                #mahinisawesomemate

for n,time in enumerate(df['Day Phase'].unique()):
  a = df[df['Day Phase']== time]
  ax = plt.subplot(1,4, n+1)
  plt.subplots_adjust(hspace=0.5, wspace=0.5)
  sns.boxplot(y= 'Rented Bike Count', data = a)
  ax.set_title(time.upper())
  ax.set_ylabel("")

Most bike sharing demand is happened in the evening




## Correlation Heatmap

In [None]:
plt.figure(figsize = (15,10))
#Plotting COrrelation HEATMAP
sns.heatmap(df.corr(), annot = True)



*   Strong correlation of 0.91 observed between Hour and Dew point temperatur
*   Rented Bike count has a positive correlation with the Temperature and Hour of the day



Dropping feature: Date as this will no longer be necessary as necessary features have been extracted from it

In [None]:
df=df.drop(columns=['Date'],axis=1)

## Feature Engineering

In [None]:
plt.figure(figsize=(25, 10))
plt.suptitle('BOX PLOT of Numerical Features', fontsize=18, y=0.95)

#Looping through numerical features and looking for outliers using BOX PLOT
for n, ticker in enumerate(numerical_features + target):
  ax = plt.subplot(4,3, n+1)
  plt.subplots_adjust(hspace=0.5, wspace=0.3)
  sns.boxplot(x = df[ticker])
  ax.set_title(ticker.upper())
  ax.set_xlabel("")



*   Windspeed, Rainfall, Snowfall and Solar Radiation change wrt their seasons and hit their peaks only during their seasons and remain average during the others.
*   These outlers might offer insight when predicting the target variable



## One Hot Encoding

In [None]:
#categorical features are one hot encoded
engineered_df = df.copy()
for ftr in categorical_features:
  engineered_df = pd.concat([pd.get_dummies(engineered_df[ftr],prefix = ftr, drop_first=True), engineered_df.drop(ftr, axis = 1)], axis =1)


In [None]:
engineered_df.info()

## Train Test Split

In [None]:
#importing necessary modules for Deploying models and Evaluating them
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, max_error
from sklearn.linear_model import  LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor                                                                                                                                                                                                                                                #mahinisawesomemate
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor, VotingRegressor, StackingRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

In [None]:
#splitting data into train and test set
X,y = engineered_df.drop('Rented Bike Count', axis=1), engineered_df['Rented Bike Count']                                                                                                                                                                                                                                                #mahinisawesomemate
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2,  random_state=5)

## Functions

Scaling

In [None]:
#function to scale
def do_scale(X_train, X_test, scaling_type = StandardScaler):
  scaler = scaling_type()
  scaler.fit(X_train)
  X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns = X_train.columns)
  X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)                                                                                                                                                                                                                                                #mahinisawesomemate
  return X_train_scaled, X_test_scaled

## Evaluation

In [None]:
pip install shap

In [None]:
import shap

In [None]:
#Dictionary that will store metrics for evaluated models
#This will be converted DataFrame using display report function
report = {
    'model_type':[],
    'model_name':[],                                                                                                                                                                                                                                                #mahinisawesomemate
    'rmse':[],
    'mae':[],
    'R2':[],
    'adjusted R2':[]

}

In [None]:
# function to evaluate and update model and score
def evaluate(modeltype, modelname, Model, X_train, y_train, X_test, y_test):
  from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, max_error
  #making sure the same model is not re-entered again
  if modelname in report['model_name']:                                                                                                                                                                                                                                                #mahinisawesomemate
    print("Prexisting Model")
    return 0
  #making a copy to prevent accidental data changes
  X_tr = X_train.copy()
  X_te = X_test.copy()

  #Fitting Model
  Model.fit(X_tr, y_train)
                                                                                                                                                                                                                                                #mahinisawesomemate
  #Predicting Values from test set using model
  y_pred = Model.predict(X_te)

  #Model Evaluation

  #Mean Absolute Error
  mae = mean_absolute_error(y_test,y_pred)
  report['mae'].append(mae) #Appending Metric

  #R2 score
  R2 = r2_score(y_test,y_pred)
  report['R2'].append(R2) #Appending Metric

  #Root Mean Square Error
  rmse = np.sqrt(mean_squared_error(y_test,y_pred))
  report['rmse'].append(rmse) #Appending Metric

  #Adjusted R2 score
  adj_r2=1-(1-R2)*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  report['adjusted R2'].append(adj_r2) #Appending Metric

  #Appending Model Details
  report['model_name'].append(modelname)
  report['model_type'].append(modeltype)

  print(f"\n\n\n----------\n")

  #Plotting Graph of observed vs predicted values
  plt.figure(figsize=(20,10))
  plt.plot((y_pred)[:100])
  plt.plot((np.array(y_test)[:100]))
  plt.legend(["Predicted","Actual"])
  plt.title(modelname)
  plt.show()

  print(f"\n\n\n----------\n")


In [None]:
#displays report in a dataframe
def display_report():
  return pd.DataFrame(report)

## Getting Models

In [None]:
#installing catboost                                                                                                                                                                                                                                                #mahinisawesomemate
!pip install catboost

In [None]:
from catboost import CatBoostRegressor
import lightgbm as lgb
from xgboost import XGBRegressor

In [None]:
#function that returns 3 arrays of model-functions, names and their details
def get_models():
  models, names, model_type = list(), list(), list()

  # LinearReg
  models.append(LinearRegression())
  names.append('Linear Regression')
  model_type.append('Linear')

  #Lasso
  models.append(Lasso(alpha =0.2))
  names.append('Lasso Regression')
  model_type.append('Regularized Linear (Lasso) ')

  #Ridge
  models.append(Ridge(alpha =0.5))
  names.append('Ridge Regression')
  model_type.append('Regularized Linear (Ridge)')

  # DecisionTree
  models.append((DecisionTreeRegressor()))
  names.append('DecisionTree Regressor')
  model_type.append('CART')

  #RandomForest
  models.append(RandomForestRegressor())
  names.append('RandomForest Regressor')
  model_type.append('Ensemble Method')

  # GradientBoosting
  models.append(GradientBoostingRegressor())
  names.append('GradientBoosting Regressor')
  model_type.append('Ensemble Method')

  # CatBoosting
  models.append(CatBoostRegressor(silent = True))
  names.append('Cat Boosting Regressor')
  model_type.append('Ensemble Method')

  #Bagging
  models.append(BaggingRegressor())
  names.append('Bagging Regressor')
  model_type.append('Ensemble Method')

  #LightGBM Regressor
  models.append(lgb.LGBMRegressor())
  names.append('LightGBM Regressor')
  model_type.append('Ensemble Method')

  # KNN
  models.append(KNeighborsRegressor())
  names.append('K Neighbors Regressor')
  model_type.append('Neighbours')


  return models, names, model_type

## Model Deployment

In [None]:
#scaling
X_train_scaled, X_test_scaled = do_scale(X_train,X_test)

In [None]:
X_train_scaled.shape, X_test_scaled.shape

## Training and Evaluating ML models

In [None]:
#evaluating all modules in models and printing prediction graph
Model, modelname, modeltype = get_models()
for i in range(len(Model)):
  evaluate(modeltype[i], modelname[i], Model[i], X_train_scaled , y_train, X_test_scaled,y_test)

In [None]:
display_report().sort_values('R2', ascending=False)

Cat Boosting Regressor,LightGBM and Random Forest Regressor have emerged to be the best performing models with R2 scores of 0.923, 0.909 and 0.898

## Hyperparameter Tuning

## CATBOOST

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
cbc = CatBoostRegressor(silent = True)

#create the grid
grid = {'max_depth': [1,3,5,10,12],
        'n_estimators':[100,200,350,400]
        }

#Instantiate GridSearchCV
gscv = GridSearchCV (estimator = cbc, param_grid = grid, scoring ='r2', cv = 3)

#fit the model
gscv.fit(X_train_scaled,y_train)
print(gscv.best_estimator_)


In [None]:
#returns the best score
print(gscv.best_score_)                                                                                                                                                                                                                                                #mahinisawesomemate
#returns the best parameters
print(gscv.best_params_)

In [None]:
#evaluation
tuned_catboost = CatBoostRegressor(**gscv.best_params_, silent=True)
evaluate("Ensemble Method", "TUNED CatBoost Model", tuned_catboost, X_train_scaled , y_train, X_test_scaled,y_test)

In [None]:
display_report().tail(1)

After tuning the parameters, CatBoosting's R2 scored has increased from 0.9235 to 0.9368

## SHAP

In [None]:
explainer = shap.TreeExplainer(gscv.best_estimator_)
shap_values = explainer.shap_values(X_test_scaled)

In [None]:
shap.initjs()
i = 4
shap.force_plot(explainer.expected_value, shap_values[i], features=X_test_scaled.iloc[i], feature_names=X_train_scaled.columns)

In [None]:
shap.summary_plot(shap_values, features= X_test_scaled, feature_names= X_test_scaled.columns, plot_type='bar')

In [None]:
import random
random.seed(23)
index = (random.randint(0,len(X_test_scaled)))

## LIME Explainability

In [None]:
pip install lime

In [None]:
import lime
from lime.lime_tabular import LimeTabularExplainer

In [None]:
feature_names =list(X_train_scaled.columns)

explainer = LimeTabularExplainer(np.array(X_train_scaled),
    feature_names=feature_names,
    mode = 'regression')

In [None]:
exp = explainer.explain_instance(
      data_row=X_test_scaled.iloc[index],
      predict_fn= gscv.best_estimator_.predict
)
print(f"-----------\nACTUAL VALUE OBSERVED: {y_test.iloc[index]}\n")
exp.show_in_notebook(show_table=True)


## Random Forest

In [None]:
rfr = RandomForestRegressor()

#create the grid
grid = {'max_depth': [24,25,26],
        'n_estimators':[375,400,425],
        'max_samples':[0.5,0.6,0.8],
        'max_features':[10,15,20]
        }

#Instantiate GridSearchCV
gscv2 = GridSearchCV (estimator = rfr, param_grid = grid, scoring ='r2', cv = 3)                                                                                                                                                                                                                                                #mahinisawesomemate


#fit the model
gscv2.fit(X_train_scaled,y_train)                                                                                                                                                                                                                                                #mahinisawesomemate

print(gscv2.best_estimator_)


In [None]:
#returns the best score
print(gscv2.best_score_)                                                                                                                                                                                                                                                #mahinisawesomemate
#returns the best parameters
print(gscv2.best_params_)

In [None]:
evaluate("Ensemble Method", "TUNED Random Forest Model", gscv2.best_estimator_, X_train_scaled , y_train, X_test_scaled,y_test)

In [None]:
display_report().tail(1)



Upon Hyperparameter testing, the R2 score for Random Forest has increased from .0.896 to 0.903



## SHAP

In [None]:
explainer = shap.TreeExplainer(gscv2.best_estimator_)
shap_values = explainer.shap_values(X_test_scaled[:200])

In [None]:
shap.initjs()
i = 4
shap.force_plot(explainer.expected_value, shap_values[i], features=X_test_scaled.iloc[i], feature_names=X_train_scaled.columns)

In [None]:
shap.summary_plot(shap_values, features= X_test_scaled, feature_names= X_test_scaled.columns, plot_type='bar')

## LIME Explainability

In [None]:
feature_names =list(X_train_scaled.columns)

explainer = LimeTabularExplainer(np.array(X_train_scaled),
    feature_names=feature_names,
    mode = 'regression')

In [None]:
exp = explainer.explain_instance(
      data_row=X_test_scaled.iloc[index],
      predict_fn= gscv2.best_estimator_.predict
)
print(f"-----------\nACTUAL VALUE OBSERVED: {y_test.iloc[index]}\n")
exp.show_in_notebook(show_table=True)

## Light GBM

In [None]:
params = {


        'max_depth': (19,20,25),
        'n_estimators':[375,400,425],
        'max_features':[10,15,20]


}

# Initialize a GridSearchCV -
gscv3 = GridSearchCV(estimator=lgb.LGBMRegressor(), param_grid =  params, cv =  3,verbose=1)                                                                                                                                                                                                                                                #mahinisawesomemate


# Train on training data-
gscv3.fit(X_train_scaled, y_train)

In [None]:
#returns the best score
print(gscv3.best_score_)                                                                                                                                                                                                                                                #mahinisawesomemate
#returns the best parameters
print(gscv3.best_params_)

In [None]:
#evaluate
evaluate("Ensemble Method", "TUNED LGBM", gscv3.best_estimator_, X_train_scaled , y_train, X_test_scaled,y_test)

In [None]:
display_report().tail(1)

The tuned LightGBM regressor has produced a better performance than the tuned LightGBM regressor, from 0.909 to 0.921

## SHAP

In [None]:
explainer = shap.TreeExplainer(gscv3.best_estimator_)
shap_values = explainer.shap_values(X_test_scaled)

In [None]:
shap.initjs()
i = 4
shap.force_plot(explainer.expected_value, shap_values[i], features=X_test_scaled.iloc[i], feature_names=X_train_scaled.columns)

In [None]:
shap.summary_plot(shap_values, features= X_test_scaled, feature_names= X_test_scaled.columns, plot_type='bar')

## LIME Explainability

In [None]:
feature_names =list(X_train_scaled.columns)

explainer = LimeTabularExplainer(np.array(X_train_scaled),
    feature_names=feature_names,
    mode = 'regression')

In [None]:
exp = explainer.explain_instance(
      data_row=X_test_scaled.iloc[index],
      predict_fn= gscv3.best_estimator_.predict
)
print(f"-----------\nACTUAL VALUE OBSERVED: {y_test.iloc[index]}\n")
exp.show_in_notebook(show_table=True)

## Stacking

Alternatively, we are going to try and stack the top 3 best performing regressors and evaluate its performance against the best performing model

In [None]:
level0 = list()
level0.append(('LGBM', gscv3.best_estimator_)) #Light GBM                                                                                                                                                                                                                                                #mahinisawesomemate
level0.append(('Random Forest', gscv2.best_estimator_)) #Tuned Random Forest Model
level0.append(('Cat Boost', gscv.best_estimator_)) #Tuned CatBoost
# define meta learner model
level1 = LinearRegression()

In [None]:
stacking_model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5, passthrough = True)

In [None]:
evaluate("Ensemble Method", "STACKING Regressor", stacking_model, X_train_scaled , y_train, X_test_scaled,y_test)


In [None]:
display_report().tail(1)



*   Stacking all three top learners gives an R2 score of 0.9379
*   The Stacked model has better R2 scores than the individual learners



## LIME Explainability

In [None]:
stacking_model.fit(X_train_scaled , y_train)

feature_names =list(X_train_scaled.columns)

explainer = LimeTabularExplainer(np.array(X_train_scaled),
    feature_names=feature_names,
    mode = 'regression')

In [None]:
exp = explainer.explain_instance(
      data_row=X_test_scaled.iloc[index],
      predict_fn= stacking_model.predict
)
print(f"-----------\nACTUAL VALUE OBSERVED: {y_test.iloc[index]}\n")
exp.show_in_notebook(show_table=True)

## Model Perfomance Report

In [None]:
display_report().sort_values('R2', ascending = False)

The top performing models based on R2 scores are Stacking Regressor and Parameter tuned CatBoosting Regressor.

## Conclusion



*   Upon Exploratory Data Analysis, we found that the bike rentals follow an hourly trend where it hits the first peak in the morning and the highest peak later in the evening.

*   We also found that these trends are prominent only during weekdays and working days, leading us to make a safe assumption that office-goers make a notable contribution to bike sharing demand.

*   In addition, seasons were observed to have a notable effect on bike rentals with high traffic during summer and a significantly lower demand in winter.
*   Upon training and evaluation of the machine learning models, the CatBoost model and the Stacked Ensemble of CatBoost, LightGBM and Random Forest models performed the best when evaluated using the R2 metric. They produced R2 scores of 0.9369 and 0.9380, with a root mean squared error of 162.01 and 160.59 respectively.


*   It was found that the top performing models made predictions based on the weather and time of the day as high weightage was given to seasons, temperature recorded, solar radiation and hour of the day. This confirms the trends observed during the exploratory data analysis stage of the project.




