<a href="https://colab.research.google.com/github/1aishwarye/demand_prediction/blob/main/Demand_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name -**<font color='#FF3206'>Seoul Bike Sharing Demand Prediction

#**Project Type** - Regression

# **Contribution**    - Individual

## **Problem Statement** 

### Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.


##**Importing Required Libraries**

---

All required libraries are imported at once in order to have smooth workflow. These libraries have been used in data manipulation, plotting graphs, modelling etc.

In [None]:
#importing numpy and pandas
import numpy as np
import pandas as pd

#Data visulization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('ticks')
sns.set_context('poster')
from scipy.stats import norm

#Datetime library and calendar
from datetime import datetime
import calendar

#from sci-kit library we are importing scaler & encoder 
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import LabelEncoder

#from sci-kit library we are importing  ML models
from sklearn.linear_model import LinearRegression , Lasso , Ridge , ElasticNet
from sklearn.tree import DecisionTreeRegressor , ExtraTreeRegressor
from sklearn.ensemble import  RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import neighbors
from lightgbm import LGBMRegressor
import lightgbm
from xgboost import XGBRegressor

#for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

# for data split
from sklearn.model_selection import train_test_split

#fro optimization
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

#Evaluation metrics
from sklearn import metrics
from sklearn.metrics import r2_score as r2
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error

## **Loading data set**

In [None]:
data_path = "/content/drive/MyDrive/SeoulBikeData.csv"
df = pd.read_csv(data_path,encoding = "ISO-8859-1")

In [None]:
#shape of dataset
df.shape

(8760, 14)

In [None]:
#displaying first 5 data
df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [None]:
#displaying last 5 data
df.tail()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
8755,30/11/2018,1003,19,4.2,34,2.6,1894,-10.3,0.0,0.0,0.0,Autumn,No Holiday,Yes
8756,30/11/2018,764,20,3.4,37,2.3,2000,-9.9,0.0,0.0,0.0,Autumn,No Holiday,Yes
8757,30/11/2018,694,21,2.6,39,0.3,1968,-9.9,0.0,0.0,0.0,Autumn,No Holiday,Yes
8758,30/11/2018,712,22,2.1,41,1.0,1859,-9.8,0.0,0.0,0.0,Autumn,No Holiday,Yes
8759,30/11/2018,584,23,1.9,43,1.3,1909,-9.3,0.0,0.0,0.0,Autumn,No Holiday,Yes


### **Different features and their description**
   
> A feature is an input variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning 
project could use millions of features.

#### **Describing DataSet**

---
<b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


#### <b>Attribute Information: </b>

*  Date : year-month-day
*  Rented_Bike_Count - Count of bikes rented at each hour
*  Hour - Hour of the day
*  Temperature-Temperature in Celsius
*  Humidity - %
*  Windspeed - m/s
*  Visibility - 10m
*  Dew point temperature - Celsius
*  Solar radiation - MJ/m2
*  Rainfall - mm
*  Snowfall - cm
*  Seasons - Winter, Spring, Summer, Autumn
*  Holiday - Holiday/No holiday
*  Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

In [None]:
#describing dataset
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rented Bike Count,8760.0,704.602055,644.997468,0.0,191.0,504.5,1065.25,3556.0
Hour,8760.0,11.5,6.922582,0.0,5.75,11.5,17.25,23.0
Temperature(°C),8760.0,12.882922,11.944825,-17.8,3.5,13.7,22.5,39.4
Humidity(%),8760.0,58.226256,20.362413,0.0,42.0,57.0,74.0,98.0
Wind speed (m/s),8760.0,1.724909,1.0363,0.0,0.9,1.5,2.3,7.4
Visibility (10m),8760.0,1436.825799,608.298712,27.0,940.0,1698.0,2000.0,2000.0
Dew point temperature(°C),8760.0,4.073813,13.060369,-30.6,-4.7,5.1,14.8,27.2
Solar Radiation (MJ/m2),8760.0,0.569111,0.868746,0.0,0.0,0.01,0.93,3.52
Rainfall(mm),8760.0,0.148687,1.128193,0.0,0.0,0.0,0.0,35.0
Snowfall (cm),8760.0,0.075068,0.436746,0.0,0.0,0.0,0.0,8.8


* We may not need to perform extensive data cleansing because the ranges of values in the numerical columns appear fair as well. However,columns like **Wind speed** , **Dew_point_temperature**, **Solar Radiation**, **Rainfall** and **Snowfall** appears to be skewed as their **median** (50 percentile) is much **lower** than the **highest value**.


In [None]:
df.rename({'Temperature(°C)':'Temp',
           'Humidity(%)':'Humidity',
           'Wind speed (m/s)':'Wind_speed',
           'Visibility (10m)': 'Visibility',
           'Dew point temperature(°C)': 'Dew_point_temperature',
           'Solar Radiation (MJ/m2)': 'Solar_Radiation',
           'Snowfall (cm)': 'Snowfall',
           'Rainfall(mm)': 'Rainfall',
           'Rented Bike Count': 'Rented_Bike_Count',
           'Functioning Day':'Functioning_Day'}, 
          axis = "columns", inplace = True)

In [None]:
#new column names list
list(df.columns)

['Date',
 'Rented_Bike_Count',
 'Hour',
 'Temp',
 'Humidity',
 'Wind_speed',
 'Visibility',
 'Dew_point_temperature',
 'Solar_Radiation',
 'Rainfall',
 'Snowfall',
 'Seasons',
 'Holiday',
 'Functioning_Day']

In [None]:
#informaation about the dataswt
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Date                   8760 non-null   object 
 1   Rented_Bike_Count      8760 non-null   int64  
 2   Hour                   8760 non-null   int64  
 3   Temp                   8760 non-null   float64
 4   Humidity               8760 non-null   int64  
 5   Wind_speed             8760 non-null   float64
 6   Visibility             8760 non-null   int64  
 7   Dew_point_temperature  8760 non-null   float64
 8   Solar_Radiation        8760 non-null   float64
 9   Rainfall               8760 non-null   float64
 10  Snowfall               8760 non-null   float64
 11  Seasons                8760 non-null   object 
 12  Holiday                8760 non-null   object 
 13  Functioning_Day        8760 non-null   object 
dtypes: float64(6), int64(4), object(4)
memory usage: 958.2+ 

In [None]:
#checking for duplicates
df.duplicated().value_counts()

False    8760
dtype: int64

Duplicated Return boolean Series denoting duplicate rows
so flase means there is no duplicates .


#Important points from dataset
> * The dataset has total number of 8760 values with 14 different features. Luckily, there are no null values.
> * There are 10 numerical [ 'Rented_Bike_Count', 'Hour', 'Temperature(°C)', 'Humidity', 'Wind_speed',  'Visibility ', 'Dew_point_temperature', 'Solar_Radiation', 'Rainfall', 'Snowfall' ] and 4 categorial feature ('Date', 'Seasons', 'Holiday', 'Functioning_Day').
> * However, date has data type of object which should be further treated to have correct data type and as per data description hour represents timestamp.



In [None]:
#Creating a copy of dataset so we do not make any chnges in our original dataset
df1 = df.copy()

**firstly we need to make changes in our date format**

In [None]:
df1['Date'] = pd.to_datetime(df1['Date'], infer_datetime_format=True)

  df1['Date'] = pd.to_datetime(df1['Date'], infer_datetime_format=True)


In [None]:
#Extracting month from date column
df1['month'] = pd.DatetimeIndex(df['Date']).month
df1['month'] = df['month'].apply(lambda x : calendar.month_abbr[x])

#Extracting day name from date column
df1['day'] = 

Unnamed: 0,Date,Rented_Bike_Count,Hour,Temp,Humidity,Wind_speed,Visibility,Dew_point_temperature,Solar_Radiation,Rainfall,Snowfall,Seasons,Holiday,Functioning_Day,month
8755,2018-11-30,1003,19,4.2,34,2.6,1894,-10.3,0.0,0.0,0.0,Autumn,No Holiday,Yes,11
8756,2018-11-30,764,20,3.4,37,2.3,2000,-9.9,0.0,0.0,0.0,Autumn,No Holiday,Yes,11
8757,2018-11-30,694,21,2.6,39,0.3,1968,-9.9,0.0,0.0,0.0,Autumn,No Holiday,Yes,11
8758,2018-11-30,712,22,2.1,41,1.0,1859,-9.8,0.0,0.0,0.0,Autumn,No Holiday,Yes,11
8759,2018-11-30,584,23,1.9,43,1.3,1909,-9.3,0.0,0.0,0.0,Autumn,No Holiday,Yes,11
