## Context

In this project, we were asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. The data is for the years of 2011 and 2012. This dataset was provided by Hadi Fanaee Tork using data from Capital Bikeshare [2].

![image.png](attachment:image.png)

# Data Exploration


## The Dataset

To prevent future issues when predicting rental counts, we begin by familiarising ourselves with the dataset, starting with a simple analysis, exploring details step by step. 

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings("ignore")
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "svg"
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

In [2]:
df = pd.read_csv("BikeRental.csv")
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [3]:
df.shape

(17379, 17)

We have 17379 rows of data, with 17 columns. 

In [4]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

We observed following from the data:

- All columns except `dteday` appear to contain numerical values. `dteday` is in the date format.
- Values related to weather data (`temp`, `atemp`, `hum`, `windspeed`) might be scaled, as they range in the 0.XX area. 
- Values in the instant column are in the sequential order that is equal to the row number. 
- Values in `cnt` (number of total rentals) column appear to be continous varied, and this is our dependent variable. 
- All the independent variables are numeric (except `dteday`) and the variable we have to predict is the `cnt` variable. This seems to be a regression problem as the `cnt` variable is continuos varied. 

Understanding more about the structure of the dataset.

## Searching for Missing Values

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


This confirms that we are dealing with numeric columns only (except dteday).

In [6]:
df.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

Well, we observed that none of the columns contain NaN values, that will reduce our efforts while cleaning the data. 
Plotting a distribution between `dteday` & `instant` to check the missing values.

In [None]:
hours_per_day_missing_values = df.groupby('dteday').count()['instant']
fig = px.histogram(hours_per_day_missing_values, x="instant")

It clearly depicts that the majority of the days have data for all 24 hours, however occasionally some hours of data are missing for a few days. In this context, hours with missing data can indicate one of two things:  either the data is genuinely missing, in which case we need to correct it, or the hours missing from the data represent a total count of rentals equal to 0, which means no bikes were rented at that time. In instance 2, we won't need to make any corrections, therefore we look for any existing 0 counts in the data.

In [None]:
df.isnull().sum()

Relief! Cnt contains no 0 values, so we can safely assume we are dealing with case 2 here, "hours absent from the dataset indicate no rentals at that hour".

## Statistics

We now start analysing each column, running statistics.

In [None]:
df.describe()

Looking at these values and going back to our suspicions about the dataset, we can now observe and deduct the following: 

- The `instant` column ranges from 1 to 17379, which equals the number of rows the dataset has. This column is used as an incremental index.
- The `season` column contains integers between 1 and 4 with a mean at ~2.5. Hence, we assume they are distributed equally. 
- The `year` column encodes two years as either 0 or 1 with a mean above 0.5, meaning one year is represented more often in the data than the other. 
- The `month` column ranges from 1 to 12, encoding each month as an integer. 
- `hours` are encoded as integers between 0 and 23. 
- `holiday` is a binary encoding using 0 and 1 with a mean that is very small.
- `weekday` ranged from 0 to 6, encoding each day of the week. 
- `workingday` is, just like holiday, a binary encoding using 0 and 1. 
- `weathersit` is an integer column with values between 1 and 4. 
- `temp`, starting value is 0.02. It goes all the way to 1. 
- `atemp` is by all means very close to temp, might be dropped from the dataset from the further analysis. 
- `humidity` ranges from 0 to 1. 
- `windspeed` starts at 0 but goes up only to a value of 0.85. 
- numbers for `casual` users are a lot lower than for `registered` ones, indicating that both groups might need to be reviewed in more detail separately from another. 
- The *maximum* number of bikes rented in one hour is 977. 

## Exploring the features of various discrete variables 

In [None]:
sns.factorplot(x='season',data=df,kind='count',size=5,aspect=1.25, palette='muted')

In [None]:
sns.factorplot(x='holiday',data=df,kind='count',size=5,aspect=1.25, palette='muted')

We can observe that the majority of the data lies in when there is no holiday, so basically we can assume that there will be more bike rentals during working days. Let's plot a distribution to confirm our suspicion.

In [None]:
sns.factorplot(x='workingday',data=df,kind='count',size=5,aspect=1.25, palette='muted')

In [None]:
sns.factorplot(x='weathersit',data=df,kind='count',size=5,aspect=1.25, palette='muted')

"""      1 --> clear or partial cloudy weather
         2 --> mist + cloudy weather
         3 --> light snow, light rain weather
         4 --> heavy rain, snow + fog weather
"""

In [None]:
fig = px.scatter(df, x="hr", y="cnt",
                 labels={
                     "hr": "Hour",
                     "cnt": "Count",
                 },)
fig.update_layout(xaxis={"dtick":1},margin={"t":0,"b":0},height=500)
fig.show()

In [None]:
fig = px.scatter(df, x="mnth", y="cnt",
                 labels={
                     "mnth": "Month",
                     "cnt": "Count",
                 },)
fig.update_layout(xaxis={"dtick":1},margin={"t":0,"b":0},height=500)
fig.show()

In [None]:
fig = px.scatter(df, x="yr", y="cnt",
                 labels={
                     "yr": "Year",
                     "cnt": "Count",
                 },)
fig.update_layout(xaxis={"dtick":1},margin={"t":0,"b":0},height=500)
fig.show()

In [None]:
fig = px.scatter(df, x="weekday", y="cnt",
                 labels={
                     "weekday": "Day",
                     "cnt": "Count",
                 },)
fig.update_layout(xaxis={"dtick":1},margin={"t":0,"b":0},height=500)
fig.show()

From the above plots, we observe the following:
- We can observe that the data is distributed almost equally across all the seasons.
- We can observe that the majority of the data lies in when there is no holiday, so basically we can assume that there will be more bike rentals during working days.
- Bike rentals during the clear weather is the most compared to other types of weather (as expected).
- There is a high demand of bikes during morning 7-8 AM and in the evening 5-6 PM, maybe due to office hours.
- The demand of bikes is the most on 9th month.
- The demand of bikes has been increased over the years.
- The demand of bikes is more during the weekdays compared to weekend or holidays.

## Checking correlation between variables

In [None]:
fig = px.imshow(df.corr(), text_auto=True)
fig.update_layout(
    autosize=False,
    width=1000,
    height=1000,)
fig.show()

Observations - 

1. The vast majority of variables in the dataset are not correlated or the link is extremely weak
2. There are a lot more strong positive correlations than there are negative ones
3. Real temperature (temp) and felt temperature (atemp) are strongly correlated (0.99), and they have direct impact on cnt.
4. Humid is strongly negative correlated with casual and registered users, which suggests that people won't do bike rent in humid weather.
5. Holiday and Cnt is strongly negative correlated with each other, as we have seen in the plot above. 
6. Registered is highly correlated with cnt which suggests that number of people who rent bikes are registered users (compared to casual users).
7. Weather situation and cnt are negative correlated with each other, which suggests that as the weather becomes harsh (from 1 to 4 category), people are less likely to rent a bike.

## Data preparation using feature engineering

One-Hot encoding variables like `season` and `weathersit`.

In [None]:
df = pd.get_dummies(df, prefix=['season', 'weathersit'], columns=['season', 'weathersit'])
df.head()

Dropping off `dteday` column because most of the time features (yr, mnth, hr, weekday) are already in the dataset, so this feature won't be of much use in our analysis.

In [None]:
df.drop(['dteday'], axis=1, inplace=True)
df.head()

## Correlation (New Features)

Checking correlation between variables after adding/removing some features.

In [None]:
fig = px.imshow(df.corr(), text_auto=True)
fig.update_layout(
    autosize=True,
    width=1000,
    height=1000,)
fig.show()

## Data Modelling

After the data exploration and data preparation is done, moving onto the data modelling part.

Separate the independent variable (x) and dependent variable (y)

In [None]:
extracting_cnt_col = df.pop('cnt')
new_df = pd.concat([df, extracting_cnt_col], 1)
x = new_df.iloc[:, :-1].values
y = new_df.iloc[:, 1].values

Split the data into training and test set.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 40)

Using Linear Regression

In [None]:
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_pred,y_test))
print("The root mean squared error (RMSE) on test set: {:.5f}".format(rmse))

Using Random Forest Regressor

In [None]:
model_rf = RandomForestRegressor()
model_rf.fit(x_train, y_train)
y_pred = model_rf.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_pred,y_test))
print("The root mean squared error (RMSE) on test set: {:.5f}".format(rmse))

# Bibliography

1. https://www.kaggle.com/c/bike-sharing-demand
2. https://www.capitalbikeshare.com/system-data
3. https://scikit-learn.org/stable/supervised_learning.html