# Bike Sharing Demand Analysis

Data in this case is about a bike sharing system, it includes the first 19 days of each month for the period of 2 yrs, and we have to predict the demand for the next +- 10 days.

Let's see whats inside the data, but as a first apporach, i would say that peak hours during week should be for people commuting work, and weekends would depend a lot of weather.

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


In [None]:
sns.set(style="dark")
sns.set(style="whitegrid", color_codes=True)

# lmport Datasets:

In [None]:
train=pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv')
test=pd.read_csv('/kaggle/input/bike-sharing-demand/test.csv')
print('train shape:',train.shape)
print('test shape:',test.shape)

In [None]:
train.head()

In [None]:
test.head()

# About Data:

Data provided is about hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. The goal is to predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
**Data Fields:

**datetime:** hourly date + timestamp     
**season:**  1 = spring, 2 = summer, 3 = fall, 4 = winter    
**holiday:** whether the day is considered a holiday  
**workingday:** whether the day is neither a weekend nor holiday   
**weather:** 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog    
**temp:** temperature in Celsius   
**atemp:** "feels like" temperature in Celsius   
**humidity:** relative humidity   
**windspeed:** wind speed   
**casual:** number of non-registered user rentals initiated   
**registered:** number of registered user rentals initiated   
**count:** number of total rentals   

In [None]:
#check for null data
train.isnull().sum()

In [None]:
import missingno as msno

fig,ax=plt.subplots(2,1,figsize=(10,5))

msno.matrix(train,ax=ax[0])
ax[0].set_title('Train Data')
msno.matrix(test,ax=ax[1])
ax[1].set_title('Test Data')

In [None]:
#variable datatype:
train.info()

No null data!, so lets change some formats and start EDA!

## Feature Engineering:

In [None]:
from datetime import datetime
from dateutil import parser
import calendar

#parse string datetime into datetime format
train['datetime2']=train.datetime.apply(lambda x: parser.parse(x))

#Get some different time variables
train['year']=train.datetime2.apply(lambda x: x.year)
train['month']=train.datetime2.apply(lambda x: x.month)
train['weekday']=train.datetime2.apply(lambda x: x.weekday())
train['weekday_name']=train.datetime2.apply(lambda x: calendar.day_name[x.weekday()])
train['hour']=train.datetime2.apply(lambda x: x.hour)


In [None]:
#create categorical data
train['season_decode']=train.season.map({1:'spring',2:'summer',3:'fall',4:'winter'})
train['working_decode']=train.workingday.map({1:'work',0:'notwork'})
train['weather_decode']=train.weather.map({1:'Clear',2:'Mist',3:'LightRain',4:'HeavyRain'})

In [None]:
train.head()

# Outliers Analysis

In [None]:
f,ax=plt.subplots(1,2)
sns.distplot(train['count'],bins=30,ax=ax[0])
ax[0].set_title('count distrib')
sns.boxplot(data=train,y=train['count'],ax=ax[1])
ax[1].set_title('count boxplot')

The distribution is right skewed. but lets see how many instances are out of the 3 bias range:

In [None]:
mean_count=train['count'].mean()
std_count=train['count'].std()
print(mean_count-3*std_count)
print(mean_count+3*std_count)
outliers1=train[train['count']>(mean_count+3*std_count)]
len(outliers1['count'])

There are 147 cases where the count is out of the 99% probability given the count mean and std, so i will take them out for next steps.

In [None]:
train2=train[train['count']<=(mean_count+3*std_count)]
train2.shape

This outlier analysis is super basic, in order to get a better prediction, the analysis should be done by combining different features. 

# EDA: 

In [None]:
#Season
sns.boxplot(data=train2,y=train2['count'],x=train['season_decode']).set_title('Demand by season')

Quite strange but spring looks like the season with the less bikers.

In [None]:
#Year

train2.groupby(['year','month'])['count'].mean().plot().set_title('demand by year')


demand is increasing, so year, month and season are important features. More if we take in count that we are going to predict the endo of each month.

In [None]:
#WeekDay & Hour:
week_hour=train2.groupby(['weekday_name','hour'])['count'].mean().unstack()
week_hour=week_hour.reindex(index=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])


plt.figure(figsize=(15,6))
cmap2 = sns.cubehelix_palette(start=2,light=1, as_cmap=True)

sns.heatmap(week_hour,cmap=cmap2).set_title('Demand by Day-Hour')

Most important time during Week: 8 & 17-18  -->Work Commuters!  
Most important time during Weekend: 13-16  

In [None]:
#Difference between casual and resgitered
train2.groupby(['hour'])['casual','registered','count'].mean().plot().set_title('Demand by hour')


In [None]:

train2.groupby(['weekday_name'])['casual','registered','count'].mean().plot(kind='bar').set_title('demand by day of week')


* Casual demand increases during weekend, while registered is for comuting work.
* Registered demand has a high importance in the overall demand

In [None]:
#Weather
train2.groupby(['weather_decode'])['casual','registered'].mean().plot(kind='bar').set_title('demand by weather')

In [None]:
#Temp
season_temp=train2.groupby(['season_decode','temp'])['count'].mean().unstack()


plt.figure(figsize=(15,8))
cmap3 = sns.cubehelix_palette(start=6,light=1, as_cmap=True)

sns.heatmap(season_temp,cmap=cmap3).set_title('demand by season and temperature')

# Correlation & Choosing Variables:

In [None]:
Correlation_Matrix=train2[['holiday','workingday','weather','temp','atemp','humidity','windspeed','casual','registered','count']].corr()
mask = np.array(Correlation_Matrix)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(Correlation_Matrix,mask=mask,vmax=.8,annot=True,square=True)

### Run a random forest for selectin features & understand importance of each

In [None]:
#preparing data sets for random forest
X=train2[['season','holiday','workingday','weather','temp','atemp','humidity','windspeed','year','month','weekday','hour']]

y_count=train2['count']
y_casual=train2['casual']
y_reg=train2['registered']

In [None]:
from sklearn.preprocessing import StandardScaler

#Scaled all distributions
X_Scaled=StandardScaler().fit_transform(X=X)

In [None]:
from sklearn.model_selection import train_test_split
#Split for train-test
X_train, X_test, y_train, y_test = train_test_split(X_Scaled, y_count, test_size=0.25, random_state=42)


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_count=RandomForestRegressor()
rf_count.fit(X_train,y_train)

importance_count=pd.DataFrame(rf_count.feature_importances_ , index=X.columns, columns=['count']).sort_values(by='count',ascending=False)



In [None]:
importance_count.plot(kind='bar',color='r').set_title('Importance of features for total demand')

In [None]:
#repeat for casual demand:

X_train, X_test, y_train, y_test = train_test_split(X_Scaled, y_casual, test_size=0.25, random_state=42)

rf_casual=RandomForestRegressor()
rf_casual.fit(X_train,y_train)

importance_casual=pd.DataFrame(rf_casual.feature_importances_ , index=X.columns, columns=['casual']).sort_values(by='casual',ascending=False)


In [None]:
importance_casual.plot(kind='bar').set_title('Importance of features for casual demand')

In [None]:
#repeat for registered demand:

X_train, X_test, y_train, y_test = train_test_split(X_Scaled, y_reg, test_size=0.25, random_state=42)

rf_reg=RandomForestRegressor()
rf_reg.fit(X_train,y_train)

importance_reg=pd.DataFrame(rf_reg.feature_importances_ , index=X.columns, columns=['reg']).sort_values(by='reg',ascending=False)


In [None]:
importance_reg.plot(kind='bar',color='g').set_title('Importance of features for registered demand')

In [None]:
importance_df=pd.concat([importance_count,importance_casual,importance_reg],axis=1)
importance_df.plot(kind='bar').set_title('Feature importance for each kind of demand')

#### Prelliminar Conclussions:

- **Hour** is the most important feature in order to predict the demand, either in casual and registered bikers.
- resgitered bikers are driving the most of the bike demand.
- for **registered bikers** the demand peaks are during pre&post work hours, so **commuters** are the importan users.
- the overall bike users are increasing year by year, we can observe an tendency that is getting importance as we are going to predict the demand for the lasts days of each month.
- for **casual bikers** the **atemp** ('how it feels weather') and the not working days are super important as well, and this has a lot of sense.

As resgitered bikers have a big importance on the total demand, and casual bikers demand bikes the not working days, i will proceed with a shared model for both behaviors, it looks like they are not fighting for the 'same' bikes.

# Preparing Train/Test set and final feature selection: 

In [None]:
feature_selection=['workingday','weather','atemp','humidity','windspeed','year','month','weekday','hour']
print('features for model:',len(feature_selection))

In [None]:
#Prepare Training data
X_train=train2[feature_selection]
print(X_train.shape)

y_train=train2['count']
print(y_train.shape)

In [None]:
#Prepare Test data

#parse string datetime into datetime format
test['datetime2']=test.datetime.apply(lambda x: parser.parse(x))

#Get some different time variables
test['year']=test.datetime2.apply(lambda x: x.year)
test['month']=test.datetime2.apply(lambda x: x.month)
test['weekday']=test.datetime2.apply(lambda x: x.weekday())
test['hour']=test.datetime2.apply(lambda x: x.hour)

X_test=test[feature_selection]
print(X_test.shape)

In [None]:
X_train_scaled=StandardScaler().fit_transform(X=X_train)
X_test_scaled=StandardScaler().fit_transform(X=X_test)

As wee seen before, there are some outliers, i have cleaned the base but not in a deep way, so as cost function i will use RMSLE, that is better to not penalize huge differences when actual predicted value are both huge numbers.   
   

For more: https://www.kaggle.com/c/ashrae-energy-prediction/discussion/113064

In [None]:
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import make_scorer


def rmsle(y,y_pred):
    return np.sqrt(mean_squared_log_error(y,y_pred))
    
rmsle_score=make_scorer(rmsle)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

rfr=RandomForestRegressor(random_state=42)

score=cross_val_score(rfr,X_train_scaled,y_train,cv=15,scoring=rmsle_score)

print(f'Score rmsle mean: {np.round(score.mean(),4)}')
print(f'Score  rmsle std: {np.round(score.std(),4)}')

In [None]:
rfr.fit(X_train_scaled,y_train)
y_pred=rfr.predict(X_test_scaled)

In [None]:
submission=pd.read_csv('/kaggle/input/bike-sharing-demand/sampleSubmission.csv')
submission['count']=y_pred
submission.to_csv('submissionI.csv',index=False)

**Score for SubnimissionI = 0.48816**

In [None]:
#Without Scaling Data

rfr.fit(X_train,y_train)
y_pred=rfr.predict(X_test)
submission2=pd.read_csv('/kaggle/input/bike-sharing-demand/sampleSubmission.csv')
submission2['count']=y_pred
submission2.to_csv('submissionII.csv',index=False)               

**Score for SubnimissionII = 0.48682**

Let´s tune the model:

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split


x_train2,x_test2,y_train2,y_test2=train_test_split(X_train,y_train,test_size=0.25,random_state=42)

params={'n_estimators': [10,50,100,300,500],
       'n_jobs':[-1],
       'max_features':['auto','sqrt','log2'],
       'random_state':[42]}

rfr_tuned=GridSearchCV(estimator=RandomForestRegressor(),param_grid=params,scoring='neg_mean_squared_log_error',verbose=True)

rfr_tuned.fit(x_train2,y_train2)
print(rfr_tuned.best_params_)
print(rfr_tuned.best_estimator_)



In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr_final=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
                      oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

rfr_final.fit(x_train2,y_train2)
y_pred2=rfr_final.predict(x_test2)
print('RMSLE:',np.round(rmsle(y_test2,y_pred2),4))

In [None]:
rfr_final=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
                      oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

rfr_final.fit(X_train,y_train)
y_pred=rfr.predict(X_test)
submission3=pd.read_csv('/kaggle/input/bike-sharing-demand/sampleSubmission.csv')
submission3['count']=y_pred
submission3.to_csv('submissionIII.csv',index=False)

As a conclusion, there are some features that are highly important. We can predict an overall demand in an easy way, and without using a complex model, but if we want to improve the final result we must spend time working on outliers.

By other hand, i could have builded two different models (one for casual and the other for registered), in orden to be more precise.

For the last, taking in count rmsle helps us to not penalize higher the difference between real values and predicted on the peaks of demand. I think this is a good strategy for building an algorithm that predict demand.