## Demand Forecasting
Can you forecast the demand of the car rentals on an hourly basis?


#### Problem Statement
ABC is a car rental company based out of Bangalore. It rents cars for both in and out stations at affordable prices. The users can rent different types of cars like Sedans, Hatchbacks, SUVs and MUVs, Minivans and so on.

In recent times, the demand for cars is on the rise. As a result, the company would like to tackle the problem of supply and demand. The ultimate goal of the company is to strike the balance between the supply and demand inorder to meet the user expectations. 

The company has collected the details of each rental. Based on the past data, the company would like to forecast the demand of car rentals on an hourly basis. 


#### Objective
The main objective of the problem is to develop the machine learning approach to forecast the demand of car rentals on an hourly basis.


#### Data Dictionary
You are provided with 3 files - train.csv, test.csv and sample_submission.csv

Training set

train.csv contains the hourly demand of car rentals from August 2018 to February 2021.

In [1]:
## import important librabries:
import plotly.express as px
import numpy as np, pandas as pd
import datetime

import matplotlib.pyplot as plt

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from statsmodels.tsa.stattools import adfuller
from sklearn.model_selection import TimeSeriesSplit,GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error

from glmnet import ElasticNet as glm_elastic
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

import xgboost as xgb
from fbprophet import Prophet

import math
from keras.models import Sequential
from keras.layers import Dense,LSTM
from keras.callbacks import EarlyStopping
from keras.optimizers import adam_v2

  from pandas import MultiIndex, Int64Index


In [2]:
train_df = pd.read_csv("train_E1GspfA.csv")
test_df = pd.read_csv("test_6QvDdzb.csv")
sub_df = pd.read_csv("sample_4E0BhPN.csv")

In [3]:
train_df.head(2)

Unnamed: 0,date,hour,demand
0,2018-08-18,9,91
1,2018-08-18,10,21


Fb Prohpet

In [4]:
train_df['date'] = pd.to_datetime(train_df['date'])
test_df['date'] = pd.to_datetime(test_df['date'])

In [None]:
def add_hour(dt,hr):
    return dt  + datetime.timedelta(hours = hr)

In [None]:
train_df['date'] = train_df.apply(lambda row: add_hour(row['date'],row['hour']),axis=1)

In [None]:
fb_train_df = train_df[['date','demand']]
fb_train_df.columns = ['ds','y'] 

In [None]:
# train_test_split
train = fb_train_df.iloc[:-1000]
test = fb_train_df.iloc[-1000:]
fb_m = Prophet()
fb_m.fit(train)

y_pred = fb_m.predict(fb_m.make_future_dataframe(
    periods=1000,freq = 'H'))[['yhat_lower', 'yhat','yhat_upper']][-1000::]

print(rmse(test['y'],y_pred['yhat'].astype('int')))
print("with dec",rmse(test['y'],y_pred['yhat']))

In [None]:
# update test_df
test_fb_df = test_df.copy()
test_df['date'] = test_df.apply(lambda row: add_hour(row['date'],
                                                       row['hour']),axis=1)

# apply model and make prediction
full_pred = fb_m.predict(fb_m.make_future_dataframe(
    periods=11000,freq = 'H'))[['ds','yhat_lower', 'yhat','yhat_upper']]

# takeout predicted values and save
demand_pred = full_pred[full_pred['ds'].isin(test_df['date'].values)]['yhat']
sub_df['demand'] = demand_pred.values
sub_df.to_csv('sub_fb_prpht.csv',index= False)

In [None]:
fb_m_final = Prophet()
fb_m_final.fit(fb_train_df)

y_pred_full = fb_m_final.predict(fb_m_final.make_future_dataframe(
    periods=10000,freq = 'H'))[['ds','yhat_lower', 'yhat','yhat_upper']]


In [None]:
demand_pred_with_final_model = y_pred_full[y_pred_full['ds'].isin(test_df['date'].values)]['yhat']
sub_df['demand'] = demand_pred_with_final_model.values
sub_df.to_csv('sub_fb_prpht_full.csv',index= False)

In [None]:
# update test_df
test_fb_df = test_df.copy()
test_df['date'] = test_df.apply(lambda row: add_hour(row['date'],
                                                       row['hour']),axis=1)

# apply model and make prediction
full_pred = fb_m.predict(fb_m.make_future_dataframe(
    periods=11000,freq = 'H'))[['ds','yhat_lower', 'yhat','yhat_upper']]

# takeout predicted values and save
demand_pred = full_pred[full_pred['ds'].isin(test_df['date'].values)]['yhat']
sub_df['demand'] = demand_pred.values
sub_df.to_csv('sub_fb_prpht.csv',index= False)

In [None]:
demand_pred.values

In [None]:
test_df = test_df.assign("check",test_df['hour'])

In [None]:
train_df['demand'].describe()

In [None]:
sub_df['demand'].describe()

In [None]:
tt['12']

In [None]:
type(tt.copy())

Data Analysis

In [None]:
# daily demand summing up all the hours in the day
# fig = px.line(train_df.groupby(['date'])['demand'].sum().reset_index(), x="date", y="demand")
# fig.show()

# mean daily demand 
# fig = px.line(train_df.groupby(['date'])['demand'].mean().reset_index(), x="date", y="demand")
# fig.show()


# mean hourly demand
# fig = px.line(train_df.groupby(['hour'])['demand'].mean().reset_index(), x="hour", y="demand")
# fig = px.line(train_df.groupby(['hour'])['demand'].max().reset_index(), x="hour", y="demand")
# fig = px.line(train_df.groupby(['hour'])['demand'].min().reset_index(), x="hour", y="demand")

# fig.show()

# plot_acf(train_df["demand"]);
# plot_pacf(train_df["demand"]);


# print("Observations of Dickey-fuller test")
# dftest = adfuller(train_df['demand'],autolag='AIC')
# dfoutput=pd.Series(dftest[0:4],index=['Test Statistic','p-value','#lags used','number of observations used'])
# for key,value in dftest[4].items():
#     dfoutput['critical value (%s)'%key]= value
# print(dfoutput)


### Feature Engineering

In [5]:
def get_date_feats(df):
    df = df.copy()
    df['date'] = pd.to_datetime(df['date'])
    df['month'] = df['date'].dt.month
    df['week'] = df['date'].dt.week
    df['year'] = df['date'].dt.year
    df['day_of_week'] = df['date'].dt.dayofweek
    df['day_of_year'] = df['date'].dt.dayofyear
    df['quarter'] = df['date'].dt.quarter
    df['is_weekend'] = np.where(df['day_of_week'].isin([5,6]),1,0)
    df['is_weekday'] = np.where(df['day_of_week'].isin([0,1,2,3,4]),1,0)
    df['days_in_month'] = df['date'].dt.days_in_month
    
    
    return df

def get_lag_feats(df,num_of_lags=4):
    df = df.copy()
    for lag in range(1,num_of_lags+1):
        df["demand_lag_0" + str(lag)] = df['demand'].shift(lag).fillna(method = 'bfill')
    return df

def feat_engg(df):
    df = df.copy()
    df = get_date_feats(df)
#     df = get_lag_feats(df,4)
    return df

def rmse(actual,forecast):
    return np.mean((forecast - actual)**2)**.5  # RMSE

In [6]:
train_df = feat_engg(train_df)
test_df = feat_engg(test_df)

  df['week'] = df['date'].dt.week
  df['week'] = df['date'].dt.week


In [7]:
test_df.head(2)

Unnamed: 0,date,hour,month,week,year,day_of_week,day_of_year,quarter,is_weekend,is_weekday,days_in_month
0,2021-03-01,0,3,9,2021,0,60,1,0,1,31
1,2021-03-01,1,3,9,2021,0,60,1,0,1,31


In [8]:
train_df.columns

Index(['date', 'hour', 'demand', 'month', 'week', 'year', 'day_of_week',
       'day_of_year', 'quarter', 'is_weekend', 'is_weekday', 'days_in_month'],
      dtype='object')

In [9]:
cols_for_model = ['hour', 'month', 'week', 'year', 'day_of_week',
       'day_of_year', 'quarter', 'is_weekend', 'is_weekday', 'days_in_month']

In [11]:
train_df.head(2)

Unnamed: 0,date,hour,demand,month,week,year,day_of_week,day_of_year,quarter,is_weekend,is_weekday,days_in_month
0,2018-08-18,9,91,8,33,2018,5,230,3,1,0,31
1,2018-08-18,10,21,8,33,2018,5,230,3,1,0,31


## Train Test Split

In [10]:


tscv = TimeSeriesSplit(n_splits=2
                       ,test_size=500)
X = train_df[cols_for_model]
Y = train_df["demand"]

# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
#     # Build Model
#     st = StandardScaler()
#     st.fit(X_train)
#     X_train = st.transform(X_train)
#     X_test = st.transform(X_test)
#     my_lr = ElasticNet()
#     my_lr.fit(X_train, y_train)
#     y_pred = my_lr.predict(X_test)
#     print(rmse(y_test,y_pred))
    
# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
#     # Build Model
#     rfr = RandomForestRegressor()
#     rfr.fit(X_train, y_train)
#     y_pred = rfr.predict(X_test)
#     print(rmse(y_test,y_pred))

for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
    y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
    # Build Model
    st = StandardScaler()
    st.fit(X_train)
    X_train = st.transform(X_train)
    X_test = st.transform(X_test)
    
    print("Run started")
    params = {"alpha"    : [0.1, 0.3, 0.5,0.7,0.9,1]}

    gsearch1 = GridSearchCV(estimator = glm_elastic(),param_grid = params, 
                            scoring='neg_root_mean_squared_error',n_jobs=4, cv=5)
    gsearch1.fit(X_train,y_train)
    print("params", gsearch1.best_params_, gsearch1.best_score_)
    
    
#     glm = glm_elastic(alpha = 0.3)
#     glm.fit(X_train, y_train)
    
#     y_pred_train = glm.predict(X_train)
#     print("train RMSE",rmse(y_train,y_pred_train))
    
#     y_pred = glm.predict(X_test)
#     print("test RMSE",rmse(y_test,y_pred))
    


# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
#     # Build Model
#     dtree = DecisionTreeRegressor()
#     dtree.fit(X_train, y_train)
#     y_pred = dtree.predict(X_test)
#     print(rmse(y_test,y_pred))



# # XGB Model
# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
# #     Build Model
#     st = StandardScaler()
#     st.fit(X_train)
#     X_train = st.transform(X_train)
#     X_test = st.transform(X_test)

# #     print("Run started")
# #     params = {"learning_rate"    : [0.05, 0.10, 0.15] , 
# #               "max_depth"        : [ 3, 4, 5, 6],
# #              "min_child_weight" : [ 1, 3 ],
# #              "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
# #              "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }

# #     gsearch1 = GridSearchCV(estimator = xgb.XGBRegressor(n_estimators=50,
# #                                                       subsample=0.8,
# #                                                       seed=0),param_grid = params, 
# #                             scoring='neg_root_mean_squared_error',n_jobs=4, cv=5)
# #     gsearch1.fit(X_train,y_train)
# #     print("params", gsearch1.best_params_, gsearch1.best_score_)

    
    
#     xgm = xgb.XGBRegressor(n_estimators=50,subsample=0.8,
#                            colsample_bytree= 0.5,
#                         gamma = 0.0,learning_rate = 0.1,
#                         max_depth = 3,
#                         min_child_weight = 1)
#     xgm.fit(X_train, y_train)
    
#     y_pred_train = xgm.predict(X_train)
#     print("train RMSE; ",rmse(y_train,y_pred_train))
    
#     y_pred = xgm.predict(X_test)
#     print("test RMSE: ",rmse(y_test,y_pred))




## Use nn:

# # define model
# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
#     # Build Model
#     X_train = np.reshape(X_train,(X_train.shape[0], X_train.shape[1]))
#     X_test = np.reshape(X_test,(X_test.shape[0], X_test.shape[1]))
#     print(X_train.shape,X_test.shape)
#     nn_model = Sequential()
#     nn_model.add(Dense(12, activation='relu'))
#     nn_model.add(Dense(10))
#     nn_model.add(Dense(1))
#     nn_model.compile(loss='mean_squared_error', optimizer='adam')
#     early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1)
#     nn_model.fit(X_train, y_train, epochs=100, batch_size=1, verbose=1, callbacks=[early_stop], shuffle=False)
#     y_pred = nn_model.predict(X_test)
#     y_pred = np.array([y_pred[i][0] for i in range(len(y_pred)) ])
#     print("RMSE Score is",rmse(y_test,y_pred))

    
# # define LSTM model
# for train_index, test_index in tscv.split(X):
#     X_train, X_test = X.iloc[list(train_index)], X.iloc[list(test_index)]
#     y_train, y_test = Y.iloc[list(train_index)], Y.iloc[list(test_index)]
#     st = StandardScaler()
#     st.fit(X_train)
#     X_train = st.transform(X_train)
#     X_test = st.transform(X_test)
    
#     # Build Model
    
    
#     X_train = np.reshape(X_train,(X_train.shape[0], X_train.shape[1]))
#     X_test = np.reshape(X_test,(X_test.shape[0], X_test.shape[1]))
#     print(X_train.shape,X_test.shape)
    
    
#     model = Sequential()
# #     model.add(LSTM(32, batch_input_shape=(1,  X_train.shape[1], 1), stateful=True,return_sequences=True))
#     model.add(LSTM(4, batch_input_shape=(1,  X_train.shape[1], 1), stateful=True))
#     model.add(Dense(1))
#     model.compile(loss = 'mean_squared_error',optimizer = 'adam')
#     early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1)
#     model.fit(X_train, y_train, epochs=6, batch_size=1, verbose=1, callbacks=[early_stop], shuffle=False)
    
#     y_pred_train = model.predict(X_train,batch_size=1)
#     y_pred_train = np.array([y_pred_train[i][0] for i in range(len(y_pred_train)) ])
#     print("train RMSE Score is",rmse(y_train,y_pred_train)) 
    
    
#     y_pred = model.predict(X_test,batch_size=1)
#     y_pred = np.array([y_pred[i][0] for i in range(len(y_pred)) ])
#     print("test RMSE Score is",rmse(y_test,y_pred)) 
    
    


Run started
params {'alpha': 1} -40.47946295846964
Run started
params {'alpha': 0.7} -40.47123243002248


In [None]:
## For LSTM

test_df = test_df[cols_for_model]
test_df = st.transform(test_df)

demand_pred = model.predict(test_df,batch_size=1)
demand_pred = np.array([demand_pred[i][0] for i in range(len(demand_pred))])

sub_df["demand"] = demand_pred

In [None]:
# For NN
test_df = test_df[cols_for_model]
demand_pred = nn_model.predict(test_df)
demand_pred = np.array([demand_pred[i][0] for i in range(len(demand_pred))])

In [None]:
demand_pred

In [None]:
test_df = test_df[cols_for_model]
test_df = st.transform(test_df)
test_df = pd.DataFrame(test_df,columns = cols_for_model )
demand_pred = xgm.predict(test_df)

In [None]:
test_df["demand_pred"] = demand_pred
sub_df["demand"] = demand_pred

In [None]:
sub_df.to_csv("sub_lstm_3.csv",index=False)

In [None]:
# test_df["demand_pred"] = glm.predict(st.transform(test_df)).astype("int")
# sub_df['demand'] = test_df['demand_pred']
sub_df.to_csv("sub_glm.csv",index=False)

In [None]:
test_df.head(2)

In [None]:
sub_df.head(2)