### Engagement Score Prediction

### Problem statement

#### ABC is a car rental company based out of Bangalore. It rents cars for both in and out stations at affordable prices. The users can rent different types of cars like Sedans, Hatchbacks, SUVs and MUVs, Minivans and so on.

#### In recent times, the demand for cars is on the rise. As a result, the company would like to tackle the problem of supply and demand. The ultimate goal of the company is to strike the balance between the supply and demand inorder to meet the user expectations. 

#### The company has collected the details of each rental. Based on the past data, the company would like to forecast the demand of car rentals on an hourly basis.  
 
### Objective
#### The main objective of the problem is to develop the machine learning approach to forecast the demand of car rentals on an hourly basis.


#### Data Dictionary
#### Train Data
#### Variable : Description
#### date : Date (yyyy-mm-dd)
#### hour : Hour of the day
#### demand : No. of car rentals in a hour


#### Data Dictionary
#### Test Data
#### Variable : Description
#### date : Date (yyyy-mm-dd)
#### hour : Hour of the day

#### Data Dictionary
#### Submission Data
#### Variable : Description
#### date : Date (yyyy-mm-dd)
#### hour : Hour of the day
#### demand : No. of car rentals in a hour

## Table of Contents

### Step 1: Importing the Relevant Libraries
### Step 2: Data Inspection
### Step 3: Data Cleaning
### Step 4: Exploratory Data Analysis
### Step 5: Building Model


### Step 1: Importing the Relevant Libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

### Step 2 : Data inspection

In [None]:
train = pd.read_csv("/train_E1GspfA.csv")
test = pd.read_csv("/test_6QvDdzb.csv")

In [None]:
train.shape,test.shape

((18247, 3), (7650, 3))

In [None]:
train.head(5)

Unnamed: 0,date,hour,demand
0,2018-08-18,9,91
1,2018-08-18,10,21
2,2018-08-18,13,23
3,2018-08-18,14,104
4,2018-08-18,15,81


In [None]:
test.head(5)

Unnamed: 0,date,hour
0,2021-03-01,0
1,2021-03-01,1
2,2021-03-01,2
3,2021-03-01,3
4,2021-03-01,5


#### We have 18247 rows and 3 columns in Train set whereas Test set has 7650 rows and 2 columns.

In [None]:
#ratio of null values
train.isnull().sum()/train.shape[0] *100

In [None]:
#ratio of null values
test.isnull().sum()/test.shape[0] *100

In [None]:
# show the data types for each column of the train set
train.dtypes

In [None]:
# show the data types for each column of the train set
test.dtypes

#### We have no missing value in the train data and test data

### There is no data cleaning

In [None]:
#categorical features
categorical = train.select_dtypes(include =[np.object])
print("Categorical Features in Train Set:",categorical.shape[1])

#numerical features
numerical= train.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Train Set:",numerical.shape[1])

In [None]:
#categorical features
categorical = test.select_dtypes(include =[np.object])
print("Categorical Features in Test Set:",categorical.shape[1])

#numerical features
numerical= test.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Test Set:",numerical.shape[1])

#### Target Variable (Categorical)

In [None]:
# frequency table of a variable will give us the count of each category in that variable
train['date'].value_counts()

In [None]:
# bar plot to visualize the frequency
train['date'].value_counts().plot.bar()

In [None]:
# bar plot to visualize the frequency
test['date'].value_counts().plot.bar()

In [None]:
ax1 = plt.subplot(121)
train['hour'].hist(bins=20, figsize=(12,4))
ax1.set_title("Train")

ax2 = plt.subplot(122)
test['hour'].hist(bins=20)
ax2.set_title("Test")

### Step 4: Exploratory Data Analysis

In [None]:
# calculate and visualize correlation matrix
matrix = train.corr()
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(matrix, vmax=1, square=True, cmap="BuPu", annot=True)

matrix

In [None]:
### Data preprocessing

train['source'] = 'train'
test['source'] = 'test'

In [None]:
dataset = pd.concat([train, test])


#### There is no irrelevent features to edit

In [None]:
### Level encoding
label_encoder_age = LabelEncoder()
dataset['date'] = label_encoder_age.fit_transform(dataset['date'])

label_encoder_followers = LabelEncoder()
dataset['hour'] = label_encoder_followers.fit_transform(dataset['hour'])

label_encoder_gender = LabelEncoder()
dataset['demand'] = label_encoder_gender.fit_transform(dataset['demand'])


In [None]:
### seperating the train and test

train = dataset.loc[dataset['source'] == 'train']
test = dataset.loc[dataset['source'] == 'test']

In [None]:
train.drop('source', axis = 1, inplace = True)
test.drop('source', axis = 1, inplace = True)

In [None]:
#### seperating into X and Y

X = train.drop("demand", axis = 1)
Y = train["demand"]

In [None]:
#### Test data is further divided into Public (40%) and Private (60%)

In [None]:
#### test into 40% public data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.4, random_state = 50)


In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

#### KNN

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

In [None]:
knn.fit(X_train, Y_train)

In [None]:
Y_pred_knn = knn.predict(X_test)


#### Linear regression

In [None]:
lin_reg = LinearRegression()


In [None]:
lin_reg.fit(X_train, Y_train)
Y_pred_lin_reg = lin_reg.predict(X_test)

#### decision tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
dec_tree = DecisionTreeRegressor()

In [None]:
dec_tree.fit(X_train, Y_train)

In [None]:
Y_pred_dec = dec_tree.predict(X_test)

#### Random forrest regressior

In [None]:
from sklearn.ensemble import RandomForestRegressor
ran_for = RandomForestRegressor()

In [None]:
ran_for.fit(X_train, Y_train)

In [None]:
Y_pred_ran_for = ran_for.predict(X_test)

#### XGB regressior

In [None]:
from xgboost import XGBRegressor
xgb = XGBRegressor(random_state = 50)

In [None]:
xgb.fit(X_train, Y_train)

In [None]:
Y_pred_xgb = xgb.predict(X_test)

Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print("Linear Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_lin_reg)))
print("R2 score:", r2_score(Y_test, Y_pred_lin_reg))

In [None]:
print("KNN regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_knn)))
print("R2 score:", r2_score(Y_test, Y_pred_knn))

In [None]:
print("Decision tree regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_dec)))
print("R2 score:", r2_score(Y_test, Y_pred_dec))

In [None]:
print("Random forest regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_ran_for)))
print("R2 score:", r2_score(Y_test, Y_pred_ran_for))

In [None]:
print("XGB regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_xgb)))
print("R2 score:", r2_score(Y_test, Y_pred_xgb))

#### Logistic regression

In [None]:
# import StratifiedKFold from sklearn and fit the model
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# stratified 5 folds, shuffle each stratification of the data before splitting into batches

mean_accuracy = []
i = 1
kf = StratifiedKFold(n_splits=5, random_state=1, shuffle=True)

for train_index, test_index in kf.split(X, Y):
    print('\n{} of kfold {}'.format(i, kf.n_splits))
    xtr, xvl = X.loc[train_index], X.loc[test_index]
    ytr, yvl = Y[train_index], Y[test_index]
    
    model = LogisticRegression(random_state=1)
    model.fit(xtr, ytr)
    pred_test = model.predict(xvl)
    score = accuracy_score(yvl, pred_test)
    mean_accuracy.append(score)
    print('accuracy_score', score)
    i+=1
    
print("\nMean validation accuracy: ", sum(mean_accuracy)/len(mean_accuracy))


# make prediction on test set
pred_test = model.predict(test)


# calculate probability estimates of loan approval
# column 0 is the probability for class 0 and column 1 is the probability for class 1
# probability of loan default = 1 - model.predict_proba(test)[:,1]
pred = model.predict_proba(xvl)[:,1]


1 of kfold 5
accuracy_score 0.010684931506849316

2 of kfold 5
accuracy_score 0.012876712328767123

3 of kfold 5
accuracy_score 0.01068785968758564

4 of kfold 5
accuracy_score 0.009865716634694436

5 of kfold 5
accuracy_score 0.011510002740476843

Mean validation accuracy:  0.011125044579674672


ValueError: ignored

#### Logistic regression would not be applied

In [None]:
# import library
from sklearn import tree

In [None]:
mean_accuracy = []
i=1
kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
for train_index,test_index in kf.split(X,Y):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X.loc[train_index],X.loc[test_index]
    ytr,yvl = Y[train_index],Y[test_index]
    
    model = tree.DecisionTreeClassifier(random_state=1)
    model.fit(xtr, ytr)
    pred_test = model.predict(xvl)
    score = accuracy_score(yvl,pred_test)
    mean_accuracy.append(score)
    print('accuracy_score',score)
    i+=1
    
print("\nMean validation accuracy: ", sum(mean_accuracy)/len(mean_accuracy))
pred_test = model.predict(test)


1 of kfold 5
accuracy_score 0.0052054794520547945

2 of kfold 5
accuracy_score 0.010136986301369864

3 of kfold 5
accuracy_score 0.007673335160317895

4 of kfold 5
accuracy_score 0.007673335160317895

5 of kfold 5
accuracy_score 0.010139764318991504

Mean validation accuracy:  0.00816578007861039


ValueError: ignored

#### Random forest

In [None]:
# import library
from sklearn.ensemble import RandomForestClassifier

In [None]:
mean_accuracy = []
i=1
kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
for train_index,test_index in kf.split(X, Y):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X.loc[train_index],X.loc[test_index]
    ytr,yvl = Y[train_index],Y[test_index]
    
    model = RandomForestClassifier(random_state=1, max_depth=10, n_estimators=10)
    model.fit(xtr, ytr)
    pred_test = model.predict(xvl)
    score = accuracy_score(yvl,pred_test)
    mean_accuracy.append(score)
    print('accuracy_score',score)
    i+=1
    
print("\nMean validation accuracy: ", sum(mean_accuracy)/len(mean_accuracy))
pred_test = model.predict(test)


1 of kfold 5
accuracy_score 0.010684931506849316

2 of kfold 5
accuracy_score 0.011232876712328766

3 of kfold 5
accuracy_score 0.010961907371882707

4 of kfold 5
accuracy_score 0.01068785968758564

5 of kfold 5
accuracy_score 0.010961907371882707

Mean validation accuracy:  0.010905896530105827


ValueError: ignored

#### grid search CV

In [None]:
# import library
from sklearn.model_selection import GridSearchCV

In [None]:
# Provide range for max_depth from 1 to 20 with an interval of 2 and from 1 to 200 with an interval of 20 for n_estimators
paramgrid = {'max_depth': list(range(1, 20, 2)), 'n_estimators': list(range(1, 200, 20))}

In [None]:
# default 3-fold cross validation, cv=3
grid_search = GridSearchCV(RandomForestClassifier(random_state=1), paramgrid)

In [None]:
# split the data
from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(X, Y, test_size =0.3, random_state=1)

In [None]:
# fit the grid search model
grid_search.fit(x_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(random_state=1),
             param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19],
                         'n_estimators': [1, 21, 41, 61, 81, 101, 121, 141, 161,
                                          181]})

In [None]:
# estimate the optimized value
grid_search.best_estimator_

RandomForestClassifier(max_depth=3, n_estimators=141, random_state=1)

In [None]:
mean_accuracy = []
i=1
kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
for train_index,test_index in kf.split(X,Y):
    print('\n{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl = X.loc[train_index],X.loc[test_index]
    ytr,yvl = Y[train_index],Y[test_index]
    
    model = RandomForestClassifier(random_state=1, max_depth=3, n_estimators=141)
    model.fit(xtr, ytr)
    pred_test = model.predict(xvl)
    score = accuracy_score(yvl,pred_test)
    mean_accuracy.append(score)
    print('accuracy_score',score)
    i+=1
    
print("\nMean validation accuracy: ", sum(mean_accuracy)/len(mean_accuracy))
pred_test = model.predict(test)
pred2=model.predict_proba(test)[:,1]


1 of kfold 5
accuracy_score 0.00904109589041096

2 of kfold 5
accuracy_score 0.013424657534246575

3 of kfold 5
accuracy_score 0.009317621266100301

4 of kfold 5
accuracy_score 0.00822143052891203

5 of kfold 5
accuracy_score 0.013154288846259249

Mean validation accuracy:  0.010631818813185824


ValueError: ignored

#### Logistic regression is the best case

### Step 5: Building Model

In [None]:
from datetime import datetime


In [None]:
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

In [None]:
train['year']= train['date'].dt.year
train['month']= train['date'].dt.month
train['day']= train['date'].dt.day

In [None]:
test['year']= test['date'].dt.year
test['month']= test['date'].dt.month
test['day']= test['date'].dt.day

In [None]:
# drop Loan_ID 
train = train.drop('date', axis=1)
test = test.drop('date', axis=1)

In [None]:
train.columns

Index(['hour', 'demand', 'year', 'month', 'day'], dtype='object')

In [None]:
test.columns

Index(['hour', 'year', 'month', 'day'], dtype='object')

In [None]:
# Seperate Features and Target
X= train.drop(columns = ['demand'], axis=1)
y= train['demand']

In [None]:
# 40% data as validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.4,random_state=22)

In [None]:
# Model Building
xgbm = XGBRegressor()
xgbm.fit(X_train,y_train)
pred3 = xgbm.predict(X_valid)
np.sqrt(mean_squared_error(y_valid, pred3))



35.94541583326699

In [None]:
submission = pd.read_csv('/sample_4E0BhPN.csv')
final_predictions = xgbm.predict(test)
submission['demand'] = final_predictions
#only positive predictions for the target variable
submission['demand'] = submission['demand'].apply(lambda x: 0 if x<0 else x)
submission.to_csv('/my_submission_xgbm.csv', index=False)

In [None]:
MSE= metrics.mean_squared_error(y_valid,y_pred)
from math import sqrt
rmse = sqrt(MSE)
print("Root Mean Squared Error:",rmse)

Root Mean Squared Error: 40.96222754220579
