<a href="https://colab.research.google.com/github/JasonLimJS/Grab-AI-for-SEA/blob/master/Algorithm%20to%20Test%20The%20Model%20with%20Test%20Data%20Set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**To ensure fast execution, please change runtime type to Google's GPU (runtime type is on local CPU by default). In order to achieve this, go to Runtime>Change runtime type> Select Python3 as Runtime type and GPU as Hardware accelerator. The code documented here is meant to be run using Python3 and GPU.**

1. Mount google drive.

In [105]:
from google.colab import drive
drive.mount('drive') 
#You would be prompted to activate with an activation key through the link which would be found below 
#upon running this cell if you are mounting your Google drive for the first time

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


2. Upload test data into your google drive then specify the file path where you store your test data. (In the example below, it is stored at 'drive/My Drive/'  as 'testing.csv')

In [0]:
test_data_path= 'drive/My Drive/testing.csv'

3. Install all relevent libraries.

In [107]:
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install python-geohash



4. Load all relevant libraries.

In [0]:
from google.colab import drive
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
import warnings
import geohash
warnings.filterwarnings('ignore')

5. Download the uploaded test data set from Google drive as pandas dataframe.

In [0]:
test_dat= pd.read_csv(test_data_path)

6. Ensure that the data structures of test data set are similar to the one that is being provided 'training.csv' before proceed. Similar data structures must be used in order to successfully run the model. The data structure must be similar to the one below:

In [110]:
print(test_dat.info())
print('\n')
print(test_dat.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 4 columns):
geohash6     150000 non-null object
day          150000 non-null int64
timestamp    150000 non-null object
demand       150000 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.6+ MB
None


  geohash6  day timestamp    demand
0   qp09d6   32      6:30  0.335199
1   qp02zg   25       1:0  0.042383
2   qp08gm   40      4:30  0.016331
3   qp0974   59      5:45  0.097868
4   qp0973   37      3:45  0.001806
5   qp09sb   24       0:0  0.105745
6   qp09uv   30       3:0  0.304235
7   qp09gx   60     23:30  0.121063
8   qp0d0b   35      8:45  0.000623
9   qp02zj   17     13:30  0.069005


7. Pre- processes of raw data and features engineering.

In [0]:
test_dat.dropna(inplace=True)

#Decode geohash6 into latitudes('lat') and longitudes('long')
func1= np.vectorize(geohash.decode)
array1= np.asarray(func1(test_dat.geohash6),dtype='float64')
test_dat['lat']= array1[0,:]
test_dat['long']=array1[1,:]

#Decode timestamp into hour and minute, then assign time interval as hour*60 + minute
col_labels=['hour','minute']
test_dat[col_labels]=test_dat.timestamp.str.split(':',expand=True).astype(int)
test_dat['time_min']= test_dat.hour*60 + test_dat.minute

In [112]:
#Create 1-day lagged demand variable for each time interval as this feature is required for Model2 of the ensemble model to function

test_dat.sort_values(by=['geohash6','time_min','day'],inplace=True)
test_dat['demand_lag_id']= test_dat.groupby(['geohash6','time_min']).demand.shift(1)

#If the data point does not have 1-day lagged demand, then the lagged demand is taken to have a similar value as demand

test_dat['demand_lag_1']= np.where(test_dat.demand_lag_id.isna(),test_dat.demand,test_dat.demand_lag_id)
test_dat.drop(['demand_lag_id'],axis=1,inplace=True)

print(test_dat.info())
print('\n')
test_dat.head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 144865 to 103495
Data columns (total 10 columns):
geohash6        150000 non-null object
day             150000 non-null int64
timestamp       150000 non-null object
demand          150000 non-null float64
lat             150000 non-null float64
long            150000 non-null float64
hour            150000 non-null int64
minute          150000 non-null int64
time_min        150000 non-null int64
demand_lag_1    150000 non-null float64
dtypes: float64(4), int64(4), object(2)
memory usage: 12.6+ MB
None




Unnamed: 0,geohash6,day,timestamp,demand,lat,long,hour,minute,time_min,demand_lag_1
144865,qp02yc,11,2:0,0.040662,-5.484924,90.653687,2,0,120,0.040662
92468,qp02yc,39,2:45,0.220175,-5.484924,90.653687,2,45,165,0.220175
92824,qp02yc,8,3:45,0.004063,-5.484924,90.653687,3,45,225,0.004063
139501,qp02yc,37,3:45,0.027649,-5.484924,90.653687,3,45,225,0.004063
119132,qp02yc,61,4:45,0.014507,-5.484924,90.653687,4,45,285,0.014507
59974,qp02yc,50,5:15,0.019776,-5.484924,90.653687,5,15,315,0.019776
134266,qp02yc,59,5:15,0.008128,-5.484924,90.653687,5,15,315,0.019776
115278,qp02yc,15,5:30,0.016947,-5.484924,90.653687,5,30,330,0.016947
1467,qp02yc,34,5:30,4.6e-05,-5.484924,90.653687,5,30,330,0.016947
32151,qp02yc,21,5:45,0.05054,-5.484924,90.653687,5,45,345,0.05054


In [0]:
model1= joblib.load('drive/My Drive/kNN_Model1.pkl') #can be downloaded from https://drive.google.com/open?id=1-4d8Ij5wWFmbZS59TSHweOfQv5ra4q1d
model2= joblib.load('drive/My Drive/whole_lag_1d_kNN.pkl') #can be downloaded from https://drive.google.com/open?id=1Qwiy4uyXps0Z-PC49ouZgUdXmBoFlAvH

In [0]:
  weight_1= 0.6
  weight_2= 0.4
  
  arranged_col= ['lat','long','time_min','demand_lag_1']
  X_test= test_dat[arranged_col].values
  y_test= test_dat.demand.values
  
  X_test_1= X_test[:,:-1]

  mean_vect= np.array([ -5.34803514,  90.76456439, 611.01511076])
  sd_vect= np.array([5.74593198e-02, 1.02564882e-01, 3.91570353e+02])

  X_test_scaled1= (X_test_1- mean_vect)/sd_vect
  
  norm_mean= np.array([-5.34740478e+00,9.07638854e+01,6.10487686e+02,1.05574095e-01])
  norm_std= np.array([5.65621040e-02,1.02704766e-01,3.92263089e+02,1.59564436e-01])
  pca_mean= np.array([ 6.14163081e-15 ,1.66324146e-14 ,2.60231469e-16,-2.49625107e-14])
  pca_comp= np.array([[ 0.64256504 ,0.72425171, -0.21969293, -0.11960199],
                    [-0.21087891,  0.07957761, -0.71483013,  0.66197838],
                    [ 0.51445685, -0.2009283,   0.46599514,  0.69123838]])
  
  td0= (X_test- norm_mean)/norm_std
  td1 = td0 - pca_mean
  X_test_scaled2 = td1.dot(pca_comp.T)

  y_pred1= model1.predict(X_test_scaled1)
  y_pred2= model2.predict(X_test_scaled2)
  y_pred_final= weight_1*y_pred1 + weight_2*y_pred2

  rmse= MSE(y_test,y_pred_final)**0.5

In [115]:
print('RMSE: ' + str(rmse))

RMSE: 0.055680232118465414


In [116]:
pd.DataFrame({'Actual':y_test,'Predicted demand':y_pred_final})

Unnamed: 0,Actual,Predicted demand
0,0.040662,0.037249
1,0.220175,0.076578
2,0.004063,0.024288
3,0.027649,0.024288
4,0.014507,0.025111
5,0.019776,0.021925
6,0.008128,0.021925
7,0.016947,0.022823
8,0.000046,0.022823
9,0.050540,0.021968
