Reference: https://cloud.google.com/ml-engine/docs/scikit/getting-started-training

## Training model on AI Platform  
- Create Python training module
    - Add code to download the data from Cloud Storage so that AI platform can use it . 
    - Add code to export and save the model to Cloud Storage after AI platform finishes training the model . 
   
- Prepare a training application package .  
    
- Submit the training job     

**Since I am using an sklearn model, the deployment code and trainer module will have slight changes.**

### In order the train the model on GCS, the data needs to be stored preprocessed (Converting Categorical to Numerical features) and for the sklearn model, the Features and Target labels needs to be stored in different files. Let's go ahead and implement it.

In [1]:
import subprocess
import os

In [2]:
import pandas as pd
import numpy as np

In [3]:
import datetime
import os
import subprocess
import sys
from sklearn.externals import joblib

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split



In [4]:
X_cols = ['zip_encode', 'location_encode', 'community_encode', 'agency_encode',
       'complaint_encode', 'afternoon', 'evening', 'morning', 'night',
       'Fri-Sat-Sun', 'Mon-Tue', 'Wed-Thu']
y_col = 'TimeTaken'

In [5]:
cols = ['index', 'day_period', 'day_of_week', \
             'zip_encode', 'location_encode', \
             'community_encode', 'agency_encode', \
             'complaint_encode', 'TimeTaken']

In [6]:
#Bucket where files are saved
path = 'gs://nyc_servicerequest/processedInput/*'

**Downloading files from the GCS Storage bucket for preprocessing**

In [38]:
!gsutil cp 'gs://nyc_servicerequest/processedInput/evalx2.csv' 'localsave/eval2.csv'

Copying gs://nyc_servicerequest/processedInput/evalx2.csv...
| [1 files][180.4 MiB/180.4 MiB]                                                
Operation completed over 1 objects/180.4 MiB.                                    


In [50]:
files = list(os.listdir('localsave'))

In [54]:
#Download each file, save locally and upload onto bucket again.
for each in files[6:]:
    try:
        df = pd.read_csv('localsave/'+each, header=None)
        df.columns = cols
        df.drop('index', axis=1, inplace=True)

        one_hot_dp = pd.get_dummies(df['day_period'])
        df = df.drop('day_period', axis=1)
        df = df.join(one_hot_dp)

        one_hot_week = pd.get_dummies(df['day_of_week'])
        df = df.drop('day_of_week', axis=1)
        df = df.join(one_hot_week)

        df_new = df[['zip_encode', 'location_encode', 'community_encode', 'agency_encode',
               'complaint_encode']].apply(LabelEncoder().fit_transform)

        df_old = df[['afternoon', 'evening', 'morning',
            'night', 'Fri-Sat-Sun', 'Mon-Tue', 'Wed-Thu','TimeTaken']]

        ddf = pd.concat([df_new, df_old], axis=1)

        df1 = ddf[ddf['TimeTaken'] < 100]
        df1.reset_index(inplace=True)

        df1.to_csv('localsave/fix_'+each, index=False, header=False)
    except:
        print(each)


**Saving the files back to GCS bucket**

In [55]:
!gsutil cp localsave/fix_* gs://nyc_servicerequest/encodedInput/

Copying file://localsave/fix_eval1.csv [Content-Type=text/csv]...
Copying file://localsave/fix_eval2.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train0.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train1.csv [Content-Type=text/csv]...
|
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://localsave/fix_train2.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train3.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train4.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train5.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train6.csv [Content-Type=text/csv]...
Copying file://localsave/fix_train7.csv [Content-Type=text/csv]...
-
Operation completed over 10 objects/456.9 MiB.                              

**Test one file**

In [58]:
!gsutil cp gs://nyc_servicerequest/encodedInput/train0.csv data/demo2.csv

Copying gs://nyc_servicerequest/encodedInput/train0.csv...
/ [1 files][ 45.9 MiB/ 45.9 MiB]                                                
Operation completed over 1 objects/45.9 MiB.                                     


In [60]:
pd.read_csv('data/demo2.csv', header=None).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,2,1,1,1,3,2,0,0,1,0,0,0,1,1.273
1,5,1,3,1,3,2,0,0,1,0,0,0,1,85.838
2,6,1,3,1,3,2,1,0,0,0,1,0,0,59.059
3,9,1,3,1,3,2,1,0,0,0,0,0,1,79.434
4,10,1,0,1,2,2,0,0,1,0,0,0,1,26.4


#### Files for the Training and Evaluation finally stored to the bucket

#### For using the sklearn model, the features and class Labels must be used seperately. i.e. df.iloc[:, :-1] in one file, and df.iloc[:, -1] in another file.

In [74]:
pd.read_csv('localsave/fix_eval1.csv', header=None).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0,3,3,0,3,1,1,0,0,0,0,0,1,22.433
1,4,3,3,0,2,1,0,0,1,0,0,0,1,13.531
2,6,3,0,0,3,1,1,0,0,0,0,0,1,22.133
3,7,3,3,0,3,1,0,1,0,0,1,0,0,65.283
4,8,3,3,0,3,1,0,0,0,1,0,1,0,37.917


In [62]:
def func_(filename):
    df = pd.read_csv(filename, header=None)
    return df.iloc[:, :-1], df.iloc[:, -1]

In [96]:
list_of_files = os.listdir('localsave/')

In [101]:
j = 0
for i in range(len(list_of_files)):
    each = list_of_files[i]
    if 'fix_' in each: 
        x, y = func_('localsave/'+each)
        x.to_csv('localsave2/x_train'+str(j)+'.csv', header=None, index=None)
        y.to_csv('localsave2/y_train'+str(j)+'.csv', header=None, index=None)
        j = j + 1

  import sys


In [94]:
#To read file
#pd.read_csv('localsave2/x_eval1.csv', header=None).iloc[:, 1:]

Save the files in the GCP Storage bucket

In [102]:
!gsutil cp localsave2/* gs://nyc_servicerequest/sklearnInput/

Copying file://localsave2/x_train0.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train1.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train2.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train3.csv [Content-Type=text/csv]...
/
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://localsave2/x_train4.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train5.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train6.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train7.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train8.csv [Content-Type=text/csv]...
Copying file://localsave2/x_train9.csv [Content-Type=text/csv]...
Copying file://localsave2/y_train0.csv [Content-Type=text/csv]...
Copying file://local

**Test Operations**

In [109]:
a = pd.read_csv('localsave2/x_train0.csv', header=None).iloc[:, 1:]
b = pd.read_csv('localsave2/x_train1.csv', header=None).iloc[:, 1:]

In [111]:
pd.concat([a, b]).shape, a.shape, b.shape

((2380564, 12), (1191200, 12), (1189364, 12))

In [113]:
#pd.read_csv('localsave2/y_train0.csv', header=None)

#### Merging all the points and saving it in one file for the Job.

In [135]:
xs = pd.DataFrame()
ys = pd.DataFrame()

xeval = pd.DataFrame()
yeval = pd.DataFrame()

for each in list(os.listdir('localsave2')):
    each = 'localsave2/'+each
    if('x_train' in each):
        a = pd.read_csv(each, header=None).iloc[:, 1:]
        xs = pd.concat([xs, a])
    elif('y_train' in each):
        a = pd.read_csv(each, header=None)
        ys = pd.concat([ys, a])
    elif('y_eval' in each):
        a = pd.read_csv(each, header=None)
        yeval = pd.concat([yeval, a])
    elif('x_eval' in each):
        a = pd.read_csv(each, header=None).iloc[:, 1:]
        xeval = pd.concat([xeval, a])
    else:
        pass

In [136]:
xs.shape, ys.shape, xeval.shape, yeval.shape

((9526573, 12), (9526573, 1), (2380440, 12), (2380440, 1))

In [138]:
xs.to_csv('localsave2/x_all_train.csv')
ys.to_csv('localsave2/y_all_train.csv')
xeval.to_csv('localsave2/x_all_eval.csv')
yeval.to_csv('localsave2/y_all_eval.csv')

In [139]:
!gsutil cp localsave2/* gs://nyc_servicerequest/sklearnInput/ 

Copying file://localsave2/x_all_eval.csv [Content-Type=text/csv]...
Copying file://localsave2/x_all_train.csv [Content-Type=text/csv]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Copying file://localsave2/x_eval1.csv [Content-Type=text/csv]...
Copying file://localsave2/x_eval2.csv [Content-Type=text/csv]...
-
==> NOTE: You are performing a sequence of gsutil opera

In [148]:
pd.read_csv('localsave2/y_all_eval.csv').iloc[:, 1:].head()

Unnamed: 0,0
0,79.386
1,27.974
2,85.845
3,80.797
4,67.71


### Sample Code for the Model 
**DO NOT RUN**

In [1]:
import datetime
import os
import subprocess
import sys
import pandas as pd
import pickle

# Fill in your Cloud Storage bucket name
BUCKET_NAME = 'nyc_servicerequest'

In [None]:
data_filename = 'x_all_train.csv'
target_filename = 'y_all_train.csv'
data_dir = 'gs://nyc_servicerequest/sklearnInput'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir, data_filename), data_filename], stderr=sys.stdout)

subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir, target_filename), target_filename], stderr=sys.stdout)


In [None]:
xs = pd.DataFrame()
ys = pd.DataFrame()

files = list(os.listdir(os.curdir))

for each in files:
    if('x_train' in each):
        a  = pd.read_csv(each, header=None).iloc[:, 1:]
        xs = pd.concat([xs, a])
    elif('y_train' in each):
        a = pd.read_csv(each, header=None)
        ys = pd.concat([ys, a])
    else:
        pass

In [None]:
# Load data into pandas, then use `.values` to get NumPy arrays
data = xs.values
target = ys.values

# Convert one-column 2D array into 1D array for use with scikit-learn
target = target.reshape((target.size,))

In [None]:
#Train the model
dec = DecisionTreeRegressor(verbose=True)
dec.fit(data, target)

#Export the classifier to a file
model_filename = 'model.joblib'
joblib.dump(dec, model_filename)

In [None]:
with open('model.pkl', 'wb') as model_file:
    pickle.dump(classifier, model_file)


In [None]:
# Upload the saved model file to Cloud Storage
gcs_model_path = os.path.join('gs://', BUCKET_NAME, 'models',
    datetime.datetime.now().strftime('iris_%Y%m%d_%H%M%S') , model_filename)
subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path],
    stderr=sys.stdout)

**Do not run locally till this point.**

In [115]:
#Make sure you put the correct values here !!!
BUCKET='nyc_servicerequest'
PROJECT='summerai'
REGION='us-west1'

In [116]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

### Run trainer Locally

Below is the code to run the model locally. I won't do that since that's a lot of data, and I want to leave the computation to CloudML, still I will write down the commands below. You can run it if you want to test your model before submitting it to Jobs and consider fixing the errors.

In [149]:
TRAINING_PACKAGE_PATH='./DecisionTreeTrainer/'
MAIN_TRAINER_MODULE='DecisionTreeTrainer.training'

In [None]:
%%bash
gcloud ai-platform local train \
    --package-path $TRAINING_PACKAGE_PATH \
    --module-name $MAIN_TRAINER_MODULE

### Submit Training Job

In [150]:
%%bash
BUCKET_NAME="nyc_servicerequest"
JOB_NAME="decision_tree_$(date +"%Y%m%d_%H%M%S")"
JOB_DIR=gs://$BUCKET_NAME/temp/
TRAINING_PACKAGE_PATH="./DecisionTreeTrainer/"
MAIN_TRAINER_MODULE="DecisionTreeTrainer.training"
REGION=us-west1
RUNTIME_VERSION=1.14
PYTHON_VERSION=2.7
SCALE_TIER=BASIC

In [154]:
%%bash
gcloud ai-platform jobs submit training "decision_tree_$(date +"%Y%m%d_%H%M%S")" \
--job-dir=gs://nyc_servicerequest/temp/ \
--package-path=./DecisionTreeTrainer/ \
--module-name=DecisionTreeTrainer.training \
--region=us-west1 \
--runtime-version=1.14 \
--python-version=2.7 \
--scale-tier=BASIC


jobId: decision_tree_20190727_230952
state: QUEUED


Job [decision_tree_20190727_230952] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe decision_tree_20190727_230952

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs decision_tree_20190727_230952


### Verify your model file in Cloud Storage

In [156]:
!gsutil ls gs://nyc_servicerequest/models/

gs://nyc_servicerequest/models/
gs://nyc_servicerequest/models/decision_tree_20190727_231218/


### Deploying the model to AI Platform for online predicitons

- The model is already saved in Cloud Storage .  
- Create a model Resource on AI Platform .  
- Create a model version, linking the saved model .  
- Make online predictions .  
- Check the accuracy .  

1. Create a model resource for your model versions, filling in your desired name for the model without enclosing brackets

In [161]:
%%bash
REGION='us-west1'
MODEL_NAME='decision_tree_model_1'
MODEL_VERSION='v1'
FRAMEWORK='SCIKIT_LEARN'
gcloud ml-engine models create $MODEL_NAME
MODEL_LOCATION=$(gsutil ls gs://nyc_servicerequest/models/ | tail -1)

gcloud ai-platform versions create $MODEL_VERSION \
--model $MODEL_NAME \
--origin $MODEL_LOCATION \
--runtime-version=1.14 \
--framework $FRAMEWORK \
--python-version=2.7


Created ml engine model [projects/summerai/models/decision_tree_model_1].
Creating version (this might take a few minutes)......
.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

### Get information about the new version

In [164]:
%%bash
gcloud ai-platform versions describe v1 \
--model='decision_tree_model_1'

createTime: '2019-07-27T23:29:54Z'
deploymentUri: gs://nyc_servicerequest/models/decision_tree_20190727_231218/
etag: RAvQjIdybsA=
framework: SCIKIT_LEARN
isDefault: true
machineType: mls1-c1-m2
name: projects/summerai/models/decision_tree_model_1/versions/v1
pythonVersion: '2.7'
runtimeVersion: '1.14'
state: READY


![Deployed Model](Images/5.png)

### The Model is finally Deployed :) :) :)

### Send online Prediction Requests

Now we need to check the predictions for our Test/Eval datasets to see how the model performs.

In [2]:
test1 = pd.read_csv('localsave2/x_all_eval.csv', header=None).iloc[1:, 1:]

In [3]:
test_features = test1.as_matrix().tolist()

  """Entry point for launching an IPython kernel.


In [9]:
test_labels = pd.read_csv('localsave2/y_all_eval.csv', header=None).iloc[1:, 1:]\
                .as_matrix().tolist()


  """Entry point for launching an IPython kernel.


In [10]:
len(test_features)/100

23804.4

In [11]:
import googleapiclient.discovery

# Fill in your PROJECT_ID, VERSION_NAME and MODEL_NAME before running
# this code.
PROJECT_ID = 'summerai'
VERSION_NAME = 'v1'
MODEL_NAME = 'decision_tree_model_1'

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(PROJECT_ID, MODEL_NAME)
name += '/versions/{}'.format(VERSION_NAME)

"""
Sample features - 
['zip_encode', 'location_encode', 'community_encode', 'agency_encode',
'complaint_encode', 'afternoon', 'evening', 'morning', 'night', 'Fri-Sat-Sun', 'Mon-Tue', 'Wed-Thu']
[0, 3, 1, 0, 2, 0, 1, 0, 0, 0, 1, 0]        
"""
data = test_features[:int(len(test_features)/100)] # pandas_dataframe(list[list])

responses = service.projects().predict(name=name, body={'instances': data}).execute()

if 'error' in responses:
    print(response['error'])
else:
    # Print the first 10 responses
    for i, response in enumerate(responses['predictions'][:10]):
        print('Prediction: {}\t\tActual: {}'.format(response, test_labels[i][0]))

Prediction: 30.373710374885594		Actual: 79.38600000000002
Prediction: 35.84330128427631		Actual: 27.974
Prediction: 35.14199171671147		Actual: 85.845
Prediction: 38.73326135615535		Actual: 80.797
Prediction: 36.035388609715234		Actual: 67.71
Prediction: 21.873060107837222		Actual: 15.383
Prediction: 22.253109202714228		Actual: 22.483
Prediction: 22.446519035223197		Actual: 14.45
Prediction: 22.446519035223197		Actual: 1.867
Prediction: 20.00250884152769		Actual: 32.189


In [13]:
def error_anaylsis(y_pred, y_actual):
    # mean squared error
    m = len(y_actual)
    
    mse = np.sum((y_pred - y_actual)**2)

    # root mean squared error
    # m is the number of training examples
    rmse = np.sqrt(mse/m)
    
    # sum of square of residuals
    ssr = np.sum((y_pred - y_actual)**2)

    #  total sum of squares
    sst = np.sum((y_actual - np.mean(y_actual))**2)

    # R2 score
    r2_score = 1 - (ssr/sst)
    
    return mse, rmse, ssr, sst, r2_score

In [15]:
mse, rmse, ssr, sst, r2_score = error_anaylsis(responses['predictions'], np.concatenate(test_labels[:int(len(test_features)/100)]))


In [17]:
print("RMSE for the Decision Tree model on the whole dataset ", rmse)

RMSE for the Decision Tree model on the whole dataset  29.208398873435645
