# Image Classification from scratch with TPUs on Cloud ML Engine using ResNet

This notebook demonstrates how to do image classification from scratch on a flowers dataset using TPUs and the resnet trainer.

In [1]:
import os
PROJECT = 'sidewalk-dl' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'sidewalk_crops_subset' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.9'

In [2]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Convert JPEG images to TensorFlow Records

My dataset consists of JPEG images in Google Cloud Storage. I have two CSV files that are formatted as follows:
   image-name, category

Instead of reading the images from JPEG each time, we'll convert the JPEG data and store it as TF Records.


In [3]:
%%bash
gsutil cat gs://sidewalk_crops_subset/train_set.csv | head -5 > /tmp/input.csv
cat /tmp/input.csv

gs://sidewalk_crops_subset/imgs/10.jpg,curb_ramp
gs://sidewalk_crops_subset/imgs/100.jpg,curb_ramp
gs://sidewalk_crops_subset/imgs/10000.jpg,curb_ramp
gs://sidewalk_crops_subset/imgs/100000.jpg,curb_ramp
gs://sidewalk_crops_subset/imgs/100001.jpg,curb_ramp


In [4]:
%%bash
gsutil cat gs://sidewalk_crops_subset/train_set.csv  | sed 's/,/ /g' | awk '{print $2}' | sort | uniq > /tmp/labels.txt
cat /tmp/labels.txt

curb_ramp
missing_ramp
no_sidewalk
obstruction
occlusion
other
surface_problem


## Clone the TPU repo

Let's git clone the repo and get the preprocessing and model files. The model code has imports of the form:
<pre>
import resnet_model as model_lib
</pre>
We will need to change this to:
<pre>
from . import resnet_model as model_lib
</pre>


In [44]:
%%writefile copy_resnet_files.sh
#!/bin/bash
rm -rf tpu
git clone https://github.com/tensorflow/tpu
cd tpu
TFVERSION=$1
echo "Switching to version r$TFVERSION"
git checkout r$TFVERSION
cd ..
  
MODELCODE=tpu/models/official/resnet
OUTDIR=mymodel
rm -rf $OUTDIR

# preprocessing
cp -r imgclass $OUTDIR   # brings in setup.py and __init__.py
cp tpu/tools/datasets/jpeg_to_tf_record.py $OUTDIR/trainer/preprocess.py

# model: fix imports
for FILE in $(ls -p $MODELCODE | grep -v /); do
    CMD="cat $MODELCODE/$FILE "
    for f2 in $(ls -p $MODELCODE | grep -v /); do
        MODULE=`echo $f2 | sed 's/.py//g'`
        CMD="$CMD | sed 's/^import ${MODULE}/from . import ${MODULE}/g' "
    done
    CMD="$CMD > $OUTDIR/trainer/$FILE"
    eval $CMD
done
find $OUTDIR
echo "Finished copying files into $OUTDIR"

Overwriting copy_resnet_files.sh


In [45]:
!bash ./copy_resnet_files.sh $TFVERSION

Cloning into 'tpu'...
remote: Enumerating objects: 1, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 2355 (delta 0), reused 0 (delta 0), pack-reused 2354[K
Receiving objects: 100% (2355/2355), 1.38 MiB | 14.75 MiB/s, done.
Resolving deltas: 100% (1436/1436), done.
Switching to version r1.9
Branch 'r1.9' set up to track remote branch 'r1.9' from 'origin'.
Switched to a new branch 'r1.9'
mymodel
mymodel/setup.py
mymodel/trainer
mymodel/trainer/.gitignore
mymodel/trainer/imagenet_input.py
mymodel/trainer/preprocess.py
mymodel/trainer/README.md
mymodel/trainer/resnet_k8s.yaml
mymodel/trainer/resnet_main.py
mymodel/trainer/resnet_model.py
mymodel/trainer/resnet_preprocessing.py
mymodel/trainer/__init__.py
Finished copying files into mymodel


## Enable TPU service account

Allow Cloud ML Engine to access the TPU and bill to your project

690616814548-compute@developer.gserviceaccount.com

In [16]:
%%writefile enable_tpu_mlengine.sh
SVC_ACCOUNT=$(curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \
              | grep tpuServiceAccount | tr '"' ' ' | awk '{print $3}' )
echo "Enabling TPU service account $SVC_ACCOUNT to act as Cloud ML Service Agent for project $PROJECT"
gcloud projects add-iam-policy-binding $PROJECT \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
echo "Done"

Overwriting enable_tpu_mlengine.sh


In [7]:
%%writefile enable_tpu_mlengine.sh
SVC_ACCOUNT=690616814548-compute@developer.gserviceaccount.com
echo "Enabling TPU service account $SVC_ACCOUNT to act as Cloud ML Service Agent for project $PROJECT"
gcloud projects add-iam-policy-binding $PROJECT \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
echo "Done"

Overwriting enable_tpu_mlengine.sh


In [17]:
!bash ./enable_tpu_mlengine.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   235    0   235    0     0    594      0 --:--:-- --:--:-- --:--:--   593
Enabling TPU service account service-149546808362@cloud-tpu.iam.gserviceaccount.com to act as Cloud ML Service Agent for project sidewalk-dl
bindings:
- members:
  - serviceAccount:service-690616814548@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgent
- members:
  - serviceAccount:service-690616814548@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent
- members:
  - serviceAccount:service-690616814548@dataflow-service-producer-prod.iam.gserviceaccount.com
  role: roles/dataflow.serviceAgent
- members:
  - serviceAccount:690616814548-compute@developer.gserviceaccount.com
  - serviceAccount:690616814548@cloudservices.gserviceaccount.com
  - serviceAccount:service-690616814548@containerregistry.iam.g

Now run it over full training and evaluation datasets.  This will happen in Cloud Dataflow.

In [17]:
%%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel
gsutil -m rm -rf gs://${BUCKET}/tpu/resnet/data
python -m trainer.preprocess \
       --train_csv gs://sidewalk_crops_subset/train_set.csv \
       --validation_csv gs://sidewalk_crops_subset/eval_set.csv \
       --labels_file gs://sidewalk_crops_subset/labels.txt \
       --project_id $PROJECT \
       --output_dir gs://${BUCKET}/tpu/resnet/data

Collecting apache-beam==2.8.0
  Using cached https://files.pythonhosted.org/packages/81/ea/11cec69a659af024f7f37e928ff533ad5e30b7a519d9982e2bb5b81fcb52/apache-beam-2.8.0.zip
  Saved /tmp/tmpAmj08Q/apache-beam-2.8.0.zip
Successfully downloaded apache-beam
Collecting apache-beam==2.8.0
  Using cached https://files.pythonhosted.org/packages/0f/63/ea5453ba656d060936acf41d2ec057f23aafd69649e2129ac66fdda67d48/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
  Saved /tmp/tmpAmj08Q/apache_beam-2.8.0-cp27-cp27mu-manylinux1_x86_64.whl
Successfully downloaded apache-beam
Read in 5 labels, from curb_ramp to nullcrop


CommandException: 1 files/objects could not be removed.
CommandException: 1 files/objects could not be removed.
Instructions for updating:
Use tf.gfile.GFile.
  standard_options = transform_node.inputs[0].pipeline.options.view_as(


The above preprocessing step will take <b>15-20 minutes</b>. Wait for the job to finish before you proceed. Navigate to [Cloud Dataflow section of GCP web console](https://console.cloud.google.com/dataflow) to monitor job progress. You will see something like this <img src="dataflow.png" />

Alternately, you can simply copy my already preprocessed files and proceed to the next step:
<pre>
gsutil -m cp gs://cloud-training-demos/tpu/resnet/data/* gs://${BUCKET}/tpu/resnet/copied_data
</pre>

In [20]:
%%bash
gsutil ls gs://${BUCKET}/tpu/resnet/data

gs://sidewalk_crops_subset/tpu/resnet/data/train-00000-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00001-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00002-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00003-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00004-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00005-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00006-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00007-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00008-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00009-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00010-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00011-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00012-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00013-of-00087
gs://sidewalk_crops_subset/tpu/resnet/data/train-00014-of-00087
gs://sidewalk_crops_subset/tpu/resnet/da

## Train on the Cloud

In [29]:
%%bash
echo -n "--num_train_images=$(gsutil cat gs://sidewalk_crops_subset/train_set.csv | wc -l)  "
echo -n "--num_eval_images=$(gsutil cat gs://sidewalk_crops_subset/eval_set.csv | wc -l)  "
echo -n "--num_label_classes=$(gsutil cat gs://sidewalk_crops_subset/labels.txt | wc -l)"

--num_train_images=223912  --num_eval_images=24430  --num_label_classes=4

In [28]:
%%bash
TOPDIR=gs://${BUCKET}/tpu/resnet
OUTDIR=${TOPDIR}/trained
JOBNAME=imgclass_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR  # Comment out this line to continue training from the last time
gcloud ml-engine jobs submit training $JOBNAME \
  --python-version=2.7 \
  --region=$REGION \
  --module-name=trainer.resnet_main \
  --package-path=$(pwd)/mymodel/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=BASIC_TPU \
  --runtime-version=$TFVERSION \
  -- \
  --data_dir=${TOPDIR}/data \
  --model_dir=${OUTDIR} \
  --resnet_depth=18 \
  --train_batch_size=128 --eval_batch_size=32 --skip_host_call=True \
  --steps_per_eval=250 --train_steps=1000 \
  --num_train_images=223912  --num_eval_images=24430  --num_label_classes=5 \
  --export_dir=${OUTDIR}/export


gs://sidewalk_crops_subset/tpu/resnet/trained us-central1 imgclass_181206_194319
jobId: imgclass_181206_194319
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [imgclass_181206_194319] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe imgclass_181206_194319

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs imgclass_181206_194319


The above training job will take 15-20 minutes. 
Wait for the job to finish before you proceed. 
Navigate to [Cloud ML Engine section of GCP web console](https://console.cloud.google.com/mlengine) 
to monitor job progress.

In [50]:
%%bash
gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/

gs://sidewalk_crops_subset/tpu/resnet/trained/export/
gs://sidewalk_crops_subset/tpu/resnet/trained/export/1543473192/


You can look at the training charts with TensorBoard:

In [51]:
OUTDIR = 'gs://{}/tpu/resnet/trained/'.format(BUCKET)
from google.datalab.ml import TensorBoard
TensorBoard().start(OUTDIR)

ImportError: No module named datalab.ml

In [95]:
TensorBoard().stop(11531)
print("Stopped Tensorboard")

Stopped Tensorboard


These were the charts I got (I set smoothing to be zero):
<img src="resnet_traineval.png" height="50"/>
As you can see, the final blue dot (eval) is quite close to the lowest training loss, indicating that the model hasn't overfit.  The top_1 accuracy on the evaluation dataset, however, is 80% which isn't that great. More data would help.
<img src="resnet_accuracy.png" height="50"/>

## Deploying and predicting with model

Deploy the model:

In [24]:
%%bash
MODEL_NAME="sidewalk"
MODEL_VERSION=resnet
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1)

In [7]:
%%bash
MODEL_NAME="sidewalk"
MODEL_VERSION=resnet
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1)
echo "Deleting/deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"

# comment/uncomment the appropriate line to run. The first time around, you will need only the two create calls
# But during development, you might need to replace a version by deleting the version and creating it again

gcloud ml-engine versions delete --quiet ${MODEL_VERSION} --model ${MODEL_NAME}
gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

Deleting/deploying sidewalk resnet from gs://sidewalk_crops_subset/tpu/resnet/trained/export/1543964310/ ... this will take a few minutes


Deleting version [resnet]......
.....................................done.
This will delete model [sidewalk]...

Do you want to continue (Y/n)?  Please enter 'y' or 'n':  Please enter 'y' or 'n':  
Deleting model [sidewalk]...
done.


We can use saved_model_cli to find out what inputs the model expects:

In [81]:
%bash
saved_model_cli show --dir $(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1) --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['image_bytes'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: Placeholder:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['classes'] tensor_info:
      dtype: DT_INT64
      shape: (-1)
      name: ArgMax:0
  outputs['probabilities'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 5)
      name: softmax_tensor:0
Method name is: tensorflow/serving/predict


  from ._conv import register_converters as _register_converters


As you can see, the model expects image_bytes.  This is typically base64 encoded

To predict with the model, let's take one of the example images that is available on Google Cloud Storage <img src="http://storage.googleapis.com/cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg" /> and convert it to a base64-encoded array

In [25]:
import base64, sys, json
import tensorflow as tf
with tf.gfile.FastGFile('gs://sidewalk_crops_subset/imgs/10.jpg', 'r') as ifp:
  with open('test.json', 'w') as ofp:
    image_data = ifp.read()
    img = base64.b64encode(image_data)
    json.dump({"image_bytes": {"b64": img}}, ofp)

In [26]:
!ls -l test.json

-rwxrwxrwx 1 gweld gweld 295832 Dec  6 11:36 test.json


Send it to the prediction service

In [27]:
%%bash
gcloud ml-engine predict --model=sidewalk --version=nullcrop --json-instances=./test.json

CLASSES  PROBABILITIES
0        [0.43101173639297485, 0.17029161751270294, 0.014127411879599094, 0.005831782706081867, 0.37766680121421814, 0.0005188215873204172, 0.0005518007092177868]


What does CLASS no. 3 correspond to? (remember that classes is 0-based)

In [92]:
%bash
head -4 /tmp/labels.txt | tail -1

sunflowers


Here's how you would invoke those predictions without using gcloud

In [93]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import base64, sys, json
import tensorflow as tf

with tf.gfile.FastGFile('gs://cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg', 'r') as ifp:
  credentials = GoogleCredentials.get_application_default()
  api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')
  
  request_data = {'instances':
  [
      {"image_bytes": {"b64": base64.b64encode(ifp.read())}}
  ]}

  parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'flowers', 'resnet')
  response = api.projects().predict(body=request_data, name=parent).execute()
  print "response={0}".format(response)

response={u'predictions': [{u'probabilities': [0.0012481402372941375, 0.0010495249880477786, 7.82029837864684e-06, 0.9976732134819031, 2.1333773474907503e-05], u'classes': 3}]}


<pre>
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>