# Image Classification from scratch with TPUs on Cloud ML Engine using ResNet

This notebook demonstrates how to do image classification from scratch on a flowers dataset using TPUs and the resnet trainer.

In [62]:
import os
PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

In [15]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Convert JPEG images to TensorFlow Records

My dataset consists of JPEG images in Google Cloud Storage. I have two CSV files that are formatted as follows:
   image-name, category

Instead of reading the images from JPEG each time, we'll convert the JPEG data and store it as TF Records.


In [5]:
%bash
gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | head -5 > /tmp/input.csv
cat /tmp/input.csv

gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/284497199_93a01f48f6.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/3554992110_81d8c9b0bd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/4065883015_4bb6010cb7_n.jpg,daisy


In [6]:
%bash
gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv  | sed 's/,/ /g' | awk '{print $2}' | sort | uniq > /tmp/labels.txt
cat /tmp/labels.txt

daisy
dandelion
roses
sunflowers
tulips


## Enable TPU service account

Allow Cloud ML Engine to access the TPU and bill to your project

In [None]:
%bash
SVC_ACCOUNT=$(curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \
              | grep tpuServiceAccount | tr '"' ' ' | awk '{print $3}' )
echo "Enabling TPU service account $SVC_ACCOUNT to act as Cloud ML Service Agent"
gcloud projects add-iam-policy-binding $PROJECT \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent
echo "Done"

## Clone the TPU repo

Let's git clone the repo and get the preprocessing and model files. The model code has imports of the form:
<pre>
import resnet_model as model_lib
</pre>
We will need to change this to:
<pre>
from . import resnet_model as model_lib
</pre>


In [64]:
%writefile copy_resnet_files.sh
#!/bin/bash
rm -rf tpu
git clone https://github.com/tensorflow/tpu
cd tpu
TFVERSION=$1
echo "Switching to version r$TFVERSION"
git checkout r$TFVERSION
cd ..
  
MODELCODE=tpu/models/official/resnet
OUTDIR=mymodel
rm -rf $OUTDIR

# preprocessing
cp -r imgclass $OUTDIR   # brings in setup.py and __init__.py
cp tpu/tools/datasets/jpeg_to_tf_record.py $OUTDIR/trainer/preprocess.py

# model: fix imports
for FILE in $(ls -p $MODELCODE | grep -v /); do
    CMD="cat $MODELCODE/$FILE "
    for f2 in $(ls -p $MODELCODE | grep -v /); do
        MODULE=`echo $f2 | sed 's/.py//g'`
        CMD="$CMD | sed 's/^import ${MODULE}/from . import ${MODULE}/g' "
    done
    CMD="$CMD > $OUTDIR/trainer/$FILE"
    eval $CMD
done
find $OUTDIR
echo "Finished copying files into $OUTDIR"

Overwriting copy_resnet_files.sh


In [None]:
!bash ./copy_resnet_files.sh $TFVERSION

## Try preprocessing locally

In [41]:
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel
  
rm -rf /tmp/out
python -m trainer.preprocess \
       --train_csv /tmp/input.csv \
       --validation_csv /tmp/input.csv \
       --labels_file /tmp/labels.txt \
       --project_id $PROJECT \
       --output_dir /tmp/out --runner=DirectRunner

Read in 5 labels, from daisy to tulips


  from ._conv import register_converters as _register_converters
  from .lbfgsb import _minimize_lbfgsb
  from .qhull import *
2018-06-26 00:20:44.080585: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA


In [42]:
!ls -l /tmp/out

total 384
-rw-r--r-- 1 root root 195698 Jun 26 00:20 train-00000-of-00001
-rw-r--r-- 1 root root 195698 Jun 26 00:20 validation-00000-of-00001


In [43]:
!head /tmp/out/train-00000*

�l      �+��
��
��
image/encoded��
��
������ JFIF      ��HICC_PROFILE   8appl   mntrRGB XYZ �     acspAPPL    appl                 ��     �-appl                                               cprt     Hdesc     1wtpt  H   rTRC  \   gTRC  \   bTRC  \   rXYZ  l   gXYZ  �   bXYZ  �   vcgt  �   0chad  �   ,dscm  L  �desc       sRGB Profile           sRGB Profile    XYZ       �Q    �curv       3  XYZ       o�  8�  �XYZ       b�  ��  �XYZ       $�  �  ��vcgt         �H         �H         �H       sf32     B  ����&  �  ����������  �  �ntext    Copyright 1998 - 2003 Apple Computer Inc., all rights reserved. mluc          enUS     �esES     2daDK      pdeDE     HfiFI      �frFU      �itIT     �nlNL     �noNO      �ptBR     2svSE      �jaJP     
koKR     �zhTW      zhCN     ^ s R G B - p r o f i i l i s R G B - p r o f i l P r o f i l   s R V B s R G B  0�0�0�0�0�0� s R G B  �r_icϏ� P e r f i l   s

Now run it over full training and evaluation datasets.  This will happen in Cloud Dataflow.

In [None]:
%bash
export PYTHONPATH=${PYTHONPATH}:${PWD}/mymodel
gsutil -m rm -rf gs://${BUCKET}/tpu/resnet/data
python -m trainer.preprocess \
       --train_csv gs://cloud-ml-data/img/flower_photos/train_set.csv \
       --validation_csv gs://cloud-ml-data/img/flower_photos/eval_set.csv \
       --labels_file /tmp/labels.txt \
       --project_id $PROJECT \
       --output_dir gs://${BUCKET}/tpu/resnet/data

The above preprocessing step will take <b>15-20 minutes</b>. Wait for the job to finish before you proceed. Navigate to [Cloud Dataflow section of GCP web console](https://console.cloud.google.com/dataflow) to monitor job progress. You will see something like this <img src="dataflow.png" />

Alternately, you can simply copy my already preprocessed files and proceed to the next step:
<pre>
gsutil -m cp gs://cloud-training-demos/tpu/resnet/data/* gs://${BUCKET}/tpu/resnet/copied_data
</pre>

In [45]:
%bash
gsutil ls gs://${BUCKET}/tpu/resnet/data

gs://cloud-training-demos-ml/tpu/resnet/data/train-00000-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00001-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00002-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00003-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00004-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00005-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00006-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00007-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00008-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00009-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00010-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00011-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/train-00012-of-00013
gs://cloud-training-demos-ml/tpu/resnet/data/validation-00000-of-00003
gs://cloud-training-demos-ml/tpu/resnet/data/validation-00001-of-00003


## Train on the Cloud

In [51]:
%bash
echo -n "--num_train_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/train_set.csv | wc -l)  "
echo -n "--num_eval_images=$(gsutil cat gs://cloud-ml-data/img/flower_photos/eval_set.csv | wc -l)  "
echo "--num_label_classes=$(cat /tmp/labels.txt | wc -l)"

--num_train_images=3300  --num_eval_images=370  --num_label_classes=5


In [None]:
%bash
TOPDIR=gs://${BUCKET}/tpu/resnet
OUTDIR=${TOPDIR}/trained
JOBNAME=imgclass_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR  # Comment out this line to continue training from the last time
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.resnet_main \
  --package-path=$(pwd)/mymodel/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=BASIC_TPU \
  --runtime-version=$TFVERSION \
  -- \
  --data_dir=${TOPDIR}/data \
  --model_dir=${OUTDIR} \
  --resnet_depth=18 \
  --train_batch_size=128 --eval_batch_size=32 --skip_host_call=True \
  --train_steps=1000 \
  --num_train_images=3300  --num_eval_images=370  --num_label_classes=5 \
  --export_dir=${OUTDIR}/export

The above training job will take 15-20 minutes. 
Wait for the job to finish before you proceed. 
Navigate to [Cloud ML Engine section of GCP web console](https://console.cloud.google.com/mlengine) 
to monitor job progress.

In [72]:
%bash
gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/

gs://cloud-training-demos-ml/tpu/resnet/trained/export/
gs://cloud-training-demos-ml/tpu/resnet/trained/export/1529987998/


## Deploying and predicting with model

Deploy the model:

In [76]:
%bash
MODEL_NAME="flowers"
MODEL_VERSION=resnet
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete --quiet ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

Deleting and deploying flowers resnet from gs://cloud-training-demos-ml/tpu/resnet/trained/export/1529987998/ ... this will take a few minutes


Creating version (this might take a few minutes)......
..................................................................................................done.


We can use saved_model_cli to find out what inputs the model expects:

In [81]:
%bash
saved_model_cli show --dir $(gsutil ls gs://${BUCKET}/tpu/resnet/trained/export/ | tail -1) --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['image_bytes'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: Placeholder:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['classes'] tensor_info:
      dtype: DT_INT64
      shape: (-1)
      name: ArgMax:0
  outputs['probabilities'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 5)
      name: softmax_tensor:0
Method name is: tensorflow/serving/predict


  from ._conv import register_converters as _register_converters


As you can see, the model expects image_bytes.  This is typically base64 encoded

To predict with the model, let's take one of the example images that is available on Google Cloud Storage <img src="http://storage.googleapis.com/cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg" /> and convert it to a base64-encoded array

In [87]:
import base64, sys, json
import tensorflow as tf
with tf.gfile.FastGFile('gs://cloud-ml-data/img/flower_photos/sunflowers/1022552002_2b93faf9e7_n.jpg', 'r') as ifp:
  with open('test.json', 'w') as ofp:
    image_data = ifp.read()
    img = base64.b64encode(image_data)
    json.dump({"image_bytes": {"b64": img}}, ofp)

In [88]:
!ls -l test.json

-rw-r--r-- 1 root root 56992 Jun 26 05:33 test.json


Send it to the prediction service

In [89]:
%bash
gcloud ml-engine predict --model=flowers --version=resnet --json-instances=./test.json

CLASSES  PROBABILITIES
3        [0.0012481402372941375, 0.0010495249880477786, 7.82029837864684e-06, 0.9976732134819031, 2.1333773474907503e-05]


What does CLASS no. 3 correspond to? (remember that classes is 0-based)

In [92]:
%bash
head -4 /tmp/labels.txt | tail -1

sunflowers


Hurray!

<pre>
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
</pre>