<h1> Text Classification using TensorFlow on Cloud ML Engine </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a text classification model using a CNN and custom Estimator 
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [9]:
# change these to try this notebook out
BUCKET = 'vijays-sandbox-ml'
PROJECT = 'vijays-sandbox'
REGION = 'us-central1'

In [10]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

In [3]:
import tensorflow as tf
print(tf.__version__)

1.8.0


  from ._conv import register_converters as _register_converters


Ensure you have the training files. If you don't, go back and run the <a href="txtcls_fromscratch.ipynb">txtcls_fromscratch</a> notebook first.

In [4]:
!wc -l data/txtcls/*.csv

  24041 data/txtcls/eval.csv
  72162 data/txtcls/train.csv
  96203 total


#### Download Pre-trained Embedding

To provide words as inputs to a neural network, we have to convert words to numbers. Ideally, we want related words to have numbers that are close to each other. This is what an embedding (such as word2vec) does. Here, I'll use the <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> embedding from Stanford just because, at 160MB, it is smaller than <a href="https://code.google.com/archive/p/word2vec/">word2vec</a> from Google (1.5 GB).
<p>
For testing purposes, I will also create a smaller file, consisting of the 1000 most common words.

In [5]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2018-07-30 15:55:26--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2018-07-30 15:55:27--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2018-07-30 15:56:55 (9.32 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [1]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [13]:
!gsutil cp glove.6B.200d.txt gs://$BUCKET/txtcls/

Copying file://glove.6B.200d.txt [Content-Type=text/plain]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\ [1 files][661.3 MiB/661.3 MiB]    5.2 MiB/s                                   
Operation completed over 1 objects/661.3 MiB.                                    


<h2> TensorFlow code </h2>

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job.

Let's make sure the code works locally on a small dataset for a few steps.

In [60]:
%bash
rm -rf trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/txtclsmodel/trainer \
   -- \
   --output_dir=trained \
   --train_data_path=${PWD}/data/txtcls/train.csv \
   --eval_data_path=${PWD}/data/txtcls/eval.csv \
   --num_epochs=0.1

input_fn: mode: train
input_fn: x_shape: (72162, 50)
input_fn: y_shape: (72162,)
input_fn: mode: eval
input_fn: x_shape: (24041, 50)
input_fn: y_shape: (24041,)


  from ._conv import register_converters as _register_converters
2018-07-30 15:25:11.765428: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA


When I ran it, I got a 55% accuracy after a few steps. Because batchsize=32, 200 steps is essentially 6400 examples -- the full dataset is 72,000 examples, so this is not even the full dataset. And already, we are doing much better than random chance.
<p>
Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.  Since we have 72,000 examples and batchsize=32, train_steps=36,000 essentially means 16 epochs.

In [12]:
!echo gs://$BUCKET/txtcls/

gs://vijays-sandbox-ml/txtcls/


In [27]:
%bash
gsutil cp data/txtcls/*.csv gs://$BUCKET}/txtcls/

Copying file://data/txtcls/eval.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/  1.4 MiB]                                                / [1 files][  1.4 MiB/  1.4 MiB]                                                Copying file://data/txtcls/train.csv [Content-Type=text/csv]...
/ [1 files][  1.4 MiB/  5.4 MiB]                                                -- [2 files][  5.4 MiB/  5.4 MiB]                                                \
Operation completed over 2 objects/5.4 MiB.                                      


In [61]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
REGION=us-central1
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.csv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.csv \
 --num_epochs=20

jobId: txtcls_180730_153202
state: QUEUED


Removing gs://vijays-sandbox-ml/txtcls/trained/#1532964624555475...
Removing gs://vijays-sandbox-ml/txtcls/trained/checkpoint#1532964625612026...
Removing gs://vijays-sandbox-ml/txtcls/trained/eval/#1532964455123613...
Removing gs://vijays-sandbox-ml/txtcls/trained/eval/events.out.tfevents.1532964455.cmle-training-10112636399707147332#1532964628882603...
Removing gs://vijays-sandbox-ml/txtcls/trained/events.out.tfevents.1532964444.cmle-training-10112636399707147332#1532964629087309...
Removing gs://vijays-sandbox-ml/txtcls/trained/graph.pbtxt#1532964447178078...
Removing gs://vijays-sandbox-ml/txtcls/trained/model.ckpt-3001.index#1532964542728516...
Removing gs://vijays-sandbox-ml/txtcls/trained/model.ckpt-2001.data-00000-of-00001#1532964512670155...
Removing gs://vijays-sandbox-ml/txtcls/trained/model.ckpt-2001.index#1532964512909393...
Removing gs://vijays-sandbox-ml/txtcls/trained/model.ckpt-2001.meta#1532964514014036...
Removing gs://vijays-sandbox-ml/txtcls/trained/model.ckpt-3001

Training finished with an accuracy of 73%.  Obviously, this was trained on a really small dataset and with more data will hopefully come even greater accuracy.

<h2> Deploy trained model </h2>
<p>
Deploying the trained model to act as a REST web service is a simple gcloud call.

In [26]:
%bash
gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/

gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/
gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/1515090618/


In [None]:
%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}

<h2> Use model to predict </h2>
<p>
Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, github, and TechCrunch on June 19.   These titles were not part of the training or evaluation datasets.

In [29]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

request_data = {'instances':
  [
      {
        'title': 'Supreme Court to Hear Major Case on Partisan Districts'.lower()
      },
      {
        'title': 'Furan -- build and push Docker images from GitHub to target'.lower()
      },
      {
        'title': 'Time Warner will spend $100M on Snapchat original shows and ads'.lower()
      },
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

response={u'predictions': [{u'source': u'nytimes', u'prob': [0.8374465107917786, 0.04888102412223816, 0.11367242783308029], u'class': 0}, {u'source': u'github', u'prob': [0.012753208167850971, 0.987160325050354, 8.644349873065948e-05], u'class': 1}, {u'source': u'techcrunch', u'prob': [0.006752816028892994, 0.0005263009225018322, 0.992720901966095], u'class': 2}]}


As you can see, the trained model predicts that the Supreme Court article is 84% likely to come from New York Times. The Docker article is 99% likely to be from GitHub according to the service and the Time Warner one is 99% likely to be from TechCrunch.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License