<h1> Text Classification using TensorFlow/Keras on Cloud ML Engine </h1>
<h3>Leveraging a pre-trained embedding</h3>

This notebook illustrates:
<ol>
<li> Downloading a pre-trained text embedding
<li> Creating a text classification model using Keras and the Estimator API 
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [38]:
# change these to try this notebook out
BUCKET = 'vijays-sandbox-ml'
PROJECT = 'vijays-sandbox'
REGION = 'us-central1'

In [39]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

In [40]:
import tensorflow as tf
print(tf.__version__)

1.8.0


### Pre-requisites
Ensure you have the training files generated from the txtcls_fromscratch notebook. If you don't, go back and run the <a href="txtcls_fromscratch.ipynb">txtcls_fromscratch</a> notebook first.

In [None]:
!wc -l data/txtcls/*.tsv

### Download Pre-trained Embedding

In previous notebook we trained our word embedding from scratch. Often times we get better performance from leveraging a pre-trained embedding. This is a similar concept to transfer learning during image classification.

We will use the popular GloVe embedding which is trained on Wikipedia as well as various news sources like the NYTimes.

You can read more about Glove at the project homepage: https://nlp.stanford.edu/projects/glove/

*Note: The download is about 900MB so the following cell may take some time to run*

In [None]:
%bash
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d data/txtcls/

### TensorFlow/Keras Code

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job. 

This is the same code as in the previous notebook. The only difference is we invoke it with the `--embedding_path` parameter

### Run Locally
Let's make sure the code compiles and works locally by running for a fraction of an epoch. Note the new `--embedding_path` parameter

In [None]:
%bash
rm -rf trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/txtclsmodel/trainer \
   -- \
   --output_dir=trained \
   --train_data_path=${PWD}/data/txtcls/train.tsv \
   --eval_data_path=${PWD}/data/txtcls/eval.tsv \
   --embedding_path=${PWD}/data/txtcls/glove.6B.200d.txt \
   --num_epochs=0.1

### Train on the Cloud

Let's first copy our embedding file to the cloud:

In [None]:
!gsutil cp data/txtcls/glove.6B.200d.txt gs://$BUCKET/txtcls/

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained_finetune
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
REGION=us-central1
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.csv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.csv \
 --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt \
 --num_epochs=20

### Monitor training with TensorBoard
If tensorboard appears blank try refreshing after 5 minutes

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://{}/txtcls/trained_finetune'.format(BUCKET))

In [None]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print 'Stopped TensorBoard with pid {}'.format(pid)

### Results
What accuracy did you get? Was it an improvement over training the embedding from scratch? 

While the final accuracy may not change significantly, you should notice the model was able to converge to it much more quickly given the pre-trained embedding.

### Deploy trained model 

Once your training completes you will see your exported models in the output directory specified in Google Cloud Storage. 

You should see one model for each training checkpoint (default is every 1000 steps).

In [None]:
%bash
gsutil ls gs://${BUCKET}/txtcls/trained_finetune/export/exporter/

In [None]:
%%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1_finetune"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls/trained_finetune/export/exporter/ | tail -1)
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} --quiet
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

### Get Predictions

Here are some actual hacker news headlines gathered from July 2018. These titles were not part of the training or evaluation datasets.

In [None]:
techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]

Our serving input function expects the already tokenized representations of the headlines, so we do that pre-processing in the code before calling the REST API.

Note: Ideally we would do these transformation in the tensorflow graph directly instead of relying on separate client pre-processing code (see: [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml/#training_serving_skew)), howevever the keras pre-processing functions we're using are not native tensorflow functions so this is not possible. 

In [57]:
import pickle
from tensorflow.python.keras.preprocessing import sequence
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

requests = techcrunch+nytimes+github

# Tokenize and pad sentences using same mapping used in the deployed model
tokenizer = pickle.load( open( "txtclsmodel/tokenizer.pickled", "rb" ) )

requests_tokenized = tokenizer.texts_to_sequences(requests)
requests_tokenized = sequence.pad_sequences(requests_tokenized,maxlen=50)

# JSON format the requests
request_data = {'instances':requests_tokenized.tolist()}

# Authenticate and call CMLE prediction API 
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1_finetune')
response = api.projects().predict(body=request_data, name=parent).execute()

# Format and print response
for i in xrange(len(requests)):
  print('\n{}'.format(requests[i]))
  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))
  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))
  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))


Uber shuts down self-driving trucks unit
 github    : 1.11331166863e-06
 nytimes   : 0.999956488609
 techcrunch: 4.24026038672e-05

Grover raises €37M Series A to offer latest tech products as a subscription
 github    : 3.69702429452e-05
 nytimes   : 0.97977989912
 techcrunch: 0.0201830491424

Tech companies can now bid on the Pentagon’s $10B cloud contract
 github    : 0.000980124576017
 nytimes   : 0.891622960567
 techcrunch: 0.107396923006

‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions
 github    : 3.89208285013e-14
 nytimes   : 1.0
 techcrunch: 8.62220503328e-11

A $3B Plan to Turn Hoover Dam into a Giant Battery
 github    : 0.0126319322735
 nytimes   : 0.978058815002
 techcrunch: 0.00930932350457

A MeToo Reckoning in China’s Workplace Amid Wave of Accusations
 github    : 2.3520786962e-14
 nytimes   : 1.0
 techcrunch: 4.22925854494e-14

Show HN: Moon – 3kb JavaScript UI compiler
 github    : 1.0
 nytimes   : 1.3027193087e-15
 techcrunch: 8.5340

### Bonus: Native Tensorflow Predictions


#### Why Native?

Up until now we've been using pure python functions to do our data pre-processing. This is fine during training, but during serving it adds the limitation that we need a python client in the prediction pipeline.

This limits your serving flexibility. For example, lets say you want to be able to serve this model locally (offline) on a mobile phone. How would you do it? It's non trivial to execute python code on Android.

A better way would be to have all of our serving pre-processing to be done using native Tensorflow operations. As long as we stick to native operations, we can take advantage of Tensorflow's hardware agnostic execution engine, and leverage the huge Engineering efforts the Tensorflow team put into making sure our code works whether we're running on a server, mobile, or an embedded device!

### TensorFlow/Keras Code (Native)

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model_native.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job. 

In particular look for the follwing:

1. [tf.keras.preprocessing.text.Tokenizer.fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) to generate a mapping from our word vocabulary to integers
2. [tf.gfile](https://www.tensorflow.org/api_docs/python/tf/gfile/GFile) to write the vocabulary mapping to disk
3. [tf.contrib.lookup.index_table_from_file()](https://www.tensorflow.org/api_docs/python/tf/contrib/lookup/index_table_from_file) to encode our sentences into a tensor of their respective word-integers, based on the vocabulary mapping written to disk in the previous step
4. [tf.pad](https://www.tensorflow.org/api_docs/python/tf/pad) and [tf.slice](https://www.tensorflow.org/api_docs/python/tf/slice) to pad all sequences to be the same length

Note that we will leave our training/evaluation input_fn as is. However we will modify our serving_input_fn to use **3.** and **4.** which are both native Tensorflow functions. 

### Run Locally (Native)
Let's make sure the code compiles and works locally by running for a fraction of an epoch. Note the new `--native` parameter

In [52]:
%bash
OUTDIR=${PWD}/txtcls_trained
rm -rf $OUTDIR
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/txtclsmodel/trainer \
   -- \
   --output_dir=$OUTDIR \
   --train_data_path=${PWD}/data/txtcls/train.tsv \
   --eval_data_path=${PWD}/data/txtcls/eval.tsv \
   --embedding_path=${PWD}/data/txtcls/glove.6B.200d.txt \
   --native \
   --num_epochs=0.1

  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/usr/local/envs/py2env/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/envs/py2env/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/content/datalab/training-data-analyst/courses/machine_learning/deepdive/09_sequence/txtclsmodel/trainer/task.py", line 62, in <module>
    model_native.train_and_evaluate(output_dir, hparams)
  File "trainer/model_native.py", line 319, in train_and_evaluate
    f.write("{},{}\n".format(word, index))
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 103, in write
    self._prewrite_check()
  File "/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 89, in _prewrite_check
    compat.as_bytes(self.__name), compat.as_bytes(self.__mode), status)
  File "/usr/

### Train on the Cloud (Native)

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained_finetune_native
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
REGION=us-central1
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.csv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.csv \
 --embedding_path=gs://${BUCKET}/txtcls/glove.6B.200d.txt \
 --native \
 --num_epochs=20

### Deploy trained model (Native)

In [46]:
%%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1_finetune_native"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls/trained_finetune_native/export/exporter/ | tail -1)
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME} --quiet
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

Creating version (this might take a few minutes)......
...............................................................................................................done.


### Get Predictions (Native)

In [None]:
techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]

Note how we can now feed the titles directly to the model! All the pre-processing is done for us inside of the Tensorflow serving_input_fn. 

In [56]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

# JSON format the requests
requests = techcrunch+nytimes+github
request_data = {'instances': requests}

# Authenticate and call CMLE prediction API 
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1_finetune_native')
response = api.projects().predict(body=request_data, name=parent).execute()

# Format and print response
for i in xrange(len(requests)):
  print('\n{}'.format(requests[i]))
  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))
  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))
  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))


Uber shuts down self-driving trucks unit
 github    : 0.0131800081581
 nytimes   : 0.442704319954
 techcrunch: 0.544115662575

Grover raises €37M Series A to offer latest tech products as a subscription
 github    : 2.01788429877e-06
 nytimes   : 0.00564260361716
 techcrunch: 0.99435544014

Tech companies can now bid on the Pentagon’s $10B cloud contract
 github    : 0.00708839157596
 nytimes   : 0.0283307153732
 techcrunch: 0.964580893517

‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions
 github    : 0.0764058530331
 nytimes   : 0.224069595337
 techcrunch: 0.699524581432

A $3B Plan to Turn Hoover Dam into a Giant Battery
 github    : 0.0555238202214
 nytimes   : 0.270869225264
 techcrunch: 0.673606932163

A MeToo Reckoning in China’s Workplace Amid Wave of Accusations
 github    : 0.178793072701
 nytimes   : 0.382247179747
 techcrunch: 0.438959777355

Show HN: Moon – 3kb JavaScript UI compiler
 github    : 0.999206602573
 nytimes   : 2.54288829638e-06
 

#### Issues to vet
- Is the vocab.txt embedded with the model? (try deleting GCS vocab.txt, does serving still work?)
- Is all the code the same except the serving_input_fn? If so no need for two versions! Just talk about why you need the serving_input_fn to be different from the get go.
- Does distributed training work on CMLE?
- Padwords and unknown tokens have the same representation. Is this a significant issue? Could add a mapping for the padword to remedy
- Training/serving skew (redo input fn after all so padding)
- Are you using only the top_N words? (test in sandbox)

#### References
- This implementation is based on code from Google's 'eng-edu' team: https://github.com/google/eng-edu/tree/master/ml/guides/text_classification.
- See the full text classification tutorial at: https://developers.google.com/machine-learning/guides/text-classification/

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License