<h1> Text Classification using TensorFlow on Cloud ML Engine (using pretrained embedding) </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a text classification model using the high-level Estimator API and a pre-trained embedding. (this is the difference vs <a href="txtcls1.ipyb"> txtcls1.ipynb </a>)
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [3]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [4]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [5]:
!gcloud config set project $PROJECT

Updated property [core/project].


In [1]:
import tensorflow as tf
print tf.__version__

1.4.1


The idea is to look at the title of a newspaper article and figure out whether the article came from the New York Times or from TechCrunch. Look at <a href="txtcls1.ipyb"> txtcls1.ipynb </a> for a solution that learns words embeddings as part of the problem itself. In this notebook, I will show how to use a pretrained embedding instead.

<h2> Data exploration and preprocessing in BigQuery </h2>
<p>
See <a href="txtcls1.ipyb"> txtcls1.ipynb </a> for an explanation. Here, I simply repeat the key steps to create the dataset.

In [2]:
import google.datalab.bigquery as bq
query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
(SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  title
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""

traindf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").execute().result().to_dataframe()
evaldf  = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").execute().result().to_dataframe()

import os, shutil
DATADIR='data/txtcls2'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.csv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.csv'), header=False, index=False, encoding='utf-8', sep='\t')

In [None]:
%bash
gsutil cp data/txtcls2/*.csv gs://${BUCKET}/txtcls2/

## Pre-trained embedding

To provide words as inputs to a neural network, we have to convert words to numbers. Ideally, we want related words to have numbers that are close to each other. This is what an embedding (such as word2vec) does. Here, I'll use the <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> embedding from Stanford just because, at 160MB, it is smaller than <a href="https://code.google.com/archive/p/word2vec/">word2vec</a> from Google (1.5 GB).
<p>
For testing purposes, I will also create a smaller file, consisting of the 1000 most common words.

In [None]:
%bash
wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
%bash
unzip -p glove.6B.zip glove.6B.50d.txt | gzip > pretrained_embedding.txt.gz
#rm glove.6B.zip

rm subset_embedding.txt*
zcat pretrained_embedding.txt.gz | head -1000 > subset_embedding.txt
gzip subset_embedding.txt

In [8]:
%bash
zcat subset_embedding.txt.gz | head -1
gsutil cp *_embedding.txt.gz gs://${BUCKET}/txtcls2/

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581



gzip: stdout: Broken pipe
Copying file://pretrained_embedding.txt.gz [Content-Type=text/plain]...
/ [0 files][    0.0 B/ 66.0 MiB]                                                / [0 files][792.0 KiB/ 66.0 MiB]                                                -\\ [0 files][  2.3 MiB/ 66.0 MiB]                                                |// [0 files][  4.1 MiB/ 66.0 MiB]                                                -- [0 files][  5.2 MiB/ 66.0 MiB]                                                \|| [0 files][  7.2 MiB/ 66.0 MiB]                                                // [0 files][  8.3 MiB/ 66.0 MiB]                                                -\\ [0 files][ 10.3 MiB/ 66.0 MiB]                                                |// [0 files][ 12.9 MiB/ 66.0 MiB]                                                -- [0 files][ 15.5 MiB/ 66.0 MiB]                                                \|| [0 files][ 16.8 MiB/ 66.0 MiB]    1.9 MiB/s                

In [None]:
%bash
gsutil ls -l gs://${BUCKET}/txtcls2/*.txt.gz

In [9]:
PADWORD = 'ZYXW'
from tensorflow.python.lib.io import file_io

class Word2Vec:
  '''
  vocab, embeddings
  '''
  def vocab_size(self):
    return len(self.vocab)
  
  def embed_dim(self):
    return len(self.embeddings[0])
  
  def __init__(self, filename):
    import gzip, StringIO
    import numpy as np
    self.vocab = [PADWORD]
    self.embeddings = [0]
    with file_io.FileIO(filename, mode='rb') as f:
      compressedFile = StringIO.StringIO(f.read())
      decompressedFile = gzip.GzipFile(fileobj=compressedFile)
      for line in decompressedFile:
        pieces = line.split()
        self.vocab.append(pieces[0])
        self.embeddings.append(np.asarray(pieces[1:], dtype='float32'))
    self.embeddings[0] = np.zeros_like(self.embeddings[1])
    self.vocab.append('') # for out-of-value words
    self.embeddings.append(np.ones_like(self.embeddings[1]))
    self.embeddings = np.array(self.embeddings)
    print('Loaded {}D vectors for {} words from {}'.format(self.embed_dim(), self.vocab_size(), filename))

In [10]:
#wv = Word2Vec('gs://{}/txtcls2/pretrained_embedding.txt.gz'.format(BUCKET))
wv = Word2Vec('subset_embedding.txt.gz'.format(BUCKET))
print wv.embeddings.shape

Loaded 50D vectors for 1002 words from subset_embedding.txt.gz
(1002, 50)


<h2> TensorFlow code </h2>

Please explore the code in this <a href="txtcls2/trainer">directory</a> -- <a href="txtcls2/trainer/model.py">model.py</a> contains the key TensorFlow model and <a href="txtcls2/trainer/task.py">task.py</a> has a main() that launches off the training job.

The following cells should give you an idea of what the model code does. The idea is to load up the embedding file and get vectors corresponding to the words in that file's vocabulary.  For example "the" might be mapped to a 50D vector.  Now, whenever we see "the" in the input document, we need to replace it by the 50D vector.
The method that does this:
<pre>
tf.nn.embedding_lookup
</pre>
requires the *index* of the word "the" in the original file (perhaps that index=1). To find the index, we will use VocabularyProcessor.

In [12]:
import tensorflow as tf
from tensorflow.contrib import lookup
from tensorflow.python.platform import gfile
import numpy as np

print tf.__version__
MAX_DOCUMENT_LENGTH = 5  

# raw input
lines = ['Some title', 'A longer title', 'An even longer title', 'This is longer than doc length']
lines = [line.lower() for line in lines]
#lines = tf.constant(lines)  # vocabprocessor doesn't work

# we first create word-ids for each of the words in the glove embedding file
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(wv.vocab)  # word to word-id
wordid_to_embed = tf.convert_to_tensor(wv.embeddings) # word-id to embedding

# take lines of input and find word-ids; then lookup the embedding for each word-id
tensorids = np.array(list(vocab_processor.transform(lines)))
numbers = tf.nn.embedding_lookup(wordid_to_embed, tensorids)

with tf.Session() as sess:
  print "numbers=", numbers.eval()[0], numbers.shape

1.2.1
numbers= [[ 0.13403     0.89178002 -0.76761001 -0.64183998  0.86203998  1.31219995
  -0.64017999  0.82067001  0.32782999  0.021457   -0.095194    0.40825
  -0.63602    -0.018275    0.69708002 -0.29530999 -1.19120002 -0.23897
   0.34340999 -0.33195999  0.23702     1.83640003  0.12295    -0.18624
   0.86502999 -2.63599992 -0.7791      0.20299999  0.18985    -0.79896998
   2.98819995  0.44336    -0.28367001 -0.19588     0.061875    0.38558
  -0.027622    0.71846998  0.17156    -1.21679997  0.081636    0.17293
  -0.31718001 -0.37039     0.18977    -0.89174998  0.18492    -1.62510002
   0.039134   -0.10279   ]
 [ 0.79667002 -0.42120001  0.48504999  0.40887001 -0.073664   -0.21618
  -0.83293003  0.20614     0.31885001 -0.39559999 -0.096698   -0.54991001
  -0.53323001 -0.6437      1.01269996 -0.23283    -1.18599999 -0.78666002
  -0.33329001 -0.28738999  0.60349     0.54689997  0.77043998 -0.40391999
  -0.011623   -1.56649995 -0.75793999  0.3075     -0.59415001  0.57292998
   3.04690003 

However, [as pointed out by Dennis Murray](https://stackoverflow.com/questions/35687678/using-a-pre-trained-word-embedding-word2vec-or-glove-in-tensorflow), tf.constants are not memory efficient. To avoid storing multiple copies of the wordid_to_embed tensor, we should use a Variable. 
<p>
Also, although the VocabularyProcessor has that convenient transform() method, it is pure Python and can not handle Tensors. Our "lines" will actually be a tensor in real-life. So, we have to use index_table and do a lookup using that ...  This code also differs in how we handle "out-of-bucket" words -- we use ones (because PADWORD is mapped to zeros) whereas vocab processor uses zeros.

In [13]:
import tensorflow as tf
from tensorflow.contrib import lookup
from tensorflow.python.platform import gfile
import numpy as np

print tf.__version__
MAX_DOCUMENT_LENGTH = 5  

# raw input
lines = ['Some title', 'A longer title', 'An even longer title', 'This is longer than doc length']
lines = [line.lower() for line in lines]
lines = tf.constant(lines)

wordid_to_embed = tf.Variable(tf.constant(0.0, shape=[wv.vocab_size(), wv.embed_dim()]), trainable=False, name="embedding")
embedding_placeholder = tf.placeholder(tf.float32, [wv.vocab_size(), wv.embed_dim()])
embedding_init = wordid_to_embed.assign(embedding_placeholder)
  
# take lines of input and find word-ids; then lookup the embedding for each word-id
table = tf.contrib.lookup.index_table_from_tensor(tf.convert_to_tensor(wv.vocab[:-1]), num_oov_buckets=1)

words = tf.string_split(lines)
densewords = tf.sparse_tensor_to_dense(words, default_value=PADWORD)
numbers = table.lookup(densewords)
padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
padded = tf.pad(numbers, padding)
sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])
embeds = tf.nn.embedding_lookup(wordid_to_embed, sliced)

with tf.Session() as sess:
  tf.tables_initializer().run()
  tf.get_default_session().run(embedding_init, feed_dict={embedding_placeholder: wv.embeddings})
  print "embeds=", embeds.eval()[0], embeds.shape

1.2.1
embeds= [[  9.28709984e-01  -1.08340003e-01   2.14969993e-01  -5.02370000e-01
    1.03790000e-01   2.27280006e-01  -5.41980028e-01  -2.90080011e-01
   -6.46070004e-01   1.26640007e-01  -4.14869994e-01  -2.93430001e-01
    3.68550003e-01  -4.17329997e-01   6.91160023e-01   6.73409998e-02
    1.97150007e-01  -3.04649994e-02  -2.17230007e-01  -1.22379994e+00
    9.54690017e-03   1.95940003e-01   5.65949976e-01  -6.74730018e-02
    5.92079982e-02  -1.39090002e+00  -8.92750025e-01  -1.35460004e-01
    1.62000000e-01  -4.02099997e-01   4.16440010e+00   3.78160000e-01
    1.57969996e-01  -4.88920003e-01   2.31309995e-01   2.32580006e-01
   -2.53140002e-01  -1.99770004e-01  -1.22579999e-01   1.56200007e-01
   -3.19950014e-01   3.83139998e-01   4.72660005e-01   8.76999974e-01
    3.22230011e-01   1.32919999e-03  -4.98600006e-01   5.55800021e-01
   -7.03589976e-01  -5.26929975e-01]
 [ -1.23829997e+00   9.94870007e-01  -7.36769974e-01   1.07729995e+00
    2.54099995e-01   8.17380026e-02   1

In [24]:
%bash
grep -E "def |class " txtcls1/trainer/model.py

class Word2Vec:
  def vocab_size(self):
  def embed_dim(self):
  def __init__(self, filename):
  def init(sess):
  def get_embedding(self, lines):
def init(hparams):
def save_vocab(trainfile, txtcolname, outfilename):
def read_dataset(hparams, prefix, batch_size=BATCH_SIZE):
  def _input_fn():
def get_embedding(hparams, titles, embed_size):
def cnn_model(features, labels, mode, params):
def serving_input_fn():
def get_train(hparams):
def get_valid(hparams, batch_size):
def init_embedding_hooks(hparams):
  class InitEmbeddingHook(tf.train.SessionRunHook):
    def after_create_session(self, session, coord):
def make_experiment_fn(output_dir, hparams):
  def experiment_fn(output_dir):


Let's make sure the code works locally on a small dataset for a few steps. Because of the size of the graph, though, this will take a *long* time and may crash on smaller machines (it has to evaluate the graph five times and write out 5 checkpoints).

In [None]:
%bash
echo "bucket=${BUCKET}"
rm -rf outputdir
export PYTHONPATH=${PYTHONPATH}:${PWD}/txtcls1
python -m trainer.task \
   --bucket=${BUCKET} \
   --output_dir=outputdir \
   --glove_embedding=gs://${BUCKET}/txtcls2/subset_embedding.txt.gz \
   --job-dir=./tmp --train_steps=200 

When I ran it, I got a 37% accuracy after a few steps. Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.  Since we have 72,000 examples and batchsize=32, train_steps=36,000 essentially means 16 epochs.

In [None]:
%bash
OUTDIR=gs://${BUCKET}/txtcls2/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=$(pwd)/txtcls1/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC_GPU \
   --runtime-version=1.4 \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --glove_embedding=gs://${BUCKET}/txtcls2/pretrained_embedding.txt.gz \
   --train_steps=36000

Training finished with an accuracy of 54.5%.

<h2> Deploy trained model </h2>
<p>
Deploying the trained model to act as a REST web service is a simple gcloud call.

In [18]:
%bash
gsutil ls gs://${BUCKET}/txtcls2/trained_model/export/Servo/

gs://cloud-training-demos-ml/txtcls2/trained_model/export/Servo/
gs://cloud-training-demos-ml/txtcls2/trained_model/export/Servo/1515526356/


In [None]:
%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v2"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}

<h2> Use model to predict </h2>
<p>
Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, github, and TechCrunch on June 19.   These titles were not part of the training or evaluation datasets.

In [22]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

request_data = {'instances':
  [
      {
        'title': 'Supreme Court to Hear Major Case on Partisan Districts'.lower()
      },
      {
        'title': 'Furan -- build and push Docker images from GitHub to target'.lower()
      },
      {
        'title': 'Time Warner will spend $100M on Snapchat original shows and ads'.lower()
      },
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v2')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

[2018-01-09 21:44:07,479] {discovery.py:273} INFO - URL being requested: GET https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json
[2018-01-09 21:44:07,657] {discovery.py:273} INFO - URL being requested: GET https://ml.googleapis.com/$discovery/rest?version=v1
[2018-01-09 21:44:07,988] {discovery.py:863} INFO - URL being requested: POST https://ml.googleapis.com/v1/projects/cloud-training-demos/models/txtcls/versions/v2:predict?alt=json
[2018-01-09 21:44:07,990] {client.py:614} INFO - Attempting refresh to obtain initial access_token
[2018-01-09 21:44:07,991] {client.py:903} INFO - Refreshing access_token
response={u'predictions': [{u'source': u'nytimes', u'prob': [0.8374465107917786, 0.04888102412223816, 0.11367242783308029], u'class': 0}, {u'source': u'github', u'prob': [0.012753208167850971, 0.987160325050354, 8.644349873065948e-05], u'class': 1}, {u'source': u'techcrunch', u'prob': [0.006752816028892994, 0.0005263009225018322, 0.992720901966095], u'class': 2}]}


Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License