<h1> Text Classification using TensorFlow on Cloud ML Engine </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a text classification model using the high-level Estimator API 
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [2]:
# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

In [3]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [8]:
%datalab project set -p $PROJECT

In [None]:
!pip install --upgrade tensorflow

In [1]:
import tensorflow as tf
print tf.__version__

1.2.0


The idea is to look at the title of a newspaper article and figure out whether the article came from the New York Times or from TechCrunch. There are very sophisticated approaches that we can try, but for now, let's go with something very simple.

<h2> Data exploration and preprocessing in BigQuery </h2>
<p>
What does the Hacker News dataset look like?

In [10]:
%bq query
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
LIMIT 10

url,title,score
,Ask HN: What are some good gift ideas for hacker types?,44
,"Ask HN: After Google Apps and Outlook, which email provider for custom domain?",22
http://code.ipstenu.org/2011/the-legality-of-forking/,The Legality of Forking,17
https://medium.com/@polarrist/where-are-chernobyl-s-children-a-photojournalist-s-honest-project-in-the-age-of-disaster-tourism-4cd333ab80c7,Where Are Chernobyl’s Children?,50
http://www.bbc.co.uk/news/technology-19597437,Twitter hands over messages at heart of Occupy case,61
http://weblogs.asp.net/fbouma/archive/2013/08/13/windows-store-account-getting-rid-of-it-is-as-hard-as-signing-up.aspx,Windows Store dev account: getting rid of it is as hard as signing up,51
http://www.forbes.com/sites/andyellwood/2012/01/18/being-a-regular/,Being A Regular,28
,Ask HN: The Road to Becoming an Angel or VC,12
http://personalmba.com/best-business-books/,Best Business Books,20
http://www.couch.io/migrating-to-couchdb,Migrating to CouchDB,48


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>. To ensure that the parsing works for all URLs of interest, I'll group by the source to make sure there are no weird names left. This was an iterative process.

In [60]:
query="""
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '\\\.nytimes.com$|\\\.techcrunch.com$|\\\.wired.com$')
  AND LENGTH(title) > 10
  AND score > 10
GROUP BY
  source
"""

In [61]:
import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df.head()

Unnamed: 0,source,num_articles
0,nytimes,5795
1,wired,2339
2,techcrunch,1377


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [62]:
query="""
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  title
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '\\\.nytimes.com$|\\\.techcrunch.com$|\\\.wired.com$')
  AND LENGTH(title) > 10
  AND score > 10
"""
df = bq.Query(query + " LIMIT 10").execute().result().to_dataframe()
df.head()

Unnamed: 0,source,title
0,nytimes,The High Line Opens Its Third and Final Phase
1,wired,The World's most Ingenious Thief
2,techcrunch,Google To Acquire DocVerse; Office War Heats Up
3,wired,"Snow Leopard Update Blocks Intel Atom, Kills H..."
4,wired,Facebook's Human-Powered Assistant May Just Su...


For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  A simple way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).
<p>
So, let's do that and save the results as CSV files.

In [64]:
traindf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").execute().result().to_dataframe()
evaldf  = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").execute().result().to_dataframe()
traindf.head()

Unnamed: 0,source,title
0,wired,The Mystery of the Canadian Whiskey Fungus
1,wired,Signature of Antimatter Detected in Lightning
2,wired,Confirmed: Ice on Mars. News broken by Twitter.
3,wired,Adobe Plays the Porn Card in Flash Campaign Ag...
4,wired,Fat? Sick? Blame your grandparents' bad habits


In [77]:
traindf.to_csv('train.csv', header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv('eval.csv', header=False, index=False, encoding='utf-8', sep='\t')

In [78]:
!head -3 train.csv

wired	Is Free Will An Illusion?
wired	How ���Gamification��� Can Make Your Customer Service Worse
wired	How GitHub Helps You Hack the Government


In [79]:
!head -3 eval.csv

wired	The Mystery of the Canadian Whiskey Fungus
wired	Signature of Antimatter Detected in Lightning 
wired	Confirmed: Ice on Mars.  News broken by Twitter.


In [65]:
%bash
gsutil cp *.csv gs://${BUCKET}/txtcls1/

Copying file://eval.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/131.9 KiB]                                                / [0 files][131.9 KiB/131.9 KiB]                                                -- [1 files][131.9 KiB/131.9 KiB]                                                \Copying file://train.csv [Content-Type=text/csv]...
\ [1 files][131.9 KiB/532.7 KiB]                                                \ [2 files][532.7 KiB/532.7 KiB]                                                Copying file://vocab.csv [Content-Type=text/csv]...
\ [2 files][532.7 KiB/933.5 KiB]                                                \ [3 files][933.5 KiB/933.5 KiB]                                                |
Operation completed over 3 objects/933.5 KiB.                                    


<h2> TensorFlow code </h2>

Please explore the code in this <a href="txtcls1/trainer">directory</a> -- <a href="txtcls1/trainer/model.py">model.py</a> contains the key TensorFlow model and <a href="txtcls1/trainer/task.py">task.py</a> has a main() that launches off the training job.

In [112]:
%bash
grep "^def" txtcls1/trainer/model.py

def init(bucket, num_epochs):
def save_vocab(trainfile, txtcolname, outfilename):
def read_dataset(prefix, batch_size=20):
def cnn_model(features, target, mode):
def serving_input_fn():
def get_train():
def get_valid():
def experiment_fn(output_dir):


Let's make sure the code works locally on a small dataset for a few epochs.

In [None]:
%bash
echo "bucket=${BUCKET}"
rm -rf outputdir
export PYTHONPATH=${PYTHONPATH}:${PWD}/txtcls1
python -m trainer.task \
   --bucket=${BUCKET} \
   --output_dir=outputdir \
   --job-dir=./tmp --num_epochs=2

When I ran it, I got a 62% accuracy in two epochs.
<p>
Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.

In [9]:
%bash
OUTDIR=gs://${BUCKET}/txtcls1/trained_model
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gsutil cp txtcls1/trainer/*.py $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=$(pwd)/txtcls1/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC --runtime-version=1.2 \
   -- \
   --bucket=${BUCKET} \
   --output_dir=${OUTDIR} \
   --num_epochs=10

gs://cloud-training-demos-ml/txtcls1/trained_model us-central1 txtcls_170619_203611
jobId: txtcls_170619_203611
state: QUEUED


Removing gs://cloud-training-demos-ml/txtcls1/trained_model/__init__.py#1497903836107816...
Removing gs://cloud-training-demos-ml/txtcls1/trained_model/model.py#1497903836360439...
Removing gs://cloud-training-demos-ml/txtcls1/trained_model/task.py#1497903836579730...
/ [1/3 objects]  33% Done                                                       / [2/3 objects]  66% Done                                                       / [3/3 objects] 100% Done                                                       
Operation completed over 3 objects.                                              
Copying file://txtcls1/trainer/__init__.py [Content-Type=text/x-python]...
/ [0 files][    0.0 B/  677.0 B]                                                / [1 files][  677.0 B/  677.0 B]                                                Copying file://txtcls1/trainer/model.py [Content-Type=text/x-python]...
/ [1 files][  677.0 B/  8.0 KiB]                                                / [2 files][  8

Training finished with an accuracy of 61%.  Obviously, this was trained on a really small dataset. But I hope the sample works for you to apply to your *real* data.

<h2> Deploy trained model </h2>
<p>
Deploying the trained model to act as a REST web service is a simple gcloud call.

In [11]:
%bash
gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/

gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/
gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/1497904940/


In [12]:
%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v1"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls1/trained_model/export/Servo/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION}

Deleting and deploying txtcls v1 from gs://cloud-training-demos-ml/txtcls1/trained_model/export/Servo/1497904940/ ... this will take a few minutes


Creating version (this might take a few minutes)......
................................................................................done.


<h2> Use model to predict </h2>
<p>
Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, TechCrunch, and Wired on June 19.

In [18]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')

request_data = {'instances':
  [
      {
        'inputs': 'Supreme Court to Hear Major Case on Partisan Districts'
      },
      {
        'inputs': 'Time Warner will spend $100M on Snapchat original shows and ads'
      },
      {
        'inputs': 'This Dark Matter Theory Could Solve a Celestial Conundrum'
      },
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'txtcls', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

response={u'predictions': [{u'source': u'nytimes', u'prob': [0.6047254800796509, 0.24059338867664337, 0.15468111634254456], u'class': 0}, {u'source': u'nytimes', u'prob': [0.6047254800796509, 0.24059338867664337, 0.15468111634254456], u'class': 0}, {u'source': u'nytimes', u'prob': [0.6047254800796509, 0.24059338867664337, 0.15468111634254456], u'class': 0}]}


According to the model, our son would have clocked in at 7.3 lbs and our daughter at 6.8 lbs.
<p>
The weights are off by about 0.5 lbs. Pretty cool!

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License