<h1> Text Classification using TensorFlow on Cloud ML Engine </h1>

This notebook illustrates:
<ol>
<li> Creating datasets for Machine Learning using BigQuery
<li> Creating a text classification model using a CNN and custom Estimator 
<li> Training on Cloud ML Engine
<li> Deploying model
<li> Predicting with model
</ol>

In [39]:
# change these to try this notebook out
BUCKET = 'vijays-sandbox-ml'
PROJECT = 'vijays-sandbox'
REGION = 'us-central1'

In [40]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'

In [41]:
import tensorflow as tf
print(tf.__version__)

1.8.0


  from ._conv import register_converters as _register_converters


The idea is to look at the title of a newspaper article and figure out whether the article came from the New York Times or from TechCrunch. There are very sophisticated approaches that we can try, but for now, let's go with something very simple.

<h2> Data exploration and preprocessing in BigQuery </h2>
<p>
What does the Hacker News dataset look like?

In [16]:
%bq query
SELECT
  url, title, score
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  LENGTH(title) > 10
  AND score > 10
LIMIT 10

url,title,score
,"Ask HN: What's your speciality, and what's your ""FizzBuzz"" equivalent?",260
,Ask HN: Books with a high signal to noise ratio?,260
,Ask HN: Why can't I make as much as I make?,262
,Ask HN: Is it OK to submit job postings on HN?,11
,Ask HN: Can I help you be more awesome today? (No strings. Inquire within.),11
,Show HN: Free and Open Source book to teach Firefox OS app development,11
http://empowerunited.com/,Could this be the solution for the 99%?,11
https://github.com/Groundworkstech/Submicron,Deep-Submicron Backdoors,11
http://vancouver.en.craigslist.ca/van/roo/2035580033.html,Best Roommate Ad Ever,11
https://www.kickstarter.com/projects/carlosxcl/code-cards,"Show HN: Code Cards, Like Texas hold 'em for people who want to code",11


Let's do some regular expression parsing in BigQuery to get the source of the newspaper article from the URL. For example, if the url is http://mobile.nytimes.com/...., I want to be left with <i>nytimes</i>. To ensure that the parsing works for all URLs of interest, I'll group by the source to make sure there are no weird names left. This was an iterative process.

In [17]:
query="""
SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  COUNT(title) AS num_articles
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
GROUP BY
  source
ORDER BY num_articles DESC
LIMIT 10
"""

In [18]:
import google.datalab.bigquery as bq
df = bq.Query(query).execute().result().to_dataframe()
df

Unnamed: 0,source,num_articles
0,blogspot,41386
1,github,36525
2,techcrunch,30891
3,youtube,30848
4,nytimes,28787
5,medium,18422
6,google,18235
7,wordpress,17667
8,arstechnica,13749
9,wired,12841


Now that we have good parsing of the URL to get the source, let's put together a dataset of source and titles. This will be our labeled dataset for machine learning.

In [19]:
query="""
SELECT source, LOWER(REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ')) AS title FROM
(SELECT
  ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
  title
FROM
  `bigquery-public-data.hacker_news.stories`
WHERE
  REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
  AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
df = bq.Query(query + " LIMIT 10").execute().result().to_dataframe()
df.head()

Unnamed: 0,source,title
0,github,django outbox
1,github,webscrapper using node.js deferred cheerio...
2,github,a git user s guide to svn because at least 10...
3,github,show hn cmake module to take care of git subm...
4,github,play motion sensing game on chrome


For ML training, we will need to split our dataset into training and evaluation datasets (and perhaps an independent test dataset if we are going to do model or feature selection based on the evaluation dataset).  A simple way to do this is to use the hash of a well-distributed column in our data (See https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning).
<p>
So, let's do that and save the results as CSV files.

In [20]:
traindf = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) > 0").execute().result().to_dataframe()
evaldf  = bq.Query(query + " AND MOD(ABS(FARM_FINGERPRINT(title)),4) = 0").execute().result().to_dataframe()
traindf.head()

Unnamed: 0,source,title
0,github,this guy just found out how to bypass adblocker
1,github,show hn dodo command line task management f...
2,github,show hn webservicemock mock out external ca...
3,github,magento category attributes dependency
4,github,write actionscript in swift whaa


In [21]:
traindf['source'].value_counts()

github        27445
techcrunch    23131
nytimes       21586
Name: source, dtype: int64

In [22]:
evaldf['source'].value_counts()

github        9080
techcrunch    7760
nytimes       7201
Name: source, dtype: int64

In [23]:
import os, shutil
DATADIR='data/txtcls'
shutil.rmtree(DATADIR, ignore_errors=True)
os.makedirs(DATADIR)
traindf.to_csv( os.path.join(DATADIR,'train.csv'), header=False, index=False, encoding='utf-8', sep='\t')
evaldf.to_csv( os.path.join(DATADIR,'eval.csv'), header=False, index=False, encoding='utf-8', sep='\t')

In [24]:
!head -3 data/txtcls/train.csv

github	this guy just found out how to bypass adblocker
github	show hn  dodo   command line task management for developers
github	show hn  webservicemock   mock out external calls for local development


In [25]:
!wc -l data/txtcls/*.csv

  24041 data/txtcls/eval.csv
  72162 data/txtcls/train.csv
  96203 total


<h2> TensorFlow code </h2>

Please explore the code in this <a href="txtclsmodel/trainer">directory</a>: `model.py` contains the TensorFlow model and `task.py` parses command line arguments and launches off the training job.

Let's make sure the code works locally on a small dataset for a few steps.

In [None]:
%bash
rm -rf txtcls_trained
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/txtclsmodel/trainer \
   -- \
   --output_dir=${PWD}/txtcls_trained \
   --train_data_path=${PWD}/data/txtcls/train.csv \
   --eval_data_path=${PWD}/data/txtcls/eval.csv \
   --num_epochs=0.1

When I ran it, I got a 55% accuracy after a few steps. Because batchsize=32, 200 steps is essentially 6400 examples -- the full dataset is 72,000 examples, so this is not even the full dataset. And already, we are doing much better than random chance.
<p>
Once the code works in standalone mode, you can run it on Cloud ML Engine. You can monitor the job from the GCP console in the Cloud Machine Learning Engine section.  Since we have 72,000 examples and batchsize=32, train_steps=36,000 essentially means 16 epochs.

#### Local Predict

In [46]:
%%writefile tokenized_sentences.json
[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10]

Overwriting tokenized_sentences.json


In [49]:
%bash
gcloud ml-engine local predict \
  --model-dir=${PWD}/txtcls_trained/export/exporter/1532985497 \
  --json-instances=./tokenized_sentences.json

DENSE_1
[0.428494930267334, 0.5679579377174377, 0.0035471213050186634]


  from ._conv import register_converters as _register_converters
2018-07-30 21:18:35.108863: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA



In [27]:
%bash
gsutil cp data/txtcls/*.csv gs://${BUCKET}/txtcls/

Copying file://data/txtcls/eval.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/  1.4 MiB]                                                / [1 files][  1.4 MiB/  1.4 MiB]                                                Copying file://data/txtcls/train.csv [Content-Type=text/csv]...
/ [1 files][  1.4 MiB/  5.4 MiB]                                                -- [2 files][  5.4 MiB/  5.4 MiB]                                                \
Operation completed over 2 objects/5.4 MiB.                                      


#### Train on CMLE

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/txtcls/trained
JOBNAME=txtcls_$(date -u +%y%m%d_%H%M%S)
REGION=us-central1
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
 --region=$REGION \
 --module-name=trainer.task \
 --package-path=${PWD}/txtclsmodel/trainer \
 --job-dir=$OUTDIR \
 --scale-tier=BASIC_GPU \
 --runtime-version=$TFVERSION \
 -- \
 --output_dir=$OUTDIR \
 --train_data_path=gs://${BUCKET}/txtcls/train.csv \
 --eval_data_path=gs://${BUCKET}/txtcls/eval.csv \
 --num_epochs=20

Training finished with an accuracy of 73%.  Obviously, this was trained on a really small dataset and with more data will hopefully come even greater accuracy.

<h2> Deploy trained model </h2>
<p>
Deploying the trained model to act as a REST web service is a simple gcloud call.

In [54]:
%bash
gsutil ls gs://${BUCKET}/txtcls/trained/export/exporter/

gs://vijays-sandbox-ml/txtcls/trained/export/exporter/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985010/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985046/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985080/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985114/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985149/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985185/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985220/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985256/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985291/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985326/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985362/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985398/
gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985416/


In [56]:
%bash
MODEL_NAME="txtcls"
MODEL_VERSION="v_fromscratch"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/txtcls/trained/export/exporter/ | tail -1)
echo "Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=$TFVERSION

Deleting and deploying txtcls v_fromscratch from gs://vijays-sandbox-ml/txtcls/trained/export/exporter/1532985416/ ... this will take a few minutes


Creating version (this might take a few minutes)......
..............................................................................................................done.


<h2> Use model to predict </h2>
<p>
Send a JSON request to the endpoint of the service to make it predict which publication the article is more likely to run in. These are actual titles of articles in the New York Times, github, and TechCrunch on June 19.   These titles were not part of the training or evaluation datasets.

In [60]:
%bash
gcloud ml-engine predict \
  --model=txtcls \
  --version=v_fromscratch \
  --json-instances=./tokenized_sentences.json

DENSE_1
[6.640195060469978e-09, 0.05344170704483986, 0.9465582966804504]


In [85]:
techcrunch+nytimes

['Uber shuts down self-driving trucks unit',
 'Grover raises \xe2\x82\xac37M Series A to offer latest tech products as a subscription',
 'Tech companies can now bid on the Pentagon\xe2\x80\x99s $10B cloud contract',
 '\xe2\x80\x98Lopping,\xe2\x80\x99 \xe2\x80\x98Tips\xe2\x80\x99 and the \xe2\x80\x98Z-List\xe2\x80\x99: Bias Lawsuit Explores Harvard\xe2\x80\x99s Admissions',
 'A $3B Plan to Turn Hoover Dam into a Giant Battery',
 'A MeToo Reckoning in China\xe2\x80\x99s Workplace Amid Wave of Accusations']

In [None]:
import pickle
from tensorflow.python.keras.preprocessing import sequence

techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]

tokenizer = pickle.load( open( "txtclsmodel/tokenizer.pickled", "rb" ) )

x= tokenizer.texts_to_sequences(techcrunch)
x = sequence.pad_sequences(x,maxlen=50)
x.tolist()

In [91]:
classes = {'github': 0, 'nytimes': 1, 'techcrunch': 2}
len(classes)

3

In [90]:
import pickle
from tensorflow.python.keras.preprocessing import sequence
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]

requests = techcrunch+nytimes+github

# Tokenize and pad sentences using same mapping we did in training
tokenizer = pickle.load( open( "txtclsmodel/tokenizer.pickled", "rb" ) )

requests_tokenized = tokenizer.texts_to_sequences(requests)
requests_tokenized = sequence.pad_sequences(requests_tokenized,maxlen=50)

# JSON format the requests
request_data = {'instances':requests_tokenized.tolist()}

# Authenticate and call CMLE prediction API 
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

parent = 'projects/%s/models/%s' % (PROJECT, 'txtcls')
response = api.projects().predict(body=request_data, name=parent).execute()

# Format and print response
for i in xrange(len(requests)):
  print('\n{}'.format(requests[i]))
  print(' github    : {}'.format(response['predictions'][i]['dense_1'][0]))
  print(' nytimes   : {}'.format(response['predictions'][i]['dense_1'][1]))
  print(' techcrunch: {}'.format(response['predictions'][i]['dense_1'][2]))


Uber shuts down self-driving trucks unit
 github    : 1.7230887579e-06
 nytimes   : 0.00622563203797
 techcrunch: 0.993772685528

Grover raises €37M Series A to offer latest tech products as a subscription
 github    : 1.07748448386e-07
 nytimes   : 0.0014502487611
 techcrunch: 0.998549640179

Tech companies can now bid on the Pentagon’s $10B cloud contract
 github    : 1.50089090312e-06
 nytimes   : 0.00121769437101
 techcrunch: 0.998780786991

‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions
 github    : 2.77822749695e-07
 nytimes   : 0.902024507523
 techcrunch: 0.0979751124978

A $3B Plan to Turn Hoover Dam into a Giant Battery
 github    : 0.000192578372662
 nytimes   : 0.50758099556
 techcrunch: 0.492226421833

A MeToo Reckoning in China’s Workplace Amid Wave of Accusations
 github    : 1.95428145888e-10
 nytimes   : 0.999726235867
 techcrunch: 0.000273719837423

Show HN: Moon – 3kb JavaScript UI compiler
 github    : 0.989028930664
 nytimes   : 0.00

How many of your predictions were correct? Do the predictions match your intuition?

As you can see, the trained model predicts that the Supreme Court article is 84% likely to come from New York Times. The Docker article is 99% likely to be from GitHub according to the service and the Time Warner one is 99% likely to be from TechCrunch.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License