# Text Classification Using PyTorch and CMLE

This notebook illustrates CMLE's new custom prediction feature. It allows us to execute arbitrary python pre-processing code prior to invoking a model, as well as post-processing on the produced predictions. In addition, you can use a model build by your favourite python-based framework!

This is all done server-side so that the client can pass data directly to CMLE in the unprocessed state.

We will build a text classification model using PyTorch, while performing text preproessing using Keras

In [0]:
%load_ext autoreload
%autoreload 2

In [0]:
!pip install tensorflow==1.12 torch 

Collecting tensorflow==1.12
[?25l  Downloading https://files.pythonhosted.org/packages/bd/68/ec26b2cb070a5760707ec8d9491a24e5be72f4885f265bb04abf70c0f9f1/tensorflow-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (83.1MB)
[K    100% |████████████████████████████████| 83.1MB 252kB/s 
Collecting tensorboard<1.13.0,>=1.12.0 (from tensorflow==1.12)
[?25l  Downloading https://files.pythonhosted.org/packages/51/ae/9840c4837c6f54034ac942b5344396e8c3d74686a9bd29beafdf633cc221/tensorboard-1.12.2-py2-none-any.whl (3.0MB)
[K    100% |████████████████████████████████| 3.1MB 7.0MB/s 
Installing collected packages: tensorboard, tensorflow
  Found existing installation: tensorboard 1.13.1
    Uninstalling tensorboard-1.13.1:
      Successfully uninstalled tensorboard-1.13.1
  Found existing installation: tensorflow 1.13.1
    Uninstalling tensorflow-1.13.1:
      Successfully uninstalled tensorflow-1.13.1
Successfully installed tensorboard-1.12.2 tensorflow-1.12.0


In [0]:
import tensorflow as tf
import torch
print(tf.__version__) 
print(torch.__version__) 

tf.estimator package not installed.
tf.estimator package not installed.
1.12.0
1.0.1.post2


## Setup

In [0]:
from google.colab import auth
auth.authenticate_user()

In [0]:
import os

PROJECT='vijays-sandbox' # CHANGE TO YOUR GCP PROJECT NAME
BUCKET='vijays-sandbox-ml' # CHANGE TO YOUR GCS BUCKET NAME
ROOT='torch_text_classification'
MODEL_DIR=os.path.join(ROOT,'models')
PACKAGES_DIR=os.path.join(ROOT,'packages')

In [0]:
# Delete any previous artefacts from Google Cloud Storage
!gsutil rm -r gs://{BUCKET}/{ROOT}

Removing gs://vijays-sandbox-ml/torch_text_classification/models/processor_state.pkl#1551360019520520...
Removing gs://vijays-sandbox-ml/torch_text_classification/models/torch_saved_model.pt#1551360016536407...
Removing gs://vijays-sandbox-ml/torch_text_classification/packages/my_package-0.1.tar.gz#1551360065067790...
/ [3 objects]                                                                   
Operation completed over 3 objects.                                              


In [0]:
!gcloud config set project {PROJECT}

Updated property [core/project].


## Download and Explore Data

In [0]:
%%bash
gsutil cp gs://cloud-training-demos/blogs/CMLE_custom_prediction/keras_text_pre_processing/train.tsv .
gsutil cp gs://cloud-training-demos/blogs/CMLE_custom_prediction/keras_text_pre_processing/eval.tsv .

Copying gs://cloud-training-demos/blogs/CMLE_custom_prediction/keras_text_pre_processing/train.tsv...
/ [0 files][    0.0 B/  4.0 MiB]                                                / [1 files][  4.0 MiB/  4.0 MiB]                                                
Operation completed over 1 objects/4.0 MiB.                                      
Copying gs://cloud-training-demos/blogs/CMLE_custom_prediction/keras_text_pre_processing/eval.tsv...
/ [0 files][    0.0 B/  1.4 MiB]                                                / [1 files][  1.4 MiB/  1.4 MiB]                                                
Operation completed over 1 objects/1.4 MiB.                                      


In [0]:
!head eval.tsv

## Preprocessing

### Pre-processing class to be used in both training and serving

In [0]:
%%writefile preprocess.py

from tensorflow.python.keras.preprocessing import sequence
from tensorflow.keras.preprocessing import text

class TextPreprocessor(object):
  def __init__(self, vocab_size, MAX_SEQUENCE_LENGTH):
    self._vocabb_size = vocab_size
    self._MAX_SEQUENCE_LENGTH = MAX_SEQUENCE_LENGTH
    self._tokenizer = None

  def fit(self, text_list):        
    # Create vocabulary from input corpus.
    tokenizer = text.Tokenizer(num_words=self._vocabb_size)
    tokenizer.fit_on_texts(text_list)
    self._tokenizer = tokenizer

  def transform(self, text_list):        
    # Transform text to sequence of integers
    text_sequence = self._tokenizer.texts_to_sequences(text_list)

    # Fix sequence length to max value. Sequences shorter than the length are
    # padded in the beginning and sequences longer are truncated
    # at the beginning.
    padded_text_sequence = sequence.pad_sequences(
      text_sequence, maxlen=self._MAX_SEQUENCE_LENGTH)
    return padded_text_sequence

Writing preprocess.py


### Test Prepocessing Locally

In [0]:
from preprocess import TextPreprocessor

processor = TextPreprocessor(5, 5)
processor.fit(['hello machine learning'])
processor.transform(['hello machine learning'])

array([[0, 0, 1, 2, 3]], dtype=int32)

## Model Creation

### Metadata

In [0]:
CLASSES = {'github': 0, 'nytimes': 1, 'techcrunch': 2}  # label-to-int mapping
NUM_CLASSES = 3
VOCAB_SIZE = 20000  # Limit on the number vocabulary size used for tokenization
MAX_SEQUENCE_LENGTH = 50  # Sentences will be truncated/padded to this length

### Prepare data for training and evaluation

In [0]:
import pandas as pd
import numpy as np
from preprocess import TextPreprocessor

def load_data(train_data_path, eval_data_path):
    # Parse CSV using pandas
    column_names = ('label', 'text')
    
    df_train = pd.read_csv(train_data_path, names=column_names, sep='\t')
    df_train = df_train.sample(frac=1)
    
    df_eval = pd.read_csv(eval_data_path, names=column_names, sep='\t')

    return ((list(df_train['text']), np.array(df_train['label'].map(CLASSES))),
            (list(df_eval['text']), np.array(df_eval['label'].map(CLASSES))))


((train_texts, train_labels), (eval_texts, eval_labels)) = load_data(
       'train.tsv', 'eval.tsv')

# Create vocabulary from training corpus.
processor = TextPreprocessor(VOCAB_SIZE, MAX_SEQUENCE_LENGTH)
processor.fit(train_texts)

# Preprocess the data
train_texts_vectorized = processor.transform(train_texts)
eval_texts_vectorized = processor.transform(eval_texts)

### Build the model

In [0]:
%%writefile torch_model.py

import torch
import torch.nn as nn
import torch.nn.functional as F

class TorchTextClassifier(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, seq_length, num_classes, 
                 num_filters, kernel_size, pool_size, dropout_rate):
        super(TorchTextClassifier, self).__init__()

        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        
        self.conv1 = nn.Conv1d(seq_length, num_filters, kernel_size)
        self.max_pool1 = nn.MaxPool1d(pool_size)
        self.conv2 = nn.Conv1d(num_filters, num_filters*2, kernel_size)
        
        self.dropout = nn.Dropout(dropout_rate)
        self.dense = nn.Linear(num_filters*2, num_classes)
        

    def forward(self, x):
        
        x = self.embeddings(x)
        x = self.dropout(x)

        x = self.conv1(x)
        x = F.relu(x)
        x = self.max_pool1(x)
        
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool1d(x, x.size()[2]).squeeze(2)
        
        x = self.dropout(x)
        x = self.dense(x)
        x = F.softmax(x, 1)
        return x

Writing torch_model.py


### Train and save the model

In [0]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F

LEARNING_RATE=.001
FILTERS=64
DROPOUT_RATE=0.2
EMBEDDING_DIM=200
KERNEL_SIZE=3
POOL_SIZE=3

NUM_EPOCH=1
BATCH_SIZE=128

train_size = len(train_texts)
steps_per_epoch = int(len(train_labels)/BATCH_SIZE)

print("Train size: {}".format(train_size))
print("Batch size: {}".format(BATCH_SIZE))
print("Number of epochs: {}".format(NUM_EPOCH))
print("Steps per epoch: {}".format(steps_per_epoch))
print("Vocab Size: {}".format(VOCAB_SIZE))
print("Embed Dimensions: {}".format(EMBEDDING_DIM))
print("Sequence Length: {}".format(MAX_SEQUENCE_LENGTH))
print("")


def get_batch(step):
    start_index = step*BATCH_SIZE
    end_index = start_index + BATCH_SIZE
    x = Variable(torch.Tensor(train_texts_vectorized[start_index:end_index]).long())
    y = Variable(torch.Tensor(train_labels[start_index:end_index]).long())
    return x, y


from torch_model import TorchTextClassifier

model = TorchTextClassifier(VOCAB_SIZE, 
                            EMBEDDING_DIM, 
                            MAX_SEQUENCE_LENGTH, 
                            NUM_CLASSES, 
                            FILTERS, 
                            KERNEL_SIZE, 
                            POOL_SIZE, 
                            DROPOUT_RATE)

model.train()
loss_metric = F.cross_entropy
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

for epoch in range(NUM_EPOCH):

    for step in range(steps_per_epoch):

        x, y = get_batch(step)

        optimizer.zero_grad()

        y_pred = model(x)

        loss = loss_metric(y_pred, y) 

        loss.backward()
        optimizer.step()

        if step % 50 == 0:
            print('Batch [{}/{}] Loss: {}'.format(step+1, steps_per_epoch, round(loss.item(),5)))


    print('Epoch [{}/{}] Loss: {}'.format(epoch+1, NUM_EPOCH, round(loss.item(),5)))

print('Final Loss: {}'.format(epoch+1, NUM_EPOCH, round(loss.item(),5)))

torch.save(model, 'torch_saved_model.pt')

Train size: 72162
Batch size: 128
Number of epochs: 1
Steps per epoch: 563
Vocab Size: 20000
Embed Dimensions: 200
Sequence Length: 50

Batch [1/563] Loss: 1.11598
Batch [51/563] Loss: 1.07404
Batch [101/563] Loss: 1.08879
Batch [151/563] Loss: 1.05261
Batch [201/563] Loss: 1.05945
Batch [251/563] Loss: 1.00419
Batch [301/563] Loss: 0.98083
Batch [351/563] Loss: 0.95064
Batch [401/563] Loss: 0.94694
Batch [451/563] Loss: 0.89967
Batch [501/563] Loss: 0.86751
Batch [551/563] Loss: 0.94085
Epoch [1/1] Loss: 0.86216
Final Loss: 1


### Save pre-processing object

We need to save this so the same tokenizer used at training can be used to pre-process during serving

In [0]:
import pickle
with open('./processor_state.pkl', 'wb') as f:
  pickle.dump(processor, f)

## Custom Model Prediction Preparation

### Copy model and pre-processing object to GCS

In [0]:
!gsutil cp torch_saved_model.pt gs://{BUCKET}/{MODEL_DIR}/
!gsutil cp processor_state.pkl gs://{BUCKET}/{MODEL_DIR}/

Copying file://torch_saved_model.pt [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/15.4 MiB.                                     
Copying file://processor_state.pkl [Content-Type=application/octet-stream]...
/ [1 files][  3.3 MiB/  3.3 MiB]                                                
Operation completed over 1 objects/3.3 MiB.                                      


### Define Model Class

In [0]:
%%writefile model_prediction.py

import os
import pickle
import numpy as np
import torch
from torch.autograd import Variable


class CustomModelPrediction(object):
  def __init__(self, model, processor):
    self._model = model
    self._processor = processor

  def _postprocess(self, predictions):
    labels = ['github', 'nytimes', 'techcrunch']
    label_indexes = [np.argmax(prediction) 
                     for prediction in predictions.detach().numpy()]
    return [labels[label_index] for label_index in label_indexes]


  def predict(self, instances, **kwargs):
    preprocessed_data = self._processor.transform(instances)
    predictions =  self._model(Variable(torch.Tensor(preprocessed_data).long()))
    labels = self._postprocess(predictions)
    return labels


  @classmethod
  def from_path(cls, model_dir):
    import torch 
    import torch_model
    model = torch.load(os.path.join(model_dir,'torch_saved_model.pt'))
    model.eval()
    with open(os.path.join(model_dir, 'processor_state.pkl'), 'rb') as f:
      processor = pickle.load(f)

    return cls(model, processor)
    

Writing model_prediction.py


### Test Model Class Locally

In [0]:
# Headlines for Predictions

techcrunch=[
  'Uber shuts down self-driving trucks unit',
  'Grover raises €37M Series A to offer latest tech products as a subscription',
  'Tech companies can now bid on the Pentagon’s $10B cloud contract'
]
nytimes=[
  '‘Lopping,’ ‘Tips’ and the ‘Z-List’: Bias Lawsuit Explores Harvard’s Admissions',
  'A $3B Plan to Turn Hoover Dam into a Giant Battery',
  'A MeToo Reckoning in China’s Workplace Amid Wave of Accusations'
]
github=[
  'Show HN: Moon – 3kb JavaScript UI compiler',
  'Show HN: Hello, a CLI tool for managing social media',
  'Firefox Nightly added support for time-travel debugging'
]
requests = (techcrunch+nytimes+github)

In [0]:
from model_prediction import CustomModelPrediction

model = CustomModelPrediction.from_path('.')
model.predict(requests)

['nytimes',
 'techcrunch',
 'techcrunch',
 'nytimes',
 'github',
 'nytimes',
 'github',
 'github',
 'github']

### Package up files and copy to GCS

In [0]:
%%writefile setup.py

from setuptools import setup

REQUIRED_PACKAGES = ['torch', 'keras']

setup(
  name="my_package",
  version="0.1",
  include_package_data=True,
  scripts=["preprocess.py", "model_prediction.py", "torch_model.py"],
  install_requires=REQUIRED_PACKAGES
)

Writing setup.py


In [0]:
!python setup.py sdist
!gsutil cp ./dist/my_package-0.1.tar.gz gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz

running sdist
running egg_info
creating my_package.egg-info
writing requirements to my_package.egg-info/requires.txt
writing my_package.egg-info/PKG-INFO
writing top-level names to my_package.egg-info/top_level.txt
writing dependency_links to my_package.egg-info/dependency_links.txt
writing manifest file 'my_package.egg-info/SOURCES.txt'
reading manifest file 'my_package.egg-info/SOURCES.txt'
writing manifest file 'my_package.egg-info/SOURCES.txt'

running check


creating my_package-0.1
creating my_package-0.1/my_package.egg-info
copying files to my_package-0.1...
copying model_prediction.py -> my_package-0.1
copying preprocess.py -> my_package-0.1
copying setup.py -> my_package-0.1
copying torch_model.py -> my_package-0.1
copying my_package.egg-info/PKG-INFO -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/SOURCES.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-info/dependency_links.txt -> my_package-0.1/my_package.egg-info
copying my_package.egg-inf

## Model Deployment to CMLE

In [0]:
MODEL_NAME='torch_text_classification'
VERSION_NAME='v201903'
RUNTIME_VERSION='1.12'
REGION='us-central1'

In [0]:
!gcloud ml-engine models create {MODEL_NAME} --regions {REGION}

[1;31mERROR:[0m (gcloud.ml-engine.models.create) Resource in project [vijays-sandbox] is the subject of a conflict: Field: model.name Error: A model with the same name already exists.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A model with the same name already exists.
    field: model.name


In [0]:
!gcloud ml-engine versions delete {VERSION_NAME} --model {MODEL_NAME} --quiet # run if version already created

[1;31mERROR:[0m (gcloud.ml-engine.versions.delete) NOT_FOUND: Field: name Error: The specified model version was not found.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: The specified model version was not found.
    field: name


In [0]:
!gcloud alpha ml-engine versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=2.7 \
--runtime-version={RUNTIME_VERSION} \
--framework='SCIKIT_LEARN' \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/my_package-0.1.tar.gz \
--machine-type=mls1-c4-m4 \
--model-class=model_prediction.CustomModelPrediction

## Online Predictions from CMLE

In [0]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

# JSON format the requests
request_data = {'instances': requests}

# Authenticate and call CMLE prediction API 
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

parent = 'projects/{}/models/{}/versions/{}'.format(PROJECT, MODEL_NAME, VERSION_NAME)
print("Model full name: {}".format(parent))
response = api.projects().predict(body=request_data, name=parent).execute()

print(response['predictions'])

Model full name: projects/vijays-sandbox/models/torch_text_classification/versions/v201903
[u'nytimes', u'techcrunch', u'techcrunch', u'nytimes', u'github', u'nytimes', u'github', u'github', u'github']


## License

Authors: Khalid Salama & Vijay Reddy 

---
**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.

---

Copyright 2019 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.