##### Copyright 2019 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Text classification with TensorFlow Lite model customization with TensorFlow 2.0

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/examples/blob/master/tensorflow_examples/lite/model_customization/demo/text_classification.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/examples/blob/master/tensorflow_examples/lite/model_customization/demo/text_classification.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
</table>

The TensorFlow Lite model customization library simplifies the process of adapting and converting a TensorFlow neural-network model to particular input data when deploying this model for on-device ML applications.

This notebook shows an end-to-end example that utilizes this model customization library to illustrate the adaption and conversion of a commonly-used text classification model to classify movie reviews on a mobile device.

## Prerequisites

To run this example, we first need to install serveral required packages, including model customization package that in github [repo](https://github.com/tensorflow/examples).

In [0]:
!pip uninstall -q -y tensorflow google-colab grpcio
!pip install -q tf-nightly
!pip install -q git+https://github.com/tensorflow/examples

Import the required packages.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import os
import tensorflow as tf
assert tf.__version__.startswith('2')

from tensorflow_examples.lite.model_customization.core.data_util.text_dataloader import TextClassifierDataLoader
from tensorflow_examples.lite.model_customization.core.model_export_format import ModelExportFormat
import tensorflow_examples.lite.model_customization.core.task.text_classifier as text_classifier

## Simple End-to-End Example

Let's get some texts to play with this simple end-to-end example. You could replace it with your own text folders.

In [0]:
data_path = tf.keras.utils.get_file(
      fname='aclImdb',
      origin='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
      untar=True)

The example just consists of 4 lines of code as shown below, each of which representing one step of the overall process.

1.   Load train and test data specific to an on-device ML app.

In [0]:
train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), class_labels=['pos', 'neg'])
test_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'test'), shuffle=False)

2. Customize the TensorFlow model.

In [0]:
model = text_classifier.create(train_data)

3. Evaluate the model.

In [0]:
loss, acc = model.evaluate(test_data)

4.  Export to TensorFlow Lite  model.

In [0]:
model.export('movie_review_classifier.tflite', 'text_label.txt', 'vocab.txt')

After this simple 4 steps, we could further use TensorFlow Lite model file and label file in on-device applications like in [text classification](https://github.com/tensorflow/examples/tree/master/lite/examples/text_classification) reference app.

## Detailed Process

In above, we tried the simple end-to-end example. The following walks through the example step by step to show more detail.

### Step 1: Load Input Data Specific to an On-device ML App

The IMDB dataset contains 25000 movie reviews for training and 25000 movie reviews for testing from the [Internet Movie Database](https://www.imdb.com/). The dataset have two classes: positive and negative movie reviews.

Download the archive version of the dataset and untar it.

The IMDB dataset has the following directory structure:

<pre>
<b>aclImdb</b>
|__ <b>train</b>
    |______ <b>pos</b>: [1962_10.txt, 2499_10.txt, ...]
    |______ <b>neg</b>: [104_3.txt, 109_2.txt, ...]
    |______ unsup: [12099_0.txt, 1424_0.txt, ...]
|__ <b>test</b>
    |______ <b>pos</b>: [1384_9.txt, 191_9.txt, ...]
    |______ <b>neg</b>: [1629_1.txt, 21_1.txt]

</pre>

Note that the text data under `train/unsup` folder are unlabeled documents for unsupervised learning and such data should be ignored in this tutorial.


In [0]:
data_path = tf.keras.utils.get_file(
      fname='aclImdb',
      origin='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
      untar=True)

Use `TextClassifierDataLoader` to load data.

As for `from_folder()` method, it could load data from the folder. It assumes that the text data of the same class are in the same subdirectory and the subfolder name is the class name. Each text file contains one movie review sample.

Parameter `class_labels` is used to specify which subfolder should be considered. As for `train` folder, this parameter is used to skip `unsup` subfolder.


In [0]:
train_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'train'), class_labels=['pos', 'neg'])
test_data = TextClassifierDataLoader.from_folder(os.path.join(data_path, 'test'), shuffle=False)

Take a glance at 25 training data.


In [0]:
for text, label in train_data.dataset.take(25):
  print ("%s: %s"%(train_data.index_to_label[label.numpy()], text.numpy()))

### Step 2: Customize the TensorFlow Model

Create a custom text classifier model based on the loaded data. Currently, we only supports averging word embedding method.

In [0]:
model = text_classifier.create(train_data)

Have a look at the detailed model structure.

In [0]:
model.summary()

### Step 3: Evaluate the Customized Model

Evaluate the result of the model, get the loss and accuracy of the model.

Evaluate the loss and accuracy in `test_data`. If no data is given the results are evaluated on the data that's splitted in the `create` method.

In [0]:
loss, acc = model.evaluate(test_data)

### Step 4: Export to TensorFlow Lite Model

Convert the existing model to TensorFlow Lite model format that could be later used in on-device ML application. Meanwhile, save the text labels in label file and vocabulary in vocab file.

In [0]:
model.export('movie_review_classifier.tflite', 'text_label.txt', 'vocab.txt')

The TensorFlow Lite model file and label file could be used in [text classification](https://github.com/tensorflow/examples/tree/master/lite/examples/text_classification) reference app.

In detail, we could add `movie_review_classifier.tflite`, `text_label.txt` and `vocab.txt` in [assets](https://github.com/tensorflow/examples/tree/master/lite/examples/text_classification/android/app/src/main/assets) folder. Meanwhile, change the filenames in [code](https://github.com/tensorflow/examples/blob/master/lite/examples/text_classification/android/app/src/main/java/org/tensorflow/lite/examples/textclassification/TextClassificationClient.java#L43). 

Here, we also demonstrate how to use the above files to run and evaluate the TensorFlow Lite model.

In [0]:
# Read TensorFlow Lite model from TensorFlow Lite file.
with tf.io.gfile.GFile('movie_review_classifier.tflite', 'rb') as f:
  model_content = f.read()

# Read label names from label file.
with tf.io.gfile.GFile('text_label.txt', 'r') as f:
  label_names = f.read().split('\n')

# Initialze TensorFlow Lite inpterpreter.
interpreter = tf.lite.Interpreter(model_content=model_content)
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]['index']
output = interpreter.tensor(interpreter.get_output_details()[0]["index"])

# Run predictions on each test data and calculate accuracy.
accurate_count = 0
for i, (text, label) in enumerate(model.test_data.dataset):
    # Pre-processing should remain the same.
    text, label = model.preprocess_text(text, label)
    # Add batch dimension and convert to float32 to match with the model's input
    # data format.
    text = tf.expand_dims(text, 0).numpy()
    text = tf.cast(text, tf.float32)

    # Run inference.
    interpreter.set_tensor(input_index, text)
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the label with highest
    # probability.
    predict_label = np.argmax(output()[0])
    # Get label name with label index.
    predict_label_name = label_names[predict_label]
    
    accurate_count += (predict_label == label.numpy())

accuracy = accurate_count * 1.0 / model.test_data.size
print('TensorFlow Lite model accuracy = %.4f' % accuracy)

Note that preprocessing for inference should be the same as training. Currently, preprocessing contains split the text to tokens by '\W', encode the tokens to ids, the pad the text with `pad_id` to have the length of `sentence_length`.

# Advanced Usage

The `create` function is the critical part of this library that contains the following steps:

1.   Split the data into training, validation, testing data according to parameter `validation_ratio` and `test_ratio`. The default value of `validation_ratio` and `test_ratio` are `0.1` and `0.1`.
2.   Tokenize the text and select the top `num_words` frequency of words to generate the vocubulary. The default value of `num_words` is `10000`.
3.   Encode the text string tokens to int ids.
4.   Create the text classifier model. Currently, this library supports one model: average the word embedding of the text with RELU activation, then leverage softmax dense layer for classification. As for [Embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), `input_dim` is the size of the vocabulary, `output_dim` is `create` function's paramater `wordvec_dim` which default value is `16`, `input_length` is `create` function's paramater `sentence_len` which default value is `256`.
5.   Train the classifier model. The default epoch is `2` and the default batch size is `32`.

In this section, we describe several advanced topics, including adjusting the model, changing the training hyperparameters etc.


# Adjust the model

We could adjust the model infrastructure like `wordvec_dim`, `sentence_len`.



*   `wordvec_dim`: Dimension of word embedding.
*   `sentence_len`: length of sentence.

For example, we could train with larger `wordvec_dim`.

In [0]:
model = text_classifier.create(train_data, wordvec_dim=32)

## Change the training hyperparameters
We could also change the training hyperparameters like `epochs` and `batch_size` that could affect the model accuracy. For instance,

*   `epochs`: more epochs could achieve better accuracy until converage but training for too many epochs may lead to overfitting.
*   `batch_size`: number of samples to use in one training step.

For example, we could train with more epochs.

In [0]:
model = text_classifier.create(train_data, epochs=5)

Evaluate the newly retrained model with 5 training epochs.

In [0]:
loss, accuracy = model.evaluate()