**Introduction**

> The objective for this model is to learn about tensorflow hub functionality as well as acquaint myself with a second way to do sentiment classification for text. The code in this model is derived from a tensorflow tutorial.

> Sentiment classification is a key marketing use case for machine learning. This is a critical application for a service like social listening, for example. The data for the model is Imdb movie reviews. This is binary classification.

> The code for this model utilizes tfds.load() to ingest the data and hub.KerasLayer() to create the text embeddings. This approach does not bother to do any text cleaning or standardization outside of creating the embeddings. The model itself is a tf.Sequential() model that works off of the embeddings and two Dense layers. The embedding function comes from Google and has 50 units. It is url based, at "https://tfhub.dev/google/nnlm-en-dim50/2", from tensorflow hub.

> **1. Setup**

In [1]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.7.0
Eager mode:  True
Hub version:  0.12.0
GPU is available


**2. Pre-process Data**

> a) Download data from Tensorflow datasets



> The tfds.load() command fetches data from Tensorflow datasets. The default arguments are: tfds.load(name: str, *, split: Optional[Tree[splits_lib.SplitArg]] = None, data_dir: Optional[str] = None, batch_size: tfds.typing.Dim = None, shuffle_files: bool = False, download: bool = True, as_supervised: bool = False, decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None, read_config: Optional[tfds.ReadConfig] = None, with_info: bool = False, builder_kwargs: Optional[Dict[str, Any]] = None, download_and_prepare_kwargs: Optional[Dict[str, Any]] = None, as_dataset_kwargs: Optional[Dict[str, Any]] = None, try_gcs: bool = False). The tfds.load() command returns a tf.data.Dataset object, i.e. the dataset requested.

> Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review. Here we split the raw data into different sets. The training data is split 60/40 for training and validation. The test data is 100% dedicated to testing. This yields 15,000 for training, 10,000 for validation and 25,000 for testing. 

In [2]:
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

> b) Explore data

> train_data is a variable defined by the above tfds.load() command. ".bath()" is a method of the tf.data.Dataset object. The default function call is: batch(
batch_size, drop_remainder=False, num_parallel_calls=None, deterministic=None, name=None). The number '10' below is the size of the batch. "iter()" makes the dataset an iterator and "next()" retreives the next batch.

> The other thing here is that we separate the movie reviews from the labels in two different objects. Both are tensors.

In [3]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [4]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

**3. Create the Model**

> a) Embedding vectors



> The neural network is created by stacking layers—this requires three main architectural decisions: 1) How to represent the text?, 2) How many layers to use in the model?, and 3) How many hidden units to use for each layer? In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

> One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages: 1) You don't have to worry about text preprocessing, 2) Benefit from transfer learning, and 3) the embedding has a fixed size, so it's simpler to process. For this example you use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.

> There are many other pre-trained text embeddings from TFHub: 1) google/nnlm-en-dim128/2 - trained with the same NNLM architecture on the same data as google/nnlm-en-dim50/2, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model. 2) google/nnlm-en-dim128-with-normalization/2 - the same as google/nnlm-en-dim128/2, but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation. 3) google/universal-sentence-encoder/4 - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.

> The Keras layer ("hub.KerasLayer()") uses a TensorFlow Hub model ("https://tfhub.dev/google/nnlm-en-dim50/2") to embed the sentences. Here I experiment with a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

> The default function for creating a Keras layer is: hub.KerasLayer(handle, trainable=False, arguments=None, _sentinel=None, tags=None, signature=None, signature_outputs_as_dict=None, output_key=None, output_shape=None, load_options=None, **kwargs). The function wraps a callable object for use as a Keras layer. Note that hub_layer takes the data as an argument.

In [5]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423194 , -0.01190171,  0.06337537,  0.0686297 , -0.16776839,
        -0.10581177,  0.168653  , -0.04998823, -0.31148052,  0.07910344,
         0.15442258,  0.01488661,  0.03930155,  0.19772716, -0.12215477,
        -0.04120982, -0.27041087, -0.21922147,  0.26517656, -0.80739075,
         0.25833526, -0.31004202,  0.2868321 ,  0.19433866, -0.29036498,
         0.0386285 , -0.78444123, -0.04793238,  0.41102988, -0.36388886,
        -0.58034706,  0.30269453,  0.36308962, -0.15227163, -0.4439151 ,
         0.19462997,  0.19528405,  0.05666233,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201319 , -0.04418665, -0.08550781,
        -0.55847436, -0.2333639 , -0.20782956, -0.03543065, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862677,  0.7753425 , -0.07667087,
        -0.15752274,  0.01872334, -0.08169781, -0.3521876 ,  0.46373403,
        -0.08492758,  0.07166861, -0.00670818,  0.12686071, -0.19326551,
 

> b) Model architecture

> The default function for a Dense layer is: tf.keras.layers.Dense(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs). The only argument being defined here is the number of hidden units.

> The layers are stacked sequentially to build the classifier: The first layer is a TensorFlow Hub layer. For this NNLM model, the embedding_dimension is 50. This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units. The last layer is densely connected with a single unit.

In [6]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


**4. Compile the Model**

> A model needs a loss function and an optimizer for training. This is defined in the compile step. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), the loss function is set to binary_crossentropy.

> This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

> Adam is a standard optimizer for many applications. It makes sense to use it here.

> Model performance will be based on accuracy, so this is the metric we set inside the compile function.

In [7]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

**5. Train the model**

> The model.fit() function is used to train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the x_train and y_train tensors. 

> Below the model accuracy gets worse in the final or tenth epoch, indicating that the model is now overfit to the data. Additional increases in training accuracy are irrelevant at this point. It may have been better to stop the training after the 6th epoch, since the incremental gains since then on the validation set are almost zero.

In [8]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**6. Evaluate the Model**

> The model is then tested on the test set of data to evaluate its true performance. The performance here at 85% accuracy is very close to the accuracy measure on the validation set, but much lower than the 99% accuracy on the training set.

> model.evaluate() is used to generate the final accuracy numbers on the test set. 

In [9]:
results = model.evaluate(test_data.batch(512), verbose=2)

49/49 - 3s - loss: 0.3567 - accuracy: 0.8518 - 3s/epoch - 68ms/step


In [10]:
for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

loss: 0.357
accuracy: 0.852


**7. Final Comments**

> 85% isn't bad for this model, however, the model appears to be overfitted to the training data. Training the model for fewer epochs could improve its performance on the test data.

> Despite using a GPU for the running of this model, using embeddings with additional units could improve model performance.

> I re-ran this experiment, limiting the training to 5 epochs and using Google's text embeddings with 128 dimensions (google/nnlm-en-dim128/2). The results on the test data ended up at 86%. The validation set reached 87%, but niether is significantly different than the original running of this model with Google 50 dimension embeddings.