<a href="https://colab.research.google.com/github/SiriuXProtocoL/Tensorflow_examples/blob/main/03_text__classification_tensorflow_hub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Movie Sentiment Analysis using tensorflow Hub
- This demonstrates the basic application of transfer learning with TensorFlow Hub and Keras.
- We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database.
-  tensorflow_hub, a library for loading trained models from TFHub in a single line of code.

In [None]:
!pip install -q tfds-nightly
!pip install -q tensorflow-hub

In [None]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

# Load compressed models from tensorflow_hub
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "COMPRESSED"

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

###Download the IMDB dataset
- The IMDB dataset is available on imdb reviews or on TensorFlow datasets

In [None]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
# storing to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)


###Explore the data
- printing first 10 examples

In [None]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch


- printing first 10 labels

In [None]:
train_labels_batch


###Build the model
- We can use a pre-trained text embedding as the first layer, which will have three advantages:

    - we don't have to worry about text preprocessing,
    - we can benefit from transfer learning, the embedding has a fixed size, so it's simpler to process.

- For this example we will use a pre-trained text embedding model from TensorFlow Hub called google/nnlm-en-dim50/2.

- there are a lot of text embedding models avaliable in tensorflow hub

In [None]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

- The layers are stacked sequentially to build the classifier:

    - The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that we are using (google/nnlm-en-dim50/2) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension). For this NNLM model, the embedding_dimension is 50.
    - This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
    - The last layer is densely connected with a single output node.

###Loss function and optimizer

- Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), we'll use the binary_crossentropy loss function.
- it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.


In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

###Train the model
- Train the model for 10 epochs in mini-batches of 512 samples. 
- This is 10 iterations over all samples in the x_train and y_train tensors.
- While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

In [None]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

###Evaluate the model
- Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))


- This fairly naive approach achieves an accuracy of about 85%. 
- With more advanced approaches, the model should get closer to 95%.
- Lets try another model from tensorflow hub and see what happens to the accuracy

In [None]:
embedding = "https://tfhub.dev/google/universal-sentence-encoder/4"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

In [None]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

YEss we got an increased accuracy of 87.6%