<table ><tr><td valign='center' bgcolor='white'>
  <a href="https://web.facebook.com/DAT.KUSRC/" target="_blank"><img src="https://drive.google.com/uc?id=1dNBiKikzW1-osi6lleLOgSOKQ65IIfMC" height="50px"></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</td><td valign='center' bgcolor='white'>
  <a href="https://www.ku.ac.th/" target="_blank"><img src="https://drive.google.com/uc?id=1ZfGOBmxAwg8SAhyseFziyinzxBGme78a" height="80px"></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</td><td valign='center' bgcolor='white'>
<a href="https://www.tensorflow.org/" target="_blank"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/TensorFlowLogo.svg/1200px-TensorFlowLogo.svg.png" height="80px"></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</td><td valign='center' bgcolor='white'>
  <a href="https://mike.cpe.ku.ac.th/" target="_blank"><img src="https://drive.google.com/uc?id=1s6r3iG_Slpu_NSWqdt5zBp8Z9hV0-zh6" height="50px"></a>
</td></tr></table>

---

<center><h1>Text Classification with TensorFlow: Movie reviews</h1></center>

---

* Credit: https://www.tensorflow.org/tutorials/keras/text_classification_with_hub


In [0]:
print('Text Classification with TensorFlow: Movie reviews...')
print('  Brought to you by K.Toto@MikeLab.Net')

---

This notebook classifies movie reviews as <font color=ff00ff>positive</font> or <font color=ff00ff>negative</font> using the text of the review. This is an example of <font color=ff00ff>binary</font> — or two-class — classification problem, an important and widely applicable kind of machine learning problem.

We will use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). The data has been split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are <font color=ff00ff>balanced</font>, meaning they contain an equal number of positive and negative reviews. 

This notebook uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [TensorFlow Hub](https://www.tensorflow.org/hub), a library and platform for transfer learning. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

## Idea behind the text classification

Text classification has some unique challenges. So before we get coding, let step through some of them.

<center><img src="https://drive.google.com/uc?id=1E_Zjj2bon6pf05USkdpzrRcptzdodtk3" height="300px"></center>

First of all, neural networks typically deal with <font color=ff00ff>numbers</font>, not with texts when the learning patterns can be used for prediction of classification. So in this case, we are looking at learning from the movie reviews to see if those reviews are positive or negative. The first step is then to change the words into numbers that represent them.


There will be a little bit more processing of these words into vectors determining their sentiments.

<center><img src="https://drive.google.com/uc?id=1LY8QqggesnWKrRxON1eJ2SBa15yvId-8" height="270px"></center>

We will build a deep neuron network model, and then feed the movie reviews into it, and let the model infers the answer whether those reviews are positive or negative sentiment.

<center><img src="https://drive.google.com/uc?id=15tw_npTw52qa2jkCUWM9gcv5kqBA5dln" height="400px"></center>


## Import and check the Tensorflow version

In [0]:
''' OLD Tensorflow import code that will be deleted soon
import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)
'''

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np

import tensorflow as tf

!pip install -q tensorflow-hub
!pip install -q tfds-nightly
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from tensorflow import keras

print("Version: ", tf.__version__)
print("Keras version: ", keras.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

## Download the IMDB dataset

The IMDB dataset comes packaged with Tensorflow. It has already been prepocessed such that the reviews (sequence of words) have been converted to sequence of integers, where each integer represents a specific word in a dictionary. 

In that dictionary, the frequent used or common words come first in integer number, while the less common ones come higher integer number later.

The following code downloads the IMDB dataset.

In [0]:
imdb = keras.datasets.imdb
(train_data, train_label), (test_data, test_label) = imdb.load_data(num_words=10000)

The argument `num_words=10000` keeps only the top 10,000 most frequently occuring words in the training data, i.e. the top 10,000 words that are used across all the reviews. The rare words are discards to keep the size of the data manageable.

After loading, we have our <font color=ff00ff>training</font> data and labels, as well as as our <font color=ff00ff>test</font> data and labels.

First we will look at our training data. We can see that we have a total of 25,000 items of data and 25,000 labels describing them.

In [0]:
len(train_data), len(train_label)

## Explore the data

The dataset comes preprocessed; each example is an array of integers representing the words of the movie review. 

Each label is an integer value of either 0 or 1, where <font color=ff00ff>0</font> indicate the <font color=ff00ff>negative</font> review, and <font color=ff00ff>1</font> the <font color=ff00ff>positive</font> review.

In [0]:
print(f'Training entries: {len(train_data)}, labels: {len(train_label)}')

In [0]:
for i in range(10):
  if train_label[i] == 1:
    print('pos: ', end='')
  else:
    print('neg: ', end='')
  print(train_data[i])

Each of the reviews has been indexed into the <font color=ff00ff>array of words</font>. A review will start with the word indexed with the integer number <font color=00ffff>**1**</font> indicating the <font color=ffff00>\<start\></font> of the review. For eample, in the first review the index number <font color=00ffff>**14**</font> and <font color=00ffff>**22**</font> correspond to the word <font color=ffff00>**this**</font> and <font color=ffff00>**film**</font>. The length of the reviews does vary. For example, the length of the first and the second review are as follows.

In [0]:
len(train_data[0]), len(train_data[1])

### Convert the integers back to words

It may useful to know how to convert integers back to text. Here, we'll create a helpful function to query a dictionary object that contains the integer of string mapping.

In [0]:
# a dictionary mapping words to an integer index
word_index = imdb.get_word_index()

for k,v in word_index.items():
  if v <= 4:
    print(f'{k}: {v} ', end=' ')
print('| this:', word_index['this'], ' film:', word_index['film'])

# the first indices are reserved
# see https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset/44891281
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index['<PAD>'] = 0
word_index['<START>'] = 1
word_index['<UNK>'] = 2 # unknown
word_index['<UNUSED>'] = 3

for k,v in word_index.items():
  if v <= 4:
    print(f'{k}: {v} ', end=' ')
print('| this:', word_index['this'], ' film:', word_index['film'])

In [0]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
  return ' '.join([reverse_word_index.get(i, '?') for i in text])

for i in range(10):
  if train_label[i] == 1:
    print('pos: ', end='')
  else:
    print('neg: ', end='')
  print(decode_review(train_data[i]))

In [0]:
print(train_data[3])

## Prepare the data

The reviews --<font color=ff00ff>the arrays of integers</font>-- must be converted to <font color=00ffff>tensors</font> before fed into the neural network. This conversion can be done a couple of ways:
*   <font color=ffff00>One-hot-encode</font> the arrays to convert them into vectors of 0s and 1s. For example, the sequence [14, 22] would become a 10,000-dimensional vector that is all zeros except for the indices 14 and 22, which are ones. Then, make this the first layer in our network --a <font color=ff00ff>Dense layer</font>-- that can handle floating point vector data. This approach is memory intensive, though, requring a `num_words * num_reviews` size matrix. 
*   Alternatively, we can <font color=ffff00>pad the arrays</font> so they all have the same length tensor of shape `num_examples * max_length`. We can use an <font color=ff00ff>embedding layer</font> capable of handling this shape as the first layer in our network.

Let's look back to the length of the first two movie reviews, again:

In [0]:
len(train_data[0]), len(train_data[1])

We here will go with the second approach. Since the movie reviews must be the same length, we will use the pad_sequence( ) function to standardize the lengths:

In [0]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, 
                                                        value=word_index['<PAD>'], 
                                                        padding='post', 
                                                        maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, 
                                                       value=word_index['<PAD>'], 
                                                       padding='post', 
                                                       maxlen=256)

Now, let's look at the length of the examples after padding now:

In [0]:
len(train_data[0]), len(train_data[1])

And inspect the (now padded) first review:

In [0]:
print(train_data[0])

In [0]:
print(decode_review(train_data[0]))

## Build the model

The neural network is created by <font color=ffff00>stacking the layers</font> -- this requires two main architectural decisions:
*   How many <font color=ffff00>*layers*</font> to use in the model?
*   How many <font color=ffff00>*hidden units*</font> to use for each layer?

In this example, the input data consists of an array of word-indices. The labels to predict are either 0 or 1. Let's build a model for this problem.

In [0]:
# input shape is the vocabualry count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary() 

The layers are stacked sequentially to build the classifier:
1.   The first layer is an <font color=00ffff>`Embedding`</font> layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (`batch`, `sequence`, `embedding`).
2.   Next, the <font color=00ffff>`GlobalAveragePooling1D`</font> layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model can handle input of variable length, in the simplest way possible.
3.   This fixed-length outout vector is piped through a fully-connected (<font color=00ffff>`Dense`</font>) layer with 16 hidden units.
4.   The last layer is densely connected with a single output node. Using the `sigmoid` activation function, this value is a float between 0 and 1, representing a probability, or confident level.



**Hidden units**

The above model has two intermediate or 'hidden' layers, between the input and output. The number of outputs (units, nodes, or neurons) is the dimension of the representational space of the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional represntation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computational expensive and may lead to learning unwanted patterns -- patterns that improve performance on training data by not on the test data, which is called <font color=ff00ff>*overfitting*</font>.

### Loss function and optimizer

A model needs a <font color=ff00ff>loss function</font> and an <font color=ff00ff>optimizer</font> for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the <font color=ffff00>`binary_crossentropy`</font> loss function. 

This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Now, configure the model to use an optimizer and a loss function:

In [0]:
model.compile(optimizer=tf.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Create a validation set


![alt text](https://drive.google.com/uc?id=1gfkidJF3a-VxKAjcXzMEcF0Pl3kFyhb5)

When training, we want to check the accuracy of the model on data it hasn't seen before. So, we creat a <font color=ff00ff>*validation* set</font> by setting apart 10,000 examples from the original 25,000 samples of the training data. 


Why not use the <font color=ff00ff>testing set</font> now? Our goal is to develop and tune our model using only the training data, and then use the test data just once to evaluate the accuracy.

In [0]:
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_label[:10000]
partial_y_train = train_label[10000:]

## Train the model

Train the model for 40 epochs in mini-batches of 512 samples. This is 40 iterations over all samples in the `x_train` and `y_train` tensors. While training, monitor the model's loss and accuracy on the 10,000 samples for the validation set:

In [0]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

## Evaluate the model

And let's see how the model performs. Two values will be returned. <font color=ff00ff>Loss</font> (a number which represents the error, lower value is better), and <font color=ff00ff>accuracy</font>.

In [0]:
results = model.evaluate(test_data, test_label)
print(results)

We can see that, this fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model could get closer to 95%.

### Creat a graph of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:

In [0]:
history_dict = history.history
history_dict.keys() 

There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for caamparison, as well as the training and validation accuracy.

In [0]:
import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc)+1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'g', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

In [0]:
plt.clf() # clear figure
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'g', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

In this plot, the blue dots represent the training loss and accuracy, and the solid green lines are the validation loss and accuracy.

Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization --it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy -- they seem to peak after about fifteen to twenty epochs. This is an example of overfitting: the model performs better on the training data than it does it on the data it has never seen before. After this point, the model over-optimizes and learns representations <font color=ff00ff>*specific*</font> to the training data that do not <font color=ff00ff>*generalize*</font> to test data.

For this particular case, we could prevent overfitting by simply stopping the training after fifteen or twenty or so epochs. Here, we can do this automatically with a callback (let the reader to explore!).

In [0]:
# Let's also test it against a couple of planted reviews.
# One of them is a bunch of random words, and the other
# is the biased review we created.

rand_review = np.random.randint(10000, size=256)
biased_review = np.full(256, 530) # 530 is the word 'brilliant'
test_data = np.append(test_data, [rand_review], axis=0)
test_data = np.append(test_data, [biased_review], axis=0)

# And here is the prediction.
# It's an array of predictions for the test data at the particular index,
# so let's manually look at the first few, as well as the ones we planted
# to the end.
print(model.predict(test_data))
print(decode_review(test_data[-2]))
print(decode_review(test_data[-1]))