
This lab is adapted from the Tensorflow tutorial for  text classification with BERT

https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/classify_text_with_bert.ipynb#scrollTo=EqL7ihkN_862


# Classify text with BERT

Steps

- Load a BERT model from TensorFlow Hub
- Install data
- Build a model by combining BERT with a classifier
- Fine-tune BERT to create a model
- Save the model and use it to classify texts


## About BERT
The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after. This is the reason for the name: Bidirectional Encoder Representations from Transformers. 

BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks. The example we will show here is one of the standard BERT models with fine-tuning on the movie review corpus.


## Setup


In [None]:
# A dependency of the preprocessing for BERT inputs
!pip install -q -U "tensorflow-text==2.8.*"

We will use the AdamW optimizer, which is currently the most commonly used bert optimizer, from [tensorflow/models](https://github.com/tensorflow/models).

In [None]:
!pip install -q tf-models-official == 2.7.0

In [None]:
!pip install numpy == 1.21

In [None]:
# install libraries 
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

***
### Install the movie review dataset
We will use the nltk movie review dataset that we installed in a previous chapter, making a dataset from the movie review directory.

***

In [None]:
#install the movie review dataset
import os
import shutil

batch_size = 32

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')
AUTOTUNE = tf.data.AUTOTUNE

raw_ds = tf.keras.utils.text_dataset_from_directory('./movie_reviews',
                                                   batch_size = batch_size)
print(raw_ds)
print(len(raw_ds))

class_names = raw_ds.class_names
print(class_names)


***
## Split the data into training, validation and test sets
***

In [None]:
from sklearn.model_selection import train_test_split

def get_dataset_partitions_tf(ds, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=1000):
    assert (train_split + test_split + val_split) == 1
    
    if shuffle:
        # Specify seed to always have the same split distribution between runs
        ds = ds.shuffle(shuffle_size, seed=42)
    
    train_size = int(train_split * ds_size)
    val_size = int(val_split * ds_size)
    
    train_ds = ds.take(train_size)    
    val_ds = ds.skip(train_size).take(val_size)
    test_ds = ds.skip(train_size).skip(val_size)
    
    return train_ds, val_ds, test_ds

train_ds,val_ds,test_ds = get_dataset_partitions_tf(raw_ds,len(raw_ds))

#check sizes of data in trin, test, and val sets
print(train_ds)
print(len(train_ds))
print(len(test_ds))
print(len(val_ds))



***
## Check a few examples of data and labels
***

In [None]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(3):
    print(f'Review: {text_batch.numpy()[i]}')
    label = label_batch.numpy()[i]
    print(f'Label : {label} ({class_names[label]})')

***
## Loading models from TensorFlow Hub

We will use one of the smaller BERT models in order for it to run in a reasonable amount of time on a desktop computer
* "small_bert/bert_en_uncased_L-4_H-512_A-8/1"
* There are 4 hidden layers (that is, Transformer blocks), with a hidden size of 512
* A=8 Attention heads
This model was trained on Wikipedia and BooksCorpus.
There are many larger and smaller models that can be downloaded from the TensorFlow hub.
***

In [None]:
# load a BERT model
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8'  

map_name_to_handle = {
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
}

map_model_to_preprocess = {
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

## Preprocessing

Text inputs are transformed to numeric token ids and arranged in several Tensors before being input to BERT. 

TensorFlow Hub provides a matching preprocessing model for each of the BERT models discussed above.

We will load the preprocessing model into a [hub.KerasLayer](https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer) to compose the fine-tuned model. 

In [None]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

Let's try the preprocessing model on some text and see the output:


In [None]:
test_text = ['sure is a great movie. i like it']
print(test_text)

text_preprocessed = bert_preprocess_model(test_text)

print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')

As you can see, now you have the 3 outputs from the preprocessing that a BERT model would use (`input_words_id`, `input_mask` and `input_type_ids`).

Since this text preprocessor is a TensorFlow model, It can be included in the BERT model directly.

## Using the BERT model

Before putting BERT into our model, we can look at its outputs. Load it from TF Hub and see the returned values.

In [None]:
bert_model = hub.KerasLayer(tfhub_handle_encoder)

In [None]:
bert_results = bert_model(text_preprocessed)

print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')

The BERT models return a map with 3 important keys: `pooled_output`, `sequence_output`, `encoder_outputs`:

- `pooled_output` represents each input sequence as a whole. The shape is `[batch_size, H]`. You can think of this as an embedding for the entire text.
- `sequence_output` represents each input token in the context. The shape is `[batch_size, seq_length, H]`. You can think of this as a contextual embedding for every token in the text.
- `encoder_outputs` are the intermediate activations of the `L` Transformer blocks. `outputs["encoder_outputs"][i]` is a Tensor of shape `[batch_size, seq_length, 1024]` with the outputs of the i-th Transformer block, for `0 <= i < L`. The last value of the list is equal to `sequence_output`.

For the fine-tuning we are going to use the `pooled_output` array.

## Define the model

Here we create a very simple fine-tuned model, with the preprocessing model, the selected BERT model, one Dense and a Dropout layer. The parameter to the Dropout layer can be increased to make the model more robust.


In [None]:
def build_classifier_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
    return tf.keras.Model(text_input, net)

Let's check that the model runs with the output of the preprocessing model.

In [None]:
classifier_model = build_classifier_model()
bert_raw_result = classifier_model(tf.constant(test_text))
print(tf.sigmoid(bert_raw_result))

The output is meaningless, because the model has not been trained yet but it verifies that the model runs with the preprocessing.

Let's take a look at the model's structure.

In [None]:
tf.keras.utils.plot_model(classifier_model)

***
## Model training
We now have all the pieces to train a model, including the preprocessing module, BERT encoder, data, and classifier.
***

### Loss function

Since this is a binary classification problem (that is, there are only two outcomes, "positive" and "negative", we'll use `losses.BinaryCrossEntropy` loss function. A problem with several possible outcomes, such as an intent identification problem, would use categorical cross entropy.
Similarly, the metric should be "binary accuracy", rather than "accuracy".
Cross-entropy estimates the loss by scoring the average difference between the actual and predicted probability distributions for all classes.


In [None]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)

metrics = tf.metrics.BinaryAccuracy()

### Optimizer

For fine-tuning, we'll use the same optimizer that BERT was originally trained with: the "Adaptive Moments" (Adam). Adam is popular because it is fast and efficient.

In line with the BERT paper, the initial learning rate is smaller for fine-tuning (best of 5e-5, 3e-5, 2e-5). BERT generally does best with very small learning rates for fine-tuning.

In [None]:
epochs = 15

steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
print(steps_per_epoch)

num_train_steps = steps_per_epoch * epochs
# a linear warmup phase over the first 10%
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5 

optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

### Compile the BERT model and training

Using the `classifier_model` you created earlier, you can compile the model with the loss, metric and optimizer, and take a look at the summary.
It's a good idea to check the model before starting a lengthy training process to make sure the model is as expected.

In [None]:
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)
classifier_model.summary()

Training time will vary depending on the complexity of the selected BERT model.
For this model, dataset, and number of epochs, the training should take less than an hour on a cpu.
Setting "verbose" to 2 is provides the maximum amount of feedback during training and can be useful to see if things are not going as expected so that the training can be stopped.

In [None]:
print(f'Training model with {tfhub_handle_encoder}')

history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               verbose = 2,
                               epochs=epochs)

### Evaluate the model

Let's see how the model performs on the test data. Two values will be returned -- loss and accuracy.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

### Plot the accuracy and loss over time

With the `History` object returned by `model.fit()`, you can plot the training and validation loss for comparison, as well as the training and validation accuracy:

In [None]:
import matplotlib.pyplot as plt
#!matplotlib inline

history_dict = history.history
print(history_dict.keys())

acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, loss, 'r', linestyle="dashed",label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', linestyle="dashed",label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()

***
In this plot, the red lines represent the training loss and accuracy, and the blue lines are the validation loss and accuracy.
***

## Export for inference

Now you just save your fine-tuned model for later use.

In [None]:
dataset_name = 'imdb'
saved_model_path = './{}_bert'.format(dataset_name.replace('/', '_'))

classifier_model.save(saved_model_path, include_optimizer=False)

Let's reload the model, so you can try it side by side with the model that is still in memory.

In [None]:
reloaded_model = tf.saved_model.load(saved_model_path)

In [None]:
def print_my_examples(inputs, results):
  result_for_printing = \
    [f'input: {inputs[i]:<30} : score: {results[i][0]:.6f}'
                         for i in range(len(inputs))]
  print(*result_for_printing, sep='\n')
  print()


examples = [
    'this is such an amazing movie!',  # this is the same sentence tried earlier
    'The movie was great!',
    'The movie was meh.',
    'The movie was okish.',
    'The movie was terrible...'
]

reloaded_results = tf.sigmoid(reloaded_model(tf.constant(examples)))
original_results = tf.sigmoid(classifier_model(tf.constant(examples)))

print('Results from the saved model:')
print_my_examples(examples, reloaded_results)
print('Results from the model in memory:')
print_my_examples(examples, original_results)