### About BERT

[BERT](https://arxiv.org/abs/1810.04805) and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers. 

### Classify text with BERT

In this notebook, we:

- Load the IMDB dataset
- Load a BERT model from TensorFlow Hub
- Build model by combining BERT with a classifier
- Train model, fine-tuning BERT as part of that
- Save model and use it to classify sentences

### Sentiment analysis

This notebook trains a sentiment analysis model to classify movie reviews as *positive* or *negative*, based on the text of the review.

We'll use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/).

In [1]:
import sys
import os
sys.path.append(os.path.abspath('/home/meenal/nlpmaps'))  # add the parent directory to the sys.path list
from nlpmaps.bert import *


ModuleNotFoundError: No module named 'nlpmaps'

The provided code snippet is used to prepare the IMDB dataset for sentiment analysis of movie reviews. The text_dataset_from_directory utility is used to create labeled tf.data.Dataset objects for the training, validation, and testing datasets. The batch size is set to 32 and a random seed of 42 is used for consistency in results. The validation_split argument is used to create a validation set from 20% of the training data, and the remaining training data is cached and prefetched using AUTOTUNE for improved performance during training. Similarly, the validation and testing datasets are also cached and prefetched using AUTOTUNE. The class names for the sentiment labels are retrieved from the training dataset. Overall, this code sets up the IMDB dataset for sentiment analysis and optimizes data loading and preprocessing performance using AUTOTUNE.

In [None]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
dataset_dir = download_and_extract_dataset(url)

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42
validation_split=0.2

train_ds, val_ds, test_ds, class_names = load_imdb_dataset(batch_size=batch_size, validation_split=0.2, seed=seed)


In [None]:
from nlpmaps.bert import print_first_batch
print_first_batch(train_ds, class_names)

### Loading models from TensorFlow Hub

Here you can choose which BERT model you will load from TensorFlow Hub and fine-tune. There are multiple BERT models available.

  - [BERT-Base](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3), [Uncased](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) and [seven more models](https://tfhub.dev/google/collections/bert/1) with trained weights released by the original BERT authors.
  - [Small BERTs](https://tfhub.dev/google/collections/bert/1) have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
  - [ALBERT](https://tfhub.dev/google/collections/albert/1): four different sizes of "A Lite BERT" that reduces model size (but not computation time) by sharing parameters between layers.
  - [BERT Experts](https://tfhub.dev/google/collections/experts/bert/1): eight models that all have the BERT-base architecture but offer a choice between different pre-training domains, to align more closely with the target task.
  - [Electra](https://tfhub.dev/google/collections/electra/1) has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
  - BERT with Talking-Heads Attention and Gated GELU [[base](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1), [large](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1)] has two improvements to the core of the Transformer architecture.

We have the option to choose which BERT model we want to load from TensorFlow Hub and fine-tune for our sentiment analysis task. We decided to go with Small BERTs since they have fewer and/or smaller Transformer blocks, making them faster and more memory-efficient to train and deploy. This is particularly useful since our current model is taking 45 minutes per epoch to train, and we have limited computational resources. Additionally, Small BERTs still maintain a good level of performance on NLP tasks like sentiment analysis. The other BERT models available have different trade-offs between speed, size, and quality, and we can choose one of them if we want even better accuracy or have more resources available. All these models can be loaded from their corresponding TensorFlow Hub URLs using the code block provided.

You'll see in the code below that switching the tfhub.dev URL is enough to try any of these models, because all the differences between them are encapsulated in the SavedModels from TF Hub.

The first variable bert_model_name is a string which takes one of the given BERT models as input. The second variable map_name_to_handle is a dictionary which maps each BERT model name to its corresponding TensorFlow Hub URL. The URLs are used to load the pre-trained BERT models using TensorFlow Hub.

In [None]:
bert_model_name= 'small_bert/bert_en_uncased_L-4_H-512_A-8'
tfhub_handle_encoder, tfhub_handle_preprocess = select_bert_model(bert_model_name)

### The preprocessing model

We need to transform text inputs into numeric token ids and arrange them in Tensors before inputting them to BERT. Luckily, TensorFlow Hub provides a matching preprocessing model for each BERT model we choose, which implements this transformation using TF ops from the TF.text library. This means we don't need to run pure Python code outside of our TensorFlow model to preprocess text.

It's important to use the preprocessing model that is referenced in the documentation of the BERT model we choose, which we can find at the URL printed above. If we choose a BERT model from the drop-down above, the preprocessing model will be selected automatically.

Note: To load the preprocessing model into our fine-tuned model, we'll use the hub.KerasLayer API. This is the preferred way to load a TF2-style SavedModel from TF Hub into a Keras model.

We can try the preprocessing model on some text and observe the output.

We can see that we have obtained the 3 outputs from the preprocessing that a BERT model would use (input_words_id, input_mask, and input_type_ids).

It is important to note that the input is truncated to 128 tokens, although the number of tokens can be customized as required.

Also, the input_type_ids only have one value (0) because this is a single sentence input. For a multiple sentence input, it would have one number for each input.

Since this text preprocessor is a TensorFlow model, we can include it directly in our model.

In [None]:
text_test = ['this is such an amazing movie!']
text_preprocessed = preprocess_text(tfhub_handle_preprocess, text_test)

print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')

### Using the BERT model

Now that we have our preprocessed inputs, let's see how we can use the BERT model. We will load the BERT model from TF Hub and examine its outputs.

First, we will load the BERT model from the TF Hub using hub.KerasLayer(). Then, we will pass our preprocessed inputs to the BERT model to get its outputs. The outputs from the BERT model are embeddings for each token in the input sequence, along with a pooled embedding for the entire sequence.

The embeddings for each token can be used for various NLP tasks such as classification, question answering, etc. The pooled embedding is a condensed representation of the entire input sequence that can be used for tasks such as sentence similarity, clustering, etc.

Let's take a closer look at the BERT model outputs and see how we can use them in our downstream tasks.

In [None]:
bert_outputs = get_bert_outputs(text_preprocessed, tfhub_handle_encoder)

print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape: {bert_outputs["pooled_output"].shape}')
print(f'Pooled Outputs Values: {bert_outputs["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape: {bert_outputs["sequence_output"].shape}')
print(f'Sequence Outputs Values: {bert_outputs["sequence_output"][0, :12]}') 

As we have loaded the BERT model and preprocessed our text, we can see the outputs of the BERT model. The BERT model returns a dictionary with three keys: pooled_output, sequence_output, and encoder_outputs.

- pooled_output is the embedding representation of the entire sequence. The shape of this tensor is [batch_size, H], where H is the hidden size of the BERT model.


- sequence_output is the contextual embedding of each token in the input sequence. The shape of this tensor is [batch_size, seq_length, H], where seq_length is the maximum sequence length of the input text.


- encoder_outputs are the intermediate activations of the L transformer blocks, where outputs["encoder_outputs"][i] is a tensor of shape [batch_size, seq_length, 1024] with the outputs of the i-th Transformer block, for 0 <= i < L. The last value of the list is equal to sequence_output.
For fine-tuning our BERT model, we will use the pooled_output tensor, which represents the entire sequence as an embedding.

### Define your model

In this step, we are going to define our own fine-tuned model. We will use the text preprocessing model and the BERT model that we loaded in the previous steps, along with a Dense layer and a Dropout layer.

It is important to mention that the preprocessing model we used in the previous step will take care of the inputs and outputs of the BERT model. Therefore, we don't need to worry about it much, but if you need more information about the base model, you can refer to its URL for documentation.

Our model will be very simple but efficient. We will combine the output of the BERT model, which is the pooled_output array, with a Dense layer and a Dropout layer to avoid overfitting.

We will run the model using the output of the preprocessing model to check if it's working correctly.

First, we will load the build_classifier_model function, which builds the classification model using the BERT encoder.

Then, we will define a sample text and pass it through the preprocessing model.

Finally, we will pass the preprocessed text to the classification model and check if it returns a valid output.

In [None]:
classifier_model = build_classifier_model(tfhub_handle_encoder, tfhub_handle_preprocess)
# obtain the raw output from the BERT model
bert_raw_result = classifier_model(tf.constant(text_test)) 
# apply sigmoid function to the output to obtain the probabilities for the classes
print(tf.sigmoid(bert_raw_result)) 

Let's take a look at the model's structure.

In [None]:
tf.keras.utils.plot_model(classifier_model)

### Model training

Now that we have all the necessary components such as the preprocessing module, BERT encoder, data, and classifier, we can train the model. Training the model involves the following steps:

- Defining the optimizer, loss function, and evaluation metrics.
- Compiling the model with the optimizer and loss function.
- Fitting the model to the training data with a specified batch size and number of epochs.

The goal of training the model is to update the weights of the neural network such that the loss function is minimized and the evaluation metrics are maximized. Once the model has been trained, we can evaluate its performance on the validation and test sets.

### Loss function

Since this is a binary classification problem and the model outputs a probability (a single-unit layer), we'll use `losses.BinaryCrossentropy` loss function.

In [None]:
# Call the function and assign the results to variables
binary_loss, metrics = build_loss_and_metrics()

### Optimizer
Now that we have our model defined and the optimizer, loss function, and learning rate schedule selected, we can proceed to fine-tune the BERT model.

We will use the model.fit() function in TensorFlow to train the model. The training process will involve passing batches of the preprocessed movie reviews to the model and updating the weights of the model based on the loss calculated by the loss function.

During the training process, we will track the loss and accuracy of the model on both the training and validation sets. This will allow us to monitor the progress of the training and identify any overfitting or underfitting of the model.

We will use early stopping to prevent overfitting. Early stopping will monitor the validation loss and stop the training process if the loss stops improving for a certain number of epochs.

Once the model is trained, we will evaluate its performance on the test set to get an estimate of its real-world performance.

In [None]:
# set the parameters
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5

# create the optimizer
optimizer = create_optimizer(init_lr, num_train_steps, num_warmup_steps)

### Loading the BERT model and training

Using the `classifier_model` we created earlier, you can compile the model with the loss, metric and optimizer.

In [None]:
classifier_model.compile(optimizer=optimizer,
                         loss=binary_loss,
                         metrics=metrics)

Note: training time will vary depending on the complexity of the BERT model you have

In [None]:
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)

### Evaluate the model

Let's see how the model performs. Two values will be returned. Loss (a number which represents the error, lower values are better), and accuracy.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

### Plot the accuracy and loss over time

Based on the `History` object returned by `model.fit()`. We can plot the training and validation loss for comparison, as well as the training and validation accuracy. In this plot, the red lines represent the training loss and accuracy, and the blue lines are the validation loss and accuracy.

In [None]:
plot_history(history)

### Export for inference

Now just save your fine-tuned model for later use.

In [None]:
dataset_name = 'imdb'
saved_model_path = './{}_bert'.format(dataset_name.replace('/', '_'))
save_model(classifier_model, dataset_name, saved_model_path)

Here you can test your model on any sentence you want, just add to the examples variable below.

In [None]:
examples = [
    'this is such a spectacular movie!', 
    'The movie was great!',
    'The movie was meh.',
    'The movie was okish.',
    'The movie was terrible...'
]

results = tf.sigmoid(reloaded_model(tf.constant(examples)))

print('Results from the model:')
print_my_examples(examples, results)