# Deep Learning
## Formative assessment
### Week 7: Transformers

#### Instructions

In this notebook, you will write code to implement and train a Transformer classifier model.

Some code cells are provided you in the notebook. You should avoid editing provided code, and make sure to execute the cells in order to avoid unexpected errors. Some cells begin with the line: 

`#### GRADED CELL ####`

These cells require you to write your own code to complete them.

#### Let's get started!

We'll start by running some imports, and loading the dataset.

In [1]:
#### PACKAGE IMPORTS ####

# Run this cell to import all required packages. 

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (TextVectorization, Dense, MultiHeadAttention, LayerNormalization, 
                                     Layer, Embedding, Dropout, GlobalAveragePooling1D)
import numpy as np
import matplotlib.pyplot as plt

<img src="figures/IMDb.png" title="IMDb" style="width: 550px;"/>  
  
#### The IMDb dataset

In this assignment, you will use the [IMDb dataset](https://https://www.imdb.com/interfaces/). This is a sentiment analysis dataset of movie reviews with binary labels. It contains 25,000 training examples and a further 25,000 for testing. 

* Maas, A.L.,  Daly, R.E.,  Pham, P.T.,  Huang, D.,  Ng, A.Y. & Potts, C. (2011), "Learning Word Vectors for Sentiment Analysis", _Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies_, 142-150.

Your goal is to build and train an encoder-only Transformer classifier model to predict the sentiment labels from the review text.

#### Load and prepare the data
For this assignment, you will load the IMDb dataset from the TensorFlow Datasets library:

In [2]:
# Run this cell to load the data and print the element_spec

import tensorflow_datasets as tfds

train_data = tfds.load("imdb_reviews", split="train", read_config=tfds.ReadConfig(try_autocache=False))
test_data = tfds.load("imdb_reviews", split="test", read_config=tfds.ReadConfig(try_autocache=False))

train_data.element_spec

2024-02-17 22:47:05.514831: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /Users/xinyuhu/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/xinyuhu/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteUM50O8/imdb_reviews-train…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/xinyuhu/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteUM50O8/imdb_reviews-test.…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /Users/xinyuhu/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteUM50O8/imdb_reviews-unsup…

[1mDataset imdb_reviews downloaded and prepared to /Users/xinyuhu/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


{'label': TensorSpec(shape=(), dtype=tf.int64, name=None),
 'text': TensorSpec(shape=(), dtype=tf.string, name=None)}

In [3]:
# View some samples

for example in train_data.shuffle(100).take(2):
    print(example['text'].numpy().decode("utf-8"))
    print(f"Label: {example['label'].numpy()}")
    print()

This is a film which should be seen by anybody interested in, effected by, or suffering from an eating disorder. It is an amazingly accurate and sensitive portrayal of bulimia in a teenage girl, its causes and its symptoms. The girl is played by one of the most brilliant young actresses working in cinema today, Alison Lohman, who was later so spectacular in 'Where the Truth Lies'. I would recommend that this film be shown in all schools, as you will never see a better on this subject. Alison Lohman is absolutely outstanding, and one marvels at her ability to convey the anguish of a girl suffering from this compulsive disorder. If barometers tell us the air pressure, Alison Lohman tells us the emotional pressure with the same degree of accuracy. Her emotional range is so precise, each scene could be measured microscopically for its gradations of trauma, on a scale of rising hysteria and desperation which reaches unbearable intensity. Mare Winningham is the perfect choice to play her mot

#### Tokenizing the input sentences

We will need to convert the text into integer tokens to be able to process them in the Transformer. To do this we will use a `TextVectorization` layer and adapt it to the training data. You should now complete the following function to create and prepare this layer as follows:

* The function takes a `dataset` (a `tf.data.Dataset` object) as an argument, which has the same spec as `train_data` or `test_data` above. It also takes a `max_tokens` argument
* The `TextVectorization` should be configured to use a maximum of `max_tokens` tokens (including the masking and OOV tokens)
* It should standardize the input text by lower-casing the text and removing punctuation
* It should split the text on whitespace
* You should use the `adapt` method to compute the vocabulary using `dataset`

In [4]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def configure_textvectorization(dataset, max_tokens):
    """
    This function should create a TextVectorization layer and configure it as above.
    The function should then return the TextVectorization layer.
    """
    textvectorization = TextVectorization(max_tokens=max_tokens, 
                                          standardize='lower_and_strip_punctuation',  # default
                                          split='whitespace'  # default
                                         )
    textvectorization.adapt(dataset.map(lambda x: x['text']))  # Can also batch the dataset
    return textvectorization

In [5]:
# Use your function to create and configure the TextVectorization layer

MAX_TOKENS = 20000
text_vectorization = configure_textvectorization(train_data, max_tokens=MAX_TOKENS)

In [6]:
# Test your TextVectorization layer

for example in train_data.shuffle(100).take(2):
    print(text_vectorization(example['text']))

tf.Tensor(
[    2 13077   171    14   946     6     1    21     2  1045  1245     5
     2  8454  3374     5     2   167   398   131     2  5611     1   754
  3121     8    17 19606     1 14928     5     2   262   186   450     2
     1  2623   177     3    47    50   225   981     2    18 10402   354
    16    47  5578   233    30     4     1   361    32     2    97   697
   875     6 12123     2   330   361  1200     7    47  1521   260    36
  3587   172  2876     1     4   164  7176  1615   640    16   454    59
    26  1618 14219     2  8897     5  4736  1159   708     4  2963     1
    39  1982  4816   227    26   224   140    17     2     1   818    19
     1  4326     9    15    41   154     1  3416   100    84    12  1588
     1   764  6299   430   145   161   132   576   607    15  1454  1596
     3 12944    31     2   167    62  1230   811  9189  3131  1810   261
   126     2   413    15     4     1  1900   668     1     7    57   152
   172    19   825   183     6   545    

#### Preprocess the datasets

You should now complete the following function `preprocess_dataset` which you will use to preprocess the `train_data` and `test_data` Dataset objects according to the following spec:

* The function takes `dataset`, `text_vectorization_layer`, `max_seq_len`, `batch_size` and `shuffle_buffer_size` as arguments
    * `dataset` is a `tf.data.Dataset` object with the same spec as `train_data` or `test_data` above
* The `text_vectorization_layer` should be used to convert the text into integer tokens
* The maximum length of the token sequences should be `max_seq_len`. Any token sequences longer than this should be truncated
* The Datasets should return a tuple of `(tokens, label)` Tensors
* The Datasets should be shuffled with buffer size `shuffle_buffer_size`, and then batched with `batch_size`. Note that the sequences will not be the same length, so the batches should be padded with zero masking tokens where necessary (see [the docs](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch))
* The function should then return the preprocessed `dataset` Dataset object

In [7]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def preprocess_dataset(dataset, text_vectorization_layer, max_seq_len, batch_size, shuffle_buffer_size):
    """
    This function should preprocess the Dataset object as above.
    The function should then return the preprocessed Dataset.
    """
    def _inputs_and_targets(example):
        return example['text'], example['label']
    
    def _integer_tokens(text, label):
        return text_vectorization_layer(text), label
    
    def _truncate_seq(tokens, label):
        return tokens[:max_seq_len], label
    
    dataset = dataset.map(_inputs_and_targets).map(_integer_tokens).map(_truncate_seq)
    dataset = dataset.shuffle(shuffle_buffer_size).padded_batch(batch_size)
    
    return dataset

In [8]:
# Run your function to preprocess the Datasets

MAX_SEQ_LEN = 200
train_data = preprocess_dataset(train_data, text_vectorization, MAX_SEQ_LEN, 
                                batch_size=32, shuffle_buffer_size=500)
test_data = preprocess_dataset(test_data, text_vectorization, MAX_SEQ_LEN, 
                               batch_size=32, shuffle_buffer_size=500)

In [9]:
# Print the element_spec

train_data.element_spec

(TensorSpec(shape=(None, None), dtype=tf.int64, name=None),
 TensorSpec(shape=(None,), dtype=tf.int64, name=None))

In [10]:
# Inspect a data minibatch

for example in train_data.take(1):
    print(example)

(<tf.Tensor: shape=(32, 200), dtype=int64, numpy=
array([[   29,     5,    56, ...,     0,     0,     0],
       [   11,     7,  1402, ...,  8526,    21,     4],
       [   10,   378,   632, ...,     0,     0,     0],
       ...,
       [ 2724,   356,     3, ...,     0,     0,     0],
       [   49,  3372,   984, ...,    32,     8,    32],
       [    9,   110, 14022, ...,    42,  3459,  1533]])>, <tf.Tensor: shape=(32,), dtype=int64, numpy=
array([1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0])>)


Note that when we pass this integer tokens Tensor through our Transformer, we will need to be careful to not use the zero padding tokens. The mechanism to handle this is masking (see [this guide](https://www.tensorflow.org/guide/keras/masking_and_padding)), and our custom layers will need to make use of this masking mechanism.

#### Transformer architecture

We will use an encoder-only Transformer classifier architecture for the task of sentiment prediction. This will consist of a single encoder block, followed by a classifier head.

<img src="figures/encoder-only_transformer.png" alt="Transformer" style="width: 250px;"/>  

#### Positional encodings and Embedding layer

You will now implement the input embedding and positional encoding stage of the Transformer. Your model will use the deterministic positional encoding scheme $\mathbf{P}\in\mathbb{R}^{n\times d_{model}}$ as in the original Transformer:

$$
\begin{align}
P_{ti} &= \left\{
\begin{array}{c}
\sin(\omega_k t)\quad\text{for }i=2k+1,\quad(\text{for some }k\in\mathbb{N}_0)\\
\cos(\omega_k t)\quad\text{for }i=2k+2\quad(\text{for some }k\in\mathbb{N}_0)
\end{array}
\right.\\
\omega_k &= \frac{1}{10000^{2k/d_{model}}},
\end{align}
$$

where $t=1,\ldots,n$ and $i=1,\ldots,d_{model}$.

You should now complete the following `positional_encodings` function to compute the positional encoding $\mathbf{P}\in\mathbb{R}^{n\times d_{model}}$.

* The function takes `seq_len` (the number of time steps) and `d_model` the embedding dimension as integer arguments
* The function should compute a 2D Tensor of shape `(seq_len, d_model)` of positional encodings according to the above equations (be careful with python's zero-indexing, the python indices are off-by-one from the mathematical indices above)
* The function should then return the Tensor of positional encodings with type `tf.float32`

In [11]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def positional_encodings(seq_len, d_model):
    """
    This function should compute the positional encodings as above.
    The function should then return the Tensor of positional encodings.
    """
    max_wavelength = 10000.

    pos = np.arange(seq_len)
    inx = np.arange(d_model)

    I, P = np.meshgrid(inx, pos)
    pe_even = np.sin(P / max_wavelength**(I/d_model))
    pe_odd = np.cos(P / max_wavelength**(I/d_model))
        
    pe = np.zeros((seq_len, d_model))
    pe[:, ::2] = pe_even[:, ::2]
    pe[:, 1::2] = pe_odd[:, ::2]
    return tf.constant(pe, dtype=tf.float32)

In [12]:
# Run your function to get the positional encodings

D_MODEL = 32
pos_encodings = positional_encodings(MAX_SEQ_LEN, D_MODEL)

The positional encodings should be added to the token embeddings in the first stage of the Transformer.

You should now complete the `__init__` and `call` methods for the following custom layer `InputEmbeddings`, which builds and returns a model that converts the integer token sequence into a sequence of embeddings, and then adds positional encodings.

* The initialiser takes the following required arguments:
    * `d_model`: the dimension of the embedding vectors
    * `pos_encodings`: the Tensor of positional encodings of shape `(max_seq_len, d_model)`, as computed by the `positional_encodings` function
    * `max_tokens`: The maximum number of integer tokens used in the input (including the masking and OOV tokens)
* The custom layer should create an `Embedding` layer with the correct input and output dimensions, that is set to mask incoming zero tokens. This layer should be set as the class attribute `embedding`
* The `call` method takes a Tensor of integer tokens of shape `(batch_size, n)` as input, where `n` is the maximum sequence length in the batch
* The `call` method should use the `Embedding` lookup layer to convert the inputs to embedding vectors, and return the sum of the embedding vectors and positional encodings

_NB: The custom layer also implements the `compute_mask` method, which has been completed for you. This is a method should be implemented whenever a layer should produce a mask (see [this guide](https://www.tensorflow.org/guide/keras/masking_and_padding) for more information). Note that this method references `self.embedding`. In this instance we want to pass on the same mask that is produced by the `Embedding` layer, so the `compute_mask` method simply calls the existing method from the `Embedding` layer._

In [13]:
#### GRADED CELL ####

# Complete the following class.
# Make sure not to change the class or methods name or arguments.

class InputEmbeddings(Layer):
    """
    This custom layer should take a batch of integer tokens as input, 
    converts them into embeddings and adds the positional encodings.
    """
    
    def __init__(self, d_model, pos_encodings, max_tokens, name='input_embeddings', **kwargs):
        super().__init__(name=name, **kwargs)
        self.pos_encodings = pos_encodings
        self.embedding = Embedding(max_tokens, d_model, mask_zero=True)
        
    def compute_mask(self, inputs, mask=None):
        return self.embedding.compute_mask(inputs)
        
    def call(self, inputs):
        """
        inputs is an integer Tensor of shape (batch_size, n), where n is 
        the maximum sequence length in this batch (n \le max_seq_len)
        """       
        n = tf.shape(inputs)[-1]
        pos_encodings = self.pos_encodings[:n, :]
        h = self.embedding(inputs)
        return h + pos_encodings

In [14]:
# Create an instance of your custom layer

input_embeddings = InputEmbeddings(D_MODEL, pos_encodings, MAX_TOKENS)

In [15]:
# Test your custom layer on an input

for tokens, _ in train_data.take(1):
    h = input_embeddings(tokens)

In [16]:
# Check that our custom layer is producing a mask

h._keras_mask

<tf.Tensor: shape=(32, 200), dtype=bool, numpy=
array([[ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False],
       ...,
       [ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False]])>

#### Encoder block

You will now implement the encoder block of the Transformer model. This block consists of a multi-head attention block with a residual connection followed by layer normalisation, and then a pointwise feedforward network with residual connection again followed by layer normalisation.

The multi-head attention block will need to account for the masking corresponding to the zero padding tokens in the input. The way the `MultiHeadAttention` layer handles this is through the `attention_mask` argument when it is called. 

The incoming mask is a boolean Tensor with shape `(batch_size, seq_len)`, where `seq_len` is the length of the sequence of embedding vectors being input to the `MultiHeadAttention` layer. The multi-head attention is performing self-attention, so the shape of the mask required by the `attention_mask` will be `(batch_size, seq_len, seq_len)`.

Before implementing the encoder block, you should complete the following function `get_attention_mask`, which takes a single argument `mask`, which is a boolean Tensor of shape `(batch_size, seq_len)`, or `None`. If `mask` is `None`, then this function should return `None`. Otherwise, the function should return a boolean Tensor of shape `(batch_size, seq_len, seq_len)` which will be used by the `MultiHeadAttention` layer.

For a single example in the batch, suppose the vector mask $\mathbf{m}\in\mathbb{R}^n$, where $n$ is the sequence length. We would like to convert this vector mask to a matrix mask $\mathbf{M}\in\mathbb{R}^{n\times n}$, where the $i,j$-th element $M_{ij}$ is given by the element $\min (i,j)$ of the vector $\mathbf{m}$.

For example, if the incoming boolean mask was as follows:

```
mask = [[True, True, False],
        [True, False, False]]
```

where the batch size is 2 and the sequence length is 3, then the mask returned by the function `get_attention_mask` should look like:

```
returned_mask = [[[True, True, False],
                  [True, True, False],
                  [False, False, False]],
                 [True, False, False],
                 [False, False, False],
                 [False, False, False]]]
```

In [17]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_attention_mask(mask=None):
    """
    This function should compute the attention mask as described above.
    The function should then return the boolean Tensor.
    """
    if mask is None:
        return None
    mask1 = mask[:, :, None]
    mask2 = mask[:, None, :]
    return mask1 & mask2

In [18]:
# Test your mask function

input_mask = tf.constant([[True, True, False], [True, False, False]])
get_attention_mask(input_mask)

<tf.Tensor: shape=(2, 3, 3), dtype=bool, numpy=
array([[[ True,  True, False],
        [ True,  True, False],
        [False, False, False]],

       [[ True, False, False],
        [False, False, False],
        [False, False, False]]])>

You should complete the following `EncoderBlock` custom layer, that implements the encoder block. 

* The initialiser takes the following required arguments:
    * `num_heads`: the number of attention heads to use in the multi-head attention
    * `key_dim`: the key (and query and value) dimension to use in the multi-head attention
    * `d_model`: the embedding dimension of the Transformer
    * `ff_dim`: the width of the hidden layer in the feedforward network in the encoder block
* The initialiser sets `self.support_masking = True` (this has been done for you), so that the incoming boolean mask will also be included in the output of the `EncoderBlock` layer
* The `EncoderBlock` layer should create `MultiHeadAttention`, `LayerNormalization` and `Dense` layers in the initializer as required 
    * The feedforward network should have one hidden layer of size `ff_dim` with a ReLU activation
    * The output layer of the feedforward network should have size `d_model` and no activation
* The operation of the custom layer is as follows:
    * The input to the layer is a Tensor of shape `(batch_size, seq_len, d_model)`
    * The call method also takes a `mask` argument, which will be a boolean mask of shape `(batch_size, seq_len)`. The call method should use your `get_attention_mask` function to compute the attention mask
    * Pass the input through the `MultiHeadAttention` layer (performing self-attention), passing the attention mask in the `attention_mask`. Add the resulting Tensor output to the input (residual connection) and pass through a `LayerNormalization` layer
    * Pass the resulting Tensor $h$ through the feedforward network, add this to $h$ (residual connection) and pass through a `LayerNormalization` layer
    * The custom layer should then return the resulting Tensor

In [19]:
#### GRADED CELL ####

# Complete the following class.
# Make sure not to change the class or methods name or arguments.

class EncoderBlock(Layer):
    """
    This custom layer should take a Tensor of shape (batch_size, seq_len, d_model) as input.
    It should carry out the operations as described above and return the resulting Tensor.
    """
    
    def __init__(self, num_heads, key_dim, d_model, ff_dim, name='encoder_block', **kwargs):
        super().__init__(name=name, **kwargs)
        self.supports_masking = True  # This will pass on any incoming mask
        self.num_heads = num_heads
        self.key_dim = key_dim
        self.d_model = d_model
        self.ff_dim = ff_dim
        self.multihead_attention = MultiHeadAttention(num_heads, key_dim)
        self.ff = Sequential([
            Dense(ff_dim, activation='relu'),
            Dense(d_model)
        ])
        self.layernorm1 = LayerNormalization()
        self.layernorm2 = LayerNormalization()
        
    def call(self, inputs, mask=None):
        """
        inputs is a Tensor of shape (batch_size, seq_len, d_model)
        """    
        attention_mask = get_attention_mask(mask)
        h = self.multihead_attention(inputs, inputs, attention_mask=attention_mask)
        h = self.layernorm1(inputs + h)
        
        h_ff = self.ff(h)
        return self.layernorm2(h + h_ff)

In [20]:
# Create an EncoderBlock instance

encoder_block = EncoderBlock(num_heads=2, key_dim=16, d_model=32, ff_dim=32)

In [21]:
# Test your layer on a dummy input

inputs = tf.random.normal((16, 200, 32))
h = encoder_block(inputs)

In [22]:
# Test your layer on the output from the input_embeddings layer

for tokens, _ in train_data.take(1):
    h = input_embeddings(tokens)
    h = encoder_block(h)

In [23]:
# Check that the mask has been propagated

h._keras_mask

<tf.Tensor: shape=(32, 200), dtype=bool, numpy=
array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ..., False, False, False],
       [ True,  True,  True, ..., False, False, False],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])>

#### Classifier head

The final stage of our Transformer model is the classifier head. This stage consists of the following layers:

* A `GlobalAveragePooling1D` layer, that takes an incoming Tensor of shape `(batch_size, seq_len, d_model)` and reduces out the time axis to produce a Tensor of shape `(batch_size, d_model)`
* A dropout layer
* A dense layer with ReLU activation
* A dropout layer
* A dense layer with a single neuron output and sigmoid activation function

The final dense layer outputs the probability of a positive sentiment label.

You should now complete the following `get_classifier_head` function, which takes the arguments `d_model`, `dropout_rate` and `units`. The function should build and return a Sequential Model object, according to the above specification, where `dropout_rate` is used in both `Dropout` layers and `units` is used to define the width of the intermediate `Dense` layer. The `d_model` input should be used to set the `input_shape` in the first layer.

Note that the [`GlobalAveragePooling1D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer will automatically use the incoming mask when used in the `Sequential` model, see [the guide](https://www.tensorflow.org/guide/keras/masking_and_padding) here.

In [24]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_classifier_head(d_model, dropout_rate, units):
    """
    This function should compute classifier head model as described above.
    The function should then return the Model object.
    """
    model = Sequential([
        GlobalAveragePooling1D(input_shape=(None, d_model)),
        Dropout(dropout_rate),
        Dense(units, activation='relu'),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid')
    ])
    return model

In [25]:
# Create an instance of the classifier head

classifier_head = get_classifier_head(D_MODEL, dropout_rate=0.1, units=20)

In [26]:
# Print the model summary

classifier_head.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 global_average_pooling1d (  (None, 32)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 20)                660       
                                                                 
 dropout_2 (Dropout)         (None, 20)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 21        
                                                                 
Total params: 681 (2.66 KB)
Trainable params: 681 (2.66 KB)
Non-trainable params: 0 (0.00 Byte)
________________________

In [27]:
# Test your classifier head model on a dummy input

inputs = tf.random.normal((8, 200, D_MODEL))
classifier_head(inputs)

<tf.Tensor: shape=(8, 1), dtype=float32, numpy=
array([[0.48969847],
       [0.4935625 ],
       [0.5036215 ],
       [0.49018058],
       [0.50510937],
       [0.5138685 ],
       [0.49028182],
       [0.5169365 ]], dtype=float32)>

#### Build the Transformer classifier

We now have all the components to build the complete Transformer classifier. You should now complete the following function `get_transformer_classifier` to build and compile the model.

The function takes the arguments `input_embeddings_layer`, `encoder_block_layer` and `classifier_head_layer`. It should use these layers to build a `Sequential` model, and compile it with a suitable loss function and optimizer, and an accuracy metric.

In [28]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def get_transformer_classifier(input_embeddings_layer, encoder_block_layer, classifier_head_layer):
    """
    This function should compute classifier head model as described above.
    The function should then return the Model object.
    """
    model = Sequential([
        input_embeddings_layer,
        encoder_block_layer,
        classifier_head_layer
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [29]:
# Get the compiled Transformer classifier model

input_embeddings = InputEmbeddings(D_MODEL, pos_encodings, MAX_TOKENS)
encoder_block = EncoderBlock(num_heads=2, key_dim=16, d_model=32, ff_dim=32)
classifier_head = get_classifier_head(D_MODEL, dropout_rate=0.1, units=20)
transformer = get_transformer_classifier(input_embeddings, encoder_block, classifier_head)

In [30]:
# Test the Transformer classifier model

for tokens, _ in train_data.take(1):
    outputs = transformer(tokens)

#### Train the model

In [31]:
history = transformer.fit(train_data, validation_data=test_data, epochs=2)

Epoch 1/2
Epoch 2/2


#### Test on unlabelled data

The IMDB dataset also contains a split without labels. The following cell loads this dataset split and applies a shuffle.

In [32]:
unsupervised_data = tfds.load("imdb_reviews", split="unsupervised", 
                              read_config=tfds.ReadConfig(try_autocache=False))
unsupervised_data = unsupervised_data.shuffle(1000)

Now let's take a look at some model predictions.

In [33]:
for example in unsupervised_data.take(1):
    print(example['text'].numpy().decode("utf-8"))
    tokens = text_vectorization(example['text'])
    tokens = tokens[tf.newaxis, :MAX_SEQ_LEN]  # Add dummy batch dimension and truncate
    prob = transformer(tokens).numpy().squeeze()
    print(f"\nTransformer probability of positive label: {prob}")

The Alpha Video release seems to be fairly complete with the entire story intact (except for some splicy sections in what was probably a 16mm television print: The story does make sense in this version which has the entire explanation of why the criminals are on the ship in the first place and what the doctor's motivations are.<br /><br />It is mysterious that the film runs about 63 minutes when the main IMDb description has it released at 57 minutes. That's probably incorrect and doesn't represent the original theatrical release, but rather some random individual's timing from a DVD or VHS tape that wasn't complete in the first place.

Transformer probability of positive label: 0.03219905495643616


Congratulations on completing this week's assignment! You have now implemented and trained an encoder-only Transformer classifier model for the task of sentiment prediction.