<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/ann/transformer_classification_with_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers for text classification

***This is the "question" version of this notebook. It differs from the "answer" version in that it runs on a pre-vectorized version of IMDb. You are asked to amend it to run on raw text input.***

Adapted from a [Keras team example](https://keras.io/examples/nlp/text_classification_with_transformer/)

The original has been changed as follows:

* More text cell explanations and code comments
* Inspection of intermediate steps
* Plotting of training loss and accuracy
* Seperated out parameters in to a single cell
* Some renaming of variables
* Small differences in use of imports
* Code added to adapt to run on raw text input

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.

You may obtain a copy of the License at

[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Introduction

This practical uses Keras to build a transformer based text classifier, and trains it on the IMDb movie review dataset. It is very similar to the previous CNN practical, and assumes that you have run and understood the CNN practical. Explanations are not always repeated when they were give in the CNN practical.

The initial model does not classigy raw text. Instead, it classifies a version of the IMDb dataset that has already been represented as integer vectors. You are asked to amend it to use raw text, by adapting the code from the previous CNN notebook.

## Using with GPUs

The execution time of TensorFlow based code will benefit from the use of GPUs. To select a GPU runtime in colab:

* Select the *Runtime* menu
* Select the *Change runtime type* submenu
* In the dialog that appears, under *Hardware accelerator* select *GPU*
* Your existing runtime will disconnect, and you will be allocated and connected to a new GPU runtime.

We will also improve execution time through the way in which we fetch and cache data, in one of the steps below.

## Packages

First, the import. You will need Keras. Keras is the default high-level API for TensorFlow, which is itself the most popular neural net libray. 

**Note if running locally:** in order for the visualisation to work, you will need to have pydot and graphviz installed, e.g. 

```sudo apt-get install graphviz
pip3 install pydot```

In [None]:
# Basics
import tensorflow as tf
from tensorflow import keras

# Keras package to handle directories of text
from tensorflow.keras.utils import text_dataset_from_directory

# Model layers - we need these!
from tensorflow.keras import layers

# We use these next two when pre-processing string
import string
import re

# For plotting
import matplotlib.pyplot as plt

## Parameters

Now let's set up some parameters, such as number of features, embedding dimensions, batch size, epochs etc.

In [None]:
# How many documents in a batch?
batch_size = 32

# Maximum or padded length (in tokens) of a text sequence
sequence_length = 200

# Maximum number of features in our text vector space.
# i.e. how many different tokens in our vocabulary
max_features = 20000

# Dimensions in text embedding
embedding_dim = 32

# Number of training epochs
epochs = 2

## Build a transformer block

We will create new classes to model a transformer block and a transformer embedding layer. Some of you will be familiar with creating classes in Python, but for those of you who are not, here is a brief explanation.

We will define new layers as classes, as this will encapsulate and hide the details. In our code, we will be able to refer to a whole transformer block in one line, without having to write them each time. If we packaged up the classes, we could use them in other code by importing.

We define a class like this:

```class TransformerBlock(layers.Layer):```

Inside the class definition we can put methods and data attributes. In the parenthese, we can define any *superclasses* i.e. classes from which our new class inherits functionality. There are some special methods defined in the class.

The ```__init__(...)``` method is called whenever an object of this class is made. We define the different layers of our class in here. When you do the following, the ```__init__(...)``` method is run:


```newLayer = TransformerBlock(...)```


The ```call(...)``` method defines the structure of and computation of our class. We connect the layers to each other and the input here. It is called (a) when you add another layer to this class and (b) when the network is run. (Internally, it calls another method, ```__call__ ``` ).


When you do the following, the code in ```__call__(...)``` is run in the last step:


```
x = SomeLayer(...)
y = TransformerBlock(...)
z = y(x)
```


Alternatively, when you run the following functional style of Keras code, an object of the class is created and ```__init__(...)``` is run, then ```__call__ ```  is run on that object:

```
x = SomeLayer(...)
x = TransformerBlock(...)(x)
```




In [None]:
# A transformer block, inherits from Layer
class TransformerBlock(layers.Layer):

    # __init__(...) is called when you create a new object of
    # this class.
    #
    # __init(...)__ creates the layers we will need in our block.
    #
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()

        # We define some layers on initialisation, but leave building
        # the netwrok until we now what the input is, in call()

        # A multihead attenction layer
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)

        # Feed forward consisting of two dense layers
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )

        # Normalization layers
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        # Dropout layers
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    # call(...) is called by __call__(...) whic is itself called
    # when you use an object of this class as a function,
    # e.g. someTransformerBlock(x)
    #
    # call(...) combines the layers created by __init__(...)
    # with the inputs from other layers.
    #
    def call(self, inputs, training):

        # attention then dropout
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)

        # normalize: takes input and concatenates attention output
        out1 = self.layernorm1(inputs + attn_output)

        # Feed forward then dropout
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)

        # normalize: takes output from feed forward and concatenates
        # previous normaised attention
        return self.layernorm2(out1 + ffn_output)

## Build an embedding layer

We have already seen token embeddings. In transformers, we extend this to also create an embedding that encodes the position of the token. We do this beacause transformers have no information about the order of words.

A very simple approach is to learn an embedding for each position in the same way we learn a token embedding. This is what we will do here. There are more sophisticated approaches.

Our embedding layer therefore contains two separate embeddings:

* Token embedding
* Token position embedding

We will concatenate thes.

In [None]:
# An embedding layer for tokens and their positions.
class TokenAndPositionEmbedding(layers.Layer):

    # Define the two embeddings
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    # Create an embedding for the position, and
    # one for the token. Concatenate them
    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

## Build the model

We build the model layers using Keras [functional programming syntax](https://keras.io/guides/functional_api/), as in the CNN practical.

In [None]:
# We will create a network including a transformer block
# and other layer.

num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

# Input layer, to take vectorized text
inputs = layers.Input(shape=(sequence_length,))

# Token and position embedding
embedding_layer = TokenAndPositionEmbedding(sequence_length, max_features, embedding_dim)
x = embedding_layer(inputs)

# Trannsformer
transformer_block = TransformerBlock(embedding_dim, num_heads, ff_dim)
x = transformer_block(x)

# Pooling / dropout / dense / dropout / dense layers
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

Now we have all of our layers, let's put them in to a model, by using our input layer and our final predictions layer as parameters. The model wraps up the layers, adding training and inference functionality.

We can then compile our model, i.e. configure it for training by providing parameters for the loss function, optimisatiom, and metrics we will use.

In [None]:
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

## Dataset

**For the exercise at the end of this notebook, you will need to comment out the below cell. For now, leave it uncommented and run it.**

The previous CNN practical used raw text from the IMDb dataset. In this practical, we will start by using a [version of the IMDb dataset that ships with Keras](https://keras.io/api/datasets/imdb/). This has already pre-processed and reviews encoded as vectors of integers. Using such a dataset makes our job a bit easier when developing and experimenting with model architectures, as we do not need to deal with text pre-processing.

Once you have got the model working with the pre-vectorized version of IMDb, you are asked to adapt it to work with raw text.

We read in vectors named as follows:
* ```x_``` : text features
* ```y_``` : labels
* ```_train``` : training
* ```_val``` : validation

In [None]:
# Read in the IMDb dataset from Keras

(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")

# Pad sequences to our maximum length
x_train = keras.utils.pad_sequences(x_train, maxlen=sequence_length)
x_val = keras.utils.pad_sequences(x_val, maxlen=sequence_length)

## Training the model

**For the exercise at the end of this notebook, you will need to comment out the below cell. For now, leave it uncommented and run it.**

Now let's train it. Keras will validate against our test data, showing us loss and accuracy as it goes, and saving these in to a ```History``` object. We can use this ```History``` to display the results of each epoch, after we have finished all training.

In [None]:
history = model.fit(
    x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_val, y_val)
)

# Exercise

Using the example from the previous CNN notebook,

* Comment out the above two cells (**Dataset** and **Training the model**).
* Write new code to get the IMDb ***text*** dataset from [https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)
* Read it in to Keras datasets, one each for training, validation and held out testing.
* Preprocess the text, vectorize it, and use to train the model.
* Evaluate the model against the held out test set.
* **Optional:** visualise models, and write an end-to-end model that will accept text as input and return a classification.