# Text Classification with Transformers

```{article-info}
:avatar: https://avatars.githubusercontent.com/u/25820201?v=4
:avatar-link: https://github.com/PhotonicGluon/
:author: "[Ryan Kan](https://github.com/PhotonicGluon/)"
:date: "Jun 26, 2024"
:read-time: "{sub-ref}`wordcount-minutes` min read"
```

*This notebook is largely inspired by the Keras code example [Text classification with Transformer](https://keras.io/examples/nlp/text_classification_with_transformer/) by Apoorv Nandan.*

In this example, we will do text classification using Keras-MML’s transformer implementation.

:::{note}
We will use the `jax` backend for faster execution of the code. Feel free to ignore the cell below.
:::

In [1]:
import os
os.environ["KERAS_BACKEND"] = "jax"

## Preparing the Data

The dataset we will use is the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/). It contains 25000 movies reviews from IMDB, where each review is labeled as having positive or negative sentiment. 

The dataset is available for importing in Keras, where, for convience, the reviews have already been preprocessed. Each preprocessed review is encoded as a list of word indices, where the word index of a word indicates the frequency of the word in the dataset. For example, a word that was encoded as `3` would indicate that it is the third most frequent word in the dataset. The index `0` is reserved for padding.

For our purposes, we will consider only the top 20000 words. This will be our vocabulary size (`VOCAB_SIZE`).

In [2]:
import keras

In [3]:
VOCAB_SIZE = 20000

In [4]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=VOCAB_SIZE)

How many sequences did we load?

In [5]:
print(len(x_train), "training sequences")
print(len(x_val), "validation sequences")

25000 training sequences
25000 validation sequences


We will limit each sequence to a length of 200 (`MAX_LEN`). This means that words beyond the `MAX_LEN` mark will be removed, while sequences that are not long enough will be padded to `MAX_LEN`.

In [6]:
MAX_LEN = 200

In [7]:
x_train = keras.utils.pad_sequences(x_train, maxlen=MAX_LEN)
x_val = keras.utils.pad_sequences(x_val, maxlen=MAX_LEN)

## Creating the Model

Keras-MML provides a `TransformerBlockMML` layer. It acts similarly to the transformer architecture described in [*Attention Is All You Need*](https://arxiv.org/pdf/1706.03762v7) and outputs one vector per time step of the input. What results is an embedding that should encode more information about the text that went into the transformer.

In [8]:
import keras_mml

We first specify three hyperparameters for the model.
- The `EMBEDDING_DIM` gives the dimensionality of the embedding vector for each token in the sequence.
- The `NUM_HEADS` gives the number of heads to use in the multi-head attention part of the transformer layer.
- The `FFN_DIM` gives the intermediate (i.e., hidden) layer size of the feed-forward network (FFN) in the transformer.

For this example we elect to choose small numbers.

In [9]:
EMBEDDING_DIM = 32
NUM_HEADS = 2
FFN_DIM = 32

The architecture of our model is as follows.
- We first create embeddings for the tokens in the sequence. We add embeddings for the positions of the tokens to create an initial embedding.
- This initial embedding will be fed into the transformer block layer. The output will be refined embeddings that should encode more information about the sentence as a whole.
- Afterwards we take the mean (i.e., average) across all time steps using a standard `GlobalAveragePooling1D` layer available in the base Keras package.
- Finally, we will a use fully-connected network (which is several dense layers) on top of it to classify the sentiment of the review.

We will add some dropout in the final fully-connected network to act as regularization and reduce overfitting.

In [10]:
model = keras.models.Sequential(
    layers=[
        keras.layers.Input(shape=(MAX_LEN,)),
        keras_mml.layers.TokenEmbedding(MAX_LEN, VOCAB_SIZE, EMBEDDING_DIM, with_positions=True),
        keras_mml.layers.TransformerBlockMML(EMBEDDING_DIM, FFN_DIM, NUM_HEADS),
        keras.layers.GlobalAveragePooling1D(),
        keras.layers.Dropout(0.1),
        keras.layers.Dense(20, activation="relu"),
        keras.layers.Dropout(0.1),
        keras.layers.Dense(2, activation="softmax")
    ]
)

model.summary()

We will train the model to minimise the categorical crossentropy of the model, where we output the accuracy of the model as a metric for us to monitor.

In [11]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Let's train the model!

In [12]:
model.fit(
    x_train, y_train, batch_size=32, epochs=3, validation_data=(x_val, y_val)
)

Epoch 1/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 36ms/step - accuracy: 0.6802 - loss: 0.5444 - val_accuracy: 0.8592 - val_loss: 0.3278
Epoch 2/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 28ms/step - accuracy: 0.9318 - loss: 0.1858 - val_accuracy: 0.8576 - val_loss: 0.3708
Epoch 3/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 29ms/step - accuracy: 0.9707 - loss: 0.0848 - val_accuracy: 0.8521 - val_loss: 0.4232


<keras.src.callbacks.history.History at 0x7fdb94ddb3d0>

How well did the model do?

In [13]:
val_loss, val_acc = model.evaluate(x_val, y_val)
print(f"Validation loss:     {val_loss:.5f}")
print(f"Validation accuracy: {val_acc * 100:.2f}%")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - accuracy: 0.8514 - loss: 0.4266
Validation loss:     0.42323
Validation accuracy: 85.21%


## Conclusion

In this notebook, we demonstrated how to use Keras-MML’s `TransformerBlockMML` layer as a matmul-free replacement to the traditional transformer architecture. We used `TransformerBlockMML` in a text classification example, showing that it performs well in this case.