# Text classification with Transformer

From [Keras API](https://keras.io/examples/nlp/text_classification_with_transformer/) by [Apoorv Nandan](https://twitter.com/NandanApoorv)

In this example we will implement a transformer block (stacked encoders) for text classification. 


## Dataset IMDB

The dataset corresponds to film reviews labelled with positive or negative sentiment (Andrew L. et al., 2011).

### size

train 25,000

test 25,000

unlabelled 50,000


Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

In [1]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 20000 # size of lexicon
index_offset = 3
max_len = 200

(x_train, y_train), (x_val, y_val) = imdb.load_data(num_words=vocab_size, index_from=index_offset)

x_train = pad_sequences(x_train, maxlen=max_len)
x_val = pad_sequences(x_val, maxlen=max_len)

# Example label and sequence
print(y_val[0])
print(x_val[0])

# From (https://stackoverflow.com/a/44891281)
word_to_idx = imdb.get_word_index()
word_to_idx = {k: (v+index_offset) for k, v in word_to_idx.items()}
# reserved tokens
word_to_idx["<PAD>"] = 0
word_to_idx["<START>"] = 1
word_to_idx["<UNK>"] = 2
word_to_idx["<UNUSED>"] = 3
idx_to_word = {v:k for k,v in word_to_idx.items()}

def from_seq_to_text(seq):
    return ' '.join(idx_to_word[idx] for idx in seq)

print(from_seq_to_text(x_val[0]))

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


0
[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     1   591   202    14    31     6   717    10    10 18142 10698     5
     4   360     7     4   177  5760   394   354     4   123     9  1035
  1035  1035    10    10    13    92   124    89 

## Architecture
![Architecture network](img/text_classifier_arch.png)

### Parameters
* Embedding size (Embed) = 32
* Number of heads = 2
* Feedforward network dimension = 32
* Dropout = 10 %
* First dense layer units (dim1) = 20
* Last dense layer units (Classes) = 2

### Embedding Block

It consists of the word embeddings and positional embeddings. Keras provides an embedding layer.

The embedding layer takes as input positive integers and turn them into dense vectors of embedding size. 

For word embeddings the integers are the indeces of the words in the lexicon, whereas in the positional embeddings they are the position in the sequence. 

In the embedding layer the model learns one matrix of size (input_dim x embedding_size) $W_{embed}$. Where input_dim is the vocabulary size for word embeddigns and maximum length of a sequence for the positional embeddings.

The embedding block will be $W_{word} + W_{pos}$

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Embedding

class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

## Transformer block


![Transformer block architecture](img/transformer_block_arch.png) 
![Transformer block architecture in the paper](img/transformer_block_paper.png)

### Elements
* Multi-head self-attention (more details below)

* LayerNormalization, it normalises mean 0 and standar deviation of 1. The normalisation occurs across the last axis (or specified) at the sample level.

* Feed forward network (FFN). An independend network (trained together) that should be applied in pararel to all the 'words' (attention output). We will use a Sequential model with two dense layers. The process goes as the first layer has a ReLU activation, and its output is feed to another dense layer.

    * Equation: $FFN(x) = max(0,xW_1 + b1) W_2 + b_2$ 
    
    Where $W_1$ has a size of (embed_dim x ffn_dim) and $W_2$ has a size of (ffn_dim x embed_dim). 
    
    
## Multi-head self-attention block
![Multi-head self-attention architecture](img/multi_head_self_attention_arch.png)


### Elements
* Query, Key and Value matrices (all heads are concatenated). Dense layers
    * Input (BatchxSeqxEmbed)
    * Output (BatchxSeqx(Heads*d_model)) In this example d_model is calculated automatically as `embed_dim // num_heads`
* Separate heads. It reorganises the query, key and value tensors so that we can calculate the attention per head using the batch and heads dimensions as batch dimensions. 
    * Input (BatchxSeqx(Heads*d_model))
    * Output (BatchxHeadsxSeqxd_model)
    * Process: reshape input to (BatchxSeqxHeadsxd_model) then permute Seq and Heads dimensions, and reshape to (BatchxHeadsxSeqxd_model)
* Self attention:
    * Equation: $\text{Attention}(Q,KV) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
* Concatenat heads. Again, this is used to reorganise the matrices so we have all heads concatenated as in Query, Key and Value matrices. 
    * Input (BatchxHeadsxSeqxd_model)
    * Output (BatchxSeqxEmbed)
    * Process: permutate Heads and Seq dimensions, and then reshape to (BatchxSeqxEmbed)
* Multi-head matrix. The output of all heads should be combined into one tensor of the shape as the embedding tensors, because that is what the feed forward network is expecting, one tensor for word. A dense layer.
    * Equation: $\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$
    
    where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
    * The size of $W^O$ is ((Heads*d_model)xd_model)

In [9]:
# Multi-head self-attention
from tensorflow.keras.layers import Dense

def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask


class MultiHeadSelfAttention(Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = Dense(embed_dim)
        self.key_dense = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.combine_heads = Dense(embed_dim)

    def attention(self, query, key, value, mask):
        score = tf.matmul(query, key, transpose_b=True) # (batch_size, )
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32) # d_model
        scaled_score = score / tf.math.sqrt(dim_key)
        if mask is not None:
            # look ahead mask
            scaled_score += (mask * -1e9) # -inf for masked values         
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs, mask=None):
        # x.shape = [batch_size, seq_len, embedding_dim]
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)  # (batch_size, seq_len, embed_dim)
        key = self.key_dense(inputs)  # (batch_size, seq_len, embed_dim)
        value = self.value_dense(inputs)  # (batch_size, seq_len, embed_dim)
        query = self.separate_heads(
            query, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        key = self.separate_heads(
            key, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        value = self.separate_heads(
            value, batch_size
        )  # (batch_size, num_heads, seq_len, projection_dim)
        attention, weights = self.attention(query, key, value, mask)
        attention = tf.transpose(
            attention, perm=[0, 2, 1, 3]
        )  # (batch_size, seq_len, num_heads, projection_dim)
        concat_attention = tf.reshape(
            attention, (batch_size, -1, self.embed_dim)
        )  # (batch_size, seq_len, embed_dim)
        output = self.combine_heads(
            concat_attention
        )  # (batch_size, seq_len, embed_dim)
        return output

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LayerNormalization, Dropout

class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = Sequential(
            [Dense(ff_dim, activation="relu"), Dense(embed_dim),]
        ) # FeedForward Network
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        # if training=True the layer add dropout, in inference mode it does nothing.
        attn_output = self.dropout1(attn_output, training=training) 
        out1 = self.layernorm1(inputs + attn_output) # Residual connection (+)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

## Classifier

Now let's create the classifier model

![Classifier architecture model](img/text_classifier_arch.png)

In [11]:
from tensorflow.keras.layers import Input, GlobalAveragePooling1D
from tensorflow.keras.models import Model

embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer
num_classes = 2
dense_dim1 = 20 # Reducing dimensions after attention

inputs = Input(shape=(max_len,))
embedding_layer = TokenAndPositionEmbedding(max_len, vocab_size, embed_dim)
x = embedding_layer(inputs)
print('----------------------- dimensions -----------------------')
print('Shape embedding layer: ')
print(x.shape)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
print('Shape Transformer block: ')
print(x.shape)
x = GlobalAveragePooling1D()(x)
print('Shape global average pooling layer: ')
print(x.shape)
x = Dropout(0.1)(x)
print('Shape dropout layer: ')
print(x.shape)
x = Dense(dense_dim1, activation="relu")(x)
print('Shape dense 1 layer: ')
print(x.shape)
x = Dropout(0.1)(x)
print('Shape dropout layer: ')
print(x.shape)
outputs = Dense(num_classes, activation="softmax")(x)
print('Shape dense 2 (output) layer: ')
print(outputs.shape)

model = Model(inputs=inputs, outputs=outputs)
print(model.summary())

----------------------- dimensions -----------------------
Shape embedding layer: 
(None, 200, 32)
Shape Transformer block: 
(None, 200, 32)
Shape global average pooling layer: 
(None, 32)
Shape dropout layer: 
(None, 32)
Shape dense 1 layer: 
(None, 20)
Shape dropout layer: 
(None, 20)
Shape dense 2 (output) layer: 
(None, 2)
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
token_and_position_embedding (None, 200, 32)           646400    
_________________________________________________________________
transformer_block_2 (Transfo (None, 200, 32)           6464      
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
______________________________________________________________

In [12]:
# Train model
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7ff83469eb70>

In [13]:
# Predictions
predictions = model.predict(x_val)

In [14]:
# examples
print(from_seq_to_text(x_val[0]))
print(predictions[0])
print(y_val[0])

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one o