### Multi-head Attention with 128-dimensional embedding

Author: Jeanne Elizabeth Daniel

November 2019

We employ the multi-head attention encoder of the Transformer (Vaswani et al., 2017), to model the multilingual questions. The Transformer is a new type of encoder-decoder model that relies solely on attention to draw global dependencies between the input and output sequences.

Attention allows the model to focus on different parts of the input sequence at every step of the output sequence. This enables modelling dependencies without any regards for their distance in the sequences. This architecture is devoid of any recurrence or convolutions, and thus its training can be parallelizable.

The encoder component is a stack of encoders, identical in structure, but all with their own set of weights. Each encoder consists of a self-attention layer, followed by a fully-connected feedforward layer. 

The attention used by the Transformer is the scaled dot-product attention with a set of queries in matrix $\boldsymbol{Q}$, a set of keys in matrix $\boldsymbol{K}$, and a set of values in matrix $\boldsymbol{V}$, and is computed as follows: 

\begin{equation}
    \mathrm{Attention} (\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \mathrm{softmax} \Bigg(\frac{\boldsymbol{Q} \boldsymbol{K}^{\top}}{ \sqrt{d_K}}\Bigg) \boldsymbol{V},
\end{equation}

where $d_K$ is the dimension of the keys and acts as a scaling factor. Multi-headed attention allows for attention to be aggregated across $h$ different, randomly-initialized representation subspaces. Thus,

\begin{equation}
    \mathrm{MultiHead} (\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \mathrm{Concat} (\mathrm{head}_1, \dots, \mathrm{head}_h) \boldsymbol{W}^{O},
\end{equation}

where Concat refers to concatenating each head, defined as: 

\begin{equation}
    \mathrm{head}_i = \mathrm{Attention} (\boldsymbol{Q}\boldsymbol{W}^{Q}_i, \boldsymbol{K}\boldsymbol{W}^{K}_i, \boldsymbol{V}\boldsymbol{W}^{V}_i ),
\end{equation}

with $\boldsymbol{W}^{Q}_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_V }, \boldsymbol{W}^K_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_V }$, 
$\boldsymbol{W}_i^V \in \mathbb{R}^{d_{\mathrm{model}} \times d_V}$,
and $\boldsymbol{W}^O \in \mathbb{R}^{h d_v \times d_{\mathrm{model}}}$. 

The scalar $d_V$ represents the dimension of the values and $d_{\mathrm{model}}$ denotes the dimension of the model's embedding space. This multi-headed attention function can also be parallelized and trained across multiple computers. The authors also inject information about the relative and absolute positions of the values in the sequence using positional encoding to allow for the modelling of time-dependencies. 

This is done by summing the positional encodings with the input embeddings, which are defined as 
\begin{eqnarray}
PE_{(pos, 2i)}  &  = & \sin (pos/10000^{2i/d_{\mathrm{model}}}),\\
PE_{(pos, 2i+1)} & = & \cos (pos/10000^{2i/d_{\mathrm{model}}}).
\end{eqnarray}
Combining all these elements results in state-of-the-art embeddings that, when compared to previous models, has reduced computational complexity per layer, parallelizable computation, and better long-term dependency modelling.

In [1]:
import sys
import os
#sys.path.append(os.path.join(\"..\")) # path to source relative to current directory"

In [3]:
import numpy as np
import gensim

In [4]:
import preprocess_data
import pandas as pd

In [24]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices("GPU")
tf.config.experimental.set_memory_growth(physical_devices[0], True)
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, TimeDistributed, Input, Flatten, AdditiveAttention

In [49]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [5]:
data = pd.read_csv('dataset_7B', delimiter = ';', engine = 'python')
data_text = data.loc[data['set'] == 'Train'][['helpdesk_question']]
number_of_classes = data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts().shape[0]
data = data[['helpdesk_question', 'helpdesk_reply', 'set', 'low_resource']] 

In [6]:
responses = pd.DataFrame(data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts()).reset_index()
responses['reply'] = responses['index']
responses['index'] = responses.index
responses = dict(responses.set_index('reply')['index'])

In [7]:
len(responses)

89

In [8]:
data_text['index'] = data_text.index
documents = data_text

In [9]:
dictionary = preprocess_data.create_dictionary(data_text, 1, 0.25, 95000) #our entire vocabulary

In [10]:
df_train = data.loc[data['set'] == 'Train']
df_train = df_train.reset_index()[['helpdesk_question', 'helpdesk_reply']]

df_valid = data.loc[data['set'] == 'Valid']
df_valid = df_valid.reset_index()[['helpdesk_question', 'helpdesk_reply']]

df_test = data.loc[data['set'] == 'Test']
df_test = df_test.reset_index()[['helpdesk_question', 'helpdesk_reply']]

df_LR = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True') ]
df_LR = df_LR.reset_index()[['helpdesk_question', 'helpdesk_reply']]

In [11]:
df_train.shape

(96412, 2)

In [12]:
unique_words = dictionary

In [13]:
len(unique_words) + 1

57545

In [14]:
max_length = 30
min_token_length = 0

In [15]:
word_to_id, id_to_word = preprocess_data.create_lookup_tables(unique_words)

#### Transforming the input sentence into a sequence of word IDs

In [16]:
train_x_word_ids = []
for question in df_train['helpdesk_question'].apply(preprocess_data.preprocess_question, 
                                                    args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    train_x_word_ids.append(np.array(word_ids, dtype = float))
train_x_word_ids = np.stack(train_x_word_ids)
print(train_x_word_ids.shape)
    
val_x_word_ids = []
for question in data['helpdesk_question'].loc[data['set'] == 'Valid'].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    val_x_word_ids.append(np.array(word_ids, dtype = float))
val_x_word_ids = np.stack(val_x_word_ids)

test_x_word_ids = []
for question in data['helpdesk_question'].loc[data['set'] == 'Test'].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    test_x_word_ids.append(np.array(word_ids, dtype = float))
    
test_x_word_ids = np.stack(test_x_word_ids)

LR_x_word_ids = []
for question in data['helpdesk_question'].loc[(data['set'] == 'Test') & 
                                              (data['low_resource'] == 'True')].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    LR_x_word_ids.append(np.array(word_ids, dtype = float))
LR_x_word_ids = np.stack(LR_x_word_ids)

(96412, 30, 1)


In [17]:
def get_dummies(reply, all_responses):
    
    """ Constructs a one-hot vector for replies
    
    Args:
        reply: query item 
        all_responses: dict containing all the template responses with their corresponding IDs
    
    Return:
        a one-hot vector where the corresponding ID of the reply is the one-hot index
    
    """
    
    Y = np.zeros(len(all_responses), dtype = int)
    Y[all_responses[reply]] += 1
    return Y 

In [18]:
train_y = np.array(list(df_train['helpdesk_reply'].apply(get_dummies, args = [responses])))
valid_y = np.array(list(df_valid['helpdesk_reply'].apply(get_dummies, args = [responses])))
test_y  = np.array(list(df_test['helpdesk_reply'].apply(get_dummies,  args = [responses])))
LR_y    = np.array(list(df_LR['helpdesk_reply'].apply(get_dummies,    args = [responses])))

In [19]:
train_x_word_ids = train_x_word_ids.reshape(train_x_word_ids.shape[:-1])
val_x_word_ids   = val_x_word_ids.reshape(val_x_word_ids.shape[:-1])
test_x_word_ids  = test_x_word_ids.reshape(test_x_word_ids.shape[:-1])
LR_x_word_ids    = LR_x_word_ids.reshape(LR_x_word_ids.shape[:-1])

#### Transform vectors where the input sentence yields a sequence of length 0

In [20]:
train_zero_vectors = np.where(train_x_word_ids.sum(axis = 1) == 0.0)[0]
for t in range(train_zero_vectors.shape[0]):
    train_x_word_ids[train_zero_vectors[t]][0] += 1

In [21]:
val_zero_vectors = np.where(val_x_word_ids.sum(axis = 1) == 0.0)[0]
for t in range(val_zero_vectors.shape[0]):
    val_x_word_ids[val_zero_vectors[t]][0] += 1

### Building the encoder (from the Transformer)

Original code obtained from https://www.tensorflow.org/tutorials/text/transformer with minor adaptions

In [25]:
def get_angles(pos, i, d_model):
    
    """ Multiplying angle rates and positions gives a map of the position encoding angles as a 
    function of depth. The angle rates range from 1 [rads/step] to min_rate [rads/step] over the 
    vector depth.
    
    Args:
        pos: vector of positions
        i: embedding vector
        d_model: dimension of embedding vector
        
    Returns:
        Vector of angle radians
    
    """
    
    angle_rate = 1/np.power(10000, ((2*i)/np.float32(d_model)))
    return pos * angle_rate

def positional_encoding(position, d_model):
    
    """ Calculate positional encodings to inject information about relative and absolute positions/
    The positional encodings are obtained by taking the sine and cosine of the angle radians.
    
    Args:
        position: maximum position encoding
        d_model: dimension of embedding vector
    
    Returns:
        A positional encoding vector
    
    """
    
    angle_rads = get_angles(np.arange(position)[:, np.newaxis], 
                            np.arange(d_model)[np.newaxis, :], 
                            d_model)
    
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

In [26]:
def scaled_dot_product_attention(q, k, v, mask):
    
    """ Calculate the attention weights. q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
        q: query shape == (..., seq_len_q, depth)
        k: key shape == (..., seq_len_k, depth)
        v: value shape == (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
    return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
    
    """ Multi-head attention consists of four parts: linear layers that split into heads, 
    scaled dot-product attention, the concatenation of heads, and a final linear layer.

    """
    
    def __init__(self, d_model, num_heads):
        
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)
        
    def split_heads(self, x, batch_size):
        
        """ Split the last dimension into (num_heads, depth). 
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        
        Args:
            x: feed forward layer
            batch_size: number of items in a batch
            
        Returns:
            tuple containing (batch size, number of heads, sequence length, depth)
        
        """
        
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask):
        
        """ Call function to split the heads of the linear layers. 
        Returns the scaled attention dense layer and attention weights
        
        Args:
            q: query shape == (..., seq_len_q, depth)
            k: key shape == (..., seq_len_k, depth)
            v: value shape == (..., seq_len_v, depth_v)
            mask: float tensor with shape broadcastable 
            
        Returns:
            output, attention_weights
        
        """
        
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, 
                                                                              #seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
        return output, attention_weights


def point_wise_feed_forward_network(d_model, dff):
    
    """ Construct a two-layer feedforward NN with layer dimensions d_model and dff respectively 
    and ReLU activations between layers.
    
    Args:
        d_model: dimension of embedding layer
        dff: dimension of the second layer
    
    Returns:
        A two-layer feedforward NN 
        
    """
    
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

class EncoderLayer(tf.keras.layers.Layer):
    
    """ Each encoder layer consists of Multi-head attention (with padding mask) and pointwise 
    feedforward networks.
   
    """
    
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(0.1)
        self.dropout2 = tf.keras.layers.Dropout(0.1)
    
    def call(self, x, training=False, mask=None):
        
        """ Constructs the encoder layer.
        
        Args:
            x: sequential layer
            training: flag indicating training or testing
            mask: float tensor with shape broadcastable 
        
        """

        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

In [27]:
from tensorflow.keras.constraints import Constraint
from tensorflow.keras import regularizers

In [28]:
class Encoder(tf.keras.layers.Layer):
    
    """ The Encoder consists of an input embedding, summed with positional encoding, and N encoder layers. 
    The summation is the input to the encoder layers. The output of the encoder is the input to the decoder.
    
    """
    
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
                 maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers       
        self.embedding = Embedding(input_vocab_size, d_model,)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)
        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = Dropout(rate)
        
    def call(self, x, training, mask=None):
        
        """ This function constructs the encoder.
        Note we move the dropout to right before the summation (of embedding and positional encodings).
        
        Args: 
            x: sequential layer
            training: flag indicating training or testing
            mask: float tensor with shape broadcastable 
            
        Returns:
            An encoder model 
        """
        
        seq_len = tf.shape(x)[1]        
        x = self.embedding(x)
        x = self.dropout(x, training = training)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        #x = self.dropout(x, training = training)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)
            
        return x

In [29]:
def multihead_attention_encoder(num_layers, max_features, input_length=30, model_dim=512, dff = 128, 
                                num_heads=4):
    
    """ Constructs a multihead attention encoder model
    
    Args:
        num_layers: number of encoder layers
        max_features: size of vocabulary
        input_length: length of input sequence
        model_dim: dimension of embedding vector
        dff: dimension of second layer in pointwise FFNN
        num_heads: number of heads to split
    
    Returns:
        Model object
    
    """
    
    inputs = Input(shape=(input_length, ))
    x = Encoder(num_layers, model_dim, num_heads, dff, max_features, maximum_position_encoding = 10000, 
                rate=0.5)(inputs)
    x = GlobalAveragePooling1D()(x)
    outputs = Dense(89, activation='softmax')(x)
    return Model(inputs=inputs, outputs=outputs)

#### Multi-head Attention Encoder with Average Pooling

We use average pooling to construct a single feature vector from the variable-length sequence of encodings produced by the MHA Encoder. This is then connected to a classification layer. Our MHA has 8 heads, 2 layers, and dropout of 50% to regularize the model during training. 

In [47]:
max_features = len(unique_words) + 1
num_layers = 2

model = multihead_attention_encoder(num_layers, max_features, input_length=30, model_dim=128,
                                    num_heads=8)

In [48]:
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
encoder_2 (Encoder)          (None, 30, 128)           7564928   
_________________________________________________________________
global_average_pooling1d_2 ( (None, 128)               0         
_________________________________________________________________
dense_38 (Dense)             (None, 89)                11481     
Total params: 7,576,409
Trainable params: 7,576,409
Non-trainable params: 0
_________________________________________________________________


### Training

In [50]:
es = EarlyStopping(monitor='val_accuracy', verbose=1, restore_best_weights=True, patience=10)
model.compile(loss='categorical_crossentropy', 
              optimizer=tf.keras.optimizers.Adadelta(learning_rate=0.25, rho=0.95),
              metrics=['accuracy'])

In [51]:
model.fit(train_x_word_ids, train_y, 
          batch_size=32,
          epochs=500,
          callbacks=[es],
          validation_data=[val_x_word_ids, valid_y])

Train on 96412 samples, validate on 31955 samples
Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 00066: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f4227c7bc50>

In [52]:
def classifier_score_top_1(word_ids, y_true, model):
    
    """ Computes classification accuracy for model.
    
    Args:
        word_ids: matrix where each row is 
        y_true: ground truth labels
        model: pretrained model
    
    Returns:
        None
    
    """
        
    score = 0
    probs = model.predict(word_ids)
    for i in range(word_ids.shape[0]):
        if y_true[i].argmax() == np.argsort(probs[i])[-1]:
            score += 1
        
    print("Overall Accuracy:", score/word_ids.shape[0])

### Validation accuracy

In [53]:
classifier_score_top_1(val_x_word_ids, valid_y, model)

Overall Accuracy: 0.6127053669222344


### Test accuracy 

In [54]:
classifier_score_top_1(test_x_word_ids, test_y, model)

Overall Accuracy: 0.617503800452952


### LR test accuracy

In [55]:
classifier_score_top_1(LR_x_word_ids, LR_y, model)

Overall Accuracy: 0.5441706730769231


### Top-5 accuracy

In [56]:
def classifier_score_top_5(word_ids, y_true, model):
    
    """ Computes top-5 classification accuracy for model.
    
    Args:
        word_ids: matrix where each row is 
        y_true: true labels
        model: trained model
        
    Returns:
        None
    
    """
    
    score = 0
    probs = model.predict(word_ids)
    for i in range(word_ids.shape[0]):
        if y_true[i].argmax() in np.argsort(probs[i])[-5:]:
            score += 1
        
    print("Overall Accuracy:", score/word_ids.shape[0])

In [57]:
classifier_score_top_5(test_x_word_ids, test_y, model)

Overall Accuracy: 0.9068966587038129


In [58]:
classifier_score_top_5(LR_x_word_ids, LR_y, model)

Overall Accuracy: 0.8233173076923077
