### Siamese Triplets with majority class downsampled

Siamese triplet loss training creates embedding spaces where similar items are pulled closer to one another, and dissimilar items are pushed away from one another. Siamese networks were independently introduced by both Bromley et al.(1993) and Baldi and Chauvin (1993) as a similarity-learning algorithm for signature verification and fingerprint verification, respectively. 

Instead of predicting a class label, these networks directly measure the similarity between samples of the same and differing classes. This is useful for scenarios where the number of classes is very large or unknownduring training, or where there is a only a few training samples per class(Chopraet al., 2005).

For the sampling of triplets, we employ a technique called online semi-hard mining (Schroffet al., 2015). For a given minibatch, we first compute the embeddings for all the samples in the minibatch. To make up the triplets for the minibatch, all the possible positive anchor pairs $(\boldsymbol{x}_a, \boldsymbol{x}_p)$ are selected, and accompanied with a semi-hard negative that satisfies $D(\boldsymbol{x}_a, \boldsymbol{x}_p) < D(\boldsymbol{x}_a, \boldsymbol{x}_n) < D(\boldsymbol{x}_a, \boldsymbol{x}_p) + m$, where $D(\cdot)$ is the distance function and $m$ is the margin. 

Further, we downsample the majority class (which makes up about 22% of the training set) to allow the model to learn more from the minority classes. 

We train the multi-head attention encoder architecture using siamese triplet loss. 

In [1]:
import sys
import os
#sys.path.append(os.path.join(\"..\")) # path to source relative to current directory"

In [3]:
import numpy as np
import gensim

In [33]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [25]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices("GPU")
tf.config.experimental.set_memory_growth(physical_devices[0], True)
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, TimeDistributed, Input, Flatten, AdditiveAttention

In [4]:
import preprocess_data
import losses
import pandas as pd

In [5]:
data = pd.read_csv('dataset_7B', delimiter = ';', engine = 'python')
data_text = data.loc[data['set'] == 'Train'][['helpdesk_question']]
number_of_classes = data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts().shape[0]
data = data[['helpdesk_question', 'helpdesk_reply', 'set', 'low_resource']] 

In [6]:
responses = pd.DataFrame(data.loc[data['set'] == 'Train']['helpdesk_reply'].value_counts()).reset_index()
responses['reply'] = responses['index']
responses['index'] = responses.index
responses = dict(responses.set_index('reply')['index'])

In [7]:
len(responses)

89

In [8]:
data_text['index'] = data_text.index
documents = data_text

In [9]:
dictionary = preprocess_data.create_dictionary(data_text, 1, 0.25, 95000) #our entire vocabulary

In [10]:
df_train = data.loc[data['set'] == 'Train']
df_train = df_train.reset_index()[['helpdesk_question', 'helpdesk_reply']]
df_train_keep = df_train
#df_train = df_train.drop_duplicates()

df_valid = data.loc[data['set'] == 'Valid']
df_valid = df_valid.reset_index()[['helpdesk_question', 'helpdesk_reply']]

df_test = data.loc[data['set'] == 'Test']
df_test = df_test.reset_index()[['helpdesk_question', 'helpdesk_reply']]

df_LR = data.loc[(data['set'] == 'Test') & (data['low_resource'] == 'True') ]
df_LR = df_LR.reset_index()[['helpdesk_question', 'helpdesk_reply']]

In [11]:
df_train.shape

(96412, 2)

In [12]:
unique_words = dictionary

In [13]:
len(unique_words) + 1

57545

In [14]:
max_length = 30
min_token_length = 0

In [15]:
word_to_id, id_to_word = preprocess_data.create_lookup_tables(unique_words)

#### Transforming the input sentence into a sequence of word IDs

In [16]:
train_x_word_ids = []
for question in df_train['helpdesk_question'].apply(preprocess_data.preprocess_question, 
                                                    args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    train_x_word_ids.append(np.array(word_ids, dtype = float))
train_x_word_ids = np.stack(train_x_word_ids)
print(train_x_word_ids.shape)
    
val_x_word_ids = []
for question in data['helpdesk_question'].loc[data['set'] == 'Valid'].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    val_x_word_ids.append(np.array(word_ids, dtype = float))
val_x_word_ids = np.stack(val_x_word_ids)

test_x_word_ids = []
for question in data['helpdesk_question'].loc[data['set'] == 'Test'].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    test_x_word_ids.append(np.array(word_ids, dtype = float))
    
test_x_word_ids = np.stack(test_x_word_ids)

LR_x_word_ids = []
for question in data['helpdesk_question'].loc[(data['set'] == 'Test') & 
                                              (data['low_resource'] == 'True')].apply(preprocess_data.preprocess_question, 
                                                                          args = [unique_words, min_token_length]):
    word_ids = preprocess_data.transform_sequence_to_word_ids(question, word_to_id)
    LR_x_word_ids.append(np.array(word_ids, dtype = float))
LR_x_word_ids = np.stack(LR_x_word_ids)

(96412, 30, 1)


In [17]:
def get_dummies(reply, all_responses):
    
    """ Constructs a one-hot vector for replies
    
    Args:
        reply: query item 
        all_responses: dict containing all the template responses with their corresponding IDs
    
    Return:
        a one-hot vector where the corresponding ID of the reply is the one-hot index
    
    """
    
    Y = np.zeros(len(all_responses), dtype = int)
    Y[all_responses[reply]] += 1
    return Y 

In [18]:
def get_label_id(reply, all_responses):
    
    """ Returns integer ID corresponding to response for easy comparison and classification
    
    Args:
        reply: query item 
        all_responses: dict containing all the template responses with their corresponding IDs
        
    Return: 
        integer corresponding to each response     
        
    """
        
    return all_responses[reply]

In [19]:
train_y = np.array(list(df_train['helpdesk_reply'].apply(get_dummies, args = [responses])))
valid_y = np.array(list(df_valid['helpdesk_reply'].apply(get_dummies, args = [responses])))
test_y  = np.array(list(df_test['helpdesk_reply'].apply(get_dummies,  args = [responses])))
LR_y    = np.array(list(df_LR['helpdesk_reply'].apply(get_dummies,    args = [responses])))

In [20]:
train_x_word_ids = train_x_word_ids.reshape(train_x_word_ids.shape[:-1])
val_x_word_ids   = val_x_word_ids.reshape(val_x_word_ids.shape[:-1])
test_x_word_ids  = test_x_word_ids.reshape(test_x_word_ids.shape[:-1])
LR_x_word_ids    = LR_x_word_ids.reshape(LR_x_word_ids.shape[:-1])

#### Remove vectors where the input sentence yields a sequence of length 0

In [21]:
train_zero_vectors = np.where(train_x_word_ids.sum(axis = 1) == 0.0)[0]
for t in range(train_zero_vectors.shape[0]):
    train_x_word_ids[train_zero_vectors[t]][0] += 1

In [22]:
val_zero_vectors = np.where(val_x_word_ids.sum(axis = 1) == 0.0)[0]
for t in range(val_zero_vectors.shape[0]):
    val_x_word_ids[val_zero_vectors[t]][0] += 1

### Building the encoder (from the Transformer)

Original code obtained from https://www.tensorflow.org/tutorials/text/transformer with minor adaptions

In [26]:
def get_angles(pos, i, d_model):
    
    """ Multiplying angle rates and positions gives a map of the position encoding angles as a 
    function of depth. The angle rates range from 1 [rads/step] to min_rate [rads/step] over the 
    vector depth.
    
    Args:
        pos: vector of positions
        i: embedding vector
        d_model: dimension of embedding vector
        
    Returns:
        Vector of angle radians
    
    """
    
    angle_rate = 1/np.power(10000, ((2*i)/np.float32(d_model)))
    return pos * angle_rate

def positional_encoding(position, d_model):
    
    """ Calculate positional encodings to inject information about relative and absolute positions/
    The positional encodings are obtained by taking the sine and cosine of the angle radians.
    
    Args:
        position: maximum position encoding
        d_model: dimension of embedding vector
    
    Returns:
        A positional encoding vector
    
    """
    
    angle_rads = get_angles(np.arange(position)[:, np.newaxis], 
                            np.arange(d_model)[np.newaxis, :], 
                            d_model)
    
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

In [27]:
def scaled_dot_product_attention(q, k, v, mask):
    
    """ Calculate the attention weights. q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
        q: query shape == (..., seq_len_q, depth)
        k: key shape == (..., seq_len_k, depth)
        v: value shape == (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
        output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
    return output, attention_weights

class MultiHeadAttention(tf.keras.layers.Layer):
    
    """ Multi-head attention consists of four parts: linear layers that split into heads, 
    scaled dot-product attention, the concatenation of heads, and a final linear layer.

    """
    
    def __init__(self, d_model, num_heads):
        
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)
        
    def split_heads(self, x, batch_size):
        
        """ Split the last dimension into (num_heads, depth). 
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        
        Args:
            x: feed forward layer
            batch_size: number of items in a batch
            
        Returns:
            tuple containing (batch size, number of heads, sequence length, depth)
        
        """
        
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask):
        
        """ Call function to split the heads of the linear layers. 
        Returns the scaled attention dense layer and attention weights
        
        Args:
            q: query shape == (..., seq_len_q, depth)
            k: key shape == (..., seq_len_k, depth)
            v: value shape == (..., seq_len_v, depth_v)
            mask: float tensor with shape broadcastable 
            
        Returns:
            output, attention_weights
        
        """
        
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, 
                                                                              #seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)
        return output, attention_weights


def point_wise_feed_forward_network(d_model, dff):
    
    """ Construct a two-layer feedforward NN with layer dimensions d_model and dff respectively 
    and ReLU activations between layers.
    
    Args:
        d_model: dimension of embedding layer
        dff: dimension of the second layer
    
    Returns:
        A two-layer feedforward NN 
        
    """
    
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

class EncoderLayer(tf.keras.layers.Layer):
    
    """ Each encoder layer consists of Multi-head attention (with padding mask) and pointwise 
    feedforward networks.
   
    """
    
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(0.1)
        self.dropout2 = tf.keras.layers.Dropout(0.1)
    
    def call(self, x, training=False, mask=None):
        
        """ Constructs the encoder layer.
        
        Args:
            x: sequential layer
            training: flag indicating training or testing
            mask: float tensor with shape broadcastable 
        
        """

        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

In [28]:
class Encoder(tf.keras.layers.Layer):
    
    """ The Encoder consists of an input embedding, summed with positional encoding, and N encoder layers. 
    The summation is the input to the encoder layers. The output of the encoder is the input to the decoder.
    
    """
    
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
                 maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers       
        self.embedding = Embedding(input_vocab_size, d_model,)
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.d_model)
        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) for _ in range(num_layers)]
        self.dropout = Dropout(rate)
        
    def call(self, x, training, mask=None):
        
        """ This function constructs the encoder.
        Note we move the dropout to right before the summation (of embedding and positional encodings).
        
        Args: 
            x: sequential layer
            training: flag indicating training or testing
            mask: float tensor with shape broadcastable 
            
        Returns:
            An encoder model 
        """
        
        seq_len = tf.shape(x)[1]        
        x = self.embedding(x)
        x = self.dropout(x, training = training)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        #x = self.dropout(x, training = training)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)
            
        return x

In [29]:
def multihead_attention_encoder(num_layers, max_features, input_length=30, model_dim=512, dff = 128, 
                                num_heads=4):
    
    """ Constructs a multihead attention encoder model
    
    Args:
        num_layers: number of encoder layers
        max_features: size of vocabulary
        input_length: length of input sequence
        model_dim: dimension of embedding vector
        dff: dimension of second layer in pointwise FFNN
        num_heads: number of heads to split
    
    Returns:
        Model object
    
    """
    
    inputs = Input(shape=(input_length, ))
    x = Encoder(num_layers, model_dim, num_heads, dff, max_features, maximum_position_encoding = 10000, 
                rate=0.5)(inputs)
    x = GlobalAveragePooling1D()(x)
    outputs = Dense(300, activation=None)(x)
    return Model(inputs=inputs, outputs=outputs)

#### Multi-head Attention Encoder with Average Pooling

We use average pooling to construct a single feature vector from the variable-length sequence of encodings produced by the MHA Encoder. This is then connected to a single dense layer with 300 dimensions. Our MHA has 8 heads, 2 layers, and dropout of 50% to regularize the model during training. 

In [30]:
max_features = len(unique_words) + 1
num_layers = 2

model = multihead_attention_encoder(num_layers, max_features, input_length=30, model_dim=128,
                                    num_heads=8)

In [31]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
encoder (Encoder)            (None, 30, 128)           7564928   
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 300)               38700     
Total params: 7,603,628
Trainable params: 7,603,628
Non-trainable params: 0
_________________________________________________________________


### Siamese Triplet Loss Training

We perform the Siamese triplet loss training with mini-batch sizes of 256, cosine as our distance function and a margin $m$ of 0.5. For online sampling we use a batch size of 256. Larger batch sizes consumed too much memory. 

In [32]:
loss = losses.triplet_semihard_loss(margin=0.5, metric="cosine")

In [34]:
es = EarlyStopping(monitor='val_loss', verbose=1, restore_best_weights=True, patience=50)
model.compile(loss=loss, optimizer=tf.keras.optimizers.Adadelta(learning_rate= 0.05))

### Balanced Batches

Create balanced batches by downsampling majority class (which makes up 22% of the training set)

In [36]:
#pip install -U imbalanced-learn

In [37]:
from imblearn.keras import BalancedBatchGenerator
from imblearn.under_sampling import NearMiss, RandomUnderSampler

In [38]:
training_generator = BalancedBatchGenerator(train_x_word_ids, 
                                            np.array(df_train['helpdesk_reply'].apply(get_label_id, 
                                                                                      args = [responses])),
                                            sampler = RandomUnderSampler(sampling_strategy='majority'),                                        
                                            batch_size = 256)

In [39]:
model.fit_generator(training_generator, steps_per_epoch=360, epochs=1000,         
          callbacks=[es],
          validation_data=(val_x_word_ids, np.array(df_valid['helpdesk_reply'].apply(get_label_id, 
                                                                                     args = [responses]))))

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

Epoch 79/1000
Epoch 80/1000
Epoch 81/1000
Epoch 82/1000
Epoch 83/1000
Epoch 84/1000
Epoch 85/1000
Epoch 86/1000
Epoch 87/1000
Epoch 88/1000
Epoch 89/1000
Epoch 90/1000
Epoch 91/1000
Epoch 92/1000
Epoch 93/1000
Epoch 94/1000
Epoch 95/1000
Epoch 96/1000
Epoch 97/1000
Epoch 98/1000
Epoch 99/1000
Epoch 100/1000
Epoch 101/1000
Epoch 102/1000
Epoch 103/1000
Epoch 104/1000
Epoch 105/1000
Epoch 106/1000
Epoch 107/1000
Epoch 108/1000
Epoch 109/1000
Epoch 110/1000
Epoch 111/1000
Epoch 112/1000
Epoch 113/1000
Epoch 114/1000
Epoch 115/1000
Epoch 116/1000
Epoch 117/1000
Epoch 118/1000
Epoch 119/1000
Epoch 120/1000
Epoch 121/1000
Epoch 122/1000
Epoch 123/1000
Epoch 124/1000
Epoch 125/1000
Epoch 126/1000
Epoch 127/1000
Epoch 128/1000
Epoch 129/1000
Epoch 130/1000
Epoch 131/1000
Epoch 132/1000
Epoch 133/1000
Epoch 134/1000
Epoch 135/1000
Epoch 136/1000
Epoch 137/1000
Epoch 138/1000
Epoch 139/1000
Epoch 140/1000
Epoch 141/1000
Epoch 142/1000
Epoch 143/1000
Epoch 144/1000
Epoch 145/1000
Epoch 146/1000
E

Epoch 156/1000
Epoch 157/1000
Epoch 158/1000
Epoch 159/1000
Epoch 160/1000
Epoch 161/1000
Epoch 162/1000
Epoch 163/1000
Epoch 164/1000
Epoch 165/1000
Epoch 166/1000
Epoch 167/1000
Epoch 168/1000
Epoch 169/1000
Epoch 170/1000
Epoch 171/1000
Epoch 172/1000
Epoch 173/1000
Epoch 174/1000
Epoch 175/1000
Epoch 176/1000
Epoch 177/1000
Epoch 178/1000
Epoch 179/1000
Epoch 180/1000
Epoch 181/1000
Epoch 182/1000
Epoch 183/1000
Epoch 184/1000
Epoch 185/1000
Epoch 186/1000
Epoch 187/1000
Epoch 188/1000
Epoch 189/1000
Epoch 190/1000
Epoch 191/1000
Epoch 192/1000
Epoch 193/1000
Epoch 194/1000
Epoch 195/1000
Epoch 00195: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f74694d82e8>

In [40]:
def label_preprocess(entry):
    if responses.get(entry) != None:
        return responses[entry]
    else:
        return len(responses) #default unknown class

In [41]:
x_train = model.predict(train_x_word_ids)
y_train = df_train_keep['helpdesk_reply'].apply(label_preprocess)

In [42]:
x_valid = model.predict(val_x_word_ids)
y_valid = df_valid['helpdesk_reply'].apply(label_preprocess)

In [43]:
x_test = model.predict(test_x_word_ids)
y_test = df_test['helpdesk_reply'].apply(label_preprocess)

In [44]:
x_LR = model.predict(LR_x_word_ids)
y_LR = df_LR['helpdesk_reply'].apply(label_preprocess)

In [45]:
from sklearn.neighbors import KNeighborsClassifier

In [46]:
def train_knn_model(x_train, y_train, metric, k, weights):
    print(k, 'Nearest Neighbours')
    clf = KNeighborsClassifier(n_neighbors=k, weights= weights, metric = metric)
    clf.fit(x_train, y_train)
    #print("Train accuracy", clf.score(x_train, y_train))
        
    return clf

### Validation accuracy

In [47]:
clf_1NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 1, weights = 'distance')
score = clf_1NN.score(x_train, y_train)
print("Train accuracy", score)
score = clf_1NN.score(x_valid, y_valid)
print("Validation accuracy", score)

1 Nearest Neighbours
Train accuracy 0.9678048375720865
Validation accuracy 0.5196682835237052


In [48]:
clf_5NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 5, weights = 'distance')
score = clf_5NN.score(x_valid, y_valid)
print("Validation accuracy", score)

5 Nearest Neighbours
Validation accuracy 0.5605695509309967


In [49]:
clf_25NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 25, weights = 'distance')
score = clf_25NN.score(x_valid, y_valid)
print("Validation accuracy", score)

25 Nearest Neighbours
Validation accuracy 0.5897355656391801


In [50]:
clf_50NN = train_knn_model(x_train = x_train, y_train = y_train, metric = 'cosine', 
                          k = 50, weights = 'distance')
score = clf_50NN.score(x_valid, y_valid)
print("Validation accuracy", score)

50 Nearest Neighbours
Validation accuracy 0.591237677984666


### Test score 

In [51]:
score = clf_1NN.score(x_test, y_test)
print("Test accuracy on 1-NN", score)
score = clf_5NN.score(x_test, y_test)
print("Test accuracy on 5-NN", score)
score = clf_25NN.score(x_test, y_test)
print("Test accuracy on 25-NN", score)
score = clf_50NN.score(x_test, y_test)
print("Test accuracy on 50-NN", score)

Test accuracy on 1-NN 0.5270375081438278
Test accuracy on 5-NN 0.5672757732758353
Test accuracy on 25-NN 0.592870660503211
Test accuracy on 50-NN 0.5944218657897186


### LR test score

In [52]:
score = clf_1NN.score(x_LR, y_LR)
print("LR Test accuracy on 1-NN", score)
score = clf_5NN.score(x_LR, y_LR)
print("LR Test accuracy on 5-NN", score)
score = clf_25NN.score(x_LR, y_LR)
print("LR Test accuracy on 25-NN", score)
score = clf_50NN.score(x_LR, y_LR)
print("LR Test accuracy on 50-NN", score)

LR Test accuracy on 1-NN 0.42367788461538464
LR Test accuracy on 5-NN 0.4690504807692308
LR Test accuracy on 25-NN 0.5120192307692307
LR Test accuracy on 50-NN 0.5141225961538461


### Assessing the quality of cross-lingual embeddings

We design a small experiment to assess the quality of the cross-lingual embeddings for English and Zulu. The translations were obtained using google translate and verified by a Zulu speaker. We compute the sentence embedding for each English-Zulu translation pair and calculate the cosine distance between the two embeddings. 

In [71]:
eng_A  = "can you drink coca cola when you are pregnant"
zulu_A = "ungayiphuza yini i-coca cola uma ukhulelwe"

eng_B  = "when can i stop breastfeeding"
zulu_B = "ngingakuyeka nini ukuncelisa ibele"

eng_C  = "when can I start feeding my baby solid food"
zulu_C = "ngingaqala nini ukondla ingane yami ukudla okuqinile"

eng_D  = "what are the signs of labour"
zulu_D = "yiziphi izimpawu zokubeletha"

eng_E  = "when can I learn the gender of my baby"
zulu_E = "ngingabazi ubulili bengane yami"

In [72]:
unique_words['yami']

128

In [73]:
def create_sentence_embeddings(question, model, unique_words, min_token_length, word_to_id):
    q = preprocess_data.preprocess_question(question, unique_words, min_token_length)
    word_ids = preprocess_data.transform_sequence_to_word_ids(q, word_to_id)
    word_ids = np.array(word_ids, dtype = float)
    word_ids = word_ids.reshape((1, word_ids.shape[0]))
    embedding = model.predict(word_ids)
    return embedding    

In [74]:
embed_eng_A = create_sentence_embeddings(eng_A, model, unique_words, min_token_length, word_to_id)
embed_eng_B = create_sentence_embeddings(eng_B, model, unique_words, min_token_length, word_to_id)
embed_eng_C = create_sentence_embeddings(eng_C, model, unique_words, min_token_length, word_to_id)
embed_eng_D = create_sentence_embeddings(eng_D, model, unique_words, min_token_length, word_to_id)
embed_eng_E = create_sentence_embeddings(eng_E, model, unique_words, min_token_length, word_to_id)

In [75]:
embed_zulu_A = create_sentence_embeddings(zulu_A, model, unique_words, min_token_length, word_to_id)
embed_zulu_B = create_sentence_embeddings(zulu_B, model, unique_words, min_token_length, word_to_id)
embed_zulu_C = create_sentence_embeddings(zulu_C, model, unique_words, min_token_length, word_to_id)
embed_zulu_D = create_sentence_embeddings(zulu_D, model, unique_words, min_token_length, word_to_id)
embed_zulu_E = create_sentence_embeddings(zulu_E, model, unique_words, min_token_length, word_to_id)

In [76]:
from scipy.spatial.distance import cosine

In [77]:
print("Sentence A:", cosine(embed_eng_A, embed_zulu_A))
print("Sentence B:", cosine(embed_eng_B, embed_zulu_B))
print("Sentence C:", cosine(embed_eng_C, embed_zulu_C))
print("Sentence D:", cosine(embed_eng_D, embed_zulu_D))
print("Sentence E:", cosine(embed_eng_E, embed_zulu_E))

Sentence A: 0.36851388216018677
Sentence B: 0.2721353769302368
Sentence C: 0.1550511121749878
Sentence D: 0.11843031644821167
Sentence E: 0.9600929133594036
