# Task-2: Binary Classification of Sentences

### Importing Necessary Libraries

I will be using Spacy for tokenization and word embeddings. TensorFlow framework is used to build the Binary Classification Model. Scikit-Learn provides us the metrics for the evaluation purpose.

In [1]:
import numpy as np
import spacy
import tensorflow as tf
from sklearn.metrics import f1_score, precision_score, recall_score

import csv

## Data

The model is trained on the following very small amount of fabricated data.

In [2]:
sentences = [
    'you won a billion dollars , great work !',
    'click here for cs685 midterm answers',
    'read important cs685 news',
    'send me your bank account info asap'
]
labels = [1, 1, 0, 1]

## Data Preparation

The **en_core_web_sm** model provided by Spacy is used to calculate the word embedding vector for each word. Now, one issue with this **en_core_web_sm**, i.e. the small model, is it doesn't comes with static word vectors or vocabulary. Hence, there was no way to form an Embedding Matrix which requires vocabulary size and token ids. For this purpose, the pre-processing to convert the tokens to vectors is done before inputing in our model.

In [3]:
nlp = spacy.load("en_core_web_sm") # Loading Spacy Model
EMBED_DIM = len(nlp("Hi").vector) # Spacy's Embedding Dimension Size
MAX_LEN = 10 # The max length of the sentences

In [4]:
tokenizer = nlp.tokenizer
# Tokenizing and embedding the words using the Spacy Model
train_data = [[nlp(str(word)).vector for word in list(tokenizer(sent))] for sent in sentences]

Since, the padding and word to vector conversion is done before feeding into the model. It was easier to generate the corresponding mask for the padding tokens during this step.

In [5]:
mask = list()
for sent in train_data:
    k = len(sent)
    sent_mask = [1 for _ in range(k)] # These positions have a word, so mask value is 1
    for _ in range(MAX_LEN - k):
        sent.append(np.zeros(EMBED_DIM)) # The padding positions have all 0s embdedding vectors for casting into matrix
        sent_mask.append(0) # The rest of the positions doesn't have a word hence are padding, so mask value is 0
    mask.append(sent_mask)

In [6]:
train_data = tf.cast(np.array(train_data), dtype=tf.float32) # Embedded training data
mask = tf.cast(np.array(mask), dtype=tf.float32)[:, tf.newaxis, tf.newaxis, :] # Padding Mask
labels = tf.cast(np.array(labels), dtype=tf.float32) # Labels
print("Train Data Shape:", train_data.shape)
print("Padding Mask Shape:", mask.shape)
print("Labels Shape:", labels.shape)

Metal device set to: Apple M1 Max
Train Data Shape: (4, 10, 96)
Padding Mask Shape: (4, 1, 1, 10)
Labels Shape: (4,)


2022-09-01 18:42:16.051888: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-09-01 18:42:16.052218: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


## Model Building

The following are the model parameters

In [7]:
POS_ENC_ANGLE_DENO = 10000 # Denominator angle in Positional Encoding
NUM_ENC_LAYERS =  2 # Number of Encoder Blocks or Layers
NUM_HEADS =  2 # Number of heads
EMBED_DIM = 96 # Embedding Dimension
FEED_FORWARD_DIM = 32 # Feed Forward NNs number of units in hidden layer
DROPOUT_RATE = 0.1 # Dropout Rate
MAX_LEN = 10 # Max length of each tokenized sentence
BATCH_SIZE = 4 # Training Batch Size
EPOCHS = 10 # Number of epochs to train model

We calculate the positional encoding for each word vector using,
$$PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}})$$
where, $2i$, $2i+1$ is the index in the column of the word embedding vector and $pos$ is the position of the word in the padded tokenized sequence.

In [8]:
def pos_enc(max_len, d_model):
    # returns the positional encoding matrix which needs to be added to the embedding matrix
    angles = np.arange(max_len)[:, np.newaxis] / np.power(POS_ENC_ANGLE_DENO, 2*(np.arange(d_model)[np.newaxis, :]//2/np.float32(d_model)))
    pos_encode = np.zeros((max_len, d_model))
    pos_encode[:, 0::2] = np.sin(angles[:, 0::2])
    pos_encode[:, 1::2] = np.cos(angles[:, 1::2])
    return tf.cast(pos_encode[np.newaxis, :], dtype=tf.float32)

We calculate the padding mask which is basically the same as the original padding mask just that it has $2$ extra dimensions to make up for the dimension corresponding to the heads and embedding vector length in the heads.

In [9]:
def pad_mask(mat):
    # recasts the padding mask with little different dimensions
    mask = tf.cast(tf.math.equal(mat, 0), dtype=tf.float32)
    return mask[:, tf.newaxis, tf.newaxis, :]

We compute the attention score of each word with the other word using the query vector, key vector and value vector for each word which when stacked forms the $Q$, $K$ and $V$ matrices. The attention vector (stacked to form attention matrix) is calculated using the scaled dot product attention formula given by,
$$Attention(Q,K,V) = softmax_k(\frac{QK^T}{\sqrt{d_k}})V$$

In [10]:
def scaled_dot_prod_attn(q, k, v, mask):
    # returns the scaled dot product attention based on queries, keys and values and mask
    qk = tf.matmul(q, k, transpose_b=True) # calculates the numerator of the softmax input
    dk = tf.cast(tf.shape(k)[-1], dtype=tf.float32)
    pre_softmax = qk / tf.sqrt(dk) # calculates the angle input into softmax
    if mask is not None:
        pre_softmax += (mask * 1e-9) # padding mask as softmax would give almost zero for these positions
    attn_wts = tf.nn.softmax(pre_softmax, axis=-1) # attention weights per word for other words
    final_attention = tf.matmul(attn_wts, v) # value vectors weighted average with attention weights
    return final_attention

Now, the Q, K and V matrix are linearly transformed to learn the nature of queries, keys and values across various representations in different subspace. For each of these representations the attentions are calculated as mentioned above in the different heads and these are then stacked together to be passed to the next layers.

In [11]:
class MHA(tf.keras.layers.Layer):
    # class for multi-head attention
    def __init__(self, *, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % num_heads == 0 # checks if split of d_model is position among heads
        self.d_head = self.d_model // self.num_heads
        
        self.linear_q = tf.keras.layers.Dense(self.d_model)
        self.linear_k = tf.keras.layers.Dense(self.d_model)
        self.linear_v = tf.keras.layers.Dense(self.d_model)
        
        self.dense = tf.keras.layers.Dense(d_model)
        
    def div_heads(self, dat, batch_size):
        dat = tf.reshape(dat, (batch_size, -1, self.num_heads, self.d_head))
        return tf.transpose(dat, perm=[0, 2, 1, 3])
    
    def call(self, V, K, Q, mask):
        batch_size = tf.shape(Q)[0]
        
        # Linear Transformation for different representation in different heads
        Q = self.linear_q(Q)
        K = self.linear_k(K)
        V = self.linear_v(V)
        
        # split K, Q, V matrix among heads
        q = self.div_heads(Q, batch_size)
        k = self.div_heads(K, batch_size)
        v = self.div_heads(V, batch_size)
        
        # calculate the scaled dot product in each head
        scaled_attn = scaled_dot_prod_attn(q, k, v, mask)
        scaled_attn = tf.transpose(scaled_attn, perm=[0, 2, 1, 3])
        # concatenate the attention vectors from each head
        concat_attn = tf.reshape(scaled_attn, (batch_size, -1, self.d_model)) 
        
        final_output = self.dense(concat_attn)
        return final_output

After the attentions are being calculated, we need some non-linearity in our model as uptil now we had no source of non-linearity in our model. That is where we pass these attention vectors generated from MHA is passed through the Feedforward Neural Network.

In [12]:
def post_MHA_FF_Net(d_model, d_ff):
    # returns the post MHA feed forward neural network
    FF_Net = tf.keras.Sequential([
        tf.keras.layers.Dense(d_ff, activation='relu'),
        tf.keras.layers.Dense(d_model)
    ])
    return FF_Net

Now, the above computation i.e. Multi-Headed Attention followed by Feed Forward Neural Network (with skip connections across both MHA and Feed Forward part) forms a single layer/block of the Encoder part of a Transformer.

In [13]:
class EncoderLayer(tf.keras.layers.Layer):
    # returns the model where MHA, skip connection, batch norm and feed forward network, skip connection and batch norm is added
    def __init__(self, *, d_model, num_heads, num_nodes, drop_rate=0.1):
        super().__init__()
        
        self.mha = MHA(d_model=d_model, num_heads=num_heads)
        self.ffn = post_MHA_FF_Net(d_model, num_nodes)
        
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        
        self.dropout1 = tf.keras.layers.Dropout(drop_rate)
        self.dropout2 = tf.keras.layers.Dropout(drop_rate)
        
    def call(self, x, train, mask):
        # calculating the multi head attention
        attn_output = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=train)
        output1 = self.layernorm1(x + attn_output) # skip connection
        
        # feeding the concatenated attention matrix output of MHA into Feed-Forward Network
        ffn_output = self.ffn(output1)
        ffn_output = self.dropout2(ffn_output, training = train)
        final_output = self.layernorm2(x + ffn_output) # skip connection
        return final_output

We repeat the Encoder Block twice, as decided in our parameter after the positional encoding is added to build the final Encoder part of the Transformer.

In [14]:
class Encoder(tf.keras.layers.Layer):
    # returns model which takes the embeddings as input along with mask and adds positional encoding 
    # and 2 encoder blocks
    def __init__(self, *, num_layers, d_model, num_heads, num_nodes, drop_rate=0.1):
        super().__init__()
        
        self.d_model = d_model
        self.num_layers = num_layers
        self.pos_encoding = pos_enc(MAX_LEN, self.d_model)
        
        self.enc_layers = [
            EncoderLayer(d_model=d_model, num_heads=num_heads, num_nodes=num_nodes, drop_rate=drop_rate)
            for _ in range(num_layers)
        ]
        
        self.dropout = tf.keras.layers.Dropout(drop_rate)
        
    def call(self, x, mask, train=True):
        sent_len = tf.shape(x)[1]
        x *= tf.math.sqrt(tf.cast(self.d_model, dtype=tf.float32))
        x += self.pos_encoding[:, :sent_len, :] # adding positional encoding
        
        x = self.dropout(x, training=train)
        # passing the embedded and position encoded data through 1st encoder layer and the output is passed through
        # another encoder layer
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, train, mask)
        
        return x

Now, as mentioned earlier the model inputs are going to be the pre-processed train data and the padding mask.

In [15]:
input1 = tf.keras.layers.Input(shape=(MAX_LEN, EMBED_DIM)) # Word Embedding Matrix input with padding upto MAX_LEN
input2 = tf.keras.layers.Input(shape=(1, 1, MAX_LEN)) # Padding Mask Matrix 
transformer = Encoder(
    num_layers=NUM_ENC_LAYERS, 
    d_model=EMBED_DIM, 
    num_heads=NUM_HEADS, 
    num_nodes=FEED_FORWARD_DIM, 
    drop_rate=DROPOUT_RATE
)

The Encoder block learns a feature vector for each word which assists the classification. For classification we average across the temporal axis of the feature vector of the words. This concise vector is now passed through another feed forward neural network with dropouts and finally is hooked up with a $2$ neuron output layer with softmax activation. Where the first neuron corresponds to the predicted probability of the sentence belonging to the label $0$ and the second neuron corresponds to that for the label $1$.

In [16]:
# For classification we take the representation from the transformer and feeds them into the following feed forward
# network for classification
x = transformer(input1, input2, True)
x = tf.keras.layers.GlobalAveragePooling1D()(x) # averaging over the temporal axis of the resultant vector from transformer
x = tf.keras.layers.Dropout(DROPOUT_RATE)(x)
x = tf.keras.layers.Dense(FEED_FORWARD_DIM, activation='relu')(x) # Feed Forward Network for classification
x = tf.keras.layers.Dropout(DROPOUT_RATE)(x)
output = tf.keras.layers.Dense(2, activation='softmax')(x) # 0th unit corresponds to label 0 and 1st unit to label 1

In [17]:
model = tf.keras.Model(inputs=[input1, input2], outputs=output)

The following summarizes our model.

In [18]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 10, 96)]     0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 1, 1, 10)]   0           []                               
                                                                                                  
 encoder (Encoder)              (None, 10, 96)       87808       ['input_1[0][0]',                
                                                                  'input_2[0][0]']                
                                                                                                  
 global_average_pooling1d (Glob  (None, 96)          0           ['encoder[0][0]']            

The model is trained using the Adam Optimizer and the loss function chosen is the sparse categorical crossentropy as the true labels are scaler numbers. Batch Size of $4$ was used and the model was trained for $10$ epochs.

In [19]:
# Training Model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit(
    [train_data, mask], labels, batch_size=BATCH_SIZE, epochs=EPOCHS 
)

Epoch 1/10


2022-09-01 18:42:16.611635: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-09-01 18:42:17.328229: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [20]:
pred_class_probs = model.predict([train_data, mask])

2022-09-01 18:42:18.405496: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




We take the index of the maximum of the two predicted probabilities for each sentence to be the predicted label of that sentence.

In [21]:
pred_labels = np.argmax(pred_class_probs, axis=-1) # Model output is maximum of the two probabilities

It can be seen that the model predicts with accuracy $1$ all the labels. But ofcourse our model is overfit due to very small training data and the model performance on train data only is unreliable estimate of the actual model performance.

In [22]:
pred_labels

array([1, 1, 0, 1])

The precision score, recall score and the F1 score is reported below.

In [23]:
print("*** Model Performance ***")
print("Precision Score:", precision_score(labels, pred_labels))
print("Recall Score:", recall_score(labels, pred_labels))
print("F1 Score:", f1_score(labels, pred_labels))

*** Model Performance ***
Precision Score: 1.0
Recall Score: 1.0
F1 Score: 1.0


Writing the results in a csv file.

In [24]:
filename = "TensorFlow_results.csv"
fields = ['precision', 'recall', 'F1'] 
results = [precision_score(labels, pred_labels), recall_score(labels, pred_labels), f1_score(labels, pred_labels)]
with open(filename, 'w') as csvfile: 
    csvwriter = csv.writer(csvfile) 
    csvwriter.writerow(fields)
    csvwriter.writerow(results)