# Transformers

This notebook will focus on the implementation on a customizable model from scratch on the `Tweets` dataset for classification purposes. 

## Without Transformer's Module

We will be following the architecture from the original paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf?ref=blog.paperspace.com), with primarily the help of `Keras` and `Tensorflow`.

In [1]:
# Importing all the needed modules
import tensorflow as tf
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization, Dropout, Layer
from tensorflow.keras.layers import Embedding, Input, GlobalAveragePooling1D, Dense
from tensorflow.keras.models import Sequential, Model

2023-12-29 21:47:37.027419: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-29 21:47:37.030594: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-29 21:47:37.059463: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-29 21:47:37.059500: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-29 21:47:37.060258: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

### Transformer Block

This represents one of the blocks in the Transformer architecture. We will the following kinds of layers in this block:

1. We use the `MultiHeadAttention` provided by `Tensorflow` directly to avoid the implementation of the following steps:
- Split the `K`, `V` and `Q` vectors with dimension `embed_dim` into the required vectors based on the number of heads.
- Apply a linear layer to each of them.
- Perform a Scaled Dot Product to calculate the value for each of them.
- Concat all the vectors and then apply a linear classifier on it.
- Apply addition and normalization to the inputs.

2. A `Feed Forward Neural Network` that accepts the number of number of neurons `ff_dim` and the embedding dimension `embed_dim` and genrates a simple `Sequential` layer.

3. A `Normalization` layer to normalize the activations of the previous layer. 
4. A `Dropout` layer to randomly set input units to 0 with a frequency of `rate` at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1 / (1 - rate) such that the sum over all inputs is unchanged. This ensures that the layer stays normalized.

We do not apply a `Masked Multi Attention Head` layer in our implementation as we are going to be using this transformer for classification and not for generation. Thus we do not need to mask the next words in our input. Also due to the nature of the classification task, we do not need seperate `Encoder` and `Decoder` blocks, as now the architecture of both of them would overlap greatly.

In [2]:
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.attentionLayer = MultiHeadAttention(
            num_heads=num_heads, 
            key_dim=embed_dim
        )
        self.feedForwardNN = Sequential(
            [
                Dense(ff_dim, activation="relu"),
                Dense(embed_dim),
            ]
        )
        self.normalization1 = LayerNormalization(epsilon=1e-6)
        self.normalization2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    # inputs: Matrix of the appropiate size
    # training: Boolean to represent the model is being used in training or prediction
    def call(self, inputs, training):
        # Query = Value = Key = inputs
        # Key = Value is the most common case
        attn_output = self.attentionLayer(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        # out1 is the initial result after applying the attention layer and needs to be concatenated with the FFN
        out1 = self.normalization1(inputs + attn_output)

        ffn_output = self.feedForwardNN(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.normalization2(out1 + ffn_output)

### Embedding Blocks

Using this block, we would create two embedding layers, namely for the `tokens` and the `token index positions`. 
1. In the first layer, we initialize the token embeddings from the `vocab` to a space of `embed_dim` dimensions.
2. In the next layer, we intiliaze the positional embeddings. The input dimension would obviously be the length of the sequence and the output dimension would be same as `embed_dim`.

For this layer, it is assumed that all the inputs have been padded to the fixed length `maxlen`.

In [3]:
import numpy as np


class TokenEmbedding(Layer):
    def __init__(self, vocab_size, embed_dim):
        super(TokenEmbedding, self).__init__()
        self.token_emb = Embedding(input_dim=vocab_size, output_dim=embed_dim)

    def call(self, inputs):
        return self.token_emb(inputs)


class PositionalEncoding(Layer):
    def __init__(self, emb_dim, seq_len=5000):
        super(PositionalEncoding, self).__init__()
        position = tf.range(0, seq_len, dtype=tf.float32)[:, tf.newaxis]
        denominator = tf.exp(
            -tf.range(0, emb_dim, 2, dtype=tf.float32) * np.log(10000) / emb_dim
        )
        sin_vals = tf.sin(position * denominator)
        cos_vals = tf.cos(position * denominator)
        position_embedding = tf.concat([sin_vals, cos_vals], axis=-1)
        self.position_embedding = position_embedding[tf.newaxis, ...]

    def call(self, inputs):
        return inputs + self.position_embedding

### Preparing Data and Building the Model

The following code does the following:
- Load the dataset and split it into a `1:3` testing-training dataset.
- Map all the words to a distinct index (from `1` to `vocab_size`) 
- Pad all the tweets to `maxlen` to make them of the same length. 

In the next block, we put together the model with the following layers:
- Embedding Layer (Word + Positional)
- Transformer Block (MultiHead Attention + Feed Forward Neural Network)
- `AveragePooling` layer
- `Dropout` layer
- `Dense` Layer for classification followed by another `Dropout` layer

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Load the dataset
df = pd.read_csv("data/sample_tweets.csv", sep=",", names=["label", "text"], header=0)
tweets = df["text"].values
y = df["label"].values

# Divide into a 25% test set and 75% training set
tweets_train, tweets_test, y_train, y_test = train_test_split(
    tweets, y, test_size=0.25, random_state=1000
)

tweets_test, tweets_val, y_test, y_val = train_test_split(
    tweets_test, y_test, test_size=0.75, random_state=1000
)

tokenizer = Tokenizer(num_words=2500)
tokenizer.fit_on_texts(tweets_train)

X_train = tokenizer.texts_to_sequences(tweets_train)
X_test = tokenizer.texts_to_sequences(tweets_test)
X_val = tokenizer.texts_to_sequences(tweets_val)

vocab_size = (
    len(tokenizer.word_index) + 1
)  # Adding 1 because of reserved 0 index for padding


maxlen = 100
X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)
X_val = pad_sequences(X_val, padding="post", maxlen=maxlen)

In [5]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = Input(shape=(maxlen,))
embedding_layer = TokenEmbedding(vocab_size, embed_dim)
x = embedding_layer(inputs)
position_encoding = PositionalEncoding(embed_dim, maxlen)
x = position_encoding(x)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(20, activation="relu")(x)
x = Dropout(0.1)(x)
outputs = Dense(2, activation="softmax")(x)

model = Model(inputs=inputs, outputs=outputs)

### Compilation and Evaluation of the Model

In [6]:
# Compile and train the model
model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=50,
    validation_data=(X_val, y_val),
)

# Save the weights of the model
model.save_weights("data/training_checkpoints/transformer1.h5")

# Evaluate the model on the testing data and print the results
print("\n\n\n")
results = model.evaluate(X_test, y_test, verbose=2)
for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50




1/1 - 0s - loss: 1.0396 - accuracy: 0.8065 - 18ms/epoch - 18ms/step
loss: 1.040
accuracy: 0.806
