# In Class Follow Along Notebook


First things first, we'll set-up the data!

In [None]:
NUM_LABELS = 2

In [None]:
import pandas as pd

cleaned_tweets = pd.read_csv("processed_tweets (1).csv")

In [None]:
cleaned_tweets.head()

In [None]:
X, y = pd.Series(cleaned_tweets['tidy_tweet']), pd.Series(cleaned_tweets['label'])

In [None]:
from sklearn.model_selection import train_test_split
X_train_sub, X_test, y_train_sub, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_sub, y_train_sub, test_size=0.2, stratify=y_train_sub, random_state=42)

## Positional Embedding Layer

We'll make the positional embedding layer as seen in the "Attention is all you need" paper!

The idea behind Positional Encoding is fairly simple as well: to give the model access to token order information, therefore we are going to add the token's position in the sentence to each word embedding.

Thus, one input word embedding will have two components: the usual token vector representing the token independent of any specific context, and a position vector representing the position of the token in the current sequence.

In [None]:
### Positional Embedding
from tensorflow.keras import layers as L
import tensorflow as tf
from tensorflow import keras

class PositionalEmbedding(L.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        self.token_embeddings =  # YOUR CODE HERE
        self.position_embeddings =  # YOUR CODE HERE
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim
        super().__init__(**kwargs)
        
    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions
        
    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

## Transformer Block

Recently most of the natural language processing tasks are being dominated by the Transformer architecture, introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762), which used a simple mechanism called Neural Attention as one of its building blocks. As the title suggests this architecture didn't require any recurrent layer. We now build a text classification using Attention and Positional Embeddings.

Transformer (attention) Block.

The concept of Neural Attention is fairly simple; i.e., not all input information seen by a model is equally important to the task at hand. Although this concept has been utilized at various different places as well, e.g., max pooling in ConvNets, but the kind of attention we are looking for should be context aware.

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other; in other words, calculate attention of all other inputs with respect to one input.

In the paper, the authors proposed another type of attention mechanism called multi-headed attention which refers to the fact that the outer space of the self attention layer gets factored into a set of independent sub-spaces learned separately, where each subspace is called a "head". You need to implement the multi-head attention layer, supplying values for two parameters: num_heads and key_dim.

There is a learnable dense projection present after the multi-head attention which enables the layer to actually learn something, as opposed to being a purely stateless transformation. You need to implement dense_proj, use the tf.keras.Sequential to stack two dense layers:

 1. first dense layer with `dense_dim` units and activation function `relu`;
 2. second dense layer with `embed_dim` units and no activation function.

In [None]:
class TransformerBlock(L.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention =  # YOUR CODE HERE
        self.dense_proj = keras.Sequential([
            L.Dense(dense_dim, activation='relu'),
            L.Dense(embed_dim)
            ])
        self.layernorm1 = L.LayerNormalization()
        self.layernorm2 = L.LayerNormalization()
        super().__init__(**kwargs)
    
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[: tf.newaxis, :]
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm2(proj_input + proj_output)
    
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim
        })
        return config

## Transformer Model in Keras

Let's build it!

In [None]:
VOCAB_SIZE = 10_000
EMBED_DIM = 256
DENSE_DIM = 32
NUM_HEADS = 2
MAX_LEN = 256

Tokenizer.

The tokenizer is a simple tool to convert a text into a sequence of tokens. It is used to convert the training data into a sequence of integers, which are then used as input to the model.

Use Tokenizer to create a tokenizer for the training data. Set the num_words parameter to the number of words to keep in the vocabulary, and oov_token to be "\<unk>".

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = # YOUR CODE HERE
tokenizer.fit_on_texts(X_train)

Pad the sequences.

The tokenizer outputs a sequence of integers, which are then used as input to the model. However, the model expects a sequence of fixed length. To pad the sequences to the same length, use sequence.pad_sequences from keras.preprocessing.

Complete function preprocess below to 1) tokenize the texts 2) pad the sequences to the same length.

In [None]:
from keras.utils import pad_sequences

def preprocess(texts, tokenizer, maxlen:int = MAX_LEN):
    seqs =  # YOUR CODE HERE
    tokenized_text =   # YOUR CODE HERE
    return tokenized_text

Preprocess the data.

Use preprocess to preprocess the training, validation, and test data.

In [None]:
X_train =  # YOUR CODE HERE
X_valid =  # YOUR CODE HERE
X_test  =  # YOUR CODE HERE

Define the model with the following architecture:

* Input Layer
* Positional Embeddings
* Transformer Block
* Pooling
* Dropout
* Output Layer

If you are not familiar with keras functional API, take a read [here](https://keras.io/guides/functional_api/).

In [None]:
inputs = keras.Input(shape=(None, ), dtype="int64")
x = PositionalEmbedding(MAX_LEN, VOCAB_SIZE, EMBED_DIM)(inputs) # YOUR CODE HERE
x = TransformerBlock(EMBED_DIM, DENSE_DIM, NUM_HEADS)(x) # YOUR CODE HERE
x = L.GlobalMaxPooling1D()(x)
x = L.Dropout(0.1)(x)
outputs = L.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

Compile model.

Use 'adam' for the optimizer and accuracy for metrics, supply the correct value for loss.

Remember, this is a binary classification task!

In [None]:
model.compile(
    optimizer='adam', # YOUR CODE HERE
    loss='binary_crossentropy', # YOUR CODE HERE
    metrics=['accuracy']) # YOUR CODE HERE

In [None]:
model.summary()

Add [EarlyStopping](https://keras.io/api/callbacks/early_stopping/) and [ReduceLROnPlateau](https://keras.io/api/callbacks/reduce_lr_on_plateau/) to stop training if the model does not improve a set metric after a given number of epochs.

Create an EarlyStopping object named es to stop training if the validation loss does not improve after 5 epochs. Set verbose to display messages when the callback takes an action and set restore_best_weights to restore model weights from the epoch with the best value of the monitored metric.

Use ReduceLROnPlateau to reduce the learning rate if the validation loss does not improve after 3 epochs. Set verbose to display messages when the callback takes an action and use default values for other parameters.

In [None]:
es =  # YOUR CODE HERE
rlp =  # YOUR CODE HERE

Train the model.

Supply both EarlyStopping and ReduceLROnPlateau for callbacks. Set epochs to 100.

In [None]:
history = model.fit(
    X_train, y_train, 
    validation_data=(X_valid, y_valid),
    # YOUR CODE HERE
    epochs=10
)

Evaluate the trained model on the test data.

Visualize both loss and accuracy curves for the training and validation data.