## **Building a transformer from scratch for multi-label classification task**
#### **Sections:**
 - Importing libraries
 - Data Cleaning & Analysis
 - Transformer Blocks
 - Preparing Train/Test Data
 - Model Training
 - Model Evaluation


### **Importing Libraries** 

In [None]:
import numpy as np  #For numerical operations
import pandas as pd    #For data manipulation
import string          #For string operations
import re              #python regular expressions
import nltk            #For NLP operations
from nltk.corpus import stopwords    #For removing stopwords from text
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer   #For text lemmatization
nltk.download('wordnet')
from collections import Counter    #For counting words in a list

import torch  #Pytorch for neural networks
import torch.nn as nn 
import torch.optim as optim
from torch.utils.data import DataLoader, random_split  #For handling dataset and batches for training
from sklearn.model_selection import train_test_split    #For train/test data splitting
import math    #For mathematical operations

from transformers import BertTokenizer  #For text tokenization
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score  #Metrics for evaluating the model


### **Data Cleaning**

**Creating functions for cleaning text from punctuations, special characters, text tokenization, removing stop words and text lemmatization**

In [None]:
def clean_text(text):
    '''
    This function cleans a text by by making the following operations:
    1- Transform to lower case
    2- Remove punctuation
    3- Remove special characters
    4- removing repetitive letter sequences [ex: extrrrra --> extra]
    '''
    text = text.lower()   #tranforming all text to lower case
    
    #using python list comprehension to fill a list with all words
    #without punctuations and special characters included in the string library
    cleaned_text_list = [words for words in text if words not in string.punctuation] 
    
    #reforming the string by joining all the words in the list
    cleaned_text = ''.join(cleaned_text_list) 
    
    #using regular expression to substitute '\n' and '\t' with spaces
    cleaned_text = re.sub(r"\n"," ",cleaned_text)
    cleaned_text = re.sub(r"\t"," ",cleaned_text)
    cleaned_text = re.sub(r'(.)\1+', r'\1', cleaned_text)  #removing repetitive letters
    return cleaned_text

In [None]:
def tokenize(text):
    '''
    This function is responsible for the tokenization process.
    which is splitting the strings into a list of words
    '''
    #using the split function to separate words
    #'\W+' splits the words on non-words characters
    splitted_text = re.split('\W+',text) 
    return splitted_text

In [None]:
def remove_stop_words(text):
    '''
    This function removes stop words from the text using the NLTK library
    some example stop words are: i, me, you, my, myself
    '''
    #all english language stop words in the library
    stopword = nltk.corpus.stopwords.words('english')  
    
    #creating a list with all words, excluding the stop words
    cleaned_text = [word for word in text if word not in stopword]
    
    return cleaned_text

In [None]:
def lemmatize(text):
    '''
    This function applies lemmatization operation on the text
    which is reducing the word to its root form
    '''
    ##ss = SnowballStemmer(language='english') #creating instance
    #creating a list with applying stemming to each word
    #stemmed_words = [ss.stem(word) for word in text ]
    #joined_text = ' '.join(stemmed_words) 
    lemm = WordNetLemmatizer()
    lemmatized_text = ' '.join([lemm.lemmatize(word) for word in text])
    #joined_text = ' '.join(lemmatized_text)
    return lemmatized_text

#Lemmatization resulted in much better results on text than stemming.

In [None]:
og_data = pd.read_csv('train.csv')
og_data.info()

**There are two string columns and six integer columns, and no Missing values in the data.**

In [None]:
og_data.head()

**Expected results from applying the following functions to the text:**  

clean_text(): Return the text lowered cased, removed punctuations, special characters and repetitive letters  

tokenize_text(): Splits the text into individual words and return a list of words  

remove_stop_words(): return a list of words with all stopwords removed  

lemmatize(): lemmatize the words in the list and join them back into sentences

In [None]:
#applying the clean_text() function on the data
og_data['comment_text'] = og_data['comment_text'].apply(lambda text:clean_text(text))
og_data['comment_text']

In [None]:
og_data['comment_text'] = og_data['comment_text'].apply(lambda text:tokenize(text))
og_data['comment_text']

In [None]:
og_data['comment_text'] = og_data['comment_text'].apply(lambda text:remove_stop_words(text))
og_data['comment_text']

In [None]:
og_data['tokenized_txt'] = og_data['comment_text']  #Creating a new column with the tokenized text to be used in analysis

In [None]:
og_data['comment_text'] = og_data['comment_text'].apply(lambda text:lemmatize(text))
og_data['comment_text']

**The text is now fully cleaned and lemmatized**

I retured the text as sentences and not tokenized as I will use the bert tokenizer. I found that it was used in most resources, so just to make sure the the text is tokenized correctly for the training

### **Analysis**

In [None]:
og_data.head()

In [None]:
labels_columns = og_data.columns.tolist()[2:8]  #creating a list of the label names
label_counts = og_data[labels_columns].sum().sort_values()  #counting the number of occurences of a label and sorting them
labels_columns

In [None]:
label_counts

dividing the data to toxic and clean text to check for data imbalance

In [None]:
toxic_data = og_data[og_data[labels_columns].sum(axis=1)>0] #filtering for toxic data only, where it will have a value of 1 in label columns
clean_data = og_data[og_data[labels_columns].sum(axis=1)==0] #filtering for clean data only, where it will have a value of 0 in label columns

In [None]:
print(toxic_data.shape)
print(clean_data.shape)

There is a big difference between toxic and clean text  

Handling data imbalance:  
Implementing Random Under Sampling, which is randomly sampling from the majority class which is the clean comments


In [None]:
#Sampling the clean data
sampled_clean = clean_data.sample(n=16225, random_state=42)
#combine toxic and the sampled clean data
balanced_df = pd.concat([toxic_data, sampled_clean], axis=0)
#Shuffling the data
balanced_df = balanced_df.sample(frac=1, random_state=42)

balanced_df.head()

In [None]:
print(toxic_data.shape)
print(sampled_clean.shape)
print(balanced_df.shape)

The result is now a balanced dataframe of total 32450 samples of text  
___

 Finding most common words in each class to find if a word can belong to multiple classes  
 this is by looping over the text data of each label, inserting the words in a list and counting the most common 5 words in each label

In [None]:
for label in labels_columns:
    words = []
    for comment in balanced_df[balanced_df[label]==1]['tokenized_txt']:
        words.extend(comment)
    common_words = Counter(words).most_common(5)
    print(f'Common words in the {label} class: {common_words}')

A word can belong to multible labels at the same time  
___

### **Transformer Blocks**

This section includes the main building blocks of a transformer which are:  
- Positional Encoding
- Feed Forward
- Multi Head Attention
- Encoder
- Decoder
- Combined Transformer  

The following code cells contains classes for each building block  
___

**Positional Encoding block structure:**  
- The constructor takes inputs: d_model which is the model dimension, seq_len which is the maximum sequence length and dropout for regularization
- create a tensor of zeros to be filled with the positional encodings
- Calculate the positional encoding based on the formulas from the paper
- Adds the positional encoding to the input

In [None]:
#Positional Encoding class

#expected to create a vector of same size d_model, that tells the model the position of a particular word in a sentence

class PositionalEncoding(nn.Module):

    #seq_len: max length of the sentence, dropout make the model less overfit
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        # Tensor of zeros, will be filled with positional encodings
        pe = torch.zeros(seq_len, d_model)

        #The formulas used in the paper to calculate the positional encodings

        # A vector of shape (seq_len - 1)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        # The denomenator of the formula
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        #Applying sin to even positions and cos to odd positions
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        #Keeping the tensor to be saved with the model but not as a learnable parameter
        self.register_buffer('pe', pe)

    def forward(self, x):
        #Adding positional encoding to the input
        x = x+ (self.pe[:, :x.shape[1]]).requires_grad_(False) # making this tensor fixed not a learnable parameter
        return self.dropout(x)

**Feed Forward Block structure:** 
- The constructor takes two inputs: d_model and d_ff which is the dimension of the inner layer
- Creates 2 fully connected layers
- Defines the activation function
- the input passes to the 1st layer --> activation function --> 2nd layer

In [None]:
#Feed Forward Block
class FeedForwardBlock(nn.Module):
    #d_ff: dimensionality of the inner layer
    def __init__(self, d_model: int, d_ff: int) -> None:
        super().__init__()
        #two fully connnected layers
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff,d_model)
        #activation function
        self.relu = nn.ReLU()

    def forward(self,x):
        #input x -> fc1 -> activation function -> fc2
        return self.fc2(self.relu(self.fc1(x)))
         

**Multi-head Attention Block Structure:**  
- constructor takes two inputs: d_model and num_heads(number of attention heads)
- apply linear transformation to the input Q,k,v
- split to num_heads
- apply the scaled dot product
- apply linear transformation and combine back

In [None]:
#Multi head attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        # Check if d_model is divisble by number of heads(num_heads)
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model # model dimension
        self.num_heads = num_heads # num of attention heads
        self.d_k = d_model // num_heads # dimension of k,q,v
        
        self.W_q = nn.Linear(d_model, d_model) # query linear transformation
        self.W_k = nn.Linear(d_model, d_model) # key linear transformation
        self.W_v = nn.Linear(d_model, d_model) # value linear transformation
        self.W_o = nn.Linear(d_model, d_model) # output linear transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)  #calculating the attention scores with matrix multiplication
        
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)   #Applying mask to filter out unnecessary part like padding
            #the mask sets the padding to a very large negative value which then is turned to zero by the softmax function
        
        attn_probs = torch.softmax(attn_scores, dim=-1)   #applying softmax to calculate the probability
        
        output = torch.matmul(attn_probs, V)  #multiply the probabilities by the value 
        return output
        
    def split_heads(self, x):  #splits the input into multiple heads of dimension num_heads
        batch_size, seq_len, d_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):   #combine the splitted heads to their original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # applying linear transformation and splitting heads to the q,k,v
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)  #apply the dot product
        
        output = self.W_o(self.combine_heads(attn_output)) #apply linear transformation and combine heads
        return output

**Encoder Block Structure:**  
- Constructor takes four inputs: d_model, num_heads, d_ff, dropout
- the input is passed to the self attention mechanism
- adding residual connection to the self attention output and normalizing it
- passing the output to the feed forward network
- adding another residual connection
- applying normalization to stabilize the final output

In [None]:
#Encoder Block
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)  
        self.feed_forward = FeedForwardBlock(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)  #self attention mechanism
        x = self.norm1(x + self.dropout(attn_output))  # residual connection and layer normalization
        ff_output = self.feed_forward(x)   # feed forward network
        x = self.norm2(x + self.dropout(ff_output))   #residual connection and layer normalization
        return x

**Decoder Block Structure:**  
- Constructor takes four inputs: d_model, num_heads, d_ff, dropout
- the input is passed to the self attention mechanism
- the output of the attention mechanism is passed to another instance of the mechanism 'cross_attn' to align the decoder output with relevant part of the input sequence
- passing the cross attention output to the feed forwar network 
- applying normalization for stability

In [None]:
#Decoder Block
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardBlock(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)  #self attention mechanism
        x = self.norm1(x + self.dropout(attn_output))  #residual connection and normalization
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask) #cross attention with input as the encoder output
        x = self.norm2(x + self.dropout(attn_output))  #residual connection and normalization
        ff_output = self.feed_forward(x)  #feed forward network
        x = self.norm3(x + self.dropout(ff_output))  #residual connection and normalization
        return x

**Transformer Block Structure:**  
- Constructor takes the following parameters: vocab_size: vocabulary size, num_classes : number of classes, d_model: model dimesionality, num_heads: number of attention heads, num_layers: number of encoder layers, d_ff: number of layers in the feed forward network, seq_len: maximum sequence length, dropout
- the inputs: input_ids and attention_mask are passed to the model
- they are passed to the word embedding layer and then to the positional encoding
- the attention mask is applied to the output of the embeddings
- the embedding are then passed to the encoder layers, with each layer applying multi-head self attention and feedforward
- average pooling to calculate the average of each feature across seq_len
- then passess to a fully connected layer to obtain the logits

In [None]:
class Transformer(nn.Module):
    def __init__(self, vocab_size, num_classes, d_model, num_heads, num_layers, d_ff, seq_len, dropout):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, d_model)  #word embedding
        self.positional_encoding = PositionalEncoding(d_model, seq_len,dropout)  #positional encoding layer

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])  #encoder layers

        self.fc = nn.Linear(d_model, num_classes)  #Fully connected layer
        self.dropout = nn.Dropout(dropout)  # Dropout for regularization

    def generate_mask(self, src):  #Generating mask to deal with paddings
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        return src_mask
    
    def forward(self, input_ids, attention_mask):
        src_mask = self.generate_mask(input_ids)  #generating the mask
        src_embedded = self.dropout(self.positional_encoding(self.embedding(input_ids)))  #applying embbeding, positional encoding and dropout

        # Apply the attention mask to the embeddings
        src_embedded = src_embedded * attention_mask.unsqueeze(-1)  #applying attention mask to the input embeddings

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)  #passing the input to the encoder layers

        avg_pooled = torch.mean(enc_output, dim=1)  #average pooling

        logits = self.fc(avg_pooled)  #passing the average pooled to the fully connected layer to output the logits
        return logits
    #logits unnormalized predicitions for each class


**NOTE:** I tried my best to research and fully understand each block of the transformer and how it works, but in implementing the code there may be some methods and some concepts that I don't fully understand its implementation or the mathematical intuition behind it. I researched and checked multiple resources and references to make sure that I implement it the correct way mentioned in these references to make it work

### **Preparing Train/Test Data**

- In this section I will take a 60% sample from the entire dataset to work with  
- split the data into train and test data
- pass the text to the bert tokenizer for text tokenization  
 - pass the tokenized text to the data loader to shuffle and load the data in batches for training

In [None]:
#Dropping the id and tokenized text columns
balanced_df = balanced_df.drop('id', axis=1)
balanced_df = balanced_df.drop('tokenized_txt', axis=1)
balanced_df.head()

###### **Dataframe now contains the comment texts and the labels**

In [None]:
len(balanced_df) #checking lenght of the data before taking a sample

Taking a 60% sample from the dataset for easy processing

In [None]:
percentage = 60
num_rows = int(len(balanced_df) * (percentage / 100))
selected_rows = balanced_df.sample(n=num_rows, random_state=42)
sample_df = pd.DataFrame(selected_rows)
sample_df.head()

In [None]:
len(sample_df)

The data length now is 19470 instead of 32450

In [None]:
# Split the dataset into training and testing sets
train_txt, test_txt, train_lbl, test_lbl = train_test_split(sample_df['comment_text'], sample_df.iloc[:,2:], test_size=0.2, random_state=42)

In [None]:
#converting the text data to a list of comments
train_txt = train_txt.tolist()
test_txt = test_txt.tolist()

In [None]:
#Converting the labels to tensors
train_labels = torch.tensor(train_lbl.values.tolist())
test_labels = torch.tensor(test_lbl.values.tolist())

In [None]:
#Loading Bert tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
seq_len = 128  #Maximum sequence length

# Tokenize the training set
train_encodings = tokenizer(train_txt, max_length=seq_len, truncation=True, padding=True, return_tensors='pt')
# Tokenize the testing set
test_encodings = tokenizer(test_txt, max_length=seq_len , truncation=True, padding=True, return_tensors='pt')

In [None]:
# Define the DataLoader for the training set
train_dataset = torch.utils.data.TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
train_Data_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

In [None]:
# Define the DataLoader for the test set
test_dataset = torch.utils.data.TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)
test_Data_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)

The data is now loaded in the data loader for shuffling and batching the data for the training and evaluation loops

### **Model Training**

Transformer input arguments:  
- vocab size: vocab size from the bert pre trained model
- num_classes = number of classes in the data : 6
- d_model: 128
- num_heads: 8
- num_layers = 8
- d_ff = 2048
- seq_len = 128
- dropout: 0.2

In [None]:
# Define the model and optimizer
model = Transformer(vocab_size=len(tokenizer), num_classes=len(train_labels[0]), d_model=128, num_heads=8, num_layers=8, d_ff=2048, seq_len=seq_len, dropout=0.2)
optimizer = optim.Adam(model.parameters(), lr=0.001)
#criterion = nn.BCEWithLogitsLoss()  # Binary Cross-Entropy Loss for multi-label classification
criterion = nn.CrossEntropyLoss()

Training loop Structure:  
- defining the number of epochs
- sending the model to the GPU if available
- setting the model to training mode and iterating through epochs
- iterate through batches of data using the data loader
- Forward pass to make prediction
- calculate loss
- Backpropagation
- Forward pass to update parameters
- calculate the average loss per epoch

In [None]:
#Training Loop

epochs = 10  #number of training epochs

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #Check if a supported GPU is available
model.to(device)  # move the model to the GPU

for epoch in range(epochs):
    model.train()  #setting model in training mode
    total_loss = 0  #defining the total loss for an epoch
    for batch in train_Data_loader:  #iterate through data in batches using the data loader
        
        #inputs = [batch[0].to(device), batch[1].to(device)]
        labels_float = batch[2].to(torch.float32)              #target labels converted to float as it gives error without the conversion

        #inputs = {'input_ids': batch[0].to(device),
                  #'attention_mask': batch[1].to(device),
                  #'labels': batch[2].to(device)}
        
        input_ids = batch[0].to(device)      #input IDs
        attention_mask = batch[1].to(device)   #attention mask
        #labels = batch['labels'].to(device)

        optimizer.zero_grad()  #setting the gradients in the optimizer to zero
        output = model(input_ids, attention_mask)  #Forward pass
        loss = criterion(output, labels_float.to(device))  #Calculate the loss
        loss.backward()  #Backpropagation
        optimizer.step()  #Forward pass to update the paramters

        total_loss += loss.item()  #sum the loss per batch
    avg_loss = total_loss / len(train_Data_loader)   #average loss per epoch
    print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")  #printing epochs and average loss


The model now trained for 10 epochs on the training data and ready for evaluation

### **Model Evaluation**

Evaluation loop structure:  
- Set the model to evaluation mode
- initialize true and prediction lists
- disabling the gradient computation for no backpropagation
- Iterate through data batches using the data loader
- Forward pass to make predictions
- Extract logits which are the raw scores made by the model for each class
- Calculate prediction
- Compute the evaluation metrics for the model

In [None]:
# Evaluate the model

model.eval()  #Setting the model to evaluation mode
y_true, y_pred = [], []  #lists to store true and predicted labels

with torch.no_grad():  #disabling gradient computation as we don't need backpropagation

    for batch in test_Data_loader:  #Iterate through batches of data using the dataloader

        input_ids = batch[0].to(device)  #input IDs
        attention_mask = batch[1].to(device) #Attention mask
    
        outputs = model(input_ids, attention_mask) #Forward pass to make the predictions
        predictions = (torch.sigmoid(outputs) > 0.5).cpu().numpy()  #calculate predictions

        #logits = outputs.detach().cpu().numpy() #Extract logits from model output
        #predictions = np.argmax(logits, axis=1)   # make prediction by selecting class with highest logit score

        y_true.extend(batch[2].cpu().numpy())  #true labels in a list
        y_pred.extend(predictions)    # predictions in a list

# Compute the metrics
micro_precision = precision_score(y_true, y_pred, average='micro')
micro_recall = recall_score(y_true, y_pred, average='micro')
micro_f1 = f1_score(y_true, y_pred, average='micro')
accuracy = accuracy_score(y_true, y_pred)
print('micro_precision' , micro_precision)
print('micro_recall' , micro_recall)
print('micro_f1' , micro_f1)
print('accuracy' , accuracy)


In [None]:
print(y_true)
print(y_pred)

**Final Notes**: The model at first try got very low scores (0.02). I tried then revising data cleaning and resolved the data imbalance problem, I also tried to change a bit with the model input arguments and the accuracy is now (0.12) still very low but it doesn't change much even when changing arguments. I tried alot but can't find a solution for the problem currently. I suspect the problem may be either in the evaluation loop or the Transformer() class.  

Finally, I understand the full code and how a transformer works, but there may be some methods and modules that I don't 100% understand how they work and why to choose them specifically and also the mathematical intuition behind some operations. Thank you