#  **Introduction**

> Recently I've been reading "**Attention Is All You Need**" paper aka **Tranformer by Google in 2017**. Later on, **Andrej Karpathy** had explained this paper on simple understandable chuncks. Also I've recently switch to pytorch from tensorflow. So, this notebook is of a begineer trying to implement what he has been learning recently. Pytorch has a inbuilt Transformer on `nn.transformer` but still I've tried my best to implement the decoder only transformer architecture using basic pytorch and this notebook will be guide for someone like me. 

In [1]:
import pandas as pd 
import os
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rodolfo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


#  Directory Reading

# Data Reading
> As per the dataset description, each file has 4 column i.e. twitter Id, entity, sentiment and text. I've loaded the both training data and validation data for preprocessing.

In [2]:
columns_name = ["t_id", "entity", "sentiment", "text"]
training_data = pd.read_csv("dataset/twitter_training.csv", header=None, names=columns_name, index_col=False)
validation_data = pd.read_csv("dataset/twitter_validation.csv", header=None, names=columns_name, index_col=False)

Just tried to peek into the data.

In [3]:
training_data.head(), validation_data.head()

(   t_id       entity sentiment  \
 0  2401  Borderlands  Positive   
 1  2401  Borderlands  Positive   
 2  2401  Borderlands  Positive   
 3  2401  Borderlands  Positive   
 4  2401  Borderlands  Positive   
 
                                                 text  
 0  im getting on borderlands and i will murder yo...  
 1  I am coming to the borders and I will kill you...  
 2  im getting on borderlands and i will kill you ...  
 3  im coming on borderlands and i will murder you...  
 4  im getting on borderlands 2 and i will murder ...  ,
    t_id     entity   sentiment  \
 0  3364   Facebook  Irrelevant   
 1   352     Amazon     Neutral   
 2  8312  Microsoft    Negative   
 3  4371      CS-GO    Negative   
 4  4433     Google     Neutral   
 
                                                 text  
 0  I mentioned on Facebook that I was struggling ...  
 1  BBC News - Amazon boss Jeff Bezos rejects clai...  
 2  @Microsoft Why do I pay for WORD when it funct...  
 3  CSGO matchm

> from four columns, only **text** and **sentiment** will be used for the sentiment analysis. text will be the input features and sentiment will be the target. Also I'm printing the shape of trainin_data and validation_data to look into their shape before and after preprocessing

In [4]:
training_data = training_data[["text", "sentiment"]]
validation_data = validation_data[["text", "sentiment"]]
print(f"Shape of Training_data:{training_data.shape} | Shape of validation_data:{validation_data.shape}")

Shape of Training_data:(74682, 2) | Shape of validation_data:(1000, 2)


# Data Preprocessing

> This is a basic text cleaning preprocess. It convert text to lowercase, remove urls, remove hashtag for mentions, remove some punctuation and remove numbers. I think these are irrelevant for sentiment analysis. You can add as per your need.

In [5]:
def clean_tweet(text):
    if isinstance(text, str):
        text = text.lower()  # Convert to lowercase
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
        text = re.sub(r'\@\w+|\#', '', text)  # Remove mentions and hashtags
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
        text = re.sub(r'\d+', '', text)  # Remove numbers
    else:
        text = ''  # Handle non-string inputs like float (NaN) by returning an empty string or handling as needed
    
    return text

In [6]:
training_data["text"] = training_data["text"].apply(clean_tweet)
validation_data["text"] = validation_data["text"].apply(clean_tweet)

> Also I've removed stopwords from the text as it is irrelevant to sentiment analysis.

In [7]:
stop_words = set(stopwords.words('english'))
print(f"Length of stopwords:{len(stop_words)}")

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

training_data["text"] = training_data["text"].apply(remove_stopwords)
validation_data["text"] = validation_data["text"].apply(remove_stopwords)
print(f" Shape of Training data: {training_data.shape} | Shape of Validation data: {validation_data.shape}")

Length of stopwords:179
 Shape of Training data: (74682, 2) | Shape of Validation data: (1000, 2)


> Removing duplicates is relevant to sentiment analysis. Look at the shape of data before and after removing the duplicates.

In [8]:
training_data.drop_duplicates(subset=["text"], inplace=True)
validation_data.drop_duplicates(subset=["text"], inplace=True)
print(f" Shape of Training data: {training_data.shape} | Shape of Validation data: {validation_data.shape}")

 Shape of Training data: (62166, 2) | Shape of Validation data: (997, 2)


# Label encoding
I've used LabelEncoder from sklean. You can use other available modules.
There are four classes as Sentiment. They are **Positive, Negative, Neutral & Irrelevant**

In [9]:
label_encoder = LabelEncoder()

training_data["sentiment"] = label_encoder.fit_transform(training_data["sentiment"])
validation_data["sentiment"] = label_encoder.fit_transform(validation_data["sentiment"])
label_classes = label_encoder.classes_
num_classes = len(label_classes)
num_classes

4

# Custom Dataset from Pytorch
> For a cutom dataset, we should override three functions. They are `__init__`, `__len__` and `__getitem__`

In [10]:
class TwitterSentimentDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        text = self.data.iloc[index]["text"]
        sentiment = self.data.iloc[index]["sentiment"]
        
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length = self.max_length,
            padding = 'max_length',
            truncation = True,
            return_attention_mask = True,
            return_tensors = 'pt' 
        )
        label = torch.tensor(sentiment, dtype=torch.long)
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask':encoding['attention_mask'].flatten(),
            'labels':label
        }

# Tokenization
> I've used `BertTokenizer` from huggingface. You can use other as well. Like `tiktoken` (tokenizer of OpenAI), `Sentencepice` by google and many more. In near future I'll use custom tokenizer.

> Be careful about `max_length` and `batch_size`. `Max_length` is the maximum length of token to feed into tokenizer and `batch_size` control the number of data that a model interact in each pass per iteration. Having a low batch_size is time expensive and high batch_size is memory expensive and model can't learn the pattern in the data.

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 128

dataset = TwitterSentimentDataset(
    training_data,
    tokenizer=tokenizer,
    max_length=max_length
)

batch_size = 512
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)



> This is the vocabulary size aka vocab_size of the tokenizer. Tokenizer can use 30522 vocab_size (gpt-2 tokenizer had 50,257) to encode and decode text.

In [12]:
vocab_size = tokenizer.vocab_size
vocab_size

30522

> These are the hyperparameters for the transformer.

In [13]:
n_embed = 32
block_size = 4
dropout = 0.1
n_head = 6
n_layer = 6
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

# Transformer Block -> Self Attention, Multihead Attention

In [14]:
class Self_Attention(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(max_length, max_length)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)

        wei = torch.matmul(q, k.transpose(-2, -1)) * (k.shape[-1]**-0.5)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  #comment this line for encoder only transformer
        wei = nn.Softmax(dim=-1)(wei)
        wei = self.dropout(wei)
        out = torch.matmul(wei, v)
        return out

class Multi_Head_Attention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Self_Attention(head_size) for _ in range(num_heads)])
        self.projection = nn.Linear(head_size * num_heads, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([sa(x) for sa in self.heads], dim=-1)
        out = self.dropout(self.projection(out))
        return out

class Feed_Forward_Network(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.layer = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.layer(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.self_attention = Multi_Head_Attention(n_head, head_size)
        self.feed_forward = Feed_Forward_Network(n_embed)
        self.layer_norm_1 = nn.LayerNorm(n_embed)
        self.layer_norm_2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.self_attention(self.layer_norm_1(x))
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

# SentimentAnalysisTransformer

In [15]:
class SentimentAnalysisTransformer(nn.Module):
    def __init__(self):  
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.positional_embedding_table = nn.Embedding(max_length, n_embed)
        self.block = nn.Sequential(*[Block(n_embed, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, num_classes)
        
    def forward(self, x):
        B, T = x.shape
        tok_emb = self.token_embedding_table(x)  # (B, T, n_embed)
        pos_emb = self.positional_embedding_table(torch.arange(T, device=x.device))  # (T, n_embed)
        x = tok_emb + pos_emb  # (B, T, n_embed)
        x = self.block(x)  # (B, T, n_embed)
        x = self.ln_f(x)  # (B, T, n_embed)
        logits = self.lm_head(x.mean(dim=1))  # (B, T, vocab_size)
        return logits

In [16]:
model = SentimentAnalysisTransformer()
model.to(device)

SentimentAnalysisTransformer(
  (token_embedding_table): Embedding(30522, 32)
  (positional_embedding_table): Embedding(128, 32)
  (block): Sequential(
    (0): Block(
      (self_attention): Multi_Head_Attention(
        (heads): ModuleList(
          (0-5): 6 x Self_Attention(
            (key): Linear(in_features=32, out_features=5, bias=False)
            (query): Linear(in_features=32, out_features=5, bias=False)
            (value): Linear(in_features=32, out_features=5, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (projection): Linear(in_features=30, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (feed_forward): Feed_Forward_Network(
        (layer): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): ReLU()
          (2): Linear(in_features=128, out_features=32, bias=True)
          (3): Dropout(p=0.1, inplace=False)
        )
      )
      (la

# Loss function and Optimizer

In [17]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training

In [18]:
num_epochs= 60

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        y_hat = model(input_ids)
        loss = loss_fn(y_hat, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()        
    epoch_loss = running_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")

Epoch 1/60, Loss: 1.3426
Epoch 2/60, Loss: 1.2323
Epoch 3/60, Loss: 1.1179
Epoch 4/60, Loss: 1.0277
Epoch 5/60, Loss: 0.9509
Epoch 6/60, Loss: 0.8700
Epoch 7/60, Loss: 0.7985
Epoch 8/60, Loss: 0.7346
Epoch 9/60, Loss: 0.6737
Epoch 10/60, Loss: 0.6174
Epoch 11/60, Loss: 0.5663
Epoch 12/60, Loss: 0.5225
Epoch 13/60, Loss: 0.4809
Epoch 14/60, Loss: 0.4408
Epoch 15/60, Loss: 0.4087
Epoch 16/60, Loss: 0.3783
Epoch 17/60, Loss: 0.3523
Epoch 18/60, Loss: 0.3237
Epoch 19/60, Loss: 0.3040
Epoch 20/60, Loss: 0.2851
Epoch 21/60, Loss: 0.2665
Epoch 22/60, Loss: 0.2471
Epoch 23/60, Loss: 0.2409
Epoch 24/60, Loss: 0.2188
Epoch 25/60, Loss: 0.2077
Epoch 26/60, Loss: 0.1935
Epoch 27/60, Loss: 0.1840
Epoch 28/60, Loss: 0.1793
Epoch 29/60, Loss: 0.1658
Epoch 30/60, Loss: 0.1576
Epoch 31/60, Loss: 0.1539
Epoch 32/60, Loss: 0.1439
Epoch 33/60, Loss: 0.1413
Epoch 34/60, Loss: 0.1341
Epoch 35/60, Loss: 0.1298
Epoch 36/60, Loss: 0.1200
Epoch 37/60, Loss: 0.1179
Epoch 38/60, Loss: 0.1099
Epoch 39/60, Loss: 0.

# Validation Dataset and Dataloader

In [19]:
validation_dataset = TwitterSentimentDataset(validation_data, tokenizer=tokenizer, max_length=max_length)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False)

# Prediction

In [20]:
model.eval()
all_pred = []
all_labels = []

with torch.inference_mode():
    for batch in validation_loader:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)

    outputs = model(input_ids)
    _, preds = torch.max(outputs, dim=1)

    all_pred.extend(preds.cpu().numpy())
    all_labels.extend(labels.cpu().numpy())

# Accuracy Calculation

In [21]:
accuracy = accuracy_score(all_labels, all_pred)
print(f"validation accuracy is:{accuracy}")

validation accuracy is:0.931958762886598
