## Building a Transformer for a Multi-classification NLP Problem
## Project Phases:
### 1. Analysis & Cleaning
### 2. Preparing Input & Target
### 3. Building Model
### 4. Training
### 5. Testing
### 6. Optemizing

###### Import libraries and modules

In [1]:
import pandas as pd
import numpy as np
import math
import re
from collections import Counter

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.nn.utils.rnn import pad_sequence

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

###### Load data

In [2]:
df = pd.read_csv("train.csv")

In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


In [5]:
#shows if there are nulls in any column
df.isna().apply(pd.value_counts)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
False,159571,159571,159571,159571,159571,159571,159571,159571


### What's concluded from above is as follow:
- there are 8 columns and 159571 rows
- id column has no use for the problem
- comment_text column is of type string and has the comments to be cleaned, analysed and finally taken as an input for the encoder block of the transformer
- the rest of the columns assign the comment's category with a 0 or 1 (might combine them all in one column to use them as the target for the transformer)
- there are no nulls
- the comments are filled with punctiuation and special characters (need to be removed)
- of course not all comments are the same length which will require padding them

### Next step:

- 
#### Cleaning:
    - start with primary cleaning (lower case data, remove links, emails, punctiuation, special characters, numbers, stop words) to allow faster computation and easier and further analysis (next step)
- 
#### Analysis:
    - determine frequency of words
    - identify most and least common words
    - spot any irregular phrases or patterns that might create noise (need to be removed)
    - identify pattern of categorizing (is there a category associated with certain words)
    - check if there any uncatgorized comments (find a way to categorize them or remove them)
-
#### Further Cleaning:
    - based on the analysis conclusion, i will modify the data by removeing or modifying specific values (removeing ceratin words or categorizing uncategorized comments)

### 1. Analysis & Cleaning

- usnig Regular Expression, the above function removes  links, punctiuation and special characters, emails, and numbers
- now the data will be analyzed

In [6]:
def clean(text):
    clean_text = text.lower()
    #remove links
    clean_text = re.sub(r"https?://\S+|www\.\S+", "", clean_text)
    #remove punctiuation and special characters
    clean_text = re.sub(r"[^a-zA-Z0-9 ]", "", clean_text)
    #remove emails
    clean_text = re.sub(r"([\w\.\-\_]+@[\w\.\-\_]+)", "", clean_text) 
    #remove numbers
    clean_text = re.sub(r"(\d+)", "", clean_text)    
    return clean_text
df["comment_text"] = df["comment_text"].apply(clean)
df["comment_text"]

0         explanationwhy the edits made under my usernam...
1         daww he matches this background colour im seem...
2         hey man im really not trying to edit war its j...
3         morei cant make any real suggestions on improv...
4         you sir are my hero any chance you remember wh...
                                ...                        
159566    and for the second time of asking when your vi...
159567    you should be ashamed of yourself that is a ho...
159568    spitzer umm theres no actual article for prost...
159569    and it looks like it was actually you who put ...
159570    and  i really dont think you understand  i cam...
Name: comment_text, Length: 159571, dtype: object

- as I saw the phrase "daww" in the second row and since it is an irregular phrase, I looked for more in other rows
- but there were non
- I did the same with the phrase "umm", and it was present in 3533 rows
- I decided to add these "daww" and "umm" to the stop words that will be removed 

In [7]:
df.loc[df["comment_text"].str.contains('daww')]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
1,000103f0d9cfb60f,daww he matches this background colour im seem...,0,0,0,0,0,0


In [8]:
df.loc[df["comment_text"].str.contains('umm')]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
40,001735f961a23fc4,sure but the lead must briefly summarize arme...,0,0,0,0,0,0
80,00328eadb85b3010,minimization of textile effluenta proposed del...,0,0,0,0,0,0
98,003d77a20601cec1,thanks much however if its been resolved why ...,0,0,0,0,0,0
142,00587c559177dcf2,replyare you being facetious if not you would ...,0,0,0,0,0,0
251,00a1aabcab9d44a0,yes they are indeed ive replaced that second ...,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159183,f9c943b15a015b99,important notice the actions that you have tak...,0,0,0,0,0,0
159203,fa35aba966be5657,fourth examination th december additional sta...,0,0,0,0,0,0
159455,fdefec22131d625a,i would value your opinion on this the recent ...,0,0,0,0,0,0
159496,fefb3cc85d4d74b2,from wpthird opinion the inclusion of that se...,0,0,0,0,0,0


- as part of the analysis, I check if there are uncategorized comments
- there are 143,346 rows
- i will look for patterns in the categorized comments to try to lable them manually

In [9]:
df.loc[(df["toxic"] == 0) & (df["severe_toxic"] == 0) & (df["obscene"] == 0) & (df["threat"] == 0) & (df["insult"] == 0) & (df["identity_hate"] == 0)]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanationwhy the edits made under my usernam...,0,0,0,0,0,0
1,000103f0d9cfb60f,daww he matches this background colour im seem...,0,0,0,0,0,0
2,000113f07ec002fd,hey man im really not trying to edit war its j...,0,0,0,0,0,0
3,0001b41b1c6bb37e,morei cant make any real suggestions on improv...,0,0,0,0,0,0
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,and for the second time of asking when your vi...,0,0,0,0,0,0
159567,ffea4adeee384e90,you should be ashamed of yourself that is a ho...,0,0,0,0,0,0
159568,ffee36eab5c267c9,spitzer umm theres no actual article for prost...,0,0,0,0,0,0
159569,fff125370e4aaaf3,and it looks like it was actually you who put ...,0,0,0,0,0,0


In [10]:
df.loc[df["obscene"] == 1]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0
42,001810bf8c45bf5f,you are gay or antisemmitian archangel white t...,1,0,1,0,1,1
43,00190820581d90ce,fuck your filthy mother in the ass dry,1,0,1,0,1,0
51,001dc38a83d420cf,get fucked up get fuckeeed up got a drink tha...,1,0,1,0,0,0
55,0020e7119b96eeeb,stupid peace of shit stop deleting my stuff as...,1,1,1,0,1,0
...,...,...,...,...,...,...,...,...
159411,fd2f53aafe8eefcc,fat piece of shit you obese piece of shit i th...,1,0,1,0,1,0
159493,fef142420a215b90,fucking faggot lolwat,1,0,1,0,1,0
159494,fef4cf7ba0012866,our previous conversation you fucking shit ea...,1,0,1,0,1,1
159541,ffa33d3122b599d6,your absurd edits your absurd edits on great w...,1,0,1,0,1,0


In [11]:
df["comment_text"][51]

'get fucked up get fuckeeed up  got a drink that you cant put down get fuck up get fucked up  im fucked up right now'

- the cell above shows obscene categorized comments, which are 8449
- as shown, words like "fuck", "ass", "shit", "asshole", "hell" are present in all the obscene comments 
- so that can be used as a condition to lable the uncategorized comments
- from the couple of rows shown, I noticed that they are all categorized as toxic as well
- so I will check if this is true for the rest or not by checking if comments categorized toxic and obscene are the same number (8449 rows)
- in this way comments categorized obscene will be automatically categorized as toxic as well  

In [12]:
df.loc[(df["toxic"] == 1)  & (df["obscene"] == 1)]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0
42,001810bf8c45bf5f,you are gay or antisemmitian archangel white t...,1,0,1,0,1,1
43,00190820581d90ce,fuck your filthy mother in the ass dry,1,0,1,0,1,0
51,001dc38a83d420cf,get fucked up get fuckeeed up got a drink tha...,1,0,1,0,0,0
55,0020e7119b96eeeb,stupid peace of shit stop deleting my stuff as...,1,1,1,0,1,0
...,...,...,...,...,...,...,...,...
159411,fd2f53aafe8eefcc,fat piece of shit you obese piece of shit i th...,1,0,1,0,1,0
159493,fef142420a215b90,fucking faggot lolwat,1,0,1,0,1,0
159494,fef4cf7ba0012866,our previous conversation you fucking shit ea...,1,0,1,0,1,1
159541,ffa33d3122b599d6,your absurd edits your absurd edits on great w...,1,0,1,0,1,0


- as shown above, there are 7926 comments that are both toxic and obscene
- so there is around 500 comments that are obscene but not toxic
- nevertheless, this is a rough estimation that can be used 
- below, I will lable the uncategorized comments that include "fuck", "ass", "shit", "asshole", or "hell" as obscene and toxic

In [13]:
def categorize_obscene(df):
    condition = (df["comment_text"].str.contains("fuck|ass|shit|asshole|hell")) & (df["toxic"] == 0) & (df["severe_toxic"] == 0) & (df["obscene"] == 0) & (df["threat"] == 0) & (df["insult"] == 0) & (df["identity_hate"] == 0)
    df.loc[condition, ["toxic", "obscene"]] = 1
    return df
df = categorize_obscene(df)

- the function above sets the condition which is that the comment is not labled and contains any of the words mentioned
- then sets toxic and obscene columns to 1 for the rows that meet the condition 
- as shown below, comments labled obscene and toxic are now 26494
- they were 7,926 before, which is an 18,568 increase

In [14]:
df.loc[(df["toxic"] == 1)  & (df["obscene"] == 1)]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0
15,00078f8ce7eb276d,juelz santanas agein juelz santana was years...,1,0,1,0,0,0
22,000c0dfd995809fa,snowflakes are not always symmetrical under g...,1,0,1,0,0,0
27,000ffab30195c5e1,yes because the mother of the child in the cas...,1,0,1,0,0,0
33,001363e1dbe91225,i was able to post the above list so quickly b...,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...
159524,ff66383a2793fa14,sorry i find that theres nothing honorable abo...,1,0,1,0,0,0
159541,ffa33d3122b599d6,your absurd edits your absurd edits on great w...,1,0,1,0,1,0
159554,ffbdbb0483ed0841,and im going to keep posting the stuff u delet...,1,0,1,0,1,0
159557,ffc7bbb177c3c966,it is my opinion that that happens to be offto...,1,0,1,0,0,0


- as shown below, the total unlabled columns were 143,346 and now they are 124,778
- the rest of the unlabled  data are unlabled indicating they are not offensive to be categorized under any of the categories
- from the comments shown below, they don't show aggressivness and don't include offensive words like the ones mentioned earlier

In [15]:
df.loc[(df["toxic"] == 0) & (df["severe_toxic"] == 0) & (df["obscene"] == 0) & (df["threat"] == 0) & (df["insult"] == 0) & (df["identity_hate"] == 0)]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanationwhy the edits made under my usernam...,0,0,0,0,0,0
1,000103f0d9cfb60f,daww he matches this background colour im seem...,0,0,0,0,0,0
2,000113f07ec002fd,hey man im really not trying to edit war its j...,0,0,0,0,0,0
3,0001b41b1c6bb37e,morei cant make any real suggestions on improv...,0,0,0,0,0,0
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,and for the second time of asking when your vi...,0,0,0,0,0,0
159567,ffea4adeee384e90,you should be ashamed of yourself that is a ho...,0,0,0,0,0,0
159568,ffee36eab5c267c9,spitzer umm theres no actual article for prost...,0,0,0,0,0,0
159569,fff125370e4aaaf3,and it looks like it was actually you who put ...,0,0,0,0,0,0


- continuing the analysis, i will define the most and least common words
- the comments are first tokenized to apply the counter on it

In [16]:
df["comment_text"] = df["comment_text"].apply(nltk.word_tokenize)
df["comment_text"]

0         [explanationwhy, the, edits, made, under, my, ...
1         [daww, he, matches, this, background, colour, ...
2         [hey, man, im, really, not, trying, to, edit, ...
3         [morei, cant, make, any, real, suggestions, on...
4         [you, sir, are, my, hero, any, chance, you, re...
                                ...                        
159566    [and, for, the, second, time, of, asking, when...
159567    [you, should, be, ashamed, of, yourself, that,...
159568    [spitzer, umm, theres, no, actual, article, fo...
159569    [and, it, looks, like, it, was, actually, you,...
159570    [and, i, really, dont, think, you, understand,...
Name: comment_text, Length: 159571, dtype: object

In [17]:
word_counts = Counter()
for tokens in df["comment_text"]:
    word_counts.update(tokens)

word_counts.most_common(15)

[('the', 490086),
 ('to', 296181),
 ('of', 223797),
 ('and', 221106),
 ('a', 213713),
 ('you', 201238),
 ('i', 192393),
 ('is', 175356),
 ('that', 153583),
 ('in', 142792),
 ('it', 127627),
 ('for', 101599),
 ('not', 96168),
 ('this', 94764),
 ('on', 88881)]

- shown above the 15 most common words, which are obviously stop words that will be removed
- shown below the remove_stop_words function uses the list of stopwords from NLTK and I added to it the words "daww" and "umm"
- the function takes in the tokeized comments and returns them without the stop words defined

In [18]:
def remove_stop_words(text):
    custom_stopwords = ["daww", "umm"]
    stop_words = set(stopwords.words("english"))
    stop_words.update(custom_stopwords)
    filtered_text = [word for word in text if word not in stop_words]
    return filtered_text
df["comment_text"] = df["comment_text"].apply(remove_stop_words)   
df["comment_text"]

0         [explanationwhy, edits, made, username, hardco...
1         [matches, background, colour, im, seemingly, s...
2         [hey, man, im, really, trying, edit, war, guy,...
3         [morei, cant, make, real, suggestions, improve...
4                [sir, hero, chance, remember, page, thats]
                                ...                        
159566    [second, time, asking, view, completely, contr...
159567          [ashamed, horrible, thing, put, talk, page]
159568    [spitzer, theres, actual, article, prostitutio...
159569    [looks, like, actually, put, speedy, first, ve...
159570    [really, dont, think, understand, came, idea, ...
Name: comment_text, Length: 159571, dtype: object

In [19]:
word_counts = Counter()

for tokens in df["comment_text"]:
    word_counts.update(tokens)

word_counts.most_common(15)

[('article', 53866),
 ('page', 43972),
 ('wikipedia', 33869),
 ('talk', 31130),
 ('would', 29092),
 ('please', 27891),
 ('one', 27714),
 ('like', 27572),
 ('dont', 25879),
 ('see', 21101),
 ('also', 19945),
 ('think', 19925),
 ('know', 18911),
 ('im', 18643),
 ('people', 17504)]

- after removing the stop words, the most common words make more sense
- and the elements of the word counter shows the frequency of each word
- it also shows alot of repeated letters such as cccccc, kkkkkk, zzzzzzzz and the repeated pattern hahahaha
- most of the words with count 3 or less are repeated letters like above, gibbrish, and misspelled words

In [20]:
def remove_infrequent_words(column, n=4):
    word_counts = Counter()

    for row in column:
        word_counts.update(row)

    frequent_words = {word for word, frequency in word_counts.items() if frequency > n}
    
    def filter_words(tokens):
        return [word for word in tokens if word in frequent_words]

    column = column.apply(filter_words)

    return column
df["comment_text"] = remove_infrequent_words(df["comment_text"])
df["comment_text"]

0         [edits, made, username, hardcore, metallica, f...
1         [matches, background, colour, im, seemingly, s...
2         [hey, man, im, really, trying, edit, war, guy,...
3         [morei, cant, make, real, suggestions, improve...
4                [sir, hero, chance, remember, page, thats]
                                ...                        
159566    [second, time, asking, view, completely, contr...
159567          [ashamed, horrible, thing, put, talk, page]
159568    [spitzer, theres, actual, article, prostitutio...
159569    [looks, like, actually, put, speedy, first, ve...
159570    [really, dont, think, understand, came, idea, ...
Name: comment_text, Length: 159571, dtype: object

- after removing words that are repeated 3 time or less, the comments will be converted to numerical values based on the vocab's indicies

In [21]:
def text_to_indices(tokenized_text, vocab):
    return [[vocab[token] for token in tokens] for tokens in tokenized_text]

df["numerical_text"] = text_to_indices(df["comment_text"], word_counts)

### What's concluded from above is as follow:
- the comments are clean and ready to be embedded to be the model's input
- the categories will be set as target for the model

### Next step:

- 
#### Preparing Input:
    - set the input
- 
#### Preparing Target:
    - set categories columns as the target
- 
#### Split Data:
    - split data for training and testing
    - pad the input to be trained

### 2. Preparing Input & Target

- the input of the transformer will be the numerical_text column

In [22]:
Input = df["numerical_text"]
Input

0         [9745, 9520, 1760, 147, 35, 979, 3851, 452, 38...
1         [341, 696, 207, 18643, 164, 336, 12005, 31130,...
2         [2505, 2643, 18643, 9157, 4975, 17377, 3751, 1...
3         [14, 6101, 12879, 3679, 715, 448, 108, 9788, 3...
4                       [537, 217, 1068, 2067, 43972, 6823]
                                ...                        
159566    [2821, 15141, 1305, 3788, 2194, 113, 619, 4573...
159567                 [161, 251, 5209, 6139, 31130, 43972]
159568             [5, 3249, 1648, 53866, 66, 217, 27, 153]
159569    [2139, 27572, 5958, 6139, 4817, 10566, 3274, 8...
159570    [9157, 25879, 19925, 4834, 2097, 3130, 3369, 8...
Name: numerical_text, Length: 159571, dtype: object

- because each category is in a seperate column, they will be combined to represent a comment's category (target)
- in the following cell, the columns merged into one column "Target" and its values are stored as lists and converted to a tansor
- in this way each comment has a corresponding list with 6 zeros or/and ones indicating its category

In [23]:
def merge_columns():
    columns_to_merge = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
    df["Target"] = df[columns_to_merge].values.tolist()
    return torch.tensor(df["Target"], dtype=torch.int32)
target = merge_columns()
target.shape

torch.Size([159571, 6])

- the input and target are split to 40% training and 60% testing

In [24]:
X_train, X_test, y_train, y_test = train_test_split(Input,target, test_size = 0.60)

- the input used for training which is the X_train is padded using the function below
- the function converts the column to a tensor, then pads it with zero using pad_sequence

In [25]:
def pad_column(column):
    # Convert the column to a tensor
    column_data = [torch.tensor(sequence, dtype=torch.int32) for sequence in column]

    # Pad the tensors using pad_sequence
    padded_data = pad_sequence(column_data, padding_value = 0)

    return padded_data

In [26]:
X_train_padded = pad_column(X_train)
X_train_padded

tensor([[ 2139,    93, 27891,  ...,   302, 15489,  5812],
        [   32,    73,  8199,  ...,    12,  2941, 10405],
        [ 4674,  1042,  5608,  ...,  2194,    82, 33869],
        ...,
        [    0,     0,     0,  ...,     0,     0,     0],
        [    0,     0,     0,  ...,     0,     0,     0],
        [    0,     0,     0,  ...,     0,     0,     0]], dtype=torch.int32)

In [27]:
X_train_padded.shape

torch.Size([1250, 63828])

In [28]:
X_train_padded = np.transpose(X_train_padded)
X_train_padded

tensor([[ 2139,    32,  4674,  ...,     0,     0,     0],
        [   93,    73,  1042,  ...,     0,     0,     0],
        [27891,  8199,  5608,  ...,     0,     0,     0],
        ...,
        [  302,    12,  2194,  ...,     0,     0,     0],
        [15489,  2941,    82,  ...,     0,     0,     0],
        [ 5812, 10405, 33869,  ...,     0,     0,     0]], dtype=torch.int32)

In [29]:
X_train_padded.shape

torch.Size([63828, 1250])

- cast y_train to torch's type integer

In [30]:
y_train = y_train.type(torch.LongTensor)
y_train

tensor([[0, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

In [31]:
y_train.shape

torch.Size([63828, 6])

- will do the same for the test data

In [32]:
X_test_padded = np.transpose(pad_column(X_test))
X_test_padded

tensor([[    7,  8962,   329,  ...,     0,     0,     0],
        [   17,  6139,   729,  ...,     0,     0,     0],
        [   21,   846,  2107,  ...,     0,     0,     0],
        ...,
        [ 2505,  5837,  2194,  ...,     0,     0,     0],
        [12005,  5907,   912,  ...,     0,     0,     0],
        [  230,   500,   450,  ...,     0,     0,     0]], dtype=torch.int32)

In [33]:
y_test = y_test.type(torch.LongTensor)
y_test

tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]])

### What's concluded from above is as follow:
- the input and target of the model are set
- the data is split into training and testing

### Next step:

- 
#### Building the model:
    - build all the transformer's components

### 3. Building Model

- 
#### The steps of building the transformer model is as follow:
    - Building the basic blocks: Multi-head Attention, Position-wise Feed-Forward Networks, Positional Encoding
    - Building the Encoder block
    - Building the Decoder block
    - Combining the Encoder and Decoder 

In [34]:
class MultiHeadAttention(nn.Module):
    # here a class is initialized by taking the embedding dimension and number of heads
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # make sure the model's dimention is divisible by the number of heads
        # to be abel to compute the embedding size of each head as shown in line 12
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model # Dimention: 128 
        self.num_heads = num_heads # Dimention: 8
        # compute the dimensions of one head
        self.d_k = d_model // num_heads #128/8 = 16 for each key,query,value
        
        # create a neural network layer for each vector and the number of nodes is the embedding dimension
        # Dimention: 64 x 64 
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # the linear layer to generate the final output that would be then put into an activation function
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask = None):
        
        # get the dimention of the key vector and cast it to float,
        # so the attention score would be divided by it later it
        # here is the operation for computing the attention score
        
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # the if condition sets a mask to negative infinity if the mask equals zero
        # this is done for the decoder part, where inputs ahead of the the decoder's time step would be blocked 
        # this is the Masked Multi-Head Self-Attention
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
            
        # the attention scores are put into a softmax to compute the attention weights
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # the weighted sum of values is the multiplication of the attention weights and the value vector
        output = torch.matmul(attn_probs, V)
        
        return output
        
    # this method takes in a matrix with all the q (or k or v) weights for all heads
    #and splits the matrix to separate each head's weights in a separate matrix
    def split_heads(self, x):
        
        # here we get the size of the batch to use it to reshape the vectors accordingly
        #the split involves two steps: 
        #1-reshaping: the matrix had all q (or k or v) weights of all heads in one row
        #reshaping lets each q[i] of all heads in a separate matrix and each q[i] of a head in a separate column
        #however, all q[i] weights of each head are not togather, and that what transpose does
        #2-transpose: switches the position of the second and third axis 
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        
        # here it reverses what the previous function did
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        
        # generate the Q, K, V after spliting them
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # the output is generated by caling the scaled_dot_product_attention method
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [35]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()        
        # here is the feed forward layer that is used twice in the transformer
        # once in the encoder and another in the decoder
        # in both they use the ReLU activation function
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [36]:
class PositionalEncoding(nn.Module):
    # this function adds the position embedding to the input tensor
    # to enable the transformer to identify words' positions using sin and cosin functions
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()        
        # here a tensor of zeros is generated with dimension similar to the input tensor
        # with dimensions of sequence length * embedding dimension 
        # for one token, its dimension is 1 * 128. So the pe will be the same size to be added to the input tensor
        pe = torch.zeros(max_seq_length, d_model)
        
        # a tensor is generated with values from 0 to maximum sequence length to create position indices for each position
        # then the sin values are computed from even div_term,which is a factor that scales the position indicies
        # and the cosin function is used for odd div_term values
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    # here the stored positional encoding values are added to the input tensor    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [37]:
class EncoderLayer(nn.Module):
    # the encoder layer encapsulates its compnents: multi-head attention layer, the feed forward layer 
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)        
        # and adds two normalization layers: one after the multi-head attention and one after the feed forward
        # normalization is a regularization technique to smooth values
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)       
        # in addition to the drop out layer with each normalization layer
        # the dropout is used to prevent overfitting by droping out some of the network's nodes
        # it could be a hyperparameter to optemize the model later
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        # the encoder layer runs as follow:
        # 1A- input (word embedding + posional encoding) is fed into the multi-head attention layer
        attn_output = self.self_attn(x, x, x, mask)
        # 1B- input is also fed into the normalization and dropout layers, which is the residual connection
        # 2- Multi-head attention layer outputs the attention weights, which are fed into the normalization and dropout layers 
        x = self.norm1(x + self.dropout(attn_output))
        # 3- the output of the previous step goes into the feed forward layer
        ff_output = self.feed_forward(x)
        # 4- finally, the output goes into the second normalization and dropout layers
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [38]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        # the decoder layer consists of two multi-head attention layers: one is a masked MHAL and the another regular MHAL
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        # a feed forward layer
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        # and three normalization layers and corresponding dropout layers
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        # the decoder layer runs as follow:
        # 1A- the target goes into the first masked MHAL 
        attn_output = self.self_attn(x, x, x, tgt_mask)
        # 1B- the target goes into the normalization and dropout layers, which is the first residual connection 
        # 2- the output from the first MHAL goes into the normalization and dropout layers
        x = self.norm1(x + self.dropout(attn_output))
        # 3- the encoder block's output goes into the second MHAL       
        # 4- output of the first residual connection and  utput of the first MHAL go into the second residual connection
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        # 5- output from the second residual connection goes into the feed forward
        # 6- and finally, the output goes into the third residual connection
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

In [39]:
class Transformer(nn.Module):
    # the transformer takes in all parameters including: source vocab size, target vocab size, embedding dimension,
    # number of heads, number of layers of encoder and decoder, feed-forward layer dimension, maximum sequence length,
    # and dropout rate
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        # the transformer takes in the components: embedding layers for the encoder and decoder and the positional encoding
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)
        # here is the list of encoder and decoder layers # which are 6 for each 
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        # linear layer to map the output of the decoder to the target vocabulary size
        self.fc = nn.Linear(d_model, tgt_vocab_size)
        # dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout)

    #this method generates masks 
    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

### What's concluded from above is as follow:
- all parts of the model's components are built 

### Next step:

- 
#### Train the model:
    - prepare the data by batching
    - identify the model's arguments, such as source and target vocab size, embedding dimension, number of heads and more
    - train loop

### 4. Training

- get the size of the vocabulary of the comments using the function below
- it iterates over the elements from the Counter function and adds 1 at each iteration
- returning the total number of elements

In [40]:
def get_vocab_size(text):
    element_iterator = text.elements()
    element_count = sum(1 for i in element_iterator)
    return element_count

- the vocabulary size is 5,413,393
- it will be used later as an argument for the transformer

In [41]:
input_vocab_size = get_vocab_size(word_counts)
input_vocab_size

5413393

- PyTorch's Dataset class is used to as the first step to batch the model's data
- then PyTorch loader uses its methods

In [42]:
class CustomDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    # this method returns the total number of samples 
    def __len__(self):
        return len(self.data)
    
    # this method retrieves individual samples (the comment and its category) from the dataset during training
    def __getitem__(self, index):
        x = self.data[index]
        y = self.targets[index]
        return x, y

In [43]:
loader = iter(DataLoader(CustomDataset(X_train_padded, y_train), batch_size=15, shuffle=True))

In [44]:
# define the model's arguments
src_vocab_size = input_vocab_size
tgt_vocab_size = 6
d_model = 128
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = len(X_train_padded[0])
dropout = 0.1

# initialize the model
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

In [45]:
# loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

Transformer(
  (encoder_embedding): Embedding(5413393, 128)
  (decoder_embedding): Embedding(6, 128)
  (positional_encoding): PositionalEncoding()
  (encoder_layers): ModuleList(
    (0-5): 6 x EncoderLayer(
      (self_attn): MultiHeadAttention(
        (W_q): Linear(in_features=128, out_features=128, bias=True)
        (W_k): Linear(in_features=128, out_features=128, bias=True)
        (W_v): Linear(in_features=128, out_features=128, bias=True)
        (W_o): Linear(in_features=128, out_features=128, bias=True)
      )
      (feed_forward): PositionWiseFeedForward(
        (fc1): Linear(in_features=128, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=128, bias=True)
        (relu): ReLU()
      )
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (decoder_layers): ModuleList(
    (0-5): 6 x DecoderLayer(
 

- the traininng loop does's generate any errors, however it keeps running without diplyaing anything 
- I restarted the whole code countless times, but nothing changes
- maybe it needed more time to run
- I manually interrupted the cell as shown to close the notebook

In [46]:
# training loop
for epoch in range(3):
    total_loss = 0.0
    
    for i, batch_data in enumerate(loader):
        optimizer.zero_grad()
        # unpack the batch
        src_data, tgt_data = batch_data  
        
        # forward pass
        output = transformer(src_data, tgt_data[:, :-1])
        
        # compute loss
        loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
        
        # backpropagation
        loss.backward()
        
        # update weights
        optimizer.step()
        
        total_loss += loss.item()
    
    # print the average loss for each epoch
    print(f"Epoch: {epoch + 1}, Loss: {total_loss / len(loader)}")

KeyboardInterrupt: 

### 5. Testing

- create a test loader with the test input and target

In [None]:
test_loader = iter(DataLoader(CustomDataset(X_test_padded, y_test), batch_size=15, shuffle=True))

- this is the testing loop, that will evaluate the model using the 60%

In [None]:
# training loop
transformer.eval()
total_loss = 0.0

with torch.no_grad():
    for i, batch_data in enumerate(test_loader):
        src_data, tgt_data = batch_data

        # forward pass
        output = transformer(src_data, tgt_data[:, :-1])

        # compute loss
        loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))

        total_loss += loss.item()

test_loss = total_loss / len(data_loader)
test_loss

### 6. Optimizing

- optimizing the model will envolve changing the parameters of the model such as the model's dimension and batch size
- regularization is also another way, where the dropout rate can be changed to prevent overfitting