# LLM From Scratch
#### (By: Mark Ehab Aziz)
##### (Built under Python 3.11.4)

Graduation Project for ***Sprints***, to build a LLM (Large Language Model) from scratch without the use of any Transformer related libraries, in order to classify comments scraped from some given website.

Notebook will include:
- List of imported libraries.
- Deep dive into the data.
- Markdown cells to explain the detail of every step, along with reason.
- Custom defined functions and classes.
- Own Transformer from scratch.

List of Libraries/Dependencies:
- [Pandas](https://pandas.pydata.org/docs/index.html)
- [NumPy](https://numpy.org/doc/stable/)
- [NLTK](https://www.nltk.org/)
    - [NLTK Regex Tokenizer](https://www.nltk.org/howto/tokenize.html)
    - [NLTK Snowball Stemmer](https://www.nltk.org/howto/stem.html)
    - [NLTK WordNet Lemmatizer](https://www.nltk.org/howto/wordnet.html)
    - [NLTK English Words (Stopwords too)](https://www.nltk.org/howto/corpus.html)
- [PyTorch (`torch` and `torch.nn`)](https://pytorch.org/docs/stable/index.html)

Note: NLTK will be used later for tokenization using RegEx.

# Library Imports
Importing libraries that will be used in order to implement our own LLM.

In [1]:
import pandas as pd                             # Pandas for DataFrame manipulation
import numpy as np                              # Linear Algebra and Mathematical Operations
import nltk                                     # Downloading word sets
from nltk.stem import SnowballStemmer           # Stemming
from nltk.stem import WordNetLemmatizer         # Lemmatization (Better word yields)
from nltk.corpus import stopwords               # Stopwords
from nltk.tokenize import regexp_tokenize       # Tokenization using RegEx (Regular Expressions)

In [2]:
# Downloading external dependencies to be used
# Such as Stopwords, English-only words

# Stopwords
nltk.download('stopwords')

# English Only
nltk.download('words')

# Wordnet for Lemmatizer
nltk.download('wordnet')

# Setting constants for both normal and stopwords in English
ENGLISH_STOPWORDS = set(stopwords.words('english'))
ENGLISH_WORDS = set(nltk.corpus.words.words())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# All About Data

## Data Import
Loading the data using Panda's `read_csv()` function.

In [3]:
# Reading the data with 2 ways
# 1 - Reading within my Github Repo
txt_dat = pd.read_csv('../dataset/train.csv')

# 2 - Reading within the same folder
#txt_dat = pd.read_csv('train.csv')

## Data Exploration
Looking at the data from multiple perspectives.

Using `head()` and `tail()` methods to look at what columns there are within the dataframe, which may be useful and which are not.

In [4]:
# Defining n rows
n = 5

# Calling and showing first and last n rows
display(txt_dat.head(n), txt_dat.tail(n), txt_dat.shape)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\r\nWhy the edits made under my use...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\r\nMore\r\nI can't make any real suggestions...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \r\n\r\nThat...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \r\n\r\nUmm, theres no actual article ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0
159570,fff46fc426af1f9a,"""\r\nAnd ... I really don't think you understa...",0,0,0,0,0,0


(159571, 8)

Looking at the data, we can see that we have about $159571$ entries.
We can also see that there are columns as such:
- `id`: Shows message id code from whatever platform this was scraped from. (Will be dropped)
- `comment_text`: main star of the show, text we have to process, it has upper-case letters, escape characters (`\n`,`\r`, etc.), and most likely special characters (non-latin-alphabet), URLs, IP addresses; removal needed.
- `toxic` (And its derivatives): Labelling the comment if it were a toxic or not.

## Defining What To Clean
In the above mentioned description of the data, there exists aspects that need removal, explanation will follow within this cell.

- Removal of id: Due to the nature of what this project is about, and usually a general case, IDs are not usually if at all used to gain insights from data, due to them being unique per entry and being nothing but an enumeration of entries.
- Cleaning within `comment_text`: Cleaning the comments will be separated into $x$ steps, namely:
    - Space and Tab removal: Removing of Newline Characters (`\n`) and Tabs (`\t`) will help remove special characters from text and avoiding noise within.
    - Uppercase: Due to standarisations within the NLP field, it has been agreed upon to change any uppercase letter into lowercase, to mitigate the face that words can be written using multiple permutations of the same letters but with different cases, so in order to for the machine to recognise the word and not have to account for $n^{52}$ different combinations for the word (Where $n$ is number of characters to represent a word, and $52$ is due to both upper and lower cases of a letter)
    - URL: Removal of URLs will prove beneficial, as it doesn't contribute much to the corpus nor is considered a baseline for labelling the comment.
    - IP Address: For security reasons.
    - Special Characters: Due to the non-existence of any in the Latin Alphabet which English uses, it would be useless to bother with them, although if this was a multi-lingual dataset, some characters from different languages would be needed to keep.
- Derivatives of `toxic`: For purposes of simplicity, I have decided to *"collapse"* the values that follow after the *`toxic`* column, as in summing the values into said column, then swapping values $>1$ to be just $1$ to indicate toxicity, implying that $0$ would be for non-toxic comments; as a result, this will water down into just a "Binary Classification" problem based on words.

In [5]:
# Collapsing the column values onto toxic
txt_dat['toxic'] = txt_dat.iloc[:, 2:].sum(axis = 1)

# Drop the collapsed columns
# Along with the id column
txt_dat.drop(columns = ['id', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], inplace = True)

In [6]:
# Printing the head and describing the toxic column
display(txt_dat.head(n), txt_dat.describe())

Unnamed: 0,comment_text,toxic
0,Explanation\r\nWhy the edits made under my use...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\r\nMore\r\nI can't make any real suggestions...",0
4,"You, sir, are my hero. Any chance you remember...",0


Unnamed: 0,toxic
count,159571.0
mean,0.219952
std,0.74826
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,6.0


As noted, due to collapsing the values onto `toxic`, values over 1 arose hence they will need handling in order to keep the labels as $1$ and $0$. A function doing just so will be implemented later.

### Regex Patterns
Identified Patterns that will be required to be used to capture specific instances of removable slices of text within the comments.

In [7]:
# Newline, Tab spaces, etc.
newline_tabspace = r'[\r\n\t]'

# Match words starting with Uppercase letters
upper_words = r'([A-Z])\w+'

# Match Words that start with either Upper/lowercase letters
upperlower_words = r'[A-Za-z]\w+'

# Sub/Superscript characters
# Encountered previously
sub_sup_scripts = r'\w[²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]+'

# Punctuation
punc_pattern = r'[!\?.,\':;"]'

# Single Letters
single_letter = r'((?<=^)|(?<= )).((?=$)|(?= ))'

# Match URLs
url_pattern = r'(http|ftp|https):\/\/([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?'

# Cleaning

## Functions and such
Functions to be used and applied on the dataframe.

Each function will be commented, whilst also writing an explanation to a grouped cell of functions indicating their use.

In [8]:
def binarize(df: pd.DataFrame, colname: str):
    return np.where(df[colname] > 0, 1, 0)

def clean_comment(df: pd.DataFrame, colname: str):
    # Remove the newlines, tabs, etc.
    df[colname].replace(newline_tabspace, ' ', regex = True, inplace = True)

    # Remove URLs
    df[colname].replace(url_pattern, ' ', regex = True, inplace = True)

    # Remove superscript, subscript
    df[colname].replace(sub_sup_scripts, ' ', regex = True, inplace = True)

    # Remove punctuation
    df[colname].replace(punc_pattern, ' ', regex = True, inplace = True)

    # Remove Single Letters
    df[colname].replace(single_letter, ' ', regex = True, inplace = True)

    return df[colname]

A walkthrough the above functions:
- `binarize(df, colname)`: Function that takes a `pd.DataFrame`, `string` for input, dictating which DataFrame and Column within said DataFrame to carry out the operations on.
Due to the rise of values larger than 1 after collapsing the other column values, a requirement for binarizing (Chose to make the labels binary, for simplicity's sake) the values, $1$ for Toxic and $0$ for Non-Toxic, this is achieved by assigning any value larger than 0 to 1 otherwise stays as 0.

- `clean_comment(df, colname)`: Function that takes a `pd.DataFrame`, `string` for input, dictating which DataFrame and Column within said DataFrame to carry out the operations on.
The nature of the data given, is that it has a lot of *whitespaces*, *tabs*, *special characters*, *urls*, *punctuations*, *single letters*, which are all bound to be removed.
The function removes:
    - Newlines
    - Tabs
    - URLs
    - Sub/Super Scripts
    - Punctuation
    - Single Letters (From removal of some punctuation symbols)

Both functions return the updated column.

In [9]:
txt_dat['comment_text'] = clean_comment(txt_dat, 'comment_text')

txt_dat['toxic'] = binarize(txt_dat, 'toxic')

display(txt_dat.head(n), txt_dat.tail(n), txt_dat.describe())

Unnamed: 0,comment_text,toxic
0,Explanation Why the edits made under my usern...,0
1,aww He matches this background colour s...,0
2,Hey man really not trying to edit war It...,0
3,More can make any real suggestions on ...,0
4,You sir are my hero Any chance you remember...,0


Unnamed: 0,comment_text,toxic
159566,And for the second time of asking when ...,0
159567,You should be ashamed of yourself That is ...,0
159568,Spitzer Umm theres no actual article for ...,0
159569,And it looks like it was actually you who put ...,0
159570,And really don think you understand...,0


Unnamed: 0,toxic
count,159571.0
mean,0.101679
std,0.302226
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


As we can see, there are no extra characters, like apostrophe's, extra punctuations, etc.

As well as having the maximum value to be 1 within the `toxic` column, although the mean is closer to 0 which suggest that the data is really imbalanced, which will require fixing later down the line, ideally it would be better to be closer to 0.5.

# NLP Preprocessing
Preprocessing the data with ways that are under the NLP PreProcessing Standards, as in:
- Tokenizing.
- Lowercasing.
- Removal of Stopwords.
- Removing Non-English words.
- Lemmatization/Stemming.

## Tokenization
Tokenization is the process of splitting up a sentence or a corpus into plain words (tokens) for variable reasons:
- To know which words are present.
- Count them.
- Et cetera.

In my implementation, I will be using the `regexp_tokenizer()` in order to tokenize the text within the `comment_text` column.

As well as keeping the tokens within the same dataframe as a list.

In [10]:
# Function to tokenize a specific column within a passed DataFrame
def tokenize(dataframe: pd.DataFrame, colname: str):
    return [regexp_tokenize(row, upperlower_words) for row in dataframe[colname]]

In [11]:
# Calling the tokenize() function on a new column within the main DataFrame
txt_dat['tokens'] = tokenize(txt_dat, 'comment_text')

# Checking if the function worked
txt_dat.head(n)

Unnamed: 0,comment_text,toxic,tokens
0,Explanation Why the edits made under my usern...,0,"[Explanation, Why, the, edits, made, under, my..."
1,aww He matches this background colour s...,0,"[aww, He, matches, this, background, colour, s..."
2,Hey man really not trying to edit war It...,0,"[Hey, man, really, not, trying, to, edit, war,..."
3,More can make any real suggestions on ...,0,"[More, can, make, any, real, suggestions, on, ..."
4,You sir are my hero Any chance you remember...,0,"[You, sir, are, my, hero, Any, chance, you, re..."


Notice how the sentences have been tokenised, each word is its own string, within a list of strings under the new `tokens` column.

## Lower Case Words
By conventional standards, changing the case of the words into lowercase has been agreed upon to negate the need to account for case sensitivity with the operations that follow to preprocess text.

In [12]:
# function to lowercase text
def token_lower(token_list: list):
    return [token.lower() for token in token_list]

In [13]:
# Iterating over the column
# Iterable is of type "list"
txt_dat['tokens'] = [token_lower(row) for row in txt_dat['tokens']]

# Displaying results of applying the above function
txt_dat.head()

Unnamed: 0,comment_text,toxic,tokens
0,Explanation Why the edits made under my usern...,0,"[explanation, why, the, edits, made, under, my..."
1,aww He matches this background colour s...,0,"[aww, he, matches, this, background, colour, s..."
2,Hey man really not trying to edit war It...,0,"[hey, man, really, not, trying, to, edit, war,..."
3,More can make any real suggestions on ...,0,"[more, can, make, any, real, suggestions, on, ..."
4,You sir are my hero Any chance you remember...,0,"[you, sir, are, my, hero, any, chance, you, re..."


Explaining how the function works first before observation.

Take the column `tokens`, it contains lists of tokens for each comment like: `[Explanation, Why, the, edits, made, under, my...]`.

Said list is what is being iterated on, which gets passed to the function, so for every iteration a new list of words gets passed to the function; but within the function, we iterate over each item within said list, so `'Explanantion` and `'Why'`, etc..

Therefore, passing over each token, we apply the `.lower()` function for the `str` data type; thus we get a lowercase version of the string.

After every iteration of lists, they get returned and reassigned where the original list used to be.

As for observation, we can see that the words did indeed become lowercase.

## Stopword Removal
Removing of stopwords drastically decreases the amount of words that need to be taken into consideration by the model, and contribute little to nothing regarding meaning.

Removing them would shave off redundant computations.

In [14]:
# Function to remove the stopwords
def stopword_remover(wordlist: list):
    return [word for word in wordlist if word not in ENGLISH_STOPWORDS]

In [15]:
# Displaying word count for first few lists
print([len(tokenlist) for tokenlist in txt_dat['tokens'].iloc[:5]])

# Iterating over column
# Iterable is the list of
# Lowercase english words
txt_dat['tokens'] = [stopword_remover(row) for row in txt_dat['tokens']]

print([len(tokenlist) for tokenlist in txt_dat['tokens'].iloc[:5]])

txt_dat.head(n)

[41, 13, 41, 103, 13]
[23, 10, 21, 50, 5]


Unnamed: 0,comment_text,toxic,tokens
0,Explanation Why the edits made under my usern...,0,"[explanation, edits, made, username, hardcore,..."
1,aww He matches this background colour s...,0,"[aww, matches, background, colour, seemingly, ..."
2,Hey man really not trying to edit war It...,0,"[hey, man, really, trying, edit, war, guy, con..."
3,More can make any real suggestions on ...,0,"[make, real, suggestions, improvement, wondere..."
4,You sir are my hero Any chance you remember...,0,"[sir, hero, chance, remember, page]"


As we can see by the decrease in the word count per token list, there was indeed a removal of redundant words.

As for how the function works, it is similar to how the previous ones work, passing the iterated list to iterate over the tokens within and apply a function; in our case it is a conditional to just include words that are not within the collection of words which are considered stopwords.

## Lemmatization & Stemming
Lemmatising/Stemming both have the same target in mind, to reach the root of the word but their difference is the algorithm used.

Stemming utilises a "Porter" algorithm which essentially just chops off common word endings, due to Stemmer using a crude old method, which is aimed for speed and efficiency, unlike lemmatizaton which morphologically analyses lexical changes in words to revert them back to their roots, unlike the chopping of "commonly found prefixes/suffixes" which stemming does.

In [16]:
def lemmatize(wordlist: list):
    return [WordNetLemmatizer().lemmatize(token) for token in wordlist]

def stem(wordlist):
    return [SnowballStemmer("english").stem(token) for token in wordlist]

Essentially, after some look up, lemmatization is beneficial to reduce the word to its base, in a more human context and understanding, mainly used in chatbots, etc.

Whilest the stemmer is used to purely chop off the words' endings to get to the base, used more in sentiment analysis, which is our current case with the binary classification, hence it will be used after lemmatizer and english word remover are used.

## Non-English Words Removal
Due to the nature of the internet, it is bound that people don't use proper English words, but due to current knowledge of NLP process, it would be better to remove words that do not belong to the English language as a whole.

An important note to keep in mind is, calling this after lemmatization/stemming, due to `words` not having all forms of a singular word which may cause it to be removed.

In [17]:
# Define Non-English word remover
def englishify(wordlist: list):
    return [word for word in wordlist if word in ENGLISH_WORDS]

In [18]:
# Call the function
txt_dat['tokens'] = [englishify(row) for row in txt_dat['tokens']]

# Print some entries
txt_dat.head(n)

Unnamed: 0,comment_text,toxic,tokens
0,Explanation Why the edits made under my usern...,0,"[explanation, made, fan, closure, gas, new, yo..."
1,aww He matches this background colour s...,0,"[background, colour, seemingly, stuck, thanks,..."
2,Hey man really not trying to edit war It...,0,"[hey, man, really, trying, edit, war, guy, con..."
3,More can make any real suggestions on ...,0,"[make, real, improvement, section, statistics,..."
4,You sir are my hero Any chance you remember...,0,"[sir, hero, chance, remember, page]"


## Wrap Up NLP Preprocessing
In this section, a whole function will take in a column in order to apply ***all*** the above functions in one go.

Description and explanation for function order will be explained in either comments or a markdown cell following it.

In [19]:
def preprocess(df: pd.DataFrame, colname: str):
    # List to hold the lists of tokens
    tokenlist = tokenize(df, colname)

    # Get number of tokens originally
    token_num = sum([len(t_list) for t_list in tokenlist])

    # Lowercase list of token lists
    # t_list -> (t)oken_list
    lower_tokens = [token_lower(t_list) for t_list in tokenlist]

    # Stopword-free list
    # tl_list -> (t)oken(l)lowercase_list
    stopwordless = [stopword_remover(tl_list) for tl_list in lower_tokens]

    # List of token lemmas
    # tls_list -> (t)oken(l)owercase(s)topwordless_list
    lemmas = [lemmatize(tls_list) for tls_list in stopwordless]

    # List of Lemmas that only exist in english
    # tlsl_list -> (t)oken(l)owercase(s)topwordless(l)lemma_list
    englishified = [englishify(tlsl_list) for tlsl_list in lemmas]

    # Get number of tokens after processing
    # (Englishify is the last function to remove)
    proc_num = sum([len(tlsle_list) for tlsle_list in englishified])

    # Reduction
    reduced_by = 1 - proc_num / token_num

    # Print counts and percentage
    print("Original Token Count : {}\nProcessed Token Count: {}\nReduction Percentage : {:.2%}".format(token_num, proc_num, reduced_by))

    # Return tuple of token list and processed tokens
    return [stem(word) for word in englishified]

Explanation for how the wrap-up function works:
- Passing the DataFrame to be processed, along with the column name to be tokenized.
    - Each row within the column gets tokenized with use of the predefined RegEx expressions up above.
- Calculating the number of tokens to get a feel for how much we'll be dealing with.
- Applying a lowercase function to ever list within the list of lists assigned to the output of the tokenization function.
- Removing stopwords from the tokens.
- Lemmatizing the words, to reach the root of the word (Useful for removing non-English words later).
- Removal of Non-English words.
- Calculating the amount of tokens, `englishify()` is the last function that removes any tokens, afterwhich the reduction of tokens is calculated.
- Printing the calculated numbers.
- Returning a tuple of tokenised words and stemmed words.

In [20]:
# Assigning output tuple to "new" column of tokens and new for processed
txt_dat['processed'] = preprocess(txt_dat, 'comment_text')

# Printing to see progress
txt_dat.head(n)

Original Token Count : 10130199
Processed Token Count: 4342586
Reduction Percentage : 57.13%


Unnamed: 0,comment_text,toxic,tokens,processed
0,Explanation Why the edits made under my usern...,0,"[explanation, made, fan, closure, gas, new, yo...","[explan, made, fan, vandal, closur, gas, new, ..."
1,aww He matches this background colour s...,0,"[background, colour, seemingly, stuck, thanks,...","[match, background, colour, seem, stuck, thank..."
2,Hey man really not trying to edit war It...,0,"[hey, man, really, trying, edit, war, guy, con...","[hey, man, realli, tri, edit, war, guy, consta..."
3,More can make any real suggestions on ...,0,"[make, real, improvement, section, statistics,...","[make, real, suggest, improv, section, statist..."
4,You sir are my hero Any chance you remember...,0,"[sir, hero, chance, remember, page]","[sir, hero, chanc, rememb, page]"


# Transformer
Using PyTorch, a transformer will be implemented from scratch to classify comments into toxic or non-toxic.

Research Paper used as reference:
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)

Website Articles that helped:
- [The Transformer Model](https://machinelearningmastery.com/the-transformer-model/)
- [Text Classifier with PyTorch Transformer](https://n8henrie.com/2021/08/writing-a-transformer-classifier-in-pytorch/)
- [Basic Transformer With PyTorch](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
- [Positional Encoding](https://theaisummer.com/positional-embeddings/)

In [21]:
import math
import torch                                    # PyTorch
import torch.nn    as nn                        # PyTorch Neural Networks
import torch.optim as optim                     # Optimizers
from torch.utils.data import Dataset            # Dataset of words
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from collections import Counter
from sklearn.model_selection import train_test_split

## Fundamental Building Blocks

### Input Embedding
Similarly to other sequence transduction models, we use learned embeddings to convert the input
tokens and output tokens to vectors of dimension $d_{model}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In
our model, we share the same weight matrix between the two embedding layers and the pre-softmax
linear transformation. In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$.

In [22]:
class Embedding(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(Embedding, self).__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim

        self.weight = nn.Parameter(torch.Tensor(num_embeddings, embedding_dim))
        nn.init.xavier_uniform_(self.weight)

    def forward(self, input):
        return self.weight[input]

### Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence. To this end, we add ***"positional encodings"*** to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed.

Thus, we will be working with the following trig functions of different frequencies:
$$ PE(pos, 2i) = sin(pos / 1000^{2\cdot i / d_{model}}) $$
$$ PE(pos, 2i + 1) = cos(pos / 1000^{2\cdot i / d_{model}}) $$

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi^{[1]}$. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of
$PE_{pos}$.

In [23]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=0.1)

        pe = torch.zeros(max_seq_len, d_model)
        
        position = torch.arange(0, max_seq_len, dtype=torch.float32).unsqueeze(1)
        
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float32) * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

### Attention
The main star of the show and the key that emphasizes why the transformer is the better architecture.

Through remembering the prompts and contextually "understanding" the words whilst taking into consideration the meaning and value of the prior words.

Hence, the Multi-Head Attetntion formula is described by the following equation:
$$ MultiHeadAttention(Q, K, V) = Concat(head_1, \ldots, head_h)W^O$$

Where each $h_i$ is derived from the following formula:
$$ Attention(Q, K, V) = softmax(\frac{Q\cdot K^T}{\sqrt{d_{model}}})\cdot V$$

In this particular paper, the model has the hyperparameter of $h^{[1]}$ set to $8$, such that $d_k = d_v = d_{model} / h = 64$

Softmax:
$$ softmax(\vec{z})_{i} = \frac{e^{z_{i}}}{\sum^{n}_{j=1} e^{z_{j}}}$$
Where:
- $\vec{z} \rightarrow$ Input Vector.
- $e^{z_i} \rightarrow$ Standard Exponentiation for Input vector.
- $e^{z_j} \rightarrow$ Standard Exponentiation for Output Vector.
- $n \rightarrow$ number of classes in multi-class classifier.

---
[1]: h is the number of attention layers.

In [24]:
class MultiheadAttention(nn.Module):
    def __init__(self, d_model, nhead, dropout=0.01):
        super(MultiheadAttention, self).__init__()
        self.nhead = nhead
        self.d_model = d_model
        self.dim_per_head = d_model // nhead

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)

        self.out_linear = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)
        self.softmax = nn.Softmax(dim=2)

    def forward(self, query, key, value):
        batch_size = query.size(0)
        query = self.query_linear(query)
        key = self.key_linear(key)
        value = self.value_linear(value)

        query = query.view(batch_size, -1, self.nhead, self.dim_per_head).transpose(1, 2)
        key = key.view(batch_size, -1, self.nhead, self.dim_per_head).transpose(1, 2)
        value = value.view(batch_size, -1, self.nhead, self.dim_per_head).transpose(1, 2)

        attention_scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.dim_per_head)
        attention_probs = self.softmax(attention_scores)

        context = torch.matmul(self.dropout(attention_probs), value)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.nhead * self.dim_per_head)

        output = self.out_linear(context)
        return output

### Position-wise Feed Forward Network
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.

$$ FFN(x) = max(0, xW_{1} + b_{1})W_{2} + b_{2}$$

While the linear transformations are the same across different positions, they use different parameters
from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is $d_{model} = {512}$, and the inner-layer has dimensionality
$d_{ff} = {2048}$.

### Layer Normalization


In [25]:
class LayerNormalization(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super(LayerNormalization, self).__init__()
        self.norm_shape = normalized_shape
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(self.norm_shape))
        self.beta = nn.Parameter(torch.zeros(self.norm_shape))

    def forward(self, x):
        mean = x.mean(dim = -1, keepdim = True)
        std = x.std(dim = -1, keepdim = True)
        output = (x - mean) / (std + self.eps)
        output = (output * self.alpha.unsqueeze(0)) + self.beta.unsqueeze(0)
        return output

### Encoder Block
Consisting of $6^{[1]}$ layers of the same process. From the original design we can also see that there is a skip forward from:
- Before the MHA (Multi-Head Attention)
- Before the FFN (Feed Forward Network)

As well as consisting of $4$ blocks in total for each Encoder Block:
- Multi-Head Attention.
- First Normalization.
- Feed-Foward Network.
- Second Normalization.

It would be pointless if we just continuously passed the output from one block without transforming it in any way or shape, hence the introudction of the $LayerNorm(x + SubLayer(x))$ function, where $SubLayer(x)$ is a function produced by the layer itself to facilitate the residual connections, and produces an output of dimension $d_{model} = 512$.

---

[1]: Included in the original Paper, Section 3.1; Encoder Paragraph.

In [26]:
# Single layer of encoder
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=128, dropout=0.0):
        super(TransformerEncoderLayer, self).__init__()
        
        self.self_attention = MultiheadAttention(d_model, nhead)
        self.norm1 = LayerNormalization(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.d_model = d_model
        
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
        self.norm2 = LayerNormalization(d_model)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, src):
        attn_output = self.self_attention(src, src, src)
        attn_output = attn_output.view(-1, src.size(1), self.d_model)
        src = src + self.dropout1(attn_output)
        src = self.norm1(src)
        
        ffn_output = self.feed_forward(src)
        src = src + self.dropout2(ffn_output)
        src = self.norm2(src)
        
        return src

# N Layers of encoders
class TransformerEncoder(nn.Module):
    def __init__(self, d_model, nhead, num_layers, dim_feedfwd):
        super(TransformerEncoder, self).__init__()
        self.layers = nn.ModuleList(
            [TransformerEncoderLayer(d_model, nhead, dim_feedfwd) for _ in range(num_layers)]
        )

    def forward(self, src):
        output = src
        for layer in self.layers:
            output = layer(output)
        return output

### Decoder Block
The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position $i$ can depend only on the known outputs at positions less than $i$.

#### Note: Due to our implementation for it to be a Text-Classifier version of the Transformer, This will be ignored.

## Classification Transformer
For our purpose of this project, we are tasked to classify the given data of Comments into a binary label of `toxic`, as mentioned before the $1$ stands for `toxic` and the $0$ stands for `non-toxic`.

This is achieved by only using the Encoder part of the transformer, which means that only the N-Layers of the Encoder will be at work, No need for Cross-Multi-Head Attention or any of the Expected Output to be put into a Decoder, Just the Encoder.

In [27]:
class ClassifierTransformer(nn.Module):
    def __init__(self, d_model: int = 512, num_classes: int = 2, nhead: int = 8, num_encoder_layers: int = 6, vocab_size: int = 1000, feedfwd_dim: int = 1024, max_seq_len: int = 100):
        super().__init__()
        self.model_dim = d_model
        self.num_head = nhead
        self.num_enc_layers = num_encoder_layers
        self.num_classes = num_classes
        self.d_ff = feedfwd_dim
        self.vocab_size = vocab_size

        self.embedding = nn.Embedding(self.vocab_size, self.model_dim)
        self.transformer_enc = TransformerEncoder(self.model_dim, self.num_head, self.num_enc_layers, feedfwd_dim)

        self.fc = nn.Linear(self.model_dim, self.num_classes)
        self.positional_encoding = PositionalEncoding(self.model_dim, max_seq_len)

    def forward(self, src):
        embedded = self.embedding(src)
        encoded = self.positional_encoding(embedded)
        output = self.transformer_enc(encoded)
        output = output.permute(1, 0, 2)
        logits = self.fc(output[:, -1, :])
        return logits

## Training and Evaluation

In [28]:
def evaluate(model, test_data):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_data:
            logits = model(inputs)
            _, predicted = torch.max(logits, dim=1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = correct / total
    return accuracy

In [29]:
def train_model(model, train_loader, criterion, optimizer, num_epochs, device):
    model.to(device)
    model.train()
    loss_over_epoch = []
    for epoch in range(num_epochs):
        running_loss = 0.0
        for i, data in enumerate(train_loader):
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            if (outputs.shape[0] == labels.shape[0]):
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
                print("Epoch [{}/{}]\n\tSub-Epoch {}:\n\t loss: {:.4f}".format(epoch + 1, num_epochs, i, loss))
                loss_over_epoch.append(loss)

        average_loss = running_loss / len(train_loader)
        print("Epoch [{}/{}], Average Loss: {:.6f}".format(epoch + 1, num_epochs, average_loss))

In [30]:
def evaluate_model(model, test_loader, criterion, device):
    model.to(device)

    model.eval()
    with torch.no_grad():
        total_loss = 0.0
        total_correct = 0
        total_samples = 0

        for i, data in enumerate(test_loader):
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            outputs = model(inputs)
            if (outputs.shape[0] == labels.shape[0]):
                loss = criterion(outputs, labels)
                total_loss += loss.item()
                
                _, predicted = torch.max(outputs, dim=1)
                total_correct += (predicted == labels).sum().item()
                total_samples += labels.size(0)

        average_loss = total_loss / len(test_loader)
        accuracy = total_correct / total_samples

        print("Test Loss: {:.4f}, Accuracy: {:.4f}\nOver {} samples, {} were correct".format(average_loss, accuracy, total_samples, total_correct))

In [31]:
class VocabDataset(Dataset):
    def __init__(self, data, seq_len:int):
        self.data = data
        self.seq_len = seq_len

        # Create vocabulary and index mapping here
        vocab = Counter([token for tokens in txt_dat['processed'] for token in tokens])
        min_word_count = 5
        filtered_vocab = [word for word, count in vocab.items() if count >= min_word_count]
        word_to_idx = {word: idx + 1 for idx, word in enumerate(filtered_vocab)}
        word_to_idx['<PAD>'] = 0
        self.w_t_idx = word_to_idx
        self.vocab_size = len(word_to_idx)
    
    def __len__(self):
        return len(self.data)
    
    def voc_size(self):
        return len(self.w_t_idx)
    
    def __getitem__(self, index):
        item = self.data.iloc[index]
        src = item.iloc[0]  # Input sequence
        label = item.iloc[1]  # Binary label

        # Convert tokens to numerical indices and pad sequences
        src = [self.w_t_idx.get(token, 0) for token in src]  # Use word_to_idx with a default value of 0
        src = src[:self.seq_len] + [0] * (self.seq_len - len(src))  # Padding with 0
        src = torch.tensor(src, dtype=torch.long)
        label = torch.tensor(label, dtype=torch.long)
        return src, label

In [32]:
def train_test_split_dataset(dataset, test_size=0.2):
    dataset_size = len(dataset)
    num_test_samples = int(test_size * dataset_size)
    num_train_samples = dataset_size - num_test_samples

    train_dataset, test_dataset = random_split(dataset, [num_train_samples, num_test_samples])
    
    return train_dataset, test_dataset

# Bringing it All Together
Dividing Data and turning it into pytorch tensor compatible input.

Input into the transformer implemented above and evaluating.

In [33]:
# Ready the Data
whole_data = VocabDataset(txt_dat[['processed', 'toxic']], 100)

train_ds, test_ds = train_test_split_dataset(whole_data, 0.2)

# Loaders
train_load = DataLoader(train_ds, 100)
test_load = DataLoader(test_ds, 100)

In [34]:
whole_data[700]

(tensor([ 120,  212, 1415,  413,  170,  344,  429, 3245, 3246,   60,  170, 1987,
         3247,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]),
 tensor(1))

In [35]:
# Defining some params
Epochs = 1
criterion = nn.CrossEntropyLoss()
lr = 0.0001
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [36]:
# Training Loops
model = ClassifierTransformer(d_model = 100, num_classes = 2, nhead = 4, num_encoder_layers = 2, vocab_size = whole_data.voc_size())

optimizer = optim.Adam(model.parameters(), lr = lr)

train_model(model, train_load, criterion, optimizer, Epochs, device)

Epoch 1/1
	Sub-Epoch 0:
	loss: 0.324348
Epoch 1/1
	Sub-Epoch 1:
	loss: 0.560213
Epoch 1/1
	Sub-Epoch 2:
	loss: 0.318567
Epoch 1/1
	Sub-Epoch 3:
	loss: 0.396154
Epoch 1/1
	Sub-Epoch 4:
	loss: 0.343527
Epoch 1/1
	Sub-Epoch 5:
	loss: 0.349932
Epoch 1/1
	Sub-Epoch 6:
	loss: 0.397620
Epoch 1/1
	Sub-Epoch 7:
	loss: 0.291852
Epoch 1/1
	Sub-Epoch 8:
	loss: 0.201165
Epoch 1/1
	Sub-Epoch 9:
	loss: 0.525195
Epoch 1/1
	Sub-Epoch 10:
	loss: 0.374164
Epoch 1/1
	Sub-Epoch 11:
	loss: 0.223297
Epoch 1/1
	Sub-Epoch 12:
	loss: 0.299547
Epoch 1/1
	Sub-Epoch 13:
	loss: 0.276613
Epoch 1/1
	Sub-Epoch 14:
	loss: 0.398002
Epoch 1/1
	Sub-Epoch 15:
	loss: 0.307150
Epoch 1/1
	Sub-Epoch 16:
	loss: 0.354912
Epoch 1/1
	Sub-Epoch 17:
	loss: 0.321545
Epoch 1/1
	Sub-Epoch 18:
	loss: 0.353293
Epoch 1/1
	Sub-Epoch 19:
	loss: 0.422772
Epoch 1/1
	Sub-Epoch 20:
	loss: 0.383915
Epoch 1/1
	Sub-Epoch 21:
	loss: 0.317327
Epoch 1/1
	Sub-Epoch 22:
	loss: 0.336479
Epoch 1/1
	Sub-Epoch 23:
	loss: 0.305252
Epoch 1/1
	Sub-Epoch 24:
	

In [37]:
# Evaluation
evaluate_model(model, test_load, criterion, device)

ValueError: Expected input batch_size (100) to match target batch_size (14).