# LLM From Scratch
#### (By: Mark Ehab Aziz)
##### (Built under Python 3.11.4)

Graduation Project for ***Sprints***, to build a LLM (Large Language Model) from scratch without the use of any Transformer related libraries, in order to classify comments scraped from some given website.

Notebook will include:
- List of imported libraries.
- Deep dive into the data.
- Markdown cells to explain the detail of every step, along with reason.
- Custom defined functions and classes.
- Own Transformer from scratch.

List of Libraries/Dependencies:
- [Pandas](https://pandas.pydata.org/docs/index.html)
- [NumPy](https://numpy.org/doc/stable/)
- [NLTK](https://www.nltk.org/)
    - [NLTK Regex Tokenizer](https://www.nltk.org/howto/tokenize.html)
    - [NLTK Snowball Stemmer](https://www.nltk.org/howto/stem.html)
    - [NLTK WordNet Lemmatizer](https://www.nltk.org/howto/wordnet.html)
    - [NLTK English Words (Stopwords too)](https://www.nltk.org/howto/corpus.html)
- [PyTorch (`torch`)](https://pytorch.org/docs/stable/index.html)

Note: NLTK will be used later for tokenization using RegEx.

# Library Imports
Importing libraries that will be used in order to implement our own LLM.

In [1]:
import pandas as pd                             # Pandas for DataFrame manipulation
import numpy as np                              # Linear Algebra and Mathematical Operations
import nltk                                     # Downloading word sets
from nltk.stem import SnowballStemmer           # Stemming
from nltk.stem import WordNetLemmatizer         # Lemmatization (Better word yields)
from nltk.corpus import stopwords               # Stopwords
from nltk.tokenize import regexp_tokenize       # Tokenization using RegEx (Regular Expressions)
import torch                                    # PyTorch
import torch.nn                                 # PyTorch Neural Networks

In [None]:
# Downloading external dependencies to be used
# Such as Stopwords, English-only words

# Stopwords
nltk.download('stopwords')

# English Only
nltk.download('words')

# Setting constants for both normal and stopwords in English
ENGLISH_STOPWORDS = set(stopwords.words('english'))
ENGLISH_WORDS = set(nltk.corpus.words.words())

# All About Data

## Data Import
Loading the data using Panda's `read_csv()` function.

In [3]:
# Reading the data with 2 ways
# 1 - Reading within my Github Repo
txt_dat = pd.read_csv('../dataset/train.csv')

# 2 - Reading within the same folder
#txt_dat = pd.read_csv('train.csv')

## Data Exploration
Looking at the data from multiple perspectives.

Using `head()` and `tail()` methods to look at what columns there are within the dataframe, which may be useful and which are not.

In [4]:
# Defining n rows
n = 5

# Calling and showing first and last n rows
display(txt_dat.head(n), txt_dat.tail(n), txt_dat.shape)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\r\nWhy the edits made under my use...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\r\nMore\r\nI can't make any real suggestions...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \r\n\r\nThat...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \r\n\r\nUmm, theres no actual article ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0
159570,fff46fc426af1f9a,"""\r\nAnd ... I really don't think you understa...",0,0,0,0,0,0


(159571, 8)

Looking at the data, we can see that we have about $159571$ entries.
We can also see that there are columns as such:
- `id`: Shows message id code from whatever platform this was scraped from. (Will be dropped)
- `comment_text`: main star of the show, text we have to process, it has upper-case letters, escape characters (`\n`,`\r`, etc.), and most likely special characters (non-latin-alphabet), urls, IP addresses; removal needed.
- `toxic` (And its derivatives): Labelling the comment if it were a toxic or not.

## Defining What To Clean
In the above mentioned description of the data, there exists aspects that need removal, explanation will follow within this cell.

- Removal of id: Due to the nature of what this project is about, and usually a general case, IDs are not usually if at all used to gain insights from data, due to them being unique per entry and being nothing but an enumeration of entries.
- Cleaning within `comment_text`: Cleaning the comments will be separated into $x$ steps, namely:
    - Space and Tab removal: Removing of Newline Characters (`\n`) and Tabs (`\t`) will help remove special characters from text and avoiding noise within.
    - Uppercase: Due to standarisations within the NLP field, it has been agreed upon to change any uppercase letter into lowercase, to mitigate the face that words can be written using multiple permutations of the same letters but with different cases, so in order to for the machine to recognise the word and not have to account for $n^{52}$ different combinations for the word (Where $n$ is number of characters to represent a word, and $52$ is due to both upper and lower cases of a letter)
    - URL: Removal of URLs will prove beneficial, as it doesn't contribute much to the corpus nor is considered a baseline for labelling the comment.
    - IP Address: For security reasons.
    - Special Characters: Due to the non-existence of any in the Latin Alphabet which English uses, it would be useless to bother with them, although if this was a multi-lingual dataset, some characters from different languages would be needed to keep.
- Derivatives of `toxic`: For purposes of simplicity, I have decided to *"collapse"* the values that follow after the *`toxic`* column, as in summing the values into said column, then swapping values $>1$ to be just $1$ to indicate toxicity, implying that $0$ would be for non-toxic comments; as a result, this will water down into just a "Binary Classification" problem based on words.

In [5]:
# Collapsing the column values onto toxic
txt_dat['toxic'] = txt_dat.iloc[:, 2:].sum(axis = 1)

# Drop the collapsed columns
# Along with the id column
txt_dat.drop(columns = ['id', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], inplace = True)

In [6]:
# Printing the head and describing the toxic column
display(txt_dat.head(n), txt_dat.describe())

Unnamed: 0,comment_text,toxic
0,Explanation\r\nWhy the edits made under my use...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\r\nMore\r\nI can't make any real suggestions...",0
4,"You, sir, are my hero. Any chance you remember...",0


Unnamed: 0,toxic
count,159571.0
mean,0.219952
std,0.74826
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,6.0


As noted, due to collapsing the values onto `toxic`, values over 1 arose hence they will need handling in order to keep the labels as $1$ and $0$. A function doing just so will be implemented later.

### Regex Patterns
Identified Patterns that will be required to be used to capture specific instances of removable slices of text within the comments.

In [7]:
# Newline, Tab spaces, etc.
newline_tabspace = r'[\r\n\t]'

# Match words starting with Uppercase letters
upper_words = r'([A-Z])\w+'

# Match Words that start with either Upper/lowercase letters
upperlower_words = r'[A-Za-z]\w+'

# Sub/Superscript characters
# Encountered previously
sub_sup_scripts = r'\w[²³¹⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ]+'

# Punctuation
punc_pattern = r'[!\?.,\'"]'

# Single Letters
single_letter = r'((?<=^)|(?<= )).((?=$)|(?= ))'

# Match URLs
url_pattern = r'(http|ftp|https):\/\/([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?'

# Cleaning

## Functions and such
Functions to be used and applied on the dataframe.

Each function will be commented, whilst also writing an explanation to a grouped cell of functions indicating their use.

In [8]:
def binarize(df: pd.DataFrame, colname: str):
    return np.where(df[colname] > 0, 1, 0)

def clean_comment(df: pd.DataFrame, colname: str):
    # Remove the newlines, tabs, etc.
    df[colname].replace(newline_tabspace, ' ', regex = True, inplace = True)

    # Remove URLs
    df[colname].replace(url_pattern, ' ', regex = True, inplace = True)

    # Remove superscript, subscript
    df[colname].replace(sub_sup_scripts, ' ', regex = True, inplace = True)

    # Remove punctuation
    df[colname].replace(punc_pattern, ' ', regex = True, inplace = True)

    # Remove Single Letters
    df[colname].replace(single_letter, ' ', regex = True, inplace = True)

    return df[colname]

A walkthrough the above functions:
- `binarize(df, colname)`: Function that takes a `pd.DataFrame`, `string` for input, dictating which DataFrame and Column within said DataFrame to carry out the operations on.
Due to the rise of values larger than 1 after collapsing the other column values, a requirement for binarizing (Chose to make the labels binary, for simplicity's sake) the values, $1$ for Toxic and $0$ for Non-Toxic, this is achieved by assigning any value larger than 0 to 1 otherwise stays as 0.

- `clean_comment(df, colname)`: Function that takes a `pd.DataFrame`, `string` for input, dictating which DataFrame and Column within said DataFrame to carry out the operations on.
The nature of the data given, is that it has a lot of *whitespaces*, *tabs*, *special characters*, *urls*, *punctuations*, *single letters*, which are all bound to be removed.
The function removes:
    - Newlines
    - Tabs
    - URLs
    - Sub/Super Scripts
    - Punctuation
    - Single Letters (From removal of some punctuation symbols)

Both functions return the updated column.

In [9]:
txt_dat['comment_text'] = clean_comment(txt_dat, 'comment_text')

txt_dat['toxic'] = binarize(txt_dat, 'toxic')

display(txt_dat.head(n), txt_dat.tail(n), txt_dat.describe())

Unnamed: 0,comment_text,toxic
0,Explanation Why the edits made under my usern...,0
1,aww He matches this background colour s...,0
2,Hey man really not trying to edit war It...,0
3,More can make any real suggestions on ...,0
4,You sir are my hero Any chance you remember...,0


Unnamed: 0,toxic
count,159571.0
mean,0.101679
std,0.302226
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


As we can see, there are no extra characters, like apostrophe's, extra punctuations, etc.

As well as having the maximum value to be 1 within the `toxic` column, although the mean is closer to 0 which suggest that the data is really imbalanced, which will require fixing later down the line, ideally it would be better to be closer to 0.5.

# NLP Preprocessing
Preprocessing the data with ways that are under the NLP PreProcessing Standards, as in:
- Tokenizing.
- Lowercasing.
- Lemmatization/Stemming.
- Removal of Stopwords.
- Removing Non-English words.

## Tokenization
Tokenization is the process of splitting up a sentence or a corpus into plain words (tokens) for variable reasons:
- To know which words are present.
- Count them.
- Et cetera.

In my implementation, I will be using the `regexp_tokenizer()` in order to tokenize the text within the `comment_text` column.

# Transformer
With hopes and dreams of at least unsing TensorFlow or PyTorch ;-;