# Check if GPU is Online

In [1]:
!nvidia-smi

Wed May 24 22:49:33 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1650 Ti      On | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P3               17W /  50W|    539MiB /  4096MiB |     29%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Import Dependencies

In [2]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import tensorflow as tf
from tensorflow.keras.layers import (LSTM, Bidirectional, Dense, Dropout,
                                     Embedding, TextVectorization)
from tensorflow.keras.models import Sequential

2023-05-24 22:49:35.143896: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-24 22:49:35.184065: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-24 22:49:35.184623: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Import Data

## Manage Import

Creating the path to `df/` within `./assets`.

In [3]:
pathToTrain = os.path.join('assets', 'data', 'train.csv')

Importing `train.csv` to a dataframe called `df`.

In [4]:
df = pd.read_csv(pathToTrain)

## Explore Imported Data

In [5]:
df.tail()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0
159570,fff46fc426af1f9a,"""\nAnd ... I really don't think you understand...",0,0,0,0,0,0


Check all of the columns of the dataset.

In [6]:
df.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

Check the details of the 100th comment. First the comment itself.

In [7]:
df.iloc[8]['comment_text']

"Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169"

Then, it's attributes.

In [8]:
df[df.columns[2:]].iloc[8]

toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
Name: 8, dtype: int64

Check how many comments in our dataset have been labeled as severely toxic.

In [9]:
df[df['severe_toxic'] == 1].shape[0]

1595

# Preprocess Comments

To preprocess the comments, we use the `TextVectorization` layer from `tensorflow`. It's able to preprocess the samples through the following steps:
- Standardize each example (usually lowercasing + punctuation stripping)
- Split each example into substrings (usually words)
- Recombine substrings into tokens (usually ngrams)
- Index tokens (associate a unique int value with each token)
- Transform each example using this index, either into a vector of ints or a dense float vector.

More information [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). 

Here's the documentation to the `TextVectorization` function.

In [10]:
TextVectorization??

[0;31mInit signature:[0m
[0mTextVectorization[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mmax_tokens[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstandardize[0m[0;34m=[0m[0;34m'lower_and_strip_punctuation'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplit[0m[0;34m=[0m[0;34m'whitespace'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mngrams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_mode[0m[0;34m=[0m[0;34m'int'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_sequence_length[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpad_to_max_tokens[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvocabulary[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0midf_weights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msparse[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mragged[0m[0;34m=[

## Create `X` and `y` Arrays

Create our X vector.

In [11]:
X = df['comment_text']
X.shape

(159571,)

In [12]:
X

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object

Convert it into a `nd` array.

In [13]:
X.values

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       ...,
       'Spitzer \n\nUmm, theres no actual article for prostitution ring.  - Crunch Captain.',
       'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.',
       '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of communit

Create our y vector.

In [14]:
y = df[df.columns[2:]]
y.shape

(159571, 6)

In [15]:
y

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
159566,0,0,0,0,0,0
159567,0,0,0,0,0,0
159568,0,0,0,0,0,0
159569,0,0,0,0,0,0


Convert y vector to an `nd` array.

In [16]:
y.values

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

Define the maximum size of our vocabulary. This affects how large the model is and how long it'll take to train it. You need to find the optimal value for this hyperparameter to trade-off size for accuracy.

In [17]:
MAX_FEATURES = 100000

## Build Vectorizer Model

Here was pass in the max number of features, the output length and the types of vectors we expect for each word.

In [18]:
vectorizer = TextVectorization(
    # Define the size of the vocab
    max_tokens=MAX_FEATURES,
    # Define the max length of each comment to be vectorized
    output_sequence_length=1800,
    # Define the vector for each word to be an int
    output_mode='int'
)

2023-05-24 22:49:38.610087: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-24 22:49:38.636874: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


## Train Vectorizer Model

The `TextVectorizer` model can be trained using the `adapt()` method like so,

In [19]:
vectorizer.adapt(X.values)

## Get Vocabulary from Model

In [20]:
vocabulary = vectorizer.get_vocabulary()
len(vocabulary)

100000

Here's the dictionary of all the unique words in our vocabulary. The index of a word in this array denotes it's `int` vector.

In [21]:
vocabulary

['',
 '[UNK]',
 'the',
 'to',
 'of',
 'and',
 'a',
 'you',
 'i',
 'is',
 'that',
 'in',
 'it',
 'for',
 'this',
 'not',
 'on',
 'be',
 'as',
 'have',
 'are',
 'your',
 'with',
 'if',
 'article',
 'was',
 'or',
 'but',
 'page',
 'my',
 'an',
 'from',
 'by',
 'do',
 'at',
 'about',
 'me',
 'so',
 'wikipedia',
 'can',
 'what',
 'there',
 'all',
 'has',
 'will',
 'talk',
 'please',
 'would',
 'its',
 'no',
 'one',
 'just',
 'like',
 'they',
 'he',
 'dont',
 'which',
 'any',
 'been',
 'should',
 'more',
 'we',
 'some',
 'other',
 'who',
 'see',
 'here',
 'also',
 'his',
 'think',
 'im',
 'because',
 'know',
 'how',
 'am',
 'people',
 'why',
 'edit',
 'articles',
 'only',
 'out',
 'up',
 'when',
 'were',
 'use',
 'then',
 'may',
 'time',
 'did',
 'them',
 'now',
 'being',
 'their',
 'than',
 'thanks',
 'even',
 'get',
 'make',
 'good',
 'had',
 'very',
 'information',
 'does',
 'could',
 'well',
 'want',
 'such',
 'sources',
 'way',
 'name',
 'these',
 'deletion',
 'pages',
 'first',
 'help'

The word at the 288th position is,

In [22]:
vocabulary[288]

'hello'

In a sentence,

In [23]:
vectorizer('Hello World! How do you like my vectorizer?!')

<tf.Tensor: shape=(1800,), dtype=int64, numpy=array([288, 263,  73, ...,   0,   0,   0])>

It's clear that only those words that are present in the sentence are vectorized as `ints`. The rest of the 1800 tokens are padded as 0. It might be worth finding the largest comment in our original dataset and setting our `max_tokens` to that value to try our best at avoiding a sparser matrix than we can allow.

The vectors for the 5 words in the test sentence are,

In [24]:
vectorizer('Hello World! How do you like my vectorizer?!')[:5]

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([288, 263,  73,  33,   7])>

## Vectorize Text

Here's where we pass each of the comments in our dataset into the vectorizer to get our complete vectorized textual input.

In [25]:
vectorizedText = vectorizer(X.values)

2023-05-24 22:49:48.638614: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2297822400 exceeds 10% of free system memory.


This now serves as a numerical representation of all our text in the form of an integer vector.

In [26]:
vectorizedText

<tf.Tensor: shape=(159571, 1800), dtype=int64, numpy=
array([[  645,    76,     2, ...,     0,     0,     0],
       [    1,    54,  2489, ...,     0,     0,     0],
       [  425,   441,    70, ...,     0,     0,     0],
       ...,
       [32445,  7392,   383, ...,     0,     0,     0],
       [    5,    12,   534, ...,     0,     0,     0],
       [    5,     8,   130, ...,     0,     0,     0]])>

## Create Tensorflow Data Pipeline

A TensorFlow data pipeline is a mechanism used to efficiently process and feed data to deep learning models in TensorFlow. It involves a series of steps that preprocess, transform, and prepare data for training or inference. The primary goal of a data pipeline is to optimize data loading and processing, ensuring that the model receives data in a timely manner and with minimal performance overhead.

By using TensorFlow data pipelines, you can streamline the data preparation process, improve training efficiency, and ensure that your NLP models receive high-quality and properly formatted input data

There's 5 steps to create a tensorflow data pipeline, commonly by the acronym `MCSHBAP`, they're as follows:
1. M - Map using `tf.data.Dataset.from_tensor_slices()`
2. C - Cache, to cache the data to enhance memory management and response time in accessing data
3. Sh - Shuffle, a good shuffle is always good practice using a `BUFFER_SIZE`
4. B - Batch, separate the data into batches by `BATCH_SIZE`
5. P - Prefetch, to prevent bottlenecks by prefetching `PREFETCH_SIZE` of data

### Define Hyperparameters

In [27]:
BUFFER_SIZE = 160000
BATCH_SIZE = 16
PREFETCH_SIZE = 8

### Map Data to a Tensorflow Dataset

In [28]:
dataset = tf.data.Dataset.from_tensor_slices((vectorizedText, y))

### Cache, Shuffle, Batch and Prefetch

In [29]:
dataset = dataset.cache()

dataset = dataset.shuffle(BUFFER_SIZE)

# Representing each batch as BATCH_SIZE number of samples
dataset = dataset.batch(BATCH_SIZE)

# Prevent bottlenecks in batches by prefetching
dataset = dataset.prefetch(PREFETCH_SIZE)

### Accessing the Dataset

To access the dataset, we create a `numpy` generator to iterate over the batches of the dataset. We create an iterator with `dataset.as_numpy_iterator()`. This can be saved to an iterator variable and called when we move the iterator to the next batch using the `next()` method.

Displaying the first batch of the dataset.

In [30]:
dataset.as_numpy_iterator().next()

2023-05-24 22:49:50.039863: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2297822400 exceeds 10% of free system memory.
2023-05-24 22:49:51.061319: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [159571,6]
	 [[{{node Placeholder/_1}}]]
2023-05-24 22:49:51.061656: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int64 and shape [159571,1800]
	 [[{{node Placeholder/_0}}]]


(array([[90938,    46,   445, ...,     0,     0,     0],
        [  265,   265,    13, ...,     0,     0,     0],
        [ 8245,    43,  1247, ...,     0,     0,     0],
        ...,
        [   60,   586,  1229, ...,     0,     0,     0],
        [ 1416,     4,     2, ...,     0,     0,     0],
        [  168, 12423,   412, ...,     0,     0,     0]]),
 array([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 1, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]))

Creating a `numpy` generator for the dataset. 

In [31]:
datasetGenerator = dataset.as_numpy_iterator()

Storing the next batch's `X` and `y` by unpacking the batch.

In [32]:
batchX, batchY = datasetGenerator.next()

In [33]:
batchX, batchY

(array([[  540,  5889,     8, ...,     0,     0,     0],
        [  136,   313,    21, ...,     0,     0,     0],
        [   41,    20,   258, ...,     0,     0,     0],
        ...,
        [  534,    52,     7, ...,     0,     0,     0],
        [22567,  1200,     0, ...,     0,     0,     0],
        [    2,   762,  1079, ...,     0,     0,     0]]),
 array([[1, 0, 1, 0, 1, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 1, 0]]))

In [34]:
batchX.shape, batchY.shape

((16, 1800), (16, 6))

## Train-test and Validation Split

Get the total number of batches in the dataset.

In [35]:
numberBatches = len(dataset)

Split by iterating over the dataset and taking using the `take()` and `skip()` methods.

In [36]:
train = dataset.take(int(numberBatches * 0.7))
validation = dataset.skip(int(numberBatches * 0.7)).take(int(numberBatches * 0.2))
test = dataset.skip(int(numberBatches * 0.9)).take(int(numberBatches * 0.1))

Remember, these numbers are the number of batches and not the number of samples.

In [37]:
len(train), len(validation), len(test)

(6981, 1994, 997)

The number of samples in the `train` dataset would be,

In [38]:
len(train) * BATCH_SIZE

111696

# Constructing the Neural Network