# Check if GPU is Online

In [None]:
!nvidia-smi

# Import Dependencies

In [None]:
import os
from datetime import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import tensorflow as tf
from tensorflow.keras.layers import (LSTM, Bidirectional, Dense, Dropout,
                                     Embedding, TextVectorization)
from tensorflow.keras.models import Sequential

# Import Data

## Data Description

The dataset is composed of a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
```bash
    toxic
    severe_toxic
    obscene
    threat
    insult
    identity_hate
```

## Manage Import

Creating the path to `df/` within `./assets`.

In [None]:
pathToTrain = os.path.join('assets', 'data', 'train.csv')

Importing `train.csv` to a dataframe called `df`.

In [None]:
df = pd.read_csv(pathToTrain)

## Explore Imported Data

In [None]:
df.tail()

Check all of the columns of the dataset.

In [None]:
df.columns

Check the details of the 8th comment. First the comment itself.

In [None]:
df.iloc[8]['comment_text']

Then, it's attributes.

In [None]:
df[df.columns[2:]].iloc[8]

Check how many comments in our dataset have been labeled as severely toxic.

In [None]:
df[df['severe_toxic'] == 1].shape[0]

# Preprocess Comments

To preprocess the comments, we use the `TextVectorization` layer from `tensorflow`. It's able to preprocess the samples through the following steps:
- Standardize each example (usually lowercasing + punctuation stripping)
- Split each example into substrings (usually words)
- Recombine substrings into tokens (usually ngrams)
- Index tokens (associate a unique int value with each token)
- Transform each example using this index, either into a vector of ints or a dense float vector.

More information [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). 

Here's the documentation to the `TextVectorization` function.

In [None]:
TextVectorization??

## Create `X` and `y` Arrays

Create our X vector.

In [None]:
X = df['comment_text']
X.shape

In [None]:
X

Convert it into a `nd` array.

In [None]:
X.values

Create our y vector.

In [None]:
y = df[df.columns[2:]]
y.shape

In [None]:
y

Convert y vector to an `nd` array.

In [None]:
y.values

## Build Vectorizer Model

Define the maximum size of our vocabulary. This affects how large the model is and how long it'll take to train it. You need to find the optimal value for this hyperparameter to trade-off size for accuracy.

In [None]:
MAX_FEATURES = 100000

Here was pass in the max number of features, the output length and the types of vectors we expect for each word.

In [None]:
vectorizer = TextVectorization(
    # Define the size of the vocab
    max_tokens=MAX_FEATURES,
    # Define the max length of each comment to be vectorized
    output_sequence_length=1800,
    # Define the vector for each word to be an int
    output_mode='int'
)

## Train Vectorizer Model

The `TextVectorizer` model can be trained using the `adapt()` method like so,

In [None]:
vectorizer.adapt(X.values)

## Get Vocabulary from Model

In [None]:
vocabulary = vectorizer.get_vocabulary()
len(vocabulary)

Here's the dictionary of all the unique words in our vocabulary. The index of a word in this array denotes it's `int` vector.

In [None]:
vocabulary

The word at the 288th position is,

In [None]:
vocabulary[288]

In a sentence,

In [None]:
vectorizer('Hello World! How do you like my vectorizer?!')

It's clear that only those words that are present in the sentence are vectorized as `ints`. The rest of the 1800 tokens are padded as 0. It might be worth finding the largest comment in our original dataset and setting our `max_tokens` to that value to try our best at avoiding a sparser matrix than we can allow.

The vectors for the 5 words in the test sentence are,

In [None]:
vectorizer('Hello World! How do you like my vectorizer?!')[:5]

## Vectorize Text

Here's where we pass each of the comments in our dataset into the vectorizer to get our complete vectorized textual input.

In [None]:
vectorizedText = vectorizer(X.values)

This now serves as a numerical representation of all our text in the form of an integer vector.

In [None]:
vectorizedText

# Create Tensorflow Data Pipeline

A TensorFlow data pipeline is a mechanism used to efficiently process and feed data to deep learning models in TensorFlow. It involves a series of steps that preprocess, transform, and prepare data for training or inference. The primary goal of a data pipeline is to optimize data loading and processing, ensuring that the model receives data in a timely manner and with minimal performance overhead.

By using TensorFlow data pipelines, you can streamline the data preparation process, improve training efficiency, and ensure that your NLP models receive high-quality and properly formatted input data

There's 5 steps to create a tensorflow data pipeline, commonly by the acronym `MCSHBAP`, they're as follows:
1. M - Map using `tf.data.Dataset.from_tensor_slices()`
2. C - Cache, to cache the data to enhance memory management and response time in accessing data
3. Sh - Shuffle, a good shuffle is always good practice using a `BUFFER_SIZE`
4. B - Batch, separate the data into batches by `BATCH_SIZE`
5. P - Prefetch, to prevent bottlenecks by prefetching `PREFETCH_SIZE` of data

### Define Hyperparameters

In [None]:
BUFFER_SIZE = 160000
BATCH_SIZE = 16
PREFETCH_SIZE = 8

### Map Data to a Tensorflow Dataset

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((vectorizedText, y))

### Cache, Shuffle, Batch and Prefetch

In [None]:
dataset = dataset.cache()

dataset = dataset.shuffle(BUFFER_SIZE)

# Representing each batch as BATCH_SIZE number of samples
dataset = dataset.batch(BATCH_SIZE)

# Prevent bottlenecks in batches by prefetching
dataset = dataset.prefetch(PREFETCH_SIZE)

### Accessing the Dataset

To access the dataset, we create a `numpy` generator to iterate over the batches of the dataset. We create an iterator with `dataset.as_numpy_iterator()`. This can be saved to an iterator variable and called when we move the iterator to the next batch using the `next()` method.

Displaying the first batch of the dataset.

In [None]:
dataset.as_numpy_iterator().next()

Creating a `numpy` generator for the dataset. 

In [None]:
datasetGenerator = dataset.as_numpy_iterator()

Storing the next batch's `X` and `y` by unpacking the batch.

In [None]:
batchX, batchY = datasetGenerator.next()

In [None]:
batchX, batchY

In [None]:
batchX.shape, batchY.shape

## Train-test and Validation Split

Get the total number of batches in the dataset.

In [None]:
numberBatches = len(dataset)

Split by iterating over the dataset and taking using the `take()` and `skip()` methods.

In [None]:
train = dataset.take(int(numberBatches * 0.7))
validation = dataset.skip(int(numberBatches * 0.7)).take(int(numberBatches * 0.2))
test = dataset.skip(int(numberBatches * 0.9)).take(int(numberBatches * 0.1))

Remember, these numbers are the number of batches and not the number of samples.

In [None]:
len(train), len(validation), len(test)

The number of samples in the `train` dataset would be,

In [None]:
len(train) * BATCH_SIZE

# Construct the Neural Network

Here's an overview of what our model will look like with the entire deep learning workflow for this project. 

![](./assets/images/project-workflow.png)

It's clear that our deep learning model first consists of LSTM cells followed by three layers of a fully-connected deep learning model outputting binary values for each of the six independent classes we have.

## Instantiate Model Using the Sequential API

In [None]:
model = Sequential()

## Add Embedding Layer

Adding an embedding layer to our sequential model. This serves as a personality test of sorts for each word that the model then tries to learn through it's training. For this particular case, we don't pass any pre-learnt embeddings. Our deep learning model learns the embedding with the associated features for each of the words.

In [None]:
Embedding??

In [None]:
model.add(Embedding(MAX_FEATURES + 1, 32))

The `+1` for the `input_dim` is to represent the `<UNK>` value for a word. Each embedding for a word will be 32 values long.

## Create LSTM Layer

For each of the 32 values in our embeddings, we want a bidirectional LSTM to learn it with an activation of `tanh`. We use `tanh` since the GPU acceleration for the LSTM dictates that we use this activation function. Essentially, `tf` stuff.

In [None]:
Bidirectional??

In [None]:
model.add(Bidirectional(LSTM(32, activation='tanh')))

We use `Bidirectional` to capture sequences of words since it lets us pass embeddings not just in a singular direction but in two as the name suggests.

## Create Feature Extractors (FC Layers)

In [None]:
Dense??

Here's the array of our fully-connected/feature extractor layers.

In [None]:
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))

## Create Output Layer

This maps to the number of outputs we need from our deep learning network. I.e. six classes. We use the `sigmoid` to get the output value to something in the range `[0, 1]`.

In [None]:
model.add(Dense(6, activation='sigmoid'))

## Compile Model

In [None]:
model.compile??

In [None]:
model.compile(loss='BinaryCrossentropy', optimizer='Adam')

The reason we use `BinaryCrossentropy` over `CategoricalCrossentropy` is because we want the outputs NOT to be one of those 7 classes but any combination of them. I.e., a comment that is `severely_toxic` can also be a `threat` and/or an `insult`.

So it's as if we're running six different binary classifiers.

## Summarize Model

In [None]:
model.summary()

# Train the Model

In [None]:
EPOCHS = 10

In [None]:
model.fit??

In [None]:
history = model.fit(
    train,
    epochs=EPOCHS,
    validation_data=validation
)

# Save Model Weights

In [None]:
model.save_weights??

In [None]:
# Generate timestamp
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")

# Define the filename with timestamp
filename = f"weights_{timestamp}.h5"

# Save the model weights
pathToWeights = os.path.join('assets', 'weights', filename)
model.save_weights(pathToWeights)

# Write Training History to JSON

In [None]:
import json

# Save the history variable to a JSON file
with open('history.json', 'w') as f:
    json.dump(history.history, f)

# Analyse Model During Training

In [None]:
history??

In [None]:
history.history

In [None]:
plt.figure(figsize=(15, 10), dpi=120)

pd.DataFrame(history.history).plot()

plt.show()