### The goal of this phase is to have your text classifier model ready to be used: that means, not only will you train it on a labeled dataset, but also you will take care of exporting it in a format suitable for later loading by the API.

## Prepare the dataset for training

You want to "teach" a machine to distinguish between spam and ham. Unfortunately, machines prefer to speak numbers rather than words. You then need to transform the human-readable CSV file above into a format that, albeit less readable by us puny humans, is more suited to the subsequent task of training the classifier. You will express (a cleaned-out version of) the text into a sequence of numbers, each representing a token (one word) forming the message text.

More precisely:

1. first you'll initialize a "tokenizer", asking it to build a dictionary (i.e. a token/number mapping) best suited for the texts at hand;
2. then, you'll use the tokenizer to reduce all messages into (variable-length) sequences of numbers;
3. these sequences will be "padded", i.e. you'll make sure they end up all having the same length: in this way, the whole dataset will be represented by a rectangular matrix of integer numbers, each row possibly having leading zeroes;
4. the "spam/ham" column of the input dataset is recast with the "*one-hot encoding*": that is, it will become two columns, one for "spamminess" and one for "hamminess", both admitting the values zero or one (but with a single "one" per row): this turns out to be a formulation much friendlier to categorical classification tasks in general;
5. finally you'll split the labeled dataset into a "training" and a "testing" disjoint parts. This is a very important concept: the effectiveness of a model should always be validated on data points *not used during training*.

All these steps can be largely automated by using data-science Python packages such as `pandas`, `numpy`, `tensorflow/keras`.

### Overview
The dataset preparation starts with the CSV file you saw earlier and ends up exporting the new data format in the training/prepared_dataset directory. Two observations are in order:

- the "big matrix of numbers" encoding the messages and the (narrower) one containing their spam/ham status are useless without the tokenizer: after all, to process a new message you would need to make it into a sequence of numbers using this very same mapping. For this reason, it is important to export the tokenizer as well, in order to later use the classifier.
- the `pickle` protocol used in writing the reformulated data is strictly Python-specific and should not be treated as a long-term (or interoperable!) format. Later we discuss a sensible way to store model, tokenizer and metadata on disk.

### Preamble

In [None]:
import tensorflow as tf
import timeit

# Test for GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print(
      '\n\nThis error most likely means that this notebook is not configured to use a GPU. '
      'Change this in Notebook Settings via the command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
    #raise SystemError('GPU device not found')

# Disable GPU
tf.config.set_visible_devices([], 'GPU')

# List TF devices
print(f"Physical Devices: {tf.config.list_physical_devices()}")
print(f"Logical Devices:  {tf.config.list_logical_devices()}")

In [None]:
import sys
import pickle
import json
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [None]:
# set the input file
datasetInputFile = '../training/dataset/spam-dataset.csv'
# set the ouput file
trainingDumpFile = '../training/prepared_dataset/spam_training_data.pickle'

### Reading and transforming the input

#### Reading the input file and preparing legend info

In [None]:
# Load Datasets into a Pandas DataFrame
df = pd.read_csv(datasetInputFile)

# Convert Dataset to Lists
labels = df['label'].tolist()
texts = df['text'].tolist()

# Now we need to map our labels from being text values to being integer values. It's pretty simple:
labelLegend = {'ham': 0, 'spam': 1}

# The inverted legend is there to help us when we need to add a label to our predictions later.
labelLegendInverted = {'%i' % v: k for k,v in labelLegend.items()}
labelsAsInt = [labelLegend[x] for x in labels]

**Look at:** the contents of `texts`,
`labelLegend`,
`labelLegendInverted`,
`labels`,
`labelsAsInt`

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# texts
# labelLegend
# labelLegendInverted
# labels
# labelsAsInt
df.head()

#### Tokenization of texts
The Keras Tokenizer will convert our raw text into vectors. Converting texts to vectors is a required step for any machine learning model (not just keras).

In [None]:
# MAX_NUM_WORDS is set to the current max length of any given post (tweet) on Twitter. This max number of words is likely to exceed *all* of our sms text size (typically 160 characters).
MAX_NUM_WORDS = 280
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

**Look at:** `tokenizer.word_index`, `inverseWordIndex`, `sequences` and how they play together:

In [None]:
# This is only needed for demonstration purposes, will not be dumped with the rest:
inverseWordIndex = {v: k for k, v in tokenizer.word_index.items()}

## Uncomment any one of the following and press Shift+Enter to print the variable
# tokenizer.word_index
# inverseWordIndex
# sequences
# [[inverseWordIndex[i] for i in seq] for seq in sequences]

#### Create `X`, `y` training sets

In machine learning, it's common to denote the training inputs as `X` and their corresponding labels (the outputs) as `y`. 

Let's start with the `X` data (aka the text) by padding all of our tokenized sequences. This ensures all training inputs are the same shape (aka size). 

Each sentence in each paragraph in every conversation you have is rarely the same length. It is almost certainly *sometimes* the same length, but rarely all the time. With that in mind, we want to categorize every sentence (or paragraph) as either `spam` or `ham` -- an arbitrary length of data into known length of data. 

This means we have two challenges:
- Matrix multiplication has strict rules
- Spoken or written language rarely adheres to strict rules.

What to do?

`X` as new representation for the `text` from our raw dataset. As stated above, there's a very small chance that all data in this group is the exact same length so we'll use the built-in tool called `pad_sequences` to correct for the inconsistent length. This length is actually called shape because of it's roots in linear algebra (matrix multiplication).

In [None]:
MAX_SEQ_LENGTH = 300
X = pad_sequences(sequences, maxlen=MAX_SEQ_LENGTH)

**Look at:** `sequences`, `X` and compare their shape and contents:

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# [len(s) for s in sequences]
# len(sequences)
X.shape
# type(X)
# X

#### Switch to categorical form for labels
We convert our `labelsAsIntArray` into a corresponding matrix value (instead of just a list of ints) by using the built-in `to_categorical` function. The number of labels does not have to be 2 (as we have) but it should be at least 2.

In [None]:
labelsAsIntArray = np.asarray(labelsAsInt)
y = to_categorical(labelsAsIntArray)

**Look at:** `labelsAsIntArray`, `y` and how they relate to `labels` and `labelLegend`:

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# labelsAsIntArray
# labelsAsIntArray.shape
y.shape
# y
# labels
# labelLegend

#### Splitting the labeled dataset (train/test)

If we trained on all of our data, our model will fit very *well* to that training data but it will not perform well on new data; aka it will be mostly useless.

Since we have the `X` and `y` designations, we split the data into at least 2 corresponding sets: training data and validation data for each designation resulting in `X_train`, `X_test`, `y_train`, `y_test`.

An easy way (but not the only way) is to use `scikit-learn` for this:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Look at:** the shape of the four resulting numpy 2D arrays:

In [None]:
print(f"X_train: {X_train.shape}")
print(f"X_test:  {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test:  {y_test.shape}")

#### Save everything to file

As we'll see soon, the test sets (aka `X_test` and `y_test`) are used to evaluate how our AI model is learning (aka the performance). This means it's often a good idea to save the test sets for future training and not splitting the data all over again. Using the same test set over and over will show how our model is performing over time.

In [None]:
trainingData = {
    'X_train': X_train, 
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'max_words': MAX_NUM_WORDS,
    'max_seq_length': MAX_SEQ_LENGTH,
    'label_legend': labelLegend,
    'label_legend_inverted': labelLegendInverted, 
    'tokenizer': tokenizer,
}
with open(trainingDumpFile, 'wb') as f:
    pickle.dump(trainingData, f)

## Train the Model

It is time to train the model, i.e. fit a neural network to the task of associating a spam/ham label to a text message. Well, actually the task is now more like "associating probabilities for 0/1 to a sequence of integer numbers (padded to fixed length with leading zeroes)".

The training  works as follows:

1. All variables created and stored in the previous steps are loaded back to memory;
2. A specific architecture of a neural network is created, still a "blank slate" in terms of what it "knows". Its core structure is that of a [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) (long-short-term-memory), a specific kind of recurrent neural network with some clever modifications aimed at enhancing its ability to "remember" things between non-adjacent locations in a sequence, such as two displaced positions in a string of text;
3. The neural network (your classifier) is trained: that means it will progressively adapt its internal (many thousands of) parameters in order to best reproduce the input training set. Each individual neuron in the network is a relatively simple component - the "intelligence" coming from their sheer quantity and the particular choice of parameters determining which neurons affect which other and by how much;
4. Once the training process has finished, the script carefully saves everything (model, tokenizer and associated metadata) in a format that can be later loaded by the API in a stand-alone way.


### Preamble

In [None]:
import pickle
import json
import sys
import numpy as np
#
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from tensorflow.keras import models

# in
trainingDumpFile = '../training/prepared_dataset/spam_training_data.pickle'
# out
trainedModelFile = '../training/trained_model_v1/spam_model.h5'
trainedMetadataFile = '../training/trained_model_v1/spam_metadata.json'
trainedTokenizerFile = '../training/trained_model_v1/spam_tokenizer.json'

### Load the training data from previous steps

In [None]:
# load the training data and extract its parts
print('    Loading training data ... ', end ='')
data = pickle.load(open(trainingDumpFile, 'rb'))
X_test = data['X_test']
X_train = data['X_train']
y_test = data['y_test']
y_train = data['y_train']
labelLegendInverted = data['label_legend_inverted']
labelLegend = data['label_legend']
maxSeqLength = data['max_seq_length']
maxNumWords = data['max_words']
tokenizer = data['tokenizer']
print('done')

### Define the Model

In [None]:
# Model preparation
embedDim = 128
LstmOut = 196
#
model = Sequential()
model.add(Embedding(maxNumWords, embedDim, input_length=X_train.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(LstmOut, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

### Train the Model

#### Plotting Convergence

During the training phase, we aim to minimize the error rate as well as to make sure that the model generalizes well on new data. Plotting loss (or error) vs. epoch and accuracy vs. epoch allows to visualise if we are underfitting or overfitting the model.

Overfitting - (high variance) when the model fits perfectly to the training examples but has limited capability generalization.
Underfitting - (high bias) if it didn’t learn the data well enough.

The training process should be stopped when the validation error trend changes from descending to ascending

In [None]:
import numpy as np 
from tensorflow import keras
from matplotlib import pyplot as plt
from IPython.display import clear_output

class PlotLearning(keras.callbacks.Callback):
    """
    Callback to plot the learning curves of the model during training.
    """
    def on_train_begin(self, logs={}):
        self.metrics = {}
        for metric in logs:
            self.metrics[metric] = []
            

    def on_epoch_end(self, epoch, logs={}):
        # Storing metrics
        for metric in logs:
            if metric in self.metrics:
                self.metrics[metric].append(logs.get(metric))
            else:
                self.metrics[metric] = [logs.get(metric)]
        
        # Plotting
        metrics = [x for x in logs if 'val' not in x]
        
        f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
        clear_output(wait=True)

        for i, metric in enumerate(metrics):
            axs[i].plot(range(1, epoch + 2), 
                        self.metrics[metric], 
                        label=metric)
            if logs['val_' + metric]:
                axs[i].plot(range(1, epoch + 2), 
                            self.metrics['val_' + metric], 
                            label='val_' + metric)
                
            axs[i].legend()
            axs[i].grid()

        plt.tight_layout()
        plt.show()

In [None]:
#callbacks_list = [PlotLearning()]

print('    Training (it will take some minutes) ... ', end ='')
batchSize = 32
epochs = 3
model.fit(X_train, y_train,
            validation_data=(X_test, y_test),
            batch_size=batchSize, verbose=1,
            epochs=epochs,
            shuffle=True,
#            callbacks=callbacks_list
            )
print('done')

### Export the Model, Metadata, and Tokenizer

#### Save the model proper (the model has its own format and its I/O methods)

In [None]:
print('    Saving model ... ', end ='')
model.save(trainedModelFile)
print('done')

#### Save the metadata needed to 'run' the model as JSON

In [None]:
print('    Saving metadata ... ', end ='')
metadataForExport = {
    'label_legend_inverted': labelLegendInverted,
    'label_legend': labelLegend,
    'max_seq_length': maxSeqLength,
    'max_words': maxNumWords,
}
json.dump(metadataForExport, open(trainedMetadataFile, 'w'))
print('done')

### Save the tokenizer

In [None]:
print('    Saving tokenizer ... ', end ='')
tokenizerJson = tokenizer.to_json()
with open(trainedTokenizerFile, 'w') as f:
    f.write(tokenizerJson)
print('done')
#
print('FINISHED')

Take a look in the output directory (`ls training/trained_model_v1`) to find:

- A (small) JSON file with some metadata describing some features of the model;
- A (larger) JSON file containing the full definition of the tokenizer. This has been created, and will be loaded, using helper functions provided with the tokenizer itself for our convenience;
- A (rather large) binary file containing "the model". That means, among other things, the shape and topology of the neural network and all "weights", i.e. the parameters dictating which neurons will affect which others, and by how much. Saving and loading this file, which is in the HDF5 format, is best left to routines kindly offered by Keras.

### Test the trained model

Before moving on to the API section, make sure that the saved trained model is self-contained: that is, check that by loading the contents of `training/trained_model_v1`, and nothing else, you are able to perform meaningful estimates of the spam/ham status for a new arbitrary piece of text.

Note that the output is given in terms of "probabilities", or "confidence". One can interpret a result like {'ham': 0.92, 'spam': 0.08} as meaning the input is ham with 92% confidence. Indeed, generally speaking, ML-based classifiers are very sophisticated and specialized machines for statistical inference.

If you look at the (very simple) code of this function, you will see how the model, once loaded, is used to make predictions (it all boils down to the model's predict method, but first the input text must be recast as sequence of numbers with the aid of the tokenizer, and likewise the result must be made readable by humans again).

In [None]:
# Load tokenizer and metadata:
#   (in metadata, we'll need keys 'label_legend_inverted' and 'max_seq_length')
tokenizer = tokenizer_from_json(open(trainedTokenizerFile).read())
metadata = json.load(open(trainedMetadataFile))
# Load the model:
model = models.load_model(trainedModelFile)

# a function for testing:
def predictSpamStatus(text, spamModel, pMaxSequence, pLabelLegendInverted, pTokenizer):
    sequences = pTokenizer.texts_to_sequences([text])
    xInput = pad_sequences(sequences, maxlen=pMaxSequence)
    yOutput = spamModel.predict(xInput)
    preds = yOutput[0]
    labeledPredictions = {pLabelLegendInverted[str(i)]: x for i, x in enumerate(preds)}
    return labeledPredictions

In [None]:
st = 'This is a nice touch, adding a sense of belonging and coziness. Thank you so much.'
preds = predictSpamStatus(st, model, metadata['max_seq_length'], metadata['label_legend_inverted'], tokenizer)
print('TEXT       = %s' % st)
print('PREDICTION = %s' % str(preds))

In [None]:
st = 'Click here to WIN A FREE IPHONE and this and that.'
preds = predictSpamStatus(st, model, metadata['max_seq_length'], metadata['label_legend_inverted'], tokenizer)
print('TEXT       = %s' % st)
print('PREDICTION = %s' % str(preds))