# Train the Model
The goal of this phase is to have your text classifier model ready to be used: that means, not only will you train it on a labeled dataset, but also you will take care of exporting it in a format suitable for later loading by the API.

## Inspect the starting dataset
Open the file `training/dataset/spam-dataset.csv` and have a look at the lines there.
> Tip: you can open a file in Gitpod by locating it with the "File Explorer" on your left, but if you like using the keyboard you may simply issue the command `gp open training/dataset/spam-dataset.csv` from the `bash` Console at the bottom.

This is a CSV file with three columns (separated by commas):

- whether the line is spam or "ham" (i.e. the opposite of spam),
- a short piece of text (a "message"),
- the tag identifying the source of this datapoint (this will be ignored by the scripts).


The third column betrays the mixed origin of the data. To create our labeled dataset of 7,500 messages, two sets made available by the UCI Machine Learning Repository have been merged:
- [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
- [YouTube Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection)

Luckily, the (not always fun) task of cleaning, validating and normalizing the heterogeneous (and usually imperfect) data has been already done for you -- something that is seldom the case, alas, in a real-world task.

Look at line 352 of this file for example. Is that message spam or ham?
> Tip: hit Ctrl-G in the Gitpod editor to jump to a specific line number.

<details>
<summary>Show me that line in Gitpod's editor</summary>
<img src="../images/gitpod_gotoline.png" />
</details>

## Prepare the dataset for training

You want to "teach" a machine to distinguish between spam and ham. Unfortunately, machines prefer to speak numbers rather than words. You then need to transform the human-readable CSV file above into a format that, albeit less readable by us puny humans, is more suited to the subsequent task of training the classifier. You will express (a cleaned-out version of) the text into a sequence of numbers, each representing a token (one word) forming the message text.

More precisely:

1. first you'll initialize a "tokenizer", asking it to build a dictionary (i.e. a token/number mapping) best suited for the texts at hand;
2. then, you'll use the tokenizer to reduce all messages into (variable-length) sequences of numbers;
3. these sequences will be "padded", i.e. you'll make sure they end up all having the same length: in this way, the whole dataset will be represented by a rectangular matrix of integer numbers, each row possibly having leading zeroes;
4. the "spam/ham" column of the input dataset is recast with the "*one-hot encoding*": that is, it will become two columns, one for "spamminess" and one for "hamminess", both admitting the values zero or one (but with a single "one" per row): this turns out to be a formulation much friendlier to categorical classification tasks in general;
5. finally you'll split the labeled dataset into a "training" and a "testing" disjoint parts. This is a very important concept: the effectiveness of a model should always be validated on data points *not used during training*.

All these steps can be largely automated by using data-science Python packages such as `pandas`, `numpy`, `tensorflow/keras`.

### Overview
The dataset preparation starts with the CSV file you saw earlier and ends up exporting the new data format in the training/prepared_dataset directory. Two observations are in order:

- the "big matrix of numbers" encoding the messages and the (narrower) one containing their spam/ham status are useless without the tokenizer: after all, to process a new message you would need to make it into a sequence of numbers using this very same mapping. For this reason, it is important to export the tokenizer as well, in order to later use the classifier.
- the `pickle` protocol used in writing the reformulated data is strictly Python-specific and should not be treated as a long-term (or interoperable!) format. Later we discuss a sensible way to store model, tokenizer and metadata on disk.

### Preamble

In [None]:
import sys
import pickle
import json
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [None]:
# set the input file
datasetInputFile = '../training/dataset/spam-dataset.csv'
# set the ouput file
trainingDumpFile = '../training/prepared_dataset/spam_training_data.pickle'

### Reading and transforming the input

#### Reading the input file and preparing legend info

In [None]:
# Load Datasets into a Pandas DataFrame
df = pd.read_csv(datasetInputFile)

# Convert Dataset to Lists
labels = df['label'].tolist()
texts = df['text'].tolist()

# Now we need to map our labels from being text values to being integer values. It's pretty simple:
labelLegend = {'ham': 0, 'spam': 1}

# The inverted legend is there to help us when we need to add a label to our predictions later.
labelLegendInverted = {'%i' % v: k for k,v in labelLegend.items()}
labelsAsInt = [labelLegend[x] for x in labels]

**Look at:** the contents of `texts`,
`labelLegend`,
`labelLegendInverted`,
`labels`,
`labelsAsInt`

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# texts
# labelLegend
# labelLegendInverted
# labels
# labelsAsInt

#### Tokenization of texts
The Keras Tokenizer will convert our raw text into vectors. Converting texts to vectors is a required step for any machine learning model (not just keras).

In [None]:
# MAX_NUM_WORDS is set to the current max length of any given post (tweet) on Twitter. This max number of words is likely to exceed *all* of our sms text size (typically 160 characters).
MAX_NUM_WORDS = 280
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

**Look at:** `tokenizer.word_index`, `inverseWordIndex`, `sequences` and how they play together:

In [None]:
# This is only needed for demonstration purposes, will not be dumped with the rest:
inverseWordIndex = {v: k for k, v in tokenizer.word_index.items()}

## Uncomment any one of the following and press Shift+Enter to print the variable
# tokenizer.word_index
# inverseWordIndex
# sequences
# [[inverseWordIndex[i] for i in seq] for seq in sequences]
# texts

#### Create `X`, `y` training sets

In machine learning, it's common to denote the training inputs as `X` and their corresponding labels (the outputs) as `y`. 

Let's start with the `X` data (aka the text) by padding all of our tokenized sequences. This ensures all training inputs are the same shape (aka size). 

Each sentence in each paragraph in every conversation you have is rarely the same length. It is almost certainly *sometimes* the same length, but rarely all the time. With that in mind, we want to categorize every sentence (or paragraph) as either `spam` or `ham` -- an arbitrary length of data into known length of data. 

This means we have two challenges:
- Matrix multiplication has strict rules
- Spoken or written language rarely adheres to strict rules.

What to do?

`X` as new representation for the `text` from our raw dataset. As stated above, there's a very small chance that all data in this group is the exact same length so we'll use the built-in tool called `pad_sequences` to correct for the inconsistent length. This length is actually called shape because of it's roots in linear algebra (matrix multiplication).

In [None]:
MAX_SEQ_LENGTH = 300
X = pad_sequences(sequences, maxlen=MAX_SEQ_LENGTH)

**Look at:** `sequences`, `X` and compare their shape and contents:

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# [len(s) for s in sequences]
# len(sequences)
# X.shape
# type(X)
# X

#### Switch to categorical form for labels
We convert our `labelsAsIntArray` into a corresponding matrix value (instead of just a list of ints) by using the built-in `to_categorical` function. The number of labels does not have to be 2 (as we have) but it should be at least 2.

In [None]:
labelsAsIntArray = np.asarray(labelsAsInt)
y = to_categorical(labelsAsIntArray)

**Look at:** `labelsAsIntArray`, `y` and how they relate to `labels` and `labelLegend`:

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# labelsAsIntArray
# labelsAsIntArray.shape
# y.shape
# y
# labels
# labelLegend

#### Splitting the labeled dataset (train/test)

If we trained on all of our data, our model will fit very *well* to that training data but it will not perform well on new data; aka it will be mostly useless.

Since we have the `X` and `y` designations, we split the data into at least 2 corresponding sets: training data and validation data for each designation resulting in `X_train`, `X_test`, `y_train`, `y_test`.

An easy way (but not the only way) is to use `scikit-learn` for this:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Look at:** the shape of the four resulting numpy 2D arrays:

In [None]:
## Uncomment any one of the following and press Shift+Enter to print the variable
# X_train.shape
# X_test.shape
# y_train.shape
# y_test.shape

#### Save everything to file

As we'll see soon, the test sets (aka `X_test` and `y_test`) are used to evaluate how our AI model is learning (aka the performance). This means it's often a good idea to save the test sets for future training and not splitting the data all over again. Using the same test set over and over will show how our model is performing over time.

In [None]:
trainingData = {
    'X_train': X_train, 
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'max_words': MAX_NUM_WORDS,
    'max_seq_length': MAX_SEQ_LENGTH,
    'label_legend': labelLegend,
    'label_legend_inverted': labelLegendInverted, 
    'tokenizer': tokenizer,
}
with open(trainingDumpFile, 'wb') as f:
    pickle.dump(trainingData, f)