# A general procedure for supervised learning in Keras

For most of the supervised learning tasks, the procedure we follow is comprised of the following steps:

### Step 1: Data exploration
The first step is normally to load the data and try to understand its properties. A few things that are usually useful:
1. Check data formats.
2. Visual inspection of data.
3. Investigate (get some type of understanding for) how hard the problem is. 


### Step 2: Data preprocessing
1. Normalise (or scale) input data. 
2. Convert the data to a different type, or organize it differently for the optimization (e.g. Numpy arrays, subsets of the dataset, etc.)
3. Encode input and output data on a suitable form. For instance, we often use one-hot encoding to represent string variables.
4. Split data into training, validation and test sets.


### Step 3: Training
1. Build a tentative network architecture (could be the simplest one you think could work, or based in previous sucesses).
2. Select optimizer, performance measures and a few more hyperparameters. 
3. Train the network. 
4. Analyze performance on the training and validation sets. Adjust design decisions accordingly.


### Step 4: Assesment
1. Use the network for predictions in the test set.
2. Evaluate the final quality of the model. **Attention**: Once this is done, you shouldn't alter your model anymore, otherwise you need a new test set (if you want a good estimate of your model's generalization capacity).

We are going to apply most of these steps to the task of correctly classifying an Iris plant, given its morphologic features present in the IRIS dataset.

# 1. Data exploration

### 1.1  Import the necessary modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

### 1.2 Read the dataset

In [None]:
dataset = pd.read_csv("iris.csv")

In [None]:
dataset.head()

### 1.3 Analyzing the data

For this task, we'll use all of the data, not focus on only one of the species or a subset of the features. The `plot` method can help us obtain different types of visualizations of the data in the `DataFrame`. For instance, we can use it to plot histograms of each feature.

In [None]:
dataset.plot(kind='hist', bins=30, alpha=0.7, figsize=[15,6]);

This is somewhat informative, but we could get an even better grasp of the data by first separating it into the different species (it seems likely that different species will have different feature distributions), and then plotting the histograms.

However, if we let the `plot` method automatically create the histogram bins where it wants, each histogram might have different ranges, which would make it harder to compare them. Instead, we create the bins ourselves and pass that as an argument.

In [None]:
# Remove the 'species' column, so we get only the numeric values of the dataset
features_dataset = dataset.drop('species', axis=1)

# Find maximum and minimum values
maxval = np.max(features_dataset.values)
minval = np.min(features_dataset.values)

# Create 30 linearly spaced numbers in this range
my_bins = np.linspace(minval, maxval, 30)
print(my_bins)

In [None]:
# Get the names of the species
species_names = dataset['species'].unique()
print(species_names)

In [None]:
# For each species name, plot a histogram
for name in species_names:
    dataset[dataset["species"]==name].plot(kind="hist", bins=my_bins, alpha=0.7, figsize=[15,4], title=name);

This confirms that different species do have substantial differences in the distributions of each feature, e.g. the Setosa species has shorter sepals than the others, etc. 

Another way to gain more insight about the data is using the method `pairplot`, from the seaborn python module. This shows scatter plots between all feature pairs (hence the time required to run it increases exponentially with the number of features!) and histograms for each feature, color-coded by the species.

In [None]:
sns.pairplot(dataset, hue='species');

It's also helpful to check if the dataset is balanced. We can do so like this:

In [None]:
# Fill in a dictionary with the number of ocurrences of each species
n = {}
for name in species_names:
    extract_rule = dataset['species']==name
    n[name] = len(dataset[extract_rule])
    
print(n)

This shows that each species occurs exactly 50 times in the dataset, so it's perfectly balanced.

# 2. Data preprocessing

Now we need to prepare the data for the training. The first thing we should do is define the input and the output arrays for our network. 

Defining the input is as simple as extracting only the numeric columns of the dataset (this can also be conveniently done using the `drop` method, as done before).

In [None]:
# Extract numerical values
x = dataset[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values

# Print first 10 rows
print(x[:10])

In [None]:
type(x)

Note that we used the `values` attribute to create `x` as a numpy array. This is important, since Keras expects numpy arrays when training our model. Trying to use `DataFrame` objects with Keras unfortunately results in very non-informative error messages, so make sure to have this in mind.

Creating the output vector requires one more step, because of the way we'll train our network. Since the optimizer needs to be able to compare the predictions made by the neural network (i.e. a numeric vector), with the desired output vector in order to decide how to alter the weights, it's a good idea to encode the output vector in a numeric format. One way of doing this is using [one-hot encoding](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science).

One of the easiest ways of one-hot encoding a string column in pandas is to use the [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) method on that column.

In [None]:
dataset['species'][:5]

In [None]:
# Extract the species column and one-hot encode it
y = pd.get_dummies(dataset['species']).values

# Print first 5 rows
print(y[:10])

Finally, in order to assess how well our classifier generalizes to new, unseen data, we would like to withhold part of the dataset from the training process. This withheld part is usually called the test set. 

Scikit-learn provides an easy way to do so, with the `train_test_split` method. 

In [None]:
from sklearn.model_selection import train_test_split

This method randomly chooses which examples will be withheld, and here we want the test set to be comprised of approximately 30% of the samples.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)

In [None]:
x_train.shape

In [None]:
x_train[:5]

In [None]:
y_train.shape

In [None]:
y_train[:5]

In [None]:
x_test.shape

In [None]:
y_test.shape

Now we can use `x_train` and `y_train` to train the network, and `x_test` and `y_test` to evaluate it.

# 3. Training

First we import the required classes from Keras.

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Now we create a `Sequential` model, and add one layer of neurons to it.

In [None]:
model = Sequential()
model.add(Dense(3, input_dim=4, activation='softmax'))

This layer has 3 neurons, and each neuron has 4 inputs to it. The output from these 3 neurons will correspond to our prediction vector, so we use a softmax activation to make it possible to interpret this as a probability mass function. 

This way we can conveniently compare our prediction vector with the correct output vector for each example using the categorical cross-entropy loss.

**Task**: to make sure you clearly understand what we're doing here, draw the network on a piece of paper.

Just as we did in the linear regression part of this computer session, we now compile the model (i.e. configure it's learning process).

Here, instead of SGD, we use the Adam solver with a learning rate of $0.1$. As said before, we use the categorical cross-entropy loss for the optimization routine.

In [None]:
model.compile(Adam(lr=0.1), 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

Note the new argument we're using here, `metrics`. This is because we want to compute the accuracy of our model at each time step. When we have a classification task with a balanced dataset, the accuracy metric can help us to rank different models and to estimate the models' predictive power.

The accuracy is defined as:

$ Acc = \frac{\# \text{Samples correctly classified}}{\# \text{Samples}} $

From the definition, we see that accuracy is always a value between 0 and 1. An accuracy of 0 means that every single prediction made by our model is wrong, and an accuracy of 1 the exact opposite. For real life tasks, we usually obtain an accuracy between 0 and 1, and we aim to make it as high as possible.

(there are also [other metrics in Keras](https://keras.io/metrics/) that we can use to evaluate our classifier)

**Task**: what if our dataset was imbalanced? Would it still be a good idea to use accuracy?

Now we're ready to train the model. The only difference of this command from the `fit` call we used in the linear regression problem is that we now specify a `validation_split` parameter, which tells Keras we want to withhold part of the provided dataset from the training process, but still want to compute the current loss and accuracy in it (this is useful for several reasons, e.g. assessing if we are overfitting the model, when to stop the optimization, rank hyper-parameter choice, etc.). 

This withheld part is usually referred to as the validation set. [Here](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set) you can find more info about the difference between the training, validation and test sets, and why we usually divide the data this way.

In [None]:
model.fit(x_train, y_train, epochs=20, validation_split=0.4);

For this problem, the classes aren't so hard to separate in the feature-space, so it's common to obtain a very high accuracy in the training and the validation set (almost 100%).

# 4. Assesment

Finally, we would like to be able to evaluate how well the model can predict the class of new, unseen samples. This was the reason for withholding part of our data from the training process, so that now we have fresh, unseen samples. 

The idea now is to use the trained model to predict the class of each new sample, given its features, and then compare the predicted label with the correct label for each sample.

---

To compare the labels, we can use different techniques. As we saw before, we can compute the accuracy, but this time on the test set samples. However, although this helps us to evaluate the model's performance, it provides an incomplete picture. For instance, it doesn't explain the types of missclassifications we are doing.

So that we can gather more information about the quality of our classifier, we'll also compute the confusion matrix of its predictions. The confusion matrix is a table layout of the predictions of the classifier, in which each row represents the labels of the predicted class and each column the labels of the correct class.

---

To illustrate, imagine we train a classifier on samples that are either from the 'dog' class or the 'cat' class. After training, we show it 50 new samples. 30 of these new samples are cats, and 20 are dogs.

For the new cats, our classifier correctly predicts 28 of them, but in 2 samples it thinks they are from the 'dog' class. Further, the classifier correctly predicts 15 of the new dogs, and in 5 samples it thinks they are actually from the 'cat' class. 

The resulting confusion matrix for this example would be

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
      <th colspan="2"><b>Predicted label</b></th>
  </tr>
  <tr>
    <td>Cat</td>
    <td>Dog</td>
  </tr>
  <tr>
      <td rowspan="2"><b>True label</b></td>
    <td>Cat</td>
    <td>28</td>
    <td>2</td>
  </tr>
  <tr>
    <td>Dog</td>
    <td>5</td>
    <td>15</td>
  </tr>
</table>

Note that the element $C_{ij}$ ($i$-th row, $j$-th column), corresponds to the number of predictions of class $i$, when the true known class was the $j$-th class. This is not universal: some sources define the confusion matrix as the transpose of the one shown here. However, `sklearn` defines confusion matrices like this, so we'll adhere to this definition.

A handy way of computing accuracy and the confusion matrix, given the predictions and the true labels, it to use the function `accuracy_score` and `confusion_matrix` from the scikit-learn module.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

Now, the first step is to compute our predictions in the test set. For this, we'll use the `predict` method from our model.

In [None]:
y_pred = model.predict(x_test)

In [None]:
# Set options for pretty printing the numpy array
np.set_printoptions(precision=3, suppress=True)

print(y_pred[:5])

The correct labels are stored in `y_test`:

In [None]:
print(y_test[:5])

Note that the prediction and the correct expected output for each sample is a 3 dimensional vector, which is an approximation of the probability mass function of the classes for that sample, given its features. 

We can instead convert this into a "hard", single-value prediction by choosing the index of the element with the highest probability, for each sample. 

This can be easily done with the `argmax` method from numpy. The `axis` keyword passed as an argument specifies in which dimension we would like to search for the maximum value (e.g. row-wise or column-wise), and the value of 1 means row-wise (the default is 0, column-wise)

In [None]:
y_test_class = np.argmax(y_test, axis=1)
y_pred_class = np.argmax(y_pred, axis=1)

In [None]:
print(y_pred_class[:5])
print(y_test_class[:5])

So we can see here which of the first 5 elements were correctly classified. 

Now, to compute the accuracy, we use the `accuracy_score` function mentioned earlier.

In [None]:
acc = accuracy_score(y_test_class, y_pred_class)
print("Accuracy: %.2f" % acc)

And lastly, we can compute the confusion matrix using the `confusion_matrix` method from scikit-learn.

In [None]:
confusion_matrix(y_test_class, y_pred_class)

**Task**: What can you conclude from this confusion matrix? Which classes are easy/hard to separate?