## MNIST model

Let's see how we can train a simple model on the mnist dataset

### What we will do:
- we will get the data (mnist dataset)
- we will define the target (recognize the digits)
- we will build the model
- we will train the model
- we will verify that the model can classify randomly selected samples

We will start by importing some usefull tools

- **keras** is for building and training the neral network
- **numpy** is for handling numerical data
- **matplotlib**, **IPython** and **tabulate** are tools for printing and plotting (e.g. tables or images)

In [None]:
%%capture
import keras
from keras.datasets import *
from keras.models import Sequential, Model
from keras.layers import *
from keras.activations import softmax, relu
from keras.losses import categorical_crossentropy
from keras.optimizers import *
from keras.utils import to_categorical
import keras.backend as K

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
from tabulate import tabulate

## The data

Now we need the data

The data are small (28x28 pixels) gray scale images of hand-written digits.

This is what the data look like
![mnist_sample](https://www.researchgate.net/profile/Steven_Young11/publication/306056875/figure/fig1/AS:393921575309346@1470929630835/Example-images-from-the-MNIST-dataset.png)

Let's load the data...

In [None]:
(X_train, Y_train), (X_val, Y_val) = mnist.load_data()
labels_names = 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'

All the data that we use when training DL models are actually n-dimensional ***arrays*** with **numerical** values

No matter if it was originally a video, an image, a voice record or a text, in the end **everything is transformed to arrays**

The **shape** (as well the **range** of the values) of the array is important since the models are built to be able to handle specific kind of arrays regarding the shape

In [None]:
data_info = [(name, d.shape, d.min(), d.max()) 
             for name, d in zip(('X_train', 'Y_train', 'X_val', 'Y_val'),
                                (X_train, Y_train, X_val, Y_val))]

print(tabulate(data_info, headers=['name', 'shape', 'minimum', 'maximum']))

Originally the images' pixels have values in [0, 255]

However that big values are not easy to be handled by the networks

Thus we usually change the input values to something more "model friendly"

This is called data preprocessing

In our case the preprocessing is just a mapping of the values from [0, 255] to [0, 1]

by dividing the array's values by 255

In [None]:
X_train, X_val = X_train / 255, X_val / 255

But what do these "images" look like?

Let's see what is inside the first "image" of our training dataset

In [None]:
index = 0
img = X_train[index]
for r in np.round(img, 2):
  print(*r)

In [None]:
plt.imshow(X_train[index], cmap='gray')
print('label:', Y_train[index])
plt.show()

In the next cell we define some functions for getting random images from the dataset and plotting them

Don't pay too much attention to them for the moment

In [None]:
r, c = 3, 3

def get_random_imgs_labels(X_set, Y_set, n_imgs):
  inds = np.random.randint(0, len(X_set), n_imgs)
  images, labels = X_set[inds], Y_set[inds]
  return images, labels


def plot_images(images, labels, labels_names, preds=None):
  labels = labels.flatten()
  fig, axs = plt.subplots(r, c)
  cnt = 0
  for i in range(r):
    for j in range(c):
      axs[i, j].imshow(images[cnt], cmap='gray')
      axs[i, j].axis('off')
      title = labels_names[labels[cnt]] if preds is None else '%s/%s' % (labels_names[labels[cnt]], labels_names[preds[cnt]])
      axs[i, j].set_title(title, fontsize=12)
      cnt += 1
  plt.show()
  
    
def get_subset(xs, ys, l):
  n_xs, n_ys = [], []
  for x, y in zip(xs, ys):
    if y in l:
      n_xs.append(x)
      n_ys.append(y)
  n_xs, n_ys = np.array(n_xs), np.array(n_ys)
  return n_xs, n_ys

Let's plot some of the images together with their labels to see what the look like

In [None]:
images, labels = get_random_imgs_labels(X_train, Y_train, r*c)
plot_images(images, labels, labels_names)

## The model

Now we have to build the model that we will use to predict the label of a given image

In [None]:
K.clear_session()

model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(32, activation='relu'),
    Dense(10, activation='softmax')])

model.summary()

Our model has a *Flatten* layer and 2 *Fully connected* layers
![Flatten layer](https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/73_blog_image_1.png)
![Fully connected model](https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/74_blog_image_1.png)

The model's output is a list of 10 numbers; one for each category of our dataset

By using ***softmax*** as activation of the last layer we constrain these numbers to:
- be between 0 and 1
- have their sum equal to 1

This way we can interpret them as probabilities for the categories

We use the label with the highest probability as the predicted one

For example if the output is:

```0.002, 0.013, 0.017, 0.006, 0.027, 0.109, 0.024, 0.789, 0.002, 0.011```

The predicted label will be: ***seven***

Before we start the training of the model we need to define
- the target
- the way to achieve it

In our case we want 
- the model's outputs
- the given probabilities for each label

to be as close as possible to each other

Since in each case we have only one correct label, we want ideally the model to return probability 1 for the correct label and 0 for the rest ones

To ahcieve this we use a ***loss function***

In our case the loss function will be the *categorical crossentropy*:

$$H(p,q) = - \sum_x p(x) \log(q(x))$$

In [None]:
y_true = 0, 0, 0, 0, 0, 0, 0, 1, 0, 0
y_pred = 0.002, 0.013, 0.017, 0.006, 0.027, 0.109, 0.024, 0.789, 0.002, 0.011

loss = -sum([p*np.log(q) for p, q in zip(y_true, y_pred)])
print('loss based on formula:       ', np.round(loss, 5))

loss = keras.losses.categorical_crossentropy(K.variable(y_true), K.variable(y_pred))
print('loss based on keras function:', np.round(K.eval(loss), 5))

We can make some changes to the numbers and obtain the resulted loss

We also need to define a method based upon the model will try to **minimize** the loss function.

The method (also called **optimizer** since it optimize the model's parameters) that we will use is ***Adam***

We don't need to get into too much details for this one

In [None]:
model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

Now the model has been randomly initialized which means that the outputs will be mostly wrong

Let's see some examples

In [None]:
images, labels = get_random_imgs_labels(X_val, Y_val, r*c)
predictions = model.predict_on_batch(images)
predictions = np.argmax(predictions, -1)

print('correct: %d out of %d' % (np.sum(labels == predictions), len(images)))
plot_images(images, labels, labels_names, predictions)

## Training

Now let's train the model for some epochs and see if we can imporve the results

In [None]:
history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=64, epochs=10)

In [None]:
images, labels = get_random_imgs_labels(X_val, Y_val, 9)
predictions = model.predict_on_batch(images)
predictions = np.argmax(predictions, 1)

print('correct: %d out of %d' % (np.sum(labels == predictions), len(labels)))
plot_images(images, labels, labels_names, predictions)

That was it!

We trained a model to classify images of handwritten digits

### To summarize:
- we got the data (mnist dataset)
- we defined the target (recognize the digits)
- we built the model
- we trained the model
- we verified that the model can classify randomly selected samples

## The end

### of the simple MNIST classifier

## Data matters

It is commonly said that in DL you need **Big Data**

But how big must your data be?

In the previous case we had 60,000 training examples

and we achieved ~97% accuracy

which is quite good given the simplicity of the model and the training process

But what happens if we have **fewer** data?

In [None]:
short_sample = 50

X_train_short = X_train[:short_sample]
Y_train_short = Y_train[:short_sample]

X_train_short.shape, Y_train_short.shape

In [None]:
K.clear_session()

model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(32, activation='relu'),
    Dense(10, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train_short, to_categorical(Y_train_short),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=50, epochs=50)

In [None]:
images, labels = get_random_imgs_labels(X_val, Y_val, 9)
predictions = model.predict_on_batch(images)
predictions = np.argmax(predictions, 1)

print('correct: %d out of %d' % (np.sum(labels == predictions), len(labels)))
plot_images(images, labels, labels_names, predictions)

In this case we see that the accuracy of the model on the training data is very high

But on the validation data in is significantly lower

This means that the model has ***memorized*** the training data

But it cannot generalize to new/unseen images

This is called ***Overfitting***

There are some ways to reduce overfitting but it is out of the scope of this example

And if the data are too few, there is not much to be done

However, if the problem was simpler the same amount of data might be enough

Let's say for example that we want to distinguish only between 0s and 1s

In [None]:
X_train_01, Y_train_01 = get_subset(X_train, Y_train, (0, 1))
X_val_01, Y_val_01 = get_subset(X_val, Y_val, (0, 1))

Let's plot some of the images together with their labels to see what the look like

In [None]:
images, labels = get_random_imgs_labels(X_train_01, Y_train_01, r*c)
plot_images(images, labels, labels_names)

And now let's keep only few of the training data

In [None]:
short_sample = 50

X_train_01_short = X_train_01[:short_sample]
Y_train_01_short = Y_train_01[:short_sample]

X_train_01_short.shape, Y_train_01_short.shape

Now we will train a new model on the new data

Pay attention at the number of units at the output layer

Since we have only 2 possible labels, we have 2 output units

In [None]:
K.clear_session()

model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(32, activation='relu'),
    Dense(2, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train_01_short, to_categorical(Y_train_01_short),
                    validation_data=(X_val_01, to_categorical(Y_val_01)),
                    batch_size=50, epochs=50)

The results are significantly better

Let's plot some examples

In [None]:
images, labels = get_random_imgs_labels(X_val_01, Y_val_01, r*c)
predictions = model.predict_on_batch(images)
predictions = np.argmax(predictions, 1)

print('correct: %d out of %d' % (np.sum(labels == predictions), len(labels)))
plot_images(images, labels, labels_names, predictions)

This means that:
- even with the **same type** of data
- for a **different task**
- **different amount** of data might be needed
- for the **same** level of accuracy

## Data matters

What about the model?

It is also said that deeper models have better performance

This is why it is called **Deep** learning after all

In the previous example the model had already a good performance

But let's try now the same model on a more difficult task

We will use CIFAR 10, a dataset of small (32x32 pixels) colored images of different categories

### CIFAR 10 samples
![Flatten layer](https://storage.googleapis.com/kaggle-competitions/kaggle/3649/media/cifar-10.png)

In [None]:
(X_train, Y_train), (X_val, Y_val) = cifar10.load_data()

Let's see what our data look like this time

In [None]:
data_info = [(name, d.shape, d.min(), d.max()) 
             for name, d in zip(('X_train', 'Y_train', 'X_val', 'Y_val'),
                                (X_train, Y_train, X_val, Y_val))]

print(tabulate(data_info, headers=['name', 'shape', 'minimum', 'maximum']))

and let's normalize the data (in range [0, 1])

In [None]:
X_train, X_val = X_train / 255, X_val / 255
labels_names = 'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'

These are some samples of our new dataset among with their categories

In [None]:
images, labels = get_random_imgs_labels(X_train, Y_train, r*c)
plot_images(images, labels, labels_names)

Let's train the original model on the new dataset

In [66]:
K.clear_session()

model = Sequential([
    Flatten(input_shape=(32, 32, 3)),
    Dense(32, activation='relu'),
    Dense(10, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=128, epochs=30)

Train on 50000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


The performance is much worse this time

Let's see if adding more layers will make any difference

In [67]:
K.clear_session()

model = Sequential([
    Flatten(input_shape=(32, 32, 3)),
    Dense(32, activation='relu'),
    Dense(64, activation='relu'),  # New layer
    Dense(128, activation='relu'),  # New layer
    Dense(256, activation='relu'),  # New layer
    Dense(10, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=128, epochs=30)

Train on 50000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Making the model deeper improved its performance a bit

But can we do any better?

Now let's try a different type of model

This type is called ***Convolutional*** and uses a specific type of layer which is very popular for image related tasks

The idea is that at every layer of the model there are some filters that learn specific patterns or characteristics.

The first layers learn low level (simple) patterns. The deeper layer learn to recognize more complex patterns

Here is an example of filters trained to recognize human faces

See how the complexity of the patterns increases as we go deeper

![convolutional kernels](https://devblogs.nvidia.com/wp-content/uploads/2015/11/hierarchical_features.png)

In [68]:
K.clear_session()

model = Sequential([
    Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),  # New layer
    Flatten(),
    Dense(256, activation='relu'),  # New layer
    Dense(10, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=128, epochs=30)

Train on 50000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Indeed, the Convolutional Neural Network (CNN) performed better than the Fully Connected one

However it seems to overfit

We can add some layers in order to deal with the overfitting problem

In [71]:
K.clear_session()

model = Sequential([
    Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),
    MaxPool2D(),  # New layer
    Dropout(0.25),  # New layer
  
    Flatten(),
    Dense(256, activation='relu'),
    Dense(10, activation='softmax')
    ])

model.compile(optimizer=Adam(), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=128, epochs=30)

Train on 50000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


And finally let's see what will be the performance if we add more layers

In [70]:
K.clear_session()

model = Sequential([
    Conv2D(32, 3, padding='same', activation='relu', input_shape=(32, 32, 3)),
    Conv2D(32, 3, padding='same', activation='relu'),  # New layer
  
    MaxPool2D(),
    Dropout(0.25),
  
    Conv2D(64, 3, padding='same', activation='relu'),  # New layer
    Conv2D(64, 3, padding='same', activation='relu'),  # New layer
  
    MaxPool2D(),  # New layer
    Dropout(0.25),  # New layer
  
    Conv2D(128, 3, padding='same', activation='relu'),  # New layer
    Conv2D(128, 3, padding='same', activation='relu'),  # New layer
  
    MaxPool2D(),  # New layer
    Dropout(0.25),  # New layer
  
    Flatten(),
    Dense(512, activation='relu'),
    Dropout(0.25),
  
    Dense(10, activation='softmax')])

model.compile(optimizer=rmsprop(lr=0.0001, decay=1e-6), loss=categorical_crossentropy, metrics=['acc'])

history = model.fit(X_train, to_categorical(Y_train),
                    validation_data=(X_val, to_categorical(Y_val)),
                    batch_size=128, epochs=30)

Train on 50000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


This means that:
- the **complexity** of the patterns accuaries more sofisticated approaches
- the **size** of the model is important
- the **type** of the layers is important
- the **architecture** of the network is important
- there are ways to get **better results** for the same task on the same data

## The end
### of the data and model experimentation