# Multilayer perceptrons (MLPs)

Now that we have characterised multilayer perceptrons (MLPs) mathematically (see Lecture's material), let us try to implement one ourselves. 

This code is organised in two parts:

* the first part brings a MLP implementation from scratch, passing through all the necessary steps.
* the second part makes use of `keras Sequential model` API for a concise implementation. 


---

## Note for ST456

The following command is necessary for downloading some helper functions in TensorFlow used by the reference book.

If you get a message saying **you need to restart the runtime**, please **do so before** running the rest of the code.

In [None]:
!pip install d2l==0.17.1

In [None]:
# importing necessary libraries
import tensorflow as tf
from d2l import tensorflow as d2l

### Loading the dataset

To compare against our previous results
achieved with softmax regression (**see Week 01 - Homework**), we will continue to work with
the Fashion-MNIST image classification dataset.

In [None]:
# minibatch size
batch_size = 256
# load the dataset
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

## First approach: MLP from scratch

### Initializing model parameters

Recall that Fashion-MNIST contains 10 classes,
and that each image consists of a $28 \times 28 = 784$
grid of grayscale pixel values.
Again, we will disregard the spatial structure
among the pixels for now,
so we can think of this as simply a classification dataset
with 784 input features and 10 classes.

To begin, we will **implement an MLP
with one hidden layer and 256 hidden units.**
Note that we can regard both of these quantities
as hyperparameters. Typically, we choose layer widths in powers of 2,
which tend to be computationally efficient because
of how memory is allocated and addressed in hardware.

Again, we will represent our parameters with several tensors.
Note that *for every layer*, we must keep track of
one weight matrix and one bias vector.
As always, we allocate memory
for the gradients of the loss with respect to these parameters.


In [None]:
# hyperparameters for the MLP
num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = tf.Variable(tf.random.normal(shape=(num_inputs, num_hiddens), mean=0, stddev=0.01))
b1 = tf.Variable(tf.zeros(num_hiddens))
W2 = tf.Variable(tf.random.normal(shape=(num_hiddens, num_outputs), mean=0, stddev=0.01))
b2 = tf.Variable(tf.random.normal([num_outputs], stddev=.01))

params = [W1, b1, W2, b2]

### Activation function

To make sure we know how everything works, for now,
we will **implement the ReLU activation** ourselves
using the maximum function rather than
invoking the built-in `relu` function directly.


In [None]:
# custom implementation of the ReLU function
def relu(X):
    return tf.math.maximum(X, 0)

### Model defintion

Because we are disregarding spatial structure,
we `reshape` each two-dimensional image into
a flat vector of length  `num_inputs`.

Finally, we **implement our model** with just a few lines of code.


In [None]:
# custom MLP model
def net(X):
    # input layer
    X = tf.reshape(X, (-1, num_inputs))
    # hidden layer
    H = relu(tf.matmul(X, W1) + b1)
    # output layer
    return tf.matmul(H, W2) + b2

### Loss function

To ensure numerical stability,
and because we already implemented
loss functions from scratch in Week 01,
we leverage the integrated function from high-level APIs
for calculating the cross-entropy loss.


In [None]:
def loss(y_hat, y):
    return tf.losses.sparse_categorical_crossentropy(y, y_hat, from_logits=True)

### Training the model

Fortunately, **the training loop for MLPs
is exactly the same as for logistic (softmax) regression**.

Leveraging the `d2l` package again,
we call the `train_ch3` function,
setting the number of epochs to 10 and the learning rate to 0.1.


In [None]:
# training parameters
num_epochs, lr = 10, 0.1
# default updater is SGD in D2L
updater = d2l.Updater([W1, W2, b1, b2], lr)
# training the model
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)

### Testing the model

To evaluate the learned model,
we **apply it on some test data**.


In [None]:
d2l.predict_ch3(net, test_iter)

### Summary of the first approach

* Implementing a simple MLP is relatively easy, even when done manually.
* However, with a large number of layers, implementing MLPs from scratch can get messy (e.g., naming and keeping track of our model's parameters).


## Second approach: using `keras` Sequential model

### Model definition

As compared with our concise implementation
of softmax regression implementation (Week 01),
the only difference is that we add
*two fully-connected layers*: the first is **our hidden layer**,
which contains 256 hidden units
and applies the ReLU activation function, and the second is our output layer.

In [None]:
# model definition
net = tf.keras.models.Sequential([
    # input layer                               
    tf.keras.layers.Flatten(),
    # hidden layer (number of units, activation function)
    tf.keras.layers.Dense(256, activation='relu'),
    # output layer (number of output classes)
    tf.keras.layers.Dense(10)])

### Model and training hyperparameters

In [None]:
# training hyperparameters
batch_size, lr, num_epochs = 256, 0.1, 10
# model hyperparameters
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
trainer = tf.keras.optimizers.SGD(learning_rate=lr)


### Training the model

In [None]:
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

### Evaluating the model

In [None]:
d2l.predict_ch3(net, test_iter)

## Summary of the second approach

* Using high-level APIs, we can implement MLPs much more concisely.
* For the same classification problem, the implementation of an MLP is the same as that of softmax regression except for additional hidden layers with activation functions.
