# Week 4

### Defining models

Within this API we have `tf.keras.Model` class for models. `keras` models are basic computational units that transform input $x$ to output $\hat{y}$ and that can be trained via SGD or similar algorithms. 

We will define it using predefined `Layer`s. Compared to `keras` models, layers are more atomic computational units, that can be reused, e.g. `Dense` layer is an implementation of MLP layer equation: $\sigma(\mathbf{Wx} + \mathbf{b})$.

In [None]:
class MultilayerPerceptron(keras.Model):  # Subclassing
    
    def __init__(self, dim_output, dim_hidden):
        super(MultilayerPerceptron, self).__init__(name='multilayer_perceptron')
        self.dim_output = dim_output
        self.dim_hidden = dim_hidden

        # Within Model.__init__ we initialize all the layers we will use
        self.layer_1 = keras.layers.Dense(
            units=dim_hidden)  # units = how many neurons in the layer
        self.layer_2 = keras.layers.Dense(
            units=dim_output)

    def call(self, x):  # call defines the flow of the computation, e.g. in this particular model
                        # we simply call the two layers one after the oter
        h = self.layer_1(x)
        y = self.layer_2(h)
        return y


### Training models

We will train this model to classify the _Iris_ dataset from previous lab. Training models defined like this is really easy:

In [None]:
model = MultilayerPerceptron(  # We create a new model
    dim_output=3,
    dim_hidden=32)

model.compile(  # By compiling we prepare the model for training
    optimizer=keras.optimizers.SGD(learning_rate=0.003),  # We pick a optimizer algorithm
    loss='mean_squared_error',  # We pick a loss function
    metrics=['accuracy'])  # We pick evaluation metrics

model.fit(  # Fit runs the training over provided data
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20)


This is the selling point for using modern neural frameworks. The model is trained via SGD, but we do not need to calculate derivatives. Instead they are calculated automatically by TF. We also do not need to program how SGD works, nor we need to define the loss functions or metrics.

All that we done manually last week is now hidden behind the `fit` function. You should already be familiar with all the concepts that were introduced in the code above, such as `epochs`, `batch_size`, `metrics`, `loss`, `optimizer`, etc.

In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32,
    num_layers=3,
    activation=keras.activations.sigmoid)

# compile and fit are the same as above
model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.01),
    loss='mean_squared_error',
    metrics=['accuracy'])

model.fit(
    x=data.x,
    y=data.y,
    batch_size=4,
    epochs=20)


## Gradient Tape

`fit` is a very convenient way of training neural models, but sometimes we need more flexibility and control. For example, with `fit` we can not track the training step by step (e.g. for debugging). The model is compiled into a computation graph in the background. So if you want to have a debugging print within a model, it will not run. E.g., try printing the value of `h` in the model `call`.

Instead we can use so called `GradientType`. With this tape the debugging print of `h` will run. Check the following code, it is very similar in how we defined SGD in previous labs:

In [None]:
model = MultilayerPerceptron(
    dim_output=3,
    dim_hidden=32)

optimizer = keras.optimizers.SGD(learning_rate=0.01)
loss_function = keras.losses.MeanSquaredError()

# loss_function = keras.losses.CategoricalCrossentropy()
# You can use cross-entropy loss if you completed PA 4.3
    
def step(xs, ys):  # This has the same meaning as step function in previous labs
    
    with tf.GradientTape() as tape:
        preds = model(xs)  # Model predictions
        loss = loss_function(ys, preds)  # The value of loss function comparing the true
                                         # values ys with predictions

    gradient = tape.gradient(
        target=loss,
        sources=model.trainable_variables)  # Calculate the gradient of loss function w.r.t. model parameters.
                                            # This behaves the same as gradient methods from previous labs.
        
    optimizer.apply_gradients(zip(gradient, model.trainable_variables))  # Applies the computed gradient on current
                                                                         # parameter values.
    
def loss(xs, ys):
    preds = model(xs)
    return loss_function(ys, preds)
    
num_epochs = 100
batch_size = 5
num_samples = len(data.x)

# Training loop (without shuffling for simplicity)
for e in range(num_epochs):
    for i in np.arange(0, num_samples, batch_size):  # Batching
        step(data.x[i:i+batch_size], data.y[i:i+batch_size])
    print('Epoch:', e, 'Loss:', loss(data.x, data.y).numpy())
        