# Deep Learning with TensorFlow
## Formative assessment
### Week 1: Introduction to Deep Learning

#### Instructions

In this notebook, you will write code to implement a linear regression classifier in TensorFlow. You will implement the analytic solution, as well as a low-level training loop to update parameters using stochastic gradient descent. Finally, you will train a deep learning MLP regression model using the high-level Keras API.

Some code cells are provided you in the notebook. You should avoid editing provided code, and make sure to execute the cells in order to avoid unexpected errors. Some cells begin with the line: 

`#### GRADED CELL ####`

Don't move or edit this first line - this is what the automatic grader looks for to recognise graded cells. These cells require you to write your own code to complete them, and are automatically graded when you submit the notebook. Don't edit the function name or signature provided in these cells, otherwise the automatic grader might not function properly.

#### How to submit

Complete all the tasks you are asked for in the notebook. When you have finished and are happy with your code, commit and push your changes to your repository. This will trigger the automated tests, which you will be able to check on GitHub.

Make sure not to change the name or location of this notebook within your repository, or the automated tests will not be able to find it.

#### Let's get started!

We'll start by running some imports, and loading the dataset. Do not edit the existing imports in the following cell. If you would like to make further Tensorflow imports, you should add them here.

In [1]:
#### PACKAGE IMPORTS ####

# Run this cell first to import all required packages. Do not make any imports elsewhere in the notebook

import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from pathlib import Path

# If you would like to make further imports from Tensorflow, add them here




<img src="figures/life_expectancy_wikipedia.png" alt="Life expectancy" style="width: 450px;"/>
<center><font style="font-size:12px">source: <a href=https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy>wikipedia</a></font></center>

#### The WHO Life Expectancy dataset
In this formative assessment, you will use the [WHO Life Expectancy dataset](https://www.kaggle.com/kumarajarshi/life-expectancy-who) from Kaggle. This dataset was collected from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO), for the purpose of health data analysis. The dataset includes multiple factors affecting life expectancy across 133 countries, divided into the broad categories of immunization related factors, mortality factors, economical factors and social factors.

Your goal is to use TensorFlow to model the dataset using linear regression and deep MLP networks.

#### Load and subset the data

In [2]:
# Run this cell to load and describe the data

df = pd.read_csv(Path("./data/Life Expectancy Data.csv"))
df.describe()

FileNotFoundError: ignored

We will work the following columns from the DataFrame:

In [None]:
# This is the list of columns to use from the DataFrame

cols = ['Life expectancy ', 'Adult Mortality', 'Alcohol', ' BMI ',
        'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 
        'GDP', 'Income composition of resources', 'Schooling']

You should now complete the following function, according to the following specifications:

* Extract the columns above from the loaded DataFrame
* Remove any rows with `NaN` values
* Define a 1-D numpy array using the values in the `Life expectancy ` column. This will be the target variable
* Define a 2-D numpy array using the values from all remaining columns. This array should have shape `(num_examples, num_features)`. These will be the input variables
* The function should then return the tuple of constant `tf.Tensor` objects `(input_variables, target_variable)` of type `tf.float32`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_inputs_and_targets(dataframe, columns):
    """
    This function takes in the loaded DataFrame and column list as above, and extracts the
    numpy arrays as described above.
    Your function should return a tuple (input_variables, target_variable) of Tensors.
    """
    dataframe = dataframe[columns]
    dataframe = dataframe.dropna()
    y = tf.constant(dataframe[columns[0]].values, dtype=tf.float32)
    X = tf.constant(dataframe[columns[1:]].values, dtype=tf.float32)
    return X, y

In [None]:
# Run your function to get the input and target Tensors

X, y = get_inputs_and_targets(df, cols)

In [None]:
# Split the data into training and test sets and standardise the input scales

X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), test_size=0.2) #, random_state=100)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train, y_train = tf.constant(X_train), tf.constant(y_train)
X_test, y_test = tf.constant(X_test), tf.constant(y_test)

#### Linear regression model

We will fit a simple model of the form

$$
y = f_\theta(\mathbf{x}) + \epsilon,
$$

where $y\in\mathbb{R}$ is the target variable, $\mathbf{x}\in\mathbb{R}^{10}$ are the input features, $\theta\in\mathbb{R}^{11}$ are the model parameters, $\epsilon\sim\mathcal{N}(0, 1)$ is the observation noise random variable, and $f_\theta:\mathbb{R}^{10}\mapsto\mathbb{R}$ is given by

$$
\begin{align}
f_\theta(\mathbf{x}) &= \theta_0 + \sum_{m=1}^{10} \theta_m x_m\\
&= \sum_{m=0}^{10} \theta_m x_m.
\end{align}
$$

In the second line above we have defined $x_0=1$ to be the constant feature. The maximum likelihood solution is given by the normal equation

$$
\theta_{ML} = \left(\mathbf{X}^T \mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y},
$$

where $\mathbf{X}\in\mathbb{R}^{N\times M}$ is the data matrix, $\mathbf{y}\in\mathbb{R}^N$ are the targets, $N$ is the number of data examples, and $M$ are the number of features (including the constant feature).

You should now complete the following function to implement the normal equation to compute the maximum likelhood solution. Your code should only use TensorFlow functions. 

* The arguments to the function are an `inputs` Tensor of shape `(num_examples, num_features)`, and a `targets` Tensor of shape `(num_examples,)`
* The function should add a column of ones as the first column to the `inputs` Tensor for the constant feature
* The function should output a 1-D Tensor of parameters of length `(num_features + 1,)` (the first entry will be the bias)

_Hint: check [the docs](https://www.tensorflow.org/api/stable) for relevant TensorFlow functions, including the_ [`tf.linalg`](https://www.tensorflow.org/api_docs/python/tf/linalg) _module._

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def normal_equation(inputs, targets):
    """
    This function takes in inputs and targets Tensors, and implements the normal equation
    as above, only using TensorFlow functions.
    Your function should return a Tensor for the maximum likelihood solution for the parameters.
    """
    N = inputs.shape[0]
    inputs = tf.concat((tf.ones((N, 1), dtype=tf.float32), inputs), axis=1)
    Xt = tf.transpose(inputs)
    XtX = tf.linalg.matmul(Xt, inputs)
    XtXinv = tf.linalg.inv(XtX)
    Xty = tf.tensordot(Xt, targets, axes=1)
    return tf.tensordot(XtXinv, Xty, axes=1)

In [None]:
# Run your function to compute the ML estimate

theta_ml = normal_equation(X_train, y_train)
bias_ml, weights_ml = theta_ml[0], theta_ml[1:]
print("MLE weights:")
print(weights_ml)
print("MLE bias:")
print(bias_ml)

#### Stochastic gradient descent

You will now implement the stochastic gradient descent (SGD) algorithm to find the MLE using optimization. To do this, you will make use of the `tf.Variable` class. Recall that a Variable object is a special kind of Tensor that is _mutable_, so we will use it for the model parameters.

First, you should complete the following `get_variables` function to create Variable objects for the weights and bias of the linear regression model, as well as an iteration counter Variable.

* The function takes `num_features` as an argument
* The bias should be a `tf.Variable` with scalar shape, type `tf.float32`, and an initial value of zero. Set the name argument of this Variable to `"bias"`
* The weights should be a 1-D `tf.Variable` of length `num_features`, type `tf.float32`, and with initial values sampled from a standard normal distribution. Set the name argument of this Variable to `"weights"`
* Both weights and bias Variables should be trainable
* Finally, the function should create a scalar Variable of type `tf.int32`, initialised to zero, with name argument set to `"iteration"`. This Variable should be non-trainable
* The function should return the tuple of Variables `(weights, bias, iteration)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_variables(num_features):
    """
    This function takes in the number of features as an argument, and creates tf.Variable objects
    for the linear regression model weights and bias, as well as an iteration counter Variable.
    Your function should return a tuple of two tf.Variable objects (weights, bias, iteration).
    """
    bias = tf.Variable(0., dtype=tf.float32, name='bias')
    weights = tf.Variable(tf.random.normal((num_features,)), dtype=tf.float32, name='weights')
    iteration = tf.Variable(0, dtype=tf.int32, name='iteration', trainable=False)
    return weights, bias, iteration

In [None]:
# Run your function to create the Variables

weights, bias, iteration = get_variables(num_features=10)

Now define the model itself by completing the following function. This function implements $f_\theta(\mathbf{x}) = \theta_0 + \sum_{m=1}^{10} \theta_m x_m$ as above.

* The function takes an `inputs` Tensor, `weights` and `bias` Variables as input
* The `inputs` Tensor could be a batch of inputs of shape `(batch_size, num_features)`, or a single set of inputs of shape `(num_features,)`
* The function should return the output Tensor $f_\theta(\mathbf{x})$
* The output Tensor should have shape `(batch_size,)` (if passed a batch of inputs), or else should be a scalar

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def f(inputs, weights, bias):
    """
    This function takes in an inputs Tensor, weights and bias Variables. It should compute and 
    return the output Tensor prediction. 
    """
    return bias + tf.tensordot(inputs, weights, axes=1)

In [None]:
# Test your function on some dummy inputs

inputs = tf.random.normal((3, 10), dtype=tf.float32)
print(f(inputs, weights, bias))

inputs = tf.random.normal((10,), dtype=tf.float32)
print(f(inputs, weights, bias))

We will need to define the loss function to optimise. As we have assumed Gaussian noise $\epsilon\sim\mathcal{N}(0, 1)$ and we are looking to find the maximum likelihood solution, this will be the mean squared error loss. Recall that SGD provides a cheaper estimate of the full gradient, by computing the gradient on a minibatch of data points, instead of the full dataset. The loss function that you should implement is therefore:

$$
\tilde{L}_{MSE}(\theta) = \frac{1}{M} \sum_{\mathbf{x}_i, y_i\in\mathcal{D}_m} (y_i - \hat{y}_i)^2
$$

where $\hat{y}_i = f_\theta(\mathbf{x}_i)$, $(\mathbf{x}_i, y_i)$ is an example input and output from the randomly sampled minibatch $\mathcal{D}_m$ of training data points, and $M = |\mathcal{D}_m|$ is the size of the minibatch. The function specifications are as follows:

* The `mse` function takes two Tensors as arguments: `y_true` and `y_pred`
* As SGD computes gradients on minibatches, these two Tensors will have shape `(batch_size,)`
* The loss function should compute and return the mean squared error loss (MSE) as a scalar Tensor
* Use only TensorFlow functions inside your function

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def mse(y_true, y_pred):
    """
    This function takes a batch of 'ground truth' values y_true and a corresponding batch
    of model predictions y_pred, and computes the mean squared error.
    Your function should return the MSE as a scalar Tensor.
    """
    return tf.reduce_mean(tf.square(y_true - y_pred))

In [None]:
# Compute the initial loss on a batch of inputs

mse(y_train[:32], f(X_train[:32], weights, bias))

In [None]:
# Compute the train and test loss of the MLE

print("MLE train loss: {}".format(mse(y_train, f(X_train, weights_ml, bias_ml))))
print("MLE test loss: {}".format(mse(y_test, f(X_test, weights_ml, bias_ml))))

The following function implements the update step of SGD, that we will use inside the training loop. Recall this update uses the gradient of the loss with respect to the model parameters to make the update:

$$
\theta_{t+1} = \theta_{t} - \eta \nabla_\theta \tilde{L}_{MSE}(\theta_t),\qquad t\in\mathbb{N}_0,
$$

where $\eta>0$ is the learning rate.

* The `sgd_update` function takes the following arguments:
  * `model_fn` is the function that defines the predictive function (the function `f` above)
  * `inputs` and `targets` are the minibatch inputs and targets Tensors, of shape `(batch_size, 10)` and `(batch_size,)` respectively
  * `w` and `b` are the Variables that represent the model parameters
  * The `learning_rate` is the SGD hyperparameter
* The function should compute the SGD update step (assuming the mean squared error loss as above), updating the `w` and `b` Variables accordingly, using the `learning_rate` passed in. It will not return anything; the Variables are updated in-place.

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def sgd_update(model_fn, inputs, targets, w, b, learning_rate=0.01):
    """
    This function takes the model function, inputs batch, targets batch, weights Variable,
    bias Variable and learning rate as arguments. It should update the Variables w and b
    using the SGD update rule above for the MSE loss.
    """
    y_pred = model_fn(inputs, w, b)
    error = 2 * tf.expand_dims(y_pred - targets, axis=1)  # (batch_size, 1)
    w_grads = inputs * error  # (batch_size, num_features)
    b_grad = error
    average_w_grad = tf.reduce_mean(w_grads, axis=0)  # (num_features,)
    average_b_grad = tf.reduce_mean(b_grad)  # scalar
    w.assign_sub(learning_rate * average_w_grad)
    b.assign_sub(learning_rate * average_b_grad)

In [None]:
# Test your SGD update function

print("Before the update:")
print(weights)
print(bias)
sgd_update(f, X_train[:32], y_train[:32], weights, bias, learning_rate=0.05)
print("\nAfter the update:")
print(weights)
print(bias)

You are now ready to write the training loop in the following function. The training loop consists of a pre-defined number of epochs, where one epoch is one complete pass through the training dataset. Within an epoch, there is an inner loop where the algorithm iterates through the training data, pulling out a minibatch of data at each iteration, and using it to update the weights and biases according to the SGD update rule. 

You should complete the following `training_loop` function according to the specifications:

* The function takes the following arguments:
  * `num_epochs`: a positive integer that defines the number of epochs to run the training loop
  * `model_fn`: as before, the function that defines the predictive function
  * `training_data`: a 2-tuple of Tensors `(inputs, targets)` for the complete training data
  * `batch_size`: a positive integer that defines the number of examples in each minibatch
  * `w`: the Variable that represents the model weights
  * `b`: the Variable that represents the model bias
  * `iteration`: a Variable used for counting the total number of updates
  * `mse`: the loss function to evaluate the model (this will be your `mse` function above)
  * `sgd_update`: the function that implements the SGD update (this will be your `sgd_update` function above)
  * `learning_rate`: the learning rate for the SGD update
* The function should iterate through the training data `num_epochs` times
* On each pass through the training data, the function should extract `batch_size` examples from the inputs and targets provided in `training_data`
  * The minibatches should be pulled from the training data in sequence. That is, the first minibatch will be the first `batch_size` elements of the inputs and targets, the second minibatch will be the next `batch_size` elements, and so on. Bear in mind that the last minibatch of the epoch may be a different size
  * At each iteration, the `iteration` Variable should be incremented by one
* For each minibatch, the parameters `w` and `b` should be updated according to `sgd_update`, using the `learning_rate` provided
* After every update, the model loss should be evaluated on the current minibatch using the `mse` function and appended to a list as a scalar float
* The list of losses should then be returned by the function

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def training_loop(num_epochs, model_fn, training_data, batch_size, w, b, iteration, 
                  mse=mse, sgd_update=sgd_update, learning_rate=0.01):
    """
    This function executes the training loop according to the specifications above. 
    It should run for num_epochs passes through the training data, updated the model
    parameters using the sgd_update function at every iteration.
    The function should return the list of losses computed on every minibatch at each iteration,
    using the mse function.
    """
    train_inputs, train_targets = training_data
    num_train_examples = train_inputs.shape[0]
    if num_train_examples % batch_size == 0:
        iterations_per_epoch = num_train_examples // batch_size
    else:
        iterations_per_epoch = num_train_examples // batch_size + 1
    losses = []
    for epoch in range(num_epochs):
        print("Epoch {}".format(epoch))
        for i in range(iterations_per_epoch):
            iteration.assign_add(1)
            minibatch_inputs = train_inputs[i*batch_size: (i+1)*batch_size]
            minibatch_targets = train_targets[i*batch_size: (i+1)*batch_size]
            sgd_update(model_fn, minibatch_inputs, minibatch_targets, w, b, learning_rate)
            losses.append(mse(model_fn(minibatch_inputs, w, b), minibatch_targets).numpy())
    print("Training completed!")
    return losses

In [None]:
# Re-initialise the model parameters and run the training loop

weights, bias, iteration = get_variables(num_features=10)
losses = training_loop(20, f, (X_train, y_train), 128, weights, bias, iteration=iteration, 
                       mse=mse, sgd_update=sgd_update, learning_rate=0.01)

In [None]:
# Plot the losses

plt.plot(losses)
plt.title("Loss vs iterations")
plt.xlabel("Iterations")
plt.ylabel("MSE loss")
plt.show()

In [None]:
# Compute the train and test loss of the learned weights

print("Model train loss: {}".format(mse(y_train, f(X_train, weights, bias))))
print("Model test loss: {}".format(mse(y_test, f(X_test, weights, bias))))

Compare your learned weights and bias to the exact solution computed earlier. They should be fairly close:

In [None]:
# Print the learned weights and bias

print("Learned weights:")
print(weights.numpy())
print("Learned bias:")
print(bias.numpy())

In [None]:
# Print the exact weights and bias

print("Exact ML weights:")
print(weights_ml.numpy())
print("Exact ML bias:")
print(bias_ml.numpy())

#### MLP with Keras API

In the final part of this assignment, you will use the Keras API to build an MLP model to fit the data.

First, we will see how the linear regression model above can be implemented much quicker using the `Sequential` class.

In the following function, you should build a `Sequential` model with just one `Dense` layer, which has a single output unit, and no activation function. This is the same as the linear regression model above.

* The function takes the `input_shape` as an argument, which should be used in the `Dense` layer initializer to specify the input shape
* The function should build and return the `Sequential` object with one Dense layer with a single neuron, and no activation function

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def sequential_linear_regression(input_shape):
    """
    This function takes the input_shape as argument to build a Sequential model as 
    specified above. 
    The function should then return the Sequential model.
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, activation=None, input_shape=input_shape)
    ])
    return model

In [None]:
# Run your function to build the model and print the model summary

model = sequential_linear_regression(input_shape=(10,))
model.summary()

You should now compile and fit the model to the training data. 

* The following function takes the following arguments:
  * `sequential_model`: a Sequential model to fit to the training data
  * `num_epochs`: a positive integer that defines the number of epochs to train the model
  * `training_data`: a 2-tuple of Tensors (inputs, targets) for the complete training data
  * `batch_size`: a positive integer that defines the number of examples in each minibatch
* The function should compile the model with the mean squared error loss and the SGD optimizer
* The function should then fit the model to the training data for `num_epochs` epochs and save the returned history object
* Your function should then return the history object

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def compile_and_fit(sequential_model, num_epochs, training_data, batch_size):
    """
    This function should compile and fit the sequential_model as described above. 
    The function should then return the history object that is returned from the fit method.
    """
    X_train, y_train = training_data
    sequential_model.compile(loss='mse', optimizer='sgd')
    history = sequential_model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)
    return history

In [None]:
# Run your function to compile and fit the model

history = compile_and_fit(model, num_epochs=20, training_data=(X_train, y_train), batch_size=128)

In [None]:
# Plot the losses

plt.plot(history.history['loss'])
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("MSE loss")
plt.show()

In [None]:
# Compute the train and test loss of the Sequential model

print("Model train loss: {}".format(model.evaluate(X_train, y_train, verbose=0)))
print("Model test loss: {}".format(model.evaluate(X_test, y_test, verbose=0)))

Compare your model's weights and bias to the exact ML solution:

In [None]:
# Print the model's weights and bias

print("Learned weights:")
print(model.layers[0].kernel.numpy())
print("Learned bias:")
print(model.layers[0].bias.numpy())

In [None]:
# Print the exact weights and bias

print("Exact ML weights:")
print(weights_ml.numpy())
print("Exact ML bias:")
print(bias_ml.numpy())

Let's see if we can improve the model's performance by increasing its capacity. 

You should now complete the following function to build, compile and fit a new Sequential model.

* This function takes the following arguments:
  * `input_shape`: to use in the first layer of the model to set the input shape
  * `num_epochs`: a positive integer that defines the number of epochs to train the model
  * `training_data`: a 2-tuple of Tensors (inputs, targets) for the complete training data
  * `batch_size`: a positive integer that defines the number of examples in each minibatch
* This `Sequential` model should use two `Dense` layers:
  * The first `Dense` layer will be a hidden layer with 16 units and a sigmoid activation function
  * The output layer will again be a `Dense` layer with a single neuron and no activation function
* You should again compile your model with the mean squared error loss function and SGD optimizer
* You should again fit your model for `num_epochs` epochs with a batch size of `batch_size` on the `training_data`
* Your function should return a tuple containing the model and the history object

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def build_and_train_mlp(input_shape, num_epochs, training_data, batch_size):
    """
    This function takes the input_shape, num_epochs, training_data tuple and batch_size
    as arguments. It should build, compile and fit the MLP model as described above.
    The function should then return the tuple (mlp_model, history)
    """
    X_train, y_train = training_data
    mlp_model = tf.keras.Sequential([
        tf.keras.layers.Dense(16, activation='sigmoid', input_shape=input_shape),
        tf.keras.layers.Dense(1)
    ])
    mlp_model.compile(loss='mse', optimizer='sgd')
    history = mlp_model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)
    return mlp_model, history

In [None]:
# Run your function to build and train the MLP model

mlp_model, history = build_and_train_mlp(input_shape=(10,), num_epochs=20, 
                                         training_data=(X_train, y_train), batch_size=128)

In [None]:
# Plot the losses

plt.plot(history.history['loss'])
plt.title("Loss vs epochs")
plt.xlabel("Epochs")
plt.ylabel("MSE loss")
plt.show()

In [None]:
# Compute the train and test loss of the MLP

print("Model train loss: {}".format(mlp_model.evaluate(X_train, y_train, verbose=0)))
print("Model test loss: {}".format(mlp_model.evaluate(X_test, y_test, verbose=0)))

Did the model performance improve? 

Further gains could be made by training the model for longer, and/or increasing its capacity further. However, we need to be aware of overfitting and should then use a held-out validation set for model selection. 

In the next week of the module we will see how to properly validate our models, as well as regularisation methods to combat overfitting. We will also expand our options for network optimisation, including studying the all-important backpropagation algorithm, and further develop our skills with TensorFlow.