# Introduction to deep learning
Lisa Bonheme and Marek Grzes

University of Kent

COMP6360/8360, Teaching week 12

Last modified 13/11/2022

## Content
- [Installing Tensorflow on Anaconda](#Installing-Tensorflow-on-Anaconda)
- [Overview of the Moon dataset](#Overview-of-the-Moon-dataset)
- [Question 1 - Comparing logistic regression and deep learning](#Question-1---Comparing-logistic-regression-and-deep-learning)
- [Question 2 - The decision boundaries of deep neural networks](#Question-2---The-decision-boundaries-of-deep-neural-networks)
- [Question 3 - Implement the backpropagation algorithm](#Question-3---Implement-the-backpropagation-algorithm)
- [Question 4 - Tune your model](#Question-4---Tune-your-model)


### Installing Tensorflow on Anaconda
During this class, we will need tensorflow 2, which is not installed in miniconda by default.
You can install it using the anaconda navigator as follows:
- **Step 1**: Open anaconda navigator.<br/>
<img src="img/anaconda-step1.jpg"></img><br/>

- **Step 2**: Click on the environments tab and search Tensorflow 2 package.<br/>
<img src="img/anaconda-step2.jpg"></img><br/>

- **Step 3**: Select Tensorflow 2 package and install it.<br/>
<img src="img/anaconda-step3.jpg"></img><br/>

(Images are taken from [this tutorial](https://www.tutorialspoint.com/add-packages-to-anaconda-environment-in-python))

You can also install `tensorflow` and other Python packages using `pip`. For that, you need to open a terminal window, and assuming that your Python environment is active in that window, you can install `tensorflow` typing: `pip install tensorflow`.

### Overview of the Moon dataset
The Moon dataset is an artificial dataset with two intertwined moon shapes belonging to two different classes.

The dataset is composed of 10000 data points with the following features:
- x position
- y position

So, our data matrix is of size (10000, 2); that is (`nb_data_points`, `nb_features`).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 10]
from sklearn import datasets

# Here we add some noise so that some points are a bit outside of the moon shapes
X, y = datasets.make_moons(n_samples=10000, noise=0.05)

print("The data shape is {}\n".format(X.shape))
print("The first 5 data points are \n{}\n".format(X[:5]))
print("The first 5 labels are {}\n".format(y[:5]))
print("The last 5 data points are \n{}\n".format(X[-5:]))
print("The last 5 labels are {}".format(y[-5:]))

Now, let us visualise what this dataset looks like.

In [None]:
import seaborn as sns
import numpy as np
ax = sns.scatterplot(x=X[:,0], y=X[:,1], hue=y)

### Question 1 - Comparing logistic regression and deep learning
Now that we have our dataset, we will create a function to plot the decision boundary of our models and use it to compare the behaviour of a logistic regression and a deep learning model. Note that logistic regression is another name for our familiar delta rule when the sigmoid activation function is used.

#### Plotting decision boundaries
Nothing to do here, but you can explore this function to see what it does if you are curious about it.

Note that this part of the class is inspired by [scikit learn example on multinomial logistic regression](https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_multinomial.html#sphx-glr-auto-examples-linear-model-plot-logistic-multinomial-py).

In [None]:
# This function will plot the decision boundaries of any model given our 2D dataset. We will use this function many times in this class.
def plot_decision_boundaries(X, y, clf, step=None):
     # create a mesh to plot in
    h = 0.02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.contourf(xx, yy, Z, cmap=plt.cm.Pastel1)
    plt.axis("tight")
    
    # If we are plotting the decision boundaries sequentially, we mention the corresponding epoch
    if step is not None:
        plt.title("Decision boundaries at epoch {}".format(step))

    # Plot also the training points
    colors = ["blue", "orange", "green"]
    for i, color in zip(range(3), colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Pastel1, edgecolor="black", s=20)
    plt.show()

#### The decision boundary of a logistic regression model
Let us train a logistic regression model and plot its decision boundaries after training.

In [None]:
from sklearn.linear_model import LogisticRegression
# delta rule is used in the next line of code to fit the model to the data
clf = LogisticRegression(max_iter=100, random_state=0).fit(X, y)
print("The decision boundary of the multinomial logistic regression")
plot_decision_boundaries(X, y, clf)

- **(1-1):** Is the model accurately classifying every data point? Justify your answer.

#### The decision boundary of a deep neural network
Now, let us create a simple neural network with no hidden layers and train it for a few epochs.

At the end of each epoch, we will plot the decision boundaries using the `BoundariesCallback` defined below.

In [None]:
import tensorflow as tf 

class BoundariesCallback(tf.keras.callbacks.Callback):
    def __init__(self, X, y, clf, plot_freq=2):
        self._X = X
        self._y = y
        self._clf = clf
        self._plot_freq = plot_freq

    def on_epoch_end(self, epoch, logs=None):
        if epoch % self._plot_freq == 0:
            plot_decision_boundaries(self._X, self._y, self._clf, step=epoch)

Below is the deep model that we will use in the rest of this section.

In [None]:
from tensorflow import keras

class DeepModel:
    def __init__(self, n_units=[], activation_functions=[], learning_rate=0.005):
        self._history = None

        # This sequential model can be used to sequentially add layers.
        self._model = keras.Sequential()
        input_dim=2
        
        # We set the linear layers according to the given parameters
        for units, act in zip(n_units, activation_functions):
            self._model.add(keras.layers.Dense(units=units, activation=act, input_dim=input_dim))
            if input_dim is not None:
                input_dim = None
                
        self._model.add(keras.layers.Dense(units=1, input_dim=input_dim, activation='sigmoid'))
        
        
        # This will output the final architecture of the model
        self._model.summary()
        
        # We compile the model with a specific loss and optimiser, it is here that the learning rate is set
        self._model.compile(optimizer=tf.optimizers.Adam(learning_rate=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])
        
    def predict(self, X):
        return np.round(self._model.predict(X))
    
    def fit(self, X, y, callbacks=[], epochs=11, validation_split=0.2):
        # If you don't want to see the training log, you can add verbose=0 to the fit method's arguments below.
        self._history = self._model.fit(X, y, validation_split=validation_split, epochs=epochs, 
                                        batch_size=50, callbacks=callbacks)
        return self._history

Now, we have the definition of our deep learning model, and we can compute its decision boundaries during training.

Below, we are going to create and train a very simple neural network with no hidden layers.

In [None]:
tf.random.set_seed(0)
no_hidden_layer = DeepModel()

# You can change how often the decision boundaries are displayed by modifying plot_freq below. 
# Here we display the boundaries every 2 epochs, set it to 1 to display after every epochs for example.
callbacks = [BoundariesCallback(X, y, no_hidden_layer, plot_freq=2)]

# If you want to see how the model evolves after a more training time, you can increase the epochs parameter below
res = no_hidden_layer.fit(X, y, callbacks=callbacks, epochs=11)

- **(1-2)** Compare the decision boundaries of this neural network with the decision boundaries of the logistic regression model computed earlier? Explain and justify your observations.

### Question 2 - The decision boundaries of deep neural networks
Let us repeat the last experiment with a hidden layer added to our neural network.

In [None]:
tf.random.set_seed(0)
# We add one hidden layer with 50 neurons and a linear activation.
# The linear activation is used by default.
one_hidden_layer = DeepModel(n_units=[50], activation_functions=[None])

# You can change how often the decision boundaries are displayed by modifying plot_freq below. 
# Here we display the boundaries every 2 epochs, set it to 1 to display after every epochs for example.
callbacks = [BoundariesCallback(X, y, one_hidden_layer, plot_freq=2)]

# If you want to see how the model evolves after a more training time, you can increase the epochs parameter below
res = one_hidden_layer.fit(X, y, callbacks=callbacks, epochs=11)

- **(2-1)** Is the new model with a hidden layer better than the previous one, i.e., is the new decision boundary more accurate?
- **(2-2)** How do the training stages of both models differ? What similarities / differences can you identify?

#### Impact of the activation function
Using a deep neural network with one hidden layer, let us now investigate the impact of activation functions on our hidden layer and the final predictions.

Here, we use the "ReLU" function, which stands for Rectified Linear Unit. This function transforms the outputs of your hidden layer by keeping only the positive part, so that $f(x) = \max(0, x)$. In this formula, $x$ is the net input and $f(x)$ is the activation.

In [None]:
tf.random.set_seed(0)
# We add one hidden layer with 50 neurons and relu activation function
one_hidden_layer_relu = DeepModel(n_units=[50], activation_functions=["relu"])

# You can change how often the decision boundaries are displayed by modifying plot_freq below. 
# Here we display the boundaries every 2 epochs, set it to 1 to display after every epochs for example.
callbacks = [BoundariesCallback(X, y, one_hidden_layer_relu, plot_freq=2)]

# If you wish to see how the model evolves after more training time,
# you can increase the epochs parameter below.
res = one_hidden_layer_relu.fit(X, y, callbacks=callbacks, epochs=11)

- **(2-3)** How different are the decision boundaries learned with the new activation function? Explain and justify those differences.

#### Impact of the learning rate
The default learning rate is 0.005. Change its value and observe and then analyse the results. You are encouraged to run this test with both linear and ReLU activations in the hidden layer. See the comments in the code below.

In [None]:
tf.random.set_seed(0)

# Choose your learning rate here
lr = 0.005

one_hidden_layer_lr = DeepModel(learning_rate=lr)
callbacks = [BoundariesCallback(X, y, one_hidden_layer_lr, plot_freq=2)]
# This network has linear units in the hidden layer.
res = one_hidden_layer_lr.fit(X, y, callbacks=callbacks, epochs=11)
# Use the line below to test the network with ReLU in the hidden layer. Both networks were defined above.
# res = one_hidden_layer_relu.fit(X, y, callbacks=callbacks, epochs=11)

- **(2-4)** What happens with a very small learning rate?
- **(2-5)** What happens with a high learning rate?
- **(2-6)** Explain and justify the observations that you made in the last two steps.

### Question 3 - Implement the backpropagation algorithm (from scratch)
Now that you have seen how a deep model could be implemented using tensorflow 2, you will create your own implementation of a deep learning algorithm! We ask you to code the backpropagation algorithm that was presented in our lectures. All the equations required for this implementation are in our lecture slides. You will need to transfer them to your Python code below.

#### The sigmoid function
We have previously used the sigmoid or ReLU activation function in the last layer of our deep model, and we will need it again for this question. If we use sigmod in this section, we can reuse the equations that we have in our lecture slides.

- **(3-1)** Implement the sigmoid function below using the formula seen in the lectures. You can see it also [here](https://en.wikipedia.org/wiki/Sigmoid_function).

In [None]:
def sigmoid(x):
    # TODO: Comment the line below and implement me!
    raise NotImplementedError("Implement the sigmoid function before testing.")
    return 0

#### The backpropagation algorithm
Now that our sigmoid function is ready, let us define a skeleton for our custom deep learning algorithm.

The backpropagation algorithm for a feedforward network of two layers of sigmoid units can be defined as follows:

**For each** $(x, y)$ in the training examples, **DO**:
<ul>
    <li>
        <i>Propagate the input forward through the network:</i>
        <ul>
            <li> 
                1. Input the instance $x$ to the network and compute the output $o_u$ of every unit $u$ in the network.
            </li>
        </ul>
    </li>
    <li>
        <i>Propagate the errors backward through the network:</i>
        <ul>
            <li>
                2. For each network output unit $k$, calculate its error term
                $\delta_k \leftarrow -(y_k - o_k)$</br>
                (here, $y_k$ is our $t_k$, and we don't multiply by $o_k(1-o_k)$ because we assume the CE error)
                </li>
            <li>
                3. For each hidden unit $h$, calculate its error term
                $\delta_h = - o_h(1- o_h) \sum_{k \in outputs} w_{hk}\delta_k$
            </li>
         </ul>    
    </li>
    <li>
        <i>Update the network weights:</i>
        <ul>
            <li>
                4. Update each network weight $w_{ji} = w_{ji} + \Delta w_{ji}$</br>where
                $\Delta w_{ji} = -\epsilon\delta_jx_{j}$ or $\Delta w_{ji} = -\epsilon\delta_jh_{j}$ and $\epsilon$ is the learning rate.
            </li>
        </ul>
    </li>
</ul>

This procedure is based on Table 4.2 in _Machine learning, Tom Mitchell, McGraw-Hill Education, 1997_. Note that the example in this book assumes that the output unit uses sigmoid activation and the SSE error is optimised. In our discussion of the delta rule with sigmoid activation, we assumed the CE error, which made the update equation of the output units slightly simpler. We used this assumption in the pseudocode above.

Using the class `MyDeepModel` below, do the following:
- **(3-2)** Implement the `forward()` method which corresponds to step 1 of the backpropagation algorithm.
- **(3-3)** Implement the `backward()` method which corresponds to steps 2 and 3 of the backpropagation algorithm.
- **(3-4)** Implement the `update()` method which corresponds to step 4 of the backpropagation algorithm.

In [None]:
class Layer:
    """ This is a convenience class that we will use to define layers in our deep learning model
    """
    def __init__(self, input_dim, units, random_state):
        self._random_state = random_state
        # we initialise the weights randomly
        self.weights = self._random_state.normal(size=(input_dim, units))
        # we add the weight separately to retrieve it more easily afterwards.
        self.bias = self._random_state.normal(size=units)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class MyDeepModel:
    # learning rate seems to be a sensitive parameter
    def __init__(self, learning_rate=0.02):
        self._random_state = np.random.RandomState(0)
        self._history = []
        self.learning_rate = learning_rate

        self.layers = [
            # This is the layer corresponding to something like 
            # self._model.add(keras.layers.Dense(10, input_dim=2, activatoin="sigmoid"))
            # in our DeepModel
            Layer(2, 10, self._random_state),
            
            # This is the output layer corresponding to 
            # self._model.add(keras.layers.Dense(units=1, activation="sigmoid"))
            # in our DeepModel. Note that now we pass the object sigmoid as parameter, not the string.
            Layer(10, 1, self._random_state)
        ]
        
    def predict(self, X):
        """ Takes input data of shape (n_items, n_features) and returns an array of shape (n_items,) 
        containing the labels predicted for each input x
        """
        y_pred = []
        
        # Here we register the predictions of the model for each input
        # and store the results in a list
        for x in X:
            y_pred.append(self.forward(x)[-1])
        y_pred = np.round(np.array(y_pred))
        return y_pred
    
    def fit(self, X, y, epochs=11, validation_split=0.2):
        """ Train the model during a given number of epochs and plot the decision boundaries after each epoch.
        This method performs one step of the backpropagation algorithm for each data example
        by calling the forward, backward and update methods.
        """
        self._history = {"train": [], "test": []}
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_split, 
                                                            random_state=self._random_state)
        
        # One epoch corresponds to the training steps needed to train the model on the whole dataset once
        for i in range(epochs):
            for xt, yt in zip(X_train, y_train):
                # The forward step predicts the class of the input
                try:
                    outputs = self.forward(xt)
                    # The backward step backpropagate the error
                    gradients = self.backward(outputs, yt)
                    # Now, we update the weights
                    self.update(xt, outputs, gradients)
                except NotImplementedError as e:
                    print(e)
                    return
            
            y_pred = self.predict(X_train)
            acc_train = accuracy_score(y_train, y_pred)   
            self._history["train"].append(acc_train)
            y_pred = self.predict(X_test)
            acc_test = accuracy_score(y_test, y_pred)
            self._history["test"].append(acc_test)
            print("Accuracy at epoch {}:\ntrain: {}\ntest: {}".format(i, acc_train, acc_test))
            plot_decision_boundaries(X, y, self, step=i)
        return self._history

    
    def forward(self, x):
        """ Forward pass of the backpropagation 
        Takes the one data example as input and returns the output values of each layer
        """
        outputs = []
        # TODO: Comment the line below and implement me!
        raise NotImplementedError("Implement the forward function before testing.")
        
        return outputs

    def backward(self, outputs, y_true):
        """ Backward pass of the backpropagation algorithm
        Takes the output of the layer obtained in forward pass, compute and backpropagate the error,
        and returns the obtained gradients
        """
        gradients = []
        # TODO: Comment the line below and implement me!
        raise NotImplementedError("Implement the backward function before testing.")

        return gradients
    
    def update(self, x, outputs, gradients):
        """ Update the weights after the backward pass of the backpropagation algorithm
        Takes the gradient obtained during the backpropagation and update the weights of each layer
        """
        # TODO: Comment the line below and implement me!
        raise NotImplementedError("Implement the update function before testing.")

        return None


- **(3-5)** Test your implementation and compare its results with the Keras implementation using the code below.

In [None]:
my_model = MyDeepModel()
print("Training my model")
res = my_model.fit(X, y, epochs=11)

In [None]:
tf_model = DeepModel(n_units=[10], activation_functions=["sigmoid"])
print("Training Tensorflow model")
callbacks = [BoundariesCallback(X, y, tf_model, plot_freq=1)]
res = tf_model.fit(X, y, callbacks=callbacks, epochs=11)

### Question 4 - Tune your model
If you have finished early, and you'd like something more challenging to do, you can add a few options to refine your model. For example:
- backpropagate through multiple hidden layers,
- handle different batch sizes,
- do backpropagation with other activation functions,
- implement a stopping condition.

In [None]:
import datetime
print("Last modified: ", datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S") + "\n")