# Lab 6: Non-linear Classifiers (Part 2: ANNs)
CSE2510 Machine Learning 2023/2024  

*Originally developed for TI3145TU Machine Learning and Introduction to AI*  
*Revised for CSE2510 Machine Learning*

* **What?** This nonmandatory lab consists of several programming tasks and pen-and-paper questions. 

* **Why?** The exercises are meant to help you learn about the concepts of neural networks.

* **How?** Follow the exercises in the notebook on your own or with a friend. For questions and feedback please consult the TAs during the lab session.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}}$

$\ex{1}$ During the lecture you learned about the building block of neural networks - the perceptron algorithm. Although there have been many developments in the field since its invention in 1958, almost all new forms of neural networks are based on the same idea of interconnected perceptrons. Of course, such perceptrons are limited because they are a linear classifier while many of our classification tasks may not be linearly separable. Nevertheless, it is still very important to understand how a single perceptron works before we move on to neural networks.

![Image of a perceptron](images/perceptron_1.png)




<div style="background-color:#c2eafa">

$\q{1.1}$ Above you can see a perceptron with two inputs and one bias. It uses the Heaviside step function ($H(x) = 1$ if $x > 0$, else $0$) as the activation function. Compute $z$ (weighted sum of the inputs) and $y$ output for the given inputs.

|         x1        |         x2        |     z     |     y     |
|-------------------|-------------------|-----------|-----------|
|         -1        |         -1        |           |           |
|         -1        |          1        |           |           |
|          1        |         -1        |           |           |
|          1        |          1        |           |           |


<div style="background-color:#f1be3e">
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

Let's create a set of four samples describing the XOR problem above.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

xs = np.array([[-1, -1], [-1, 1], [1, -1], [1, 1]])
ys = np.array([1, 0, 0, 1])

plt.scatter(xs[:, 0], xs[:, 1], c=ys)
plt.title("XOR problem")

plt.show()

<div style="background-color:#c2eafa"> 
    
$\q{1.2}$ Can the pattern from above be learned by a single perceptron? Why (not)?

<div style="background-color:#f1be3e">

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)


<div style="background-color:#c2eafa"> 

$\q{1.3}$ Can the same pattern be learned by a two-layer perceptron? Give an informal argument why.

<div style="background-color:#f1be3e">

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

$\ex{2}$ A multilayer perceptron is an example of a neural network - a set of perceptrons organized into layers where the outputs of layer $n$ feed into the layer $n + 1$. These are extremely performant classifiers which find their use in many different classification tasks. 

<div style="background-color:#c2eafa">

$\q{2.1}$ Explain in 2-3 sentences the idea of backpropagation algorithm.

<div style="background-color:#f1be3e">
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

<div style="background-color:#c2eafa">

$\q{2.2}$ Why do we need to apply backpropagation to train large neural networks?

<div style="background-color:#f1be3e">

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

<div style="background-color:#c2eafa">

$\q{2.3}$ What is a forward pass in neural networks? What do we compute during the forward pass?

<div style="background-color:#f1be3e">

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

$\ex{3}$ We will now implement the multi-layer perceptron that you have seen discussed during the lecture (see below), and use it to classify the samples of the XOR problem. Our network will consist of one hidden layer with _k_ nodes. To build the MLP, we will make use of the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) from a very popular Python machine learning library, `scikit-learn`, which you will also use in the bonus assignment.

![Multi-layer perceptron](images/multilayer_perceptron.png)

In [None]:
# We will use this function to show the decision boundary of our model
def plot_model(model, X):
    h = 0.005  # step size in the mesh
    # create a mesh to plot in
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))


    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.5)
    plt.show()

<div style="background-color:#c2eafa">

$\q{3.1}$ Set the values of hyper-parameters `k` and `learning_rate_init` such that our network can learn the pattern within 10 epochs. Feel free to experiment with the values (e.g. try what happens when the number of hidden neurons changes).

In [None]:
from sklearn.neural_network import MLPClassifier

# The number of nodes in the hidden layer (index k in the diagram above)
k = None
# The (initial) learning rate for our multilayer perceptron
learning_rate_init = None

# START ANSWER
# END ANSWER

# Our MLP classifier may have more than one hidden layer, hence sizes are given as an n-tuple
hidden_layer_sizes = (k,)
model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                      solver='sgd', # We will use Stochastic Gradient Descent to optimize the loss function
                      activation = 'relu', # Rectified linear unit activation function: f(x) = max(0, x)
                      learning_rate_init=learning_rate_init, # Learning rate: the relative weight of new observations 
                      random_state = 42)
 
# We will train the model over 1000 epochs
epochs = 10
for i in range(epochs):
    # We use partial_fit() to update the model in a single iteration over training data
    model.partial_fit(xs, ys, np.unique(ys))
    # We plot the decision boundary of the model
    plt.scatter(xs[:, 0], xs[:, 1], c=ys)
    plt.title(f"Epoch: {i + 1}, training set accuracy: {model.score(xs, ys)}")
    plot_model(model, xs)

<div style="background-color:#c2eafa">

$\q{3.2}$ Consider how the shape of the decision boundary changes between the iterations. Would you say it is easy to predict how the decision boundary of an (arbitrary) MLP model will evolve over time? Why is that the case?

<div style="background-color:#f1be3e"> 
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

<div style="background-color:#c2eafa">

$\q{3.3}$ How do the changes in the shape of the decision boundary of an MLP model compare to what we would expect from a Decision Tree? Which model tends to behave more predictably?

<div style="background-color:#f1be3e"> 
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)

$\ex{4}$ Although Scikit-learn offers the possibility to train artificial neural networks, these remain rather rudimentary. Instead, we will use [**Keras**](https://keras.io/getting_started/), part of Google's TensorFlow library for machine learning to train neural networks. In this exercise we will go through the steps of creating a neural network with Keras that will allow us to classify images of clothing. To that end, we will load the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which is already split into a training set of 60000 images and a test set of 10000 images. Each sample corresponds to a 28 by 28 pixel image.

In [None]:
from tensorflow import keras
import numpy as np

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full.shape, X_test.shape

As we have not worked with Fashion-MNIST in previous labs, it is useful to see what the various images look like. For this, run the cell below.

In [None]:
labels = {0: "T-shirt/top", 1: "Trouser", 2: "Pullover", 3: "Dress", 4: "Coat",
          5: "Sandal", 6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Ankle boot"}

fig, axs = plt.subplots(10, 5, figsize=(12, 12))
axs = axs.flatten()
for index, (image, ax) in enumerate(zip(X_train_full, axs)):
    ax.imshow(image, cmap='gray', interpolation="nearest")
    ax.set_title(labels[y_train_full[index]])
    ax.axis('off')
    
fig.tight_layout()
plt.show()

<div style="background-color:#c2eafa">

$\q{4.1}$ The values of the features are in range from $0$ to $255$, we would like to have them mapped to the range from $0$ to $1$. Convert all images from the training set and the test set.

In [None]:
# START ANSWER
# END ANSWER

assert np.isclose(np.amin(np.amin(X_train_full)), 0.0)
assert np.isclose(np.amax(np.amax(X_train_full)), 1.0)
assert np.isclose(np.amin(np.amin(X_test)), 0.0)
assert np.isclose(np.amax(np.amax(X_test)), 1.0)

<div style="background-color:#c2eafa">

$\q{4.2}$ Before we start to develop a model, we would also like to use 10% of the training set as a validation set. Use the function [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from scikit-learn to generate a training set of 90% and validation set of 10%.

In [None]:
from sklearn.model_selection import train_test_split
random_state = 42
X_train, X_validation, y_train, y_validation = None, None, None, None

# START ANSWER
# END ANSWER

assert X_train.shape == (54000, 28, 28)
assert X_validation.shape == (6000, 28, 28)

Although Keras allows for the creation of arbitrarily complex neural networks we will stick to a simple example of a so-called Fully Connected Network. We can define our neural network as a [`Sequential`](https://keras.io/api/models/sequential/) model (representing a sequence of layers one after the other) where each layer takes one array as input and outputs another array. Such a model will start with an input layer followed by a number of hidden layers, and finally an output layer. For our purposes all layers will be `Dense` which means that every single neuron in layer $k$ is connected to every single neurons in layer $k + 1$. 

<div style="background-color:#c2eafa">

$\q{4.3}$ Inspect and finalize the code below.

In [None]:
# input_shape is a list of numbers corresponding to the shape of a sample, i.e. the shape of a single image
input_shape = [None, None]

# START ANSWER
# END ANSWER

# Number of neurons in the hidden layer(s) is a hyper-parameter which needs to be optimized
# We can guess that the number required here would be several dozen, maybe several hundred neurons
neurons_1 = None
neurons_2 = None

# START ANSWER
# END ANSWER

# Number of neurons in the output layer corresponds to the number of classes
output_neurons = 0

# START ANSWER
# END ANSWER

model = keras.models.Sequential(
    [
        # Flatten layer converts the input into a 1-dimensional array
        keras.layers.Flatten(input_shape=input_shape),
        # Dense layers allow for several different activation functions, we will use ReLU which is a popular choice
        keras.layers.Dense(neurons_1, activation="relu"),
        # Each Dense layer can also receive regularizers for the bias, weights, and output
        keras.layers.Dense(neurons_2, activation="relu"),
        # Output layers commonly use the softmax function as activation but other options are also possible
        keras.layers.Dense(output_neurons, activation="softmax"),
    ]
)

# We can get the summary of the model which includes the number of parameters that require optimization in training
model.summary()

The number of parameters for each dense layer above depends on the number of biases and connections between the neurons. Of course, very complex neural networks may have billions of parameters but ours shouldn't have more than 300-600 thousand parameters. By default, all parameters are trainable (hence `Non-trainable params: 0`) however in some cases we may want to keep a selection of parameters to constant values, which Keras also allows.

Before we can train a model, we need to compile it. The `compile` function accepts various parameters but we specify 3 of them:
* `loss` is the metric that should be minimized during training, we will use [Sparse Categorical Cross-entropy](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class) but you don't need to understand how it works
* `optimizer` is a technique of changing the weights of a model to minimize `loss`
* `metrics` describe performance of a model but are not directly optimized in training

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

<div style="background-color:#c2eafa">

$\q{4.4}$ Finally, run the code below to train our model:

In [None]:
%%time
# validation_data is a tuple containing the feature values and labels
validation_data = (X_validation, y_validation)

# In case the code takes more than 2-3 minutes, consider lowering the number of epochs
history = model.fit(X_train, y_train, epochs=15, validation_data=validation_data)

When our model is trained, we can inspect it to learn a lot of useful information. For example:

In [None]:
weights, biases = model.layers[2].get_weights()
print(weights.shape, biases.shape)
weights[0, :10], biases[:10]

<div style="background-color:#c2eafa">

$\q{4.5}$ What do the arrays above represent? Why are they shaped like this?

<div style="background-color:#f1be3e"> 
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)


We can also learn about the training process by plotting the loss on the training set and validation set against the number of epochs. This data is directly available in the `history` variable which is generated during the training of a model.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss during training')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['training set', 'validation set'], loc='upper right')
plt.show()

Similarly, we may want to know how the accuracy of our model changed during training.

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy during training')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['training set', 'validation set'])
plt.show()

<div style="background-color:#c2eafa">
    
$\q{4.6}$ Does this model seem converged?

<div style="background-color:#f1be3e"> 

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)


Finally, we can issue predictions on the test set objects. Keras calculates the `loss` and performance `metric` as specified when the model was compiled and returns it as an array of two numbers. In our case that's `sparse_categorical_crossentropy` and `accuracy`.

In [None]:
performance = np.round(model.evaluate(X_test, y_test), 4)
print(f"Test set loss: {performance[0]}, test set accuracy: {performance[1]}")

$\ex{5}$ As a demonstration of the capabilities of (deep) neural networks in the domain of image recognition, we will  take a moment to take a look at [ResNet50](https://keras.io/api/applications/resnet/#resnet50-function) which is also available via Keras (which offers [many other pretrained networks](https://keras.io/api/applications/)).

Many advanced neural networks that classify images use [convolutions](https://www.ibm.com/topics/convolutional-neural-networks) as building blocks. ResNet50 is a neural network that consists of 50 convolutional layers. It was trained on the foundational ImageNet dataset of over 1 million training images representing 1000 different classes.

<div style="background-color:#c2eafa">

$\q{5.1}$ What do you think is the biggest challenge when training neural networks from scratch?

<div style="background-color:#f1be3e"> 
    
[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)


<div style="background-color:#c2eafa">

$\q{5.2}$ What is a good strategy to avoid this issue?  
**Hint:** take a look at the link for ResNet50


<div style="background-color:#f1be3e"> 


[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)


We will see if ResNet can deal with the classification of the following images (but feel free to also try it with your own selection images later!):

<img src="images/polar_bear.jpg" alt="Polar bear" width="200"/>
<img src="images/watermelon.jpg" alt="Watermelon" width="200"/>
<img src="images/tulip.jpg" alt="Tulip flower" width="200"/>

Let's load the neural network from the Keras library.

In [None]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

model = ResNet50(weights='imagenet')

Here weights indicates that we want to use the neural network weights that are the result of training on ImageNet. Because of this, we can directly use ResNet50 for classification. 

In [None]:
img_1 = image.load_img('images/polar_bear.jpg', target_size=(224, 224))
polar_bear = image.img_to_array(img_1)
polar_bear = np.expand_dims(polar_bear, axis=0)
polar_bear = preprocess_input(polar_bear)

img_2 = image.load_img('images/watermelon.jpg', target_size=(224, 224))
watermelon = image.img_to_array(img_2)
watermelon = np.expand_dims(watermelon, axis=0)
watermelon = preprocess_input(watermelon)

img_3 = image.load_img('images/tulip.jpg', target_size=(224, 224))
tulip = image.img_to_array(img_3)
tulip = np.expand_dims(tulip, axis=0)
tulip = preprocess_input(tulip)

preds_1 = model.predict(polar_bear)
preds_2 = model.predict(watermelon)
preds_3 = model.predict(tulip)

print('Predicted for polar bear:', decode_predictions(preds_1, top=3)[0])
print('Predicted for watermelon:', decode_predictions(preds_2, top=3)[0])
print('Predicted for tulip:', decode_predictions(preds_3, top=3)[0])

As you can see, ResNet has done really well with the polar bear image (prediction of `ice_bear` with probability of 99.9%). For the image representing a watermelon, it has correctly identified it as fruit (an all three top predictions are in fact fruits), however, it incorrectly assumed that the image represented a `fig` (probability of 83.8%). 

In the last case, ResNet was completely off identifying our flower as a type of yellow butterfly. Nevertheless, the assigned probability is rather low so it seems that the network wasn't very certain of its own classification either way.

<div style="background-color:#c2eafa"> 
    
$\q{5.3}$ Why do you think ResNet wasn't able to identify the flower at all?  
**Hint:** You can inspect the full list of the classes of ResNet [here](https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt).

<div style="background-color:#f1be3e">

[//]: # (START ANSWER)
_Write your answer here._

[//]: # (END ANSWER)