# SWS3009 Hands On - Neural Networks

Please fill in your student number and name below. Each student is to make one submission.


## 1. Introduction

The objectives of this lab are:

    1. To familiarize you with how to create dense neural networks using Keras.
    2. To familiarize you with how to encode input and output vectors for neural networks.
    3. To give you some insight into how hyperparameters like learning rate and momentum affect training.
    
To save time we will train each experiment only for 50 epochs. This will lead to less than optimal results but is enough for you to make observations.

**HINT: YOU CAN HIT SHIFT-ENTER TO RUN EACH CELL. NOTE THAT IF A CELL IS DEPENDENT ON A PREVIOUS CELL, YOU WILL NEED TO RUN THE PREVIOUS CELL(S) FIRST **


## 2. The Irises Dataset

We will now work again on the Irises Dataset, which we used in Lab 1, for classifying iris flowers into one of three possible types. As before we will consider four factors:

    1. Sepal length in cm
    2. Sepal width in cm
    3. Petal length in cm
    4. Petal width in cm

In this dataset there are 150 sample points. The code below loads the dataset and prints the first 10 rows so we have an idea of what it looks like.

In [1]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

iris = load_iris()

print("First 10 rows of data:")
print(iris.data[:10])

First 10 rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


### 2.2 Scaling the Data

We make use of the MinMaxScaler to scale the inputs to between 0 and 1.  The code below does this and prints the first 10 rows again, to show us the difference.

In the next section we will investigate what happens if we use unscaled data.

In [2]:
scaler = MinMaxScaler()
scaler.fit(iris.data)
X = scaler.transform(iris.data)

print("First 10 rows of SCALED data.")
print(X[:10])

First 10 rows of SCALED data.
[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]
 [0.30555556 0.79166667 0.11864407 0.125     ]
 [0.08333333 0.58333333 0.06779661 0.08333333]
 [0.19444444 0.58333333 0.08474576 0.04166667]
 [0.02777778 0.375      0.06779661 0.04166667]
 [0.16666667 0.45833333 0.08474576 0.        ]]


### 2.3 Encoding the Targets

In Lab 1 we saw that the target values (type of iris flower) is a vector from 0 to 2. We can see the 150 labels below:


In [3]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


We can use this to train the neural network, but we will use "one-hot" encoding, where we have a vector of _n_ integers consisting of 0's and 1's.  The table below shows how one-hot encoding works:

|   Value    |    One-Hot Encoding    |
|:----------:|:----------------------:|
| 0 | \[1 0 0\] |
| 1 | \[0 1 0\] |
| 2 | \[0 0 1\] |

Keras provides the to_categorical function to create one-hot vectors:



In [4]:
from tensorflow.keras.utils import to_categorical

Y = to_categorical(y = iris.target, num_classes = 3)
print(Y)

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0.

Now let's split the data into training and testing data:



In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, 
                                                    random_state = 1)


### 2.4 Building our Neural Network

Let's now begin building a simple neural network with a single hidden layer, using the Stochastic Gradient Descent (SGD) optimizer, ReLu transfer functions for the hidden layer and softmax for the output layer.

The code to do this is shown below:

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Create the neural network
nn = Sequential()
nn.add(Dense(100, input_shape = (4, ), activation = 'relu'))
nn.add(Dense(3, activation = 'softmax'))

# Create our optimizer
sgd = SGD(learning_rate = 0.1)

# 'Compile' the network to associate it with a loss function,
# an optimizer, and what metrics we want to track
nn.compile(loss='categorical_crossentropy', optimizer=sgd, 
          metrics = 'accuracy')

Metal device set to: Apple M1


2023-05-14 13:57:26.698989: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-05-14 13:57:26.699102: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### 2.5 Training the Neural Network

As is usually the case, we can call the "fit" method to train the neural network for 50 epochs. We will shuffle the training data between epochs, and provide validation data.

In [7]:
nn.fit(X_train, Y_train, shuffle = True, epochs = 50, 
      validation_data = (X_test, Y_test))

Epoch 1/50


2023-05-14 13:57:29.093779: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-05-14 13:57:29.286955: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50

2023-05-14 13:57:29.470874: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x161c56e20>

---
#### Question 1

Run the code above. Do you see evidence of underfitting? Overfitting? Justify your answers. ***(4 MARKS)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

_(For TA) Marks awarded: ____ / 4_

---

#### Question 2a

Consult the documentation for the SGD optimizer [here](https://keras.io/api/optimizers/sgd/). What does the lr parameter do? ***(1 MARK)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

#### Question 2b

The documentation states that the momentum parameter "accelerates gradient descent in the relevant direction and dampens oscillations". Using Google or other means, illustrate what this means. ***(2 MARKS)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

_(For TA) Marks awarded: ____ / 3_

----

#### Question 3a

We will now play with the lr parameter. Adjust the lr parameter to the following values and record the final training and validation accuracies in the respective columns. Also observe the sequence of accuracies over the training period, and place your observation in the "remarks" column, e.g. "Progresses steadily", "some oscillation" etc. ***(3 MARKS)***

**Answer: Fill the table below **

|  lr    | Training Acc. | Validation Acc. |      Remarks      |
|:------:|---------------|-----------------|-------------------|
|0.01    |               |                 |                   |
|0.1     |               |                 |                   |
|1.0     |               |                 |                   |
|10.0    |               |                 |                   |
|100     |               |                 |                   |
|1000    |               |                 |                   |
|10000   |               |                 |                   |
| 100000 |               |                 |                   |


#### Question 3b

Based on your observations above, comment on the effect of small and very large learning rates on the learning. ***(2 MARKS)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

_(For TA) Marks awarded: ____ / 5_

### 2.5 Using Momentum

We will now experiment with the momentum term. To do this:

    1. Change the learning rate to 0.1.
    2. Set the momentum to 0.1. Note: Do not use the Nesterov parameter - Leave it as False.
    
Run your neural network.

---

#### Question 4a

Keeping the learning rate at 0.1, complete the table below using the momentum values shown. Again record any observations in the "Remarks" column. ***(3 MARKS)***

**Answer: Fill the table below**

| momentum | Training Acc. | Validation Acc. |      Remarks      |
|:--------:|---------------|-----------------|-------------------|
|0.001     |               |                 |                   |
|0.01      |               |                 |                   |
|0.1       |               |                 |                   |
|1.0       |               |                 |                   |

#### Question 4b

Based on your observations above, does the momentum term help in learning? ***(2 MARKS)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

_(For TA) Marks awarded: ____ / 5_

---

### 2.6 Using Raw Unscaled Data

We begin by using unscaled X and Y data. The code below will create 120 training samples and 30 testing samples (20% of the total of 150 samples):

In [8]:
X_unscaled = iris.data
Y_raw = iris.target
X_utrain, X_utest, Y_utrain, Y_utest = train_test_split(X_unscaled, Y,
                                                        test_size = 0.2,
                                                        random_state = 1)

---

#### Question 5

Create a new neural network called "nn2" below using a single hidden layer of 100 neurons. Train using the data in X_utrain, X_utest and validate with Y_utrain and Y_utest. Again use the SGD optimizer with a learning rate of 0.1 and no momentum, and train for 50 epochs. ***(3 marks)***

In [9]:
"""
Enter your code for Question 5 below. To TA: 3 marks for code.
"""



'\nEnter your code for Question 5 below. To TA: 3 marks for code.\n'

**(Question 5 continues)**

Observe the training and validation error. Does not scaling the input affect the training? Why do you think this is so? What is the advantage of scaling? ***(5 MARKS)***

**Answer: Type your answer here. Do not hit return to continue to the next line, just let the text wrap around **

_(For TA) Marks awarded: ____ / 8_

---

## 3 Conclusion

In this lab we saw how to create a simple Dense neural network to complete the relatively simple task of learning how to classify irises according to their sepal and petal characteristics. 


---

***FOR TA ONLY***

| Question |  Marks  |
|:--------:|:-------:|
|1         |     /4  |
|2         |     /3  |
|3         |     /5  |
|4         |     /5  |
|5         |     /8  |
|Total:    |     /25 |

