**Instructions:**

- For questions that require coding, you need to write the relevant code and display its output. Your output should either be the direct answer to the question or clearly display the answer in it.
- For questions that require a written answer (sometimes along with the code), you need to put your answer in a Markdown cell. Writing the answer as a comment or as a print line is not acceptable.
- Receiving help from classmates and/or ideas from Generative AI is allowed. **However, you must submit your own original work.** 
- You need to render this file as HTML using Quarto and submit the HTML file. **Please note that this is a requirement and not optional.** A submission cannot be graded until it is properly rendered.

Import all the libraries and tools you need below.

In [28]:
# Import all packages
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score, confusion_matrix
from keras.models import Sequential
from keras.layers import Dense, InputLayer
from keras.utils import to_categorical

Run the line given below to read the MNIST dataset. Reshape the training and test predictors. (This should be ready from the previous in-class assignment.)

In [18]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

In [19]:
# Reshape data to comply with ANN (Make all x variables in to one number, 28 * 28 dimensions)
x_train_new = x_train.copy()
x_test_new = x_test.copy()
x_train_new = np.reshape(x_train_new, (60000, 28 * 28))
x_test_new = np.reshape(x_test_new, (10000, 28 * 28))

# Test the shape
print(x_train_new.shape)
print(x_test_new.shape)

(60000, 784)
(10000, 784)


### 1)

Using a **keras tool**, one-hot-encode the training target/response values. **(10 points)**

In [20]:
# One-hot-encode the training response values (Using to_categorical)
y_train_encoded = to_categorical(y_train)

# Print y_train_encoded for sanity check
y_train_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

### 2)

Explain why one-hot-encoding the target values is necessary for classification with a neural network. (We have only one-hot-encoded categorical predictors up until this point!) **(10 points)**

In the neural network classification, the cost function underlying the classification problems -- categorical cross entropy, needs the one-hot-encoded response value, $y_{c} ^{(i)}$. The output node of neural network model for classification also needs the total number of node equal to the class of node for compatability. If we have only one node with neural network classification, it will act as regression, leading to the following problem of misinterpretation. If there is no one-hoe-encoding process, the model can misinterpret ${0,1,...,9}$ as having inherent order of magnitude which is not true for classification problems.

### 3)

Create a **five-layer** network. Use 200, 100 and 50 nodes for each hidden layer, respectively. **You need to use both `InputLayer` and `Dense` objects (and only them) for credit.** 

Add the proper non-linear functions to the hidden and output layers. For the hidden layers, you should use the **most common** function that avoids the vanishing gradients problem. (Use the actual function, not its modified versions. The extensions will come later.) For the output layer, you are expected to know the only proper function to use.

**In or between the layers, do not add any extra inputs that are not instructed. All components of a neural network will appear gradually in future in-class assignments.**

**(30 points)**

In [24]:
# Create a five layer network
network_mnist = Sequential()
network_mnist.add(InputLayer(shape= (x_train_new.shape[1],))) # Create an input layer , number of node = number of variables
network_mnist.add(Dense(200, activation= "relu", kernel_initializer= 'HeNormal'))
network_mnist.add(Dense(100, activation= "relu", kernel_initializer= 'HeNormal'))
network_mnist.add(Dense(50, activation= "relu", kernel_initializer= 'HeNormal'))
network_mnist.add(Dense(y_train_encoded.shape[1], activation= "softmax", name = 'output')) # Output layer, number of node = number of classes


### 4)

Why is it very common practice to have less nodes in deeper hidden layers? **(10 points)**

It is common since the earlier layers of the network already learns about the low-level features like diagonal lines or edges. In the deeper layers, when the network processes, the outputs of the node represents more complex presentations with the learning of patterns and spatial data. Then at the deeper layer something complexed like face node can be represented based on thousands of lines and pixels.

### 5)

`compile` the network **only** with the cost function. You need to use the proper cost function for the task at hand and specify it as a **string, not as an object**. Do not account for any sparsity in the data.

Print the network summary.

**(10 points)**

In [25]:
# Use compile with loss function 
network_mnist.compile(loss = 'categorical_crossentropy', optimizer = 'adam') 

print(network_mnist.summary())

None


### 6)

Train the network for 5 epochs and a batch size of 100 (more on batch size later). **Do not use any other inputs.** Save the training line to an output variable, named `history`.

Print the test accuracy and the test confusion matrix. Note that you need to process the direct model output.

**(30 points)**

In [26]:
# Train the network, no validation split, early stop
history = network_mnist.fit(x_train_new, y_train_encoded, epochs= 5, batch_size= 100)

Epoch 1/5
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - loss: 6.1107
Epoch 2/5
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.4405
Epoch 3/5
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.2940
Epoch 4/5
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.2047
Epoch 5/5
[1m600/600[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.1773


In [31]:
# Print the test accuracy and the test confusion matrix
y_pred = network_mnist.predict(x_test_new)
y_pred_classified = np.argmax(y_pred, axis= 1) # For each row, loop through each column

# Test Accuracy
print(f'Accuracy: {accuracy_score(y_test, y_pred_classified):.4f}')
print(f'Confusion Matrix: {confusion_matrix(y_test, y_pred_classified)}')


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 861us/step
Accuracy: 0.9537
Confusion Matrix: [[ 966    0    2    0    0    2    0    1    7    2]
 [   0 1122    4    1    0    2    1    0    4    1]
 [   7    2  983    6   10    1    2    5   16    0]
 [   0    0   12  967    0    6    0    5   18    2]
 [   3    0    5    0  951    1    5    3    1   13]
 [   8    0    2   34    0  819    6    3   16    4]
 [   8    2    4    0   12    4  921    0    7    0]
 [   0    6   13    2   13    2    0  958   17   17]
 [   2    3   10   18    5    3    2    3  926    2]
 [  12    3    0    8   29    5    0   11   17  924]]


**You will fine-tune this network in the next in-class assignment!**