### Solving MNIST with a Convolutional Neural Network (CNN)

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

The CNN we use to solve MNIST is composed of two alternating layers:
* Conv2D ([docs](https://keras.io/api/layers/convolution_layers/convolution2d/))
* MaxPooling2D ([docs](https://keras.io/api/layers/pooling_layers/max_pooling2d/))

In [2]:
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(inputs)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(10, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)

In [3]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 128)         73856 

**To the student**:
* What is the effect of `MaxPooling2D` on the spatial dimension? 
* How many trainable paramters does a `MaxPooling2D` have? Why?
* How many `Conv2D`-`MaxPooling2D` do we have in this CNN?
* How many trainable parameters are there for the `Flatten` layer? What is its function? (see its [docs](https://keras.io/api/layers/reshaping_layers/flatten/) if needed)
* What is the activation of the `Dense` layer, and what is the calculation? See its [docs](https://keras.io/api/layers/core_layers/dense/), as well as the [docs](https://keras.io/api/layers/activations/) for `softmax` activation function (scroll down the docs).

#### Obtaining the MNIST data

See [Keras Datasets](https://keras.io/api/datasets/) and [MNIST dataset](https://keras.io/api/datasets/mnist/) for a description of the `keras.datasets` functionality and the `MNIST` data.

In [4]:
from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

#### Preprocessing the data

**To the Student**: 
* List all the preprocessing used below. What is the purpose of each?

In [5]:
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

### Train the Model

Links to relevant docs (if necessary):
* [Model Training APIs (compile and fit)](https://keras.io/api/models/model_training_apis/)
* [rmsprop](https://keras.io/api/optimizers/rmsprop/)
* [sparse categorial crossentropy](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class)
* [accuracy](https://keras.io/api/metrics/accuracy_metrics/)

In [6]:
model.compile(optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1f7340edf10>

#### Evaluating the Model

In [7]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {test_acc:.3f}")

Test accuracy: 0.993


#### Comparison of a CNN to a Dense Neural Network

It can be interesting to compare this accuracy to a dense neural network. Below we define and train a dense neural network for comparison with the CNN. Note that the network architecture (e.g. number of layers and their size), as well as the training method (e.g. optimizer, learning rate, intialization) have not been optimized. 

In [14]:
# define a simple dense nerual network for MNIST

inputs = keras.Input(shape=(28,28,1))
x = layers.Flatten()(inputs)
x = layers.Dense(512, activation="relu")(x)
x = layers.Dense(256, activation="relu")(x)
x = layers.Dense(128, activation="relu")(x)
outputs = layers.Dense(10, activation="softmax")(x)
dense_model = keras.Model(inputs=inputs, outputs=outputs)


In [15]:
dense_model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 flatten_2 (Flatten)         (None, 784)               0         
                                                                 
 dense_9 (Dense)             (None, 512)               401920    
                                                                 
 dense_10 (Dense)            (None, 256)               131328    
                                                                 
 dense_11 (Dense)            (None, 128)               32896     
                                                                 
 dense_12 (Dense)            (None, 10)                1290      
                                                                 
Total params: 567,434
Trainable params: 567,434
Non-trainab

In [16]:
# train the model
dense_model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
dense_model.fit(train_images, train_labels, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1f736e51950>

**To the Student**:
* What is the difference in the number of parameters? 
* What is the difference in the training time? 
* What is the differnce in the accuracy? 