**1. What are the advantages of a CNN over a fully connected DNN for image classification?**

Convolutional layers have *partially connected* neurons, so in large images, the number of parameters is much more manageable. Also, in each feature map in a conv layer, all the neurons share the same weights and bias terms, meaning that a) there is way fewer parameters and b) if the network learns how to detect a certain feature, it will be able to detect that feature over the whole image, unlike a DNN where it has to learn to detect that same feature multiple times.

**2. Consider a CNN composed of three convolutional layers, each with 3x3 kernels, a stride of 2, and SAME padding. The lowest layer outputs 100 feature maps, the middle one outputs 200, and the top one outputs 400. The input images are RGB images of 200x300 pixels. What is the total number of parameters in the CNN? If we are using 32-bit floats, at least how much RAM will this network require when making a prediction for a single instance? What about when training on a mini-batch of 50 images?**

The total number of parameters per layer would be (kernel_area x num_channels + 1 (for bias)) x num_filters. Thus, we have

( (3x3x3+1)  x 100 ) + ( (3x3x3+1) x 200 ) + ( (3x3x3+1) x 400 ) = 2800 + 5600 + 11200 = 19,600 total parameters.

If we are inferencing on a single instance, we only need at maximum, enough RAM to hold the largest two consecutive layers in memory. Since we have strides of 2 in all our layers, that means that the first conv has (100x150x100)=1,500,000 neurons, then the second would has (50x75x200)=3,000,000 neurons, then the last has (25x38x400)=6,000,000 neurons, and each neuron has 3^3=27 inputs, so when training on a mini-batch of 50 images, you would need 27x(1,500,000+3,000,000+6,000,000)x32x50 bits = 56,700 Mb. 

**3. If your GPU runs out of memory while training a CNN, what are five things you could try to solve the problem?**

- Decrease the batch size

- Use a stride to decrease dimensionality.

- Use less layers.

- Use a pooling layer.

- Use 16-bit floats rather than 32-bit.

**4. Why would you want to add a max pooling layer rather than a convolutional layer with the same stride?**

While convolutional layers with a stride > 1 will be able to output smaller images than their inputs, they still add a lot of complexity. There are parameters based on the kernel size, the number of channels, and the number of feature maps. A pooling layer will shrink the image the same way, but it doesn't have any weight values and thus is not trainable. Pooling layers are very useful for removing complexity from the network, and thus reducing the possibility of overfitting.

**5. When would you want to add a local response normalization layer?**

If you want to increase the generalizability of your network, adding local response normalization is a good idea because it will encourage each layer to find a range of feature maps that are very different from one another.

**6. Can you name the main innovations in AlexNet compared to LeNet-5? What about the main innovations in GoogLeNet and ResNet?**

The main innovation of AlexNet compared to LeNet-5 was the fact that it was much deeper than LeNet. It also stacked multiple Conv layers in succession, and AlexNet used local response normalization.

The main innovation of ResNet over GoogLeNet was the introduction of skip connections.

**7. Build your own CNN and try to achieve the highest possible accuracy on MNIST.**

In [1]:
# What I want to implement in this CNN
# 1. Data augmentation
# 2. Not too large, about the size of AlexNet
# 3. Some regularization techniques
# 4. *Maybe* implementing local response regularization

import tensorflow as tf

In [38]:
img_input = tf.keras.layers.Input((28,28,1))
conv1 = tf.keras.layers.Conv2D(5, 3, strides=2, padding='SAME', 
                               activation=tf.keras.activations.elu,
                               kernel_regularizer=tf.keras.regularizers.l2(l=0.02))(img_input)
batch_norm_a = tf.keras.layers.BatchNormalization()(conv1)
pool2 = tf.keras.layers.MaxPool2D(2)(batch_norm_a)
conv3 = tf.keras.layers.Conv2D(10, 3, strides=2, padding='SAME',
                               activation=tf.keras.activations.elu,
                               kernel_regularizer=tf.keras.regularizers.l2(l=0.02))(pool2)
batch_norm_b = tf.keras.layers.BatchNormalization()(conv3)
flatten4 = tf.keras.layers.Flatten()(batch_norm_b)
dense5 = tf.keras.layers.Dense(50, activation=tf.keras.activations.elu)(flatten4)
dropout6 = tf.keras.layers.Dropout(.5)(dense5)
dense7 = tf.keras.layers.Dense(10, activation='softmax')(dropout6)

In [39]:
cnn = tf.keras.models.Model(inputs=[img_input], outputs=[dense7])
cnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [40]:
cnn.summary()

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 28, 28, 1)]       0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 14, 14, 5)         50        
_________________________________________________________________
batch_normalization (BatchNo (None, 14, 14, 5)         20        
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 7, 7, 5)           0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 4, 4, 10)          460       
_________________________________________________________________
batch_normalization_1 (Batch (None, 4, 4, 10)          40        
_________________________________________________________________
flatten_3 (Flatten)          (None, 160)               0   

In [46]:
# Load in MNIST data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape((-1,28,28,1))
x_test = x_test.reshape((-1,28,28,1));
x_test = x_test/255.

In [42]:
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    featurewise_std_normalization=True, featurewise_center=True,
    rotation_range=20, shear_range=3,
    rescale=1/255.,
)

In [43]:
datagen.fit(x_train)

In [44]:
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss',patience=2,restore_best_weights=True)
tensorboard = tf.keras.callbacks.TensorBoard(histogram_freq=1)

In [47]:
cnn.fit(datagen.flow(x_train, y_train, batch_size=32), validation_data=(x_test, y_test), 
        callbacks=[early_stop, tensorboard], epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15


<tensorflow.python.keras.callbacks.History at 0x1792e5aaa60>

**9. Transfer learning for large image classification.**

*a. Load in the flowers dataset.*