**1. Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He inititialization?**

No, definitely not. The idea of selecting all your weights from the +/- sqrt(6/(n_inputs+n_outputs)) interval is so that you will have an approximately equal variance of the inputs and outputs of that layer. If you set all the weights in a layer to the same value, you will have variance 0 which is not ideal.

**2. Is it okay to initialize the bias terms to 0?**

Yes, it will make no difference.

**3. Name three advantages of the ELU activation function over ReLU.**

- ELU gives a nice, smooth gradient for negative z values while the ReLU has gradient 0 if z<0.

- ELU has negative values when z<0, meaning that the average outputs will be close to 0. This will alleviate vanishing gradient problems.

- "Training time was reduced and the neural network performed better on the test set."

**4. In which cases would you want to use each of the following activation functions?**

- ELU: good for reducing the vanishing gradients problem

- leaky ReLU: if you notice that a lot of your neurons are "dying", simply changing the activation to leaky_ReLU may help.

- ReLU: the fastest of the options to compute, good if you want a very quick lightweight model.

- tanh: same as logistic but goes between -1 and 1. Could be used as the output layer if you want a -1 definitely false, +1 definitely true kind of confidence scale.

- logistic: Could be used as the output layer if you want a 0 no idea, +1 definitely true kind of confidence scale.

- softmax: Used as the output layer in multi-class classification tasks. Chooses the highest probability score.

**5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using the SGD optimizer?**

The algorithm will act almost like there is no friction and it may completely overshoot the global minimum.

**6. Name three ways you can produce a sparse model.**

1. Set a threshold value and any weights that are below that value are set to 0.

2. Use heavy l2 regularization to punish non-zero values.

3. Implement Follow the Regularized Leader regularization.

**7. Does dropout slow down training? Does it slow dow inference?**

Yes, dropout does slow down training, but it is typically worth the slowdown in terms of building a more robust model. It will not slow down inference however because dropout is not applied during inference.

**8. Deep Learning!**

*a. Build a DNN with 5 hidden layers of 100 neurons each, He initialization, and the ELU activation function.*

In [32]:
import tensorflow as tf
import numpy as np
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.Dense(5, activation='softmax')
])

*b. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise.*

In [33]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping()
tensorboard = tf.keras.callbacks.TensorBoard()

In [34]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train / 255.
x_test = x_test / 255.

In [35]:
train_filter = np.isin(y_train,np.arange(0,5))
test_filter = np.isin(y_test, np.arange(0,5))

In [36]:
model.fit(x_train[train_filter], y_train[train_filter], validation_data=(x_test[test_filter],y_test[test_filter]),
         epochs=20, callbacks=[early_stop,tensorboard])

Epoch 1/20
Epoch 2/20
Epoch 3/20


<tensorflow.python.keras.callbacks.History at 0x19982e572b0>

*c. Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model?*

In [37]:
model2 = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(5, activation='softmax')
])
model2.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [38]:
model2.fit(x_train[train_filter], y_train[train_filter], validation_data=(x_test[test_filter],y_test[test_filter]),
         epochs=20, callbacks=[early_stop,tensorboard])

Epoch 1/20
Epoch 2/20


<tensorflow.python.keras.callbacks.History at 0x1998257e8b0>

The model does converge a lot faster, even if each epoch is slightly longer. This added time will be less as the epochs go on though.

*e. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?*

In [41]:
model2_dropout = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(.5),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(.5),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(.5),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(.5),
    tf.keras.layers.Dense(100, kernel_initializer='glorot_uniform',activation=tf.keras.activations.elu),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(.5),
    tf.keras.layers.Dense(5, activation='softmax')
])
model2_dropout.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [42]:
model2_dropout.fit(x_train[train_filter], y_train[train_filter], validation_data=(x_test[test_filter],y_test[test_filter]),
         epochs=20, callbacks=[early_stop,tensorboard])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


<tensorflow.python.keras.callbacks.History at 0x19982bdf640>

**9. Transfer learning**

*a. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them and replaces the softmax output layer with a new one.*

In [48]:
transfer_model = tf.keras.models.Sequential()
[transfer_model.add(layer) for layer in model2_dropout.layers[:-1]]
for layer in transfer_model.layers:
    layer.trainable=False
transfer_model.add(tf.keras.layers.Dense(5, activation='softmax'))

*b. Train the new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Can you achieve high precision?*

In [120]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255., x_test / 255.

In [121]:
train_filter = np.where(np.logical_and(y_train>=5,y_train<=9))
test_filter = np.where(np.logical_and(y_test>=5,y_test<=9))
y_train = y_train[train_filter]
y_test = y_test[test_filter]

In [124]:
y_train = tf.keras.utils.to_categorical(y_train-5, num_classes=5)
y_test = tf.keras.utils.to_categorical(y_test-5,num_classes=5)

In [125]:
transfer_model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [128]:
transfer_model.fit(x_train[:500],y_train[:500],
                   validation_data=(x_test,y_test), epochs=20, callbacks=[tensorboard,early_stop])

Epoch 1/20


InvalidArgumentError:  logits and labels must have the same first dimension, got logits shape [32,5] and labels shape [160]
	 [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-128-2a3e0a44182a>:1) ]] [Op:__inference_train_function_140761]

Function call stack:
train_function


In [79]:
transfer_model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_7 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_49 (Dense)             (None, 100)               78500     
_________________________________________________________________
batch_normalization_10 (Batc (None, 100)               400       
_________________________________________________________________
dropout_5 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_50 (Dense)             (None, 100)               10100     
_________________________________________________________________
batch_normalization_11 (Batc (None, 100)               400       
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)             

**10. Pretraining on an auxillary task.**

*a. Build a DNN that compares two MNIST digit images and predicts whether they represent the same digit or not.*

In [145]:
input_a = tf.keras.layers.Input((28,28),name='input_a')
dense1 = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden1_a')(input_a)
dense2 = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden2_a')(dense1)
output_a = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden3_a')(dense2)

In [143]:
input_b = tf.keras.layers.Input((28,28),name='input_b')
dense1 = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden1_b')(input_b)
dense2 = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden2_b')(dense1)
output_b = tf.keras.layers.Dense(100, activation=tf.keras.activations.elu,name='hidden3_b')(dense2)

In [147]:
concat = tf.keras.layers.Concatenate(axis=1,name='concat')([output_a,output_b])
dense_total = tf.keras.layers.Dense(100,activation=tf.keras.activations.elu,name='hidden_total')(concat)
output = tf.keras.layers.Dense(1, activation='sigmoid')(dense_total)

In [148]:
model = tf.keras.models.Model(inputs=[input_a,input_b], outputs=[output])

In [152]:
# Generate image of model
tf.keras.utils.plot_model(model,to_file='model.png')

Failed to import pydot. You must install pydot and graphviz for `pydotprint` to work.


In [154]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=[tf.keras.metrics.BinaryAccuracy()])

*b. Split the MNIST training set in two sets: split #1 should contain 55,000 images, and split #2 should contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images picked from split #1. Half of the training instances should be pairs of images that belong to the same class, while the other half should be images from different classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes.*

In [158]:
(x_train, y_train), (_,_) = tf.keras.datasets.mnist.load_data()
x_test, y_test = x_train[-5000:], y_train[-5000:]
x_train, y_train = x_train[:-5000], y_train[:-5000]
print(len(x_train),len(x_test))

55000 5000


In [159]:
organized = [x_train[np.where(y_train==val)] for val in range(10)]

In [164]:
def generate_batch(batch_size=64):
    # each sample has shape (2, 28, 28), so each batch_x has (batch_size, 2, 28, 28) and vatch_y has (batch_size)
    batch_x = []
    batch_y = []
    for _ in range(batch_size):
        if np.random.random() < .5:
            # match
            num = (int)(np.random.random() * 10)
            index_1,index_2 = np.random.randint(0, high=len(organized[num])),np.random.randint(0, high=len(organized[num]))
            batch_x.append([organized[num][index_1],organized[num][index_2]])
            batch_y.append(1)
        else:
            num1 = (int)(np.random.random() * 10)
            num2 = np.random.choice(list(range(10)).remove(num1))
            index_1,index_2 = np.random.randint(0, high=len(organized[num1])),np.random.randint(0, high=len(organized[num2]))
            batch_x.append([organized[num1][index_1],organized[num2][index_2]])
            batch_y.append(0)
    return np.array(batch_x), np.array(batch_y)

x,y = generate_batch(batch_size=10)
print(x.shape,y.shape)

ValueError: a must be 1-dimensional or an integer