### Problem 1

### The Vanishing Gradient Problem
Let's begin by investigating to what extent a proper choice of an activation function and weight initialization can mitigate the vanishing gradient problem. Consider the following toy data set:

```python
from sklearn.datasets import make_moons, make_circles, make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler

np.random.seed(42)

X, y = make_circles(n_samples=1000 , noise=0.08, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)

fig, ax = plt.subplots(figsize=(8, 8))
for i in range(2):
  samples_i = np.asarray(y == i)
  ax.scatter(X[samples_i, 0], X[samples_i, 1], label=str(i))

ax.legend()
ax.set_aspect(1)
```
When you run this cell, the output should looks similar to this:![](https://firebasestorage.googleapis.com/v0/b/prismia.appspot.com/o/user-images%252Fimage-35be2dfd-ce7d-49ba-b878-0c797e088ec9.png?alt=media&token=9712a25b-0c69-409c-b05e-9465241653be)


In [None]:
from sklearn.datasets import make_moons, make_circles, make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler

np.random.seed(42)

X, y = make_circles(n_samples=1000 , noise=0.08, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)

fig, ax = plt.subplots(figsize=(8, 8))
for i in range(2):
  samples_i = np.asarray(y == i)
  ax.scatter(X[samples_i, 0], X[samples_i, 1], label=str(i))

ax.legend()
ax.set_aspect(1)

Let's do our train-test split and define a plot_results function to display the results of model training runs.
```python
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=.8, random_state=42)
# Remark: the sklearn train_test_split functions randonmly shuffles the data
# before doing the specified split.  It is good practice to do this.


def plot_training(model):
  history = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=300,
    verbose=0
  )



  _, train_accuracy = model.evaluate(X_train, y_train, verbose=0)
  _, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
  print('Train: %.2f, Test: %.2f' % (train_accuracy, test_accuracy))



  print(history.history.keys())



  plt.plot(history.history['accuracy'], label='train_accuracy')
  plt.plot(history.history['val_accuracy'], label='val_accuracy')
  plt.legend()

'done!'
```


In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=.8, random_state=42)
# Remark: the sklearn train_test_split functions randonmly shuffles the data
# before doing the specified split.  It is good practice to do this.


def plot_training(model):
  history = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=300,
    verbose=0
  )



  _, train_accuracy = model.evaluate(X_train, y_train, verbose=0)
  _, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
  print('Train: %.2f, Test: %.2f' % (train_accuracy, test_accuracy))



  print(history.history.keys())



  plt.plot(history.history['accuracy'], label='train_accuracy')
  plt.plot(history.history['val_accuracy'], label='val_accuracy')
  plt.legend()

'done!'

Now let's build and train a not-so-good model.
```python
# Build a basic MLP
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential()
init = keras.initializers.RandomUniform(minval=-1, maxval=1)

for i in range(3):
  model.add(Dense(5, 
            activation='tanh',
            kernel_initializer=init))

model.add(Dense(1, activation='sigmoid', kernel_initializer=init))

model.compile(
    loss='binary_crossentropy', 
    optimizer="sgd", 
    metrics=['accuracy']
)



plot_training(model)
model.summary()
```


In [None]:
# Build a basic MLP
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential()
init = keras.initializers.RandomUniform(minval=-1, maxval=1)

for i in range(3):
  model.add(Dense(5, 
            activation='tanh',
            kernel_initializer=init))

model.add(Dense(1, activation='sigmoid', kernel_initializer=init))

model.compile(
    loss='binary_crossentropy', 
    optimizer="sgd", 
    metrics=['accuracy']
)



plot_training(model)
model.summary()

Using a Bayes classifier for this dataset has an accuracy of about 88%.  The accuracy values we got above are nowhere near this value.  The performance chart from the previous cell may look something like the following (the key point is that performance is bad):

![](https://firebasestorage.googleapis.com/v0/b/prismia.appspot.com/o/user-images%252Fimage-6a1734a7-0a2c-4e19-ad0d-b9964c25bc0d.png?alt=media&token=aee7b420-f61d-412c-94aa-841cb7d7888d)
® In the solution cell below, make a version of the code above, change the inner layer activation function(s) to something more appropriate, and the initialization method to make the model more easily trainable. You can also, try using a different optimizer.  

Your final output should look something like this:

![](https://cdn.mathpix.com/snip/images/s18OCim_ftu0EiMTzSf6XAUpeYxc8ENWsjjNPXPt8_Y.original.fullsize.png)


```python
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)


model = keras.models.Sequential()
init = keras.initializers.HeNormal(seed=42)


for i in range(3):
  model.add(Dense(5, activation='relu', kernel_initializer=init))


model.add(Dense(1, activation='sigmoid', kernel_initializer=init))


model.compile(
    loss='binary_crossentropy', 
    optimizer="adam", 
    metrics=['accuracy']
)


plot_training(model)
model.summary()
```
I got both train and test accuracy 0.87 in the end the graph as above. The initialization method I use is HeNormal and activation function I use for inner layers is "relu".


In [None]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)


model = keras.models.Sequential()
init = keras.initializers.HeNormal(seed=42)


for i in range(3):
  model.add(Dense(5, activation='relu', kernel_initializer=init))


model.add(Dense(1, activation='sigmoid', kernel_initializer=init))


model.compile(
    loss='binary_crossentropy', 
    optimizer="adam", 
    metrics=['accuracy']
)


plot_training(model)
model.summary()

### Problem 2

## Learning Rates & Schedules
_Learning rate selection is more of an art than a science. The learning rate may be chosen by trial and error, but it is usually best to choose it by monitoring learning curves that plot the objective function as a function of time. This is more of an art than a science, and most guidance on this subject should be regarded with some skepticism._ -  Goodfellow et al., [Deep Learning, 2016](https://www.deeplearningbook.org/)


#### Understanding Learning Rate
® Summarize the approach suggested in [this article](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/) for selecting an initial learning rate.  

Also, explain what learning rate scheduling is, and four different approaches to learning rate scheduling.  Which ones are built into `keras` or `tf.keras`?


A learning rate too large may converge to a suboptimal solution but a learning rate too small may converge too slow. We should tune the learning rate and use optimizer to optimize the learning rate during training based on time, epochs, learning rate itself.
The function "ReduceLROnPlateau" is built into keras and can be set as a callback funciton in "fit()". Also, you can define your own function to modify learning rate during training process using "LearningRateScheduler" in keras. The other three built-in optimizers are "RMSProp", "Adagrad" and "Adam".


### Problem 3

### Turn on your GPU
Training deep networks is computationally demanding, and the sessions that run in Binder are not beefy enough for doing it seriously. Most crucially, they do not allow access to a GPU, and your laptop probably doesn't either (for sure it doesn't if you have a Mac; if you have a PC then maybe).

Therefore, for the remainder of the homework we're going to switch to Colab for compute. This Google product offers free basic GPU access, and we recommend considering the purchase of Colab Pro for $10 a month to give upgraded RAM and better GPUs.

The easiest way to do this is to click on the help icon (top right corner of this window) and download this assignment as a Jupyter notebook. Then go to [colab.research.google.com](https://colab.research.google.com) and click "Upload" (right edge of the orange navigation bar), and select the downloaded notebook.

Make sure after you open the notebook that you **switch to Runtime that includes a GPU**. To do so, click `Colab:Runtime -> Change Runtime Type` and choose GPU as your Hardware Accelerator.


To check that you have successfully switched your runtime to include a GPU, you can utilize NVIDIA's System Management Interface, or NVIDIA SMI.
```python
!nvidia-smi
```

If you see a GPU "0", then you have a GPU! Congrats! (devices are indexed starting from 0.) (You want to be looking at the line just below the line of equals  signs). I have a


In [None]:
!nvidia-smi

### Network and dataset
The code below defines a simple CNN classifier for the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html).
```python
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D

num_classes = 10

# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

def get_model():
  model = Sequential()
  model.add(Conv2D(32, (3, 3), padding='same',
                  input_shape=x_train.shape[1:]))
  model.add(Activation('relu'))
  model.add(Conv2D(32, (3, 3)))
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Conv2D(64, (3, 3), padding='same'))
  model.add(Activation('relu'))
  model.add(Conv2D(64, (3, 3)))
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Flatten())
  model.add(Dense(512))
  model.add(Activation('relu'))
  model.add(Dropout(0.5))
  model.add(Dense(num_classes))
  model.add(Activation('softmax'))

  return model
```


In [None]:
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D

num_classes = 10

# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

def get_model():
  model = Sequential()
  model.add(Conv2D(32, (3, 3), padding='same',
                  input_shape=x_train.shape[1:]))
  model.add(Activation('relu'))
  model.add(Conv2D(32, (3, 3)))
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Conv2D(64, (3, 3), padding='same'))
  model.add(Activation('relu'))
  model.add(Conv2D(64, (3, 3)))
  model.add(Activation('relu'))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Flatten())
  model.add(Dense(512))
  model.add(Activation('relu'))
  model.add(Dropout(0.5))
  model.add(Dense(num_classes))
  model.add(Activation('softmax'))

  return model

### Selecting a learning rate
® Try training the above model using 4 rates over a geometric range from $10^{-5}$ to $10^{-1}$, using  `RMSProp` and shuffling. In the last line of your solution state what you think the best rate is as a comment.

While exploring the optimal learning rate, keep in mind this competing advice from the textbook and this advice from _Deep Learning_:
> Typically, the optimal initial learning rate, in terms of total training time and the ﬁnal cost value, is higher than the learning rate that yields the best performance after the ﬁrst 100_ iterations or so. Therefore, it is usually best to monitor the ﬁrst several iterations and use a learning rate that is higher than the best-performing learning rate at this time, but not so high that it causes severe instability. _Or when you don't have so much compute, 3 iterations. :-)


### Problem 4

### Learning Rate Schedules
After choosing a good rate above, use **exponential scheduling** to decrease the learning rate while training until the model achieves the good accuracy on the test set.  Run for 20 epochs. Include early stopping with `patience=5`.

**IMPORTANT**: save the weights of the final model, we will use them in the next problem. The following code fragment is one way to achieve this:
```python
def save_model(model):
  import os
  model_name = 'keras_cifar10_trained_model.h5'
  save_dir = os.path.join(os.getcwd(), 'saved_models')
  
  # Save model and weights
  if not os.path.isdir(save_dir):
      os.makedirs(save_dir)
  model_path = os.path.join(save_dir, model_name)
  model.save(model_path)
  print('Saved trained model at %s ' % model_path)
```


In [None]:
def save_model(model):
  import os
  model_name = 'keras_cifar10_trained_model.h5'
  save_dir = os.path.join(os.getcwd(), 'saved_models')
  
  # Save model and weights
  if not os.path.isdir(save_dir):
      os.makedirs(save_dir)
  model_path = os.path.join(save_dir, model_name)
  model.save(model_path)
  print('Saved trained model at %s ' % model_path)

## Transfer learning
The previous model was trained on 10 different classes.  It took quite a bit of time to train.

In this exercise, use **transfer learning** to create a new classifier for the following categories
- cars (label 1) 
- cats (label 3)
- trucks (label 9)
- other (labels 2, 4, 5, 6, 7, 8). 


The code below does an appropriate relabeling of the `y` variable and assigns it to the variable `y4` .
```python
# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

num_classes = 4

def label_map(label):
  label_dict = {1: 0, 3: 1, 9: 2}
  return label_dict.get(label, 3)

y4_train = np.array([label_map(label) for label in y_train.ravel()])
y4_test = np.array([label_map(label) for label in y_test.ravel()])
```


In [None]:
# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

num_classes = 4

def label_map(label):
  label_dict = {1: 0, 3: 1, 9: 2}
  return label_dict.get(label, 3)

y4_train = np.array([label_map(label) for label in y_train.ravel()])
y4_test = np.array([label_map(label) for label in y_test.ravel()])

You could just use the classifier from the previous part, however, this classifier was trained to do a more difficult task of distinguishing 10 classes. We hope that we can do better if we develop a classifier only to distinguish 4 classes. On the other hand, we don't want to train our new model from scratch. 

® The solution is to reuse layers (specifically, the convolutional ones) of the previous model. To do so, freeze the convolutional layers and train the model on your new, modified data set for approximately 5-15 epochs. 

_Hint: You can use this to confirm which layers are currently trainable._
```python
def check_trainable(model):
  for layer in new_model.layers:
    print(layer.trainable, layer)
```


In [None]:
def check_trainable(model):
  for layer in new_model.layers:
    print(layer.trainable, layer)

### Problem 5

### Unfreezing more layers
® Try squeezing out some additional performance by next unfreezing the top two convolutional layers and train your model for further 5-15 epochs.  The textbook recommends a lower learning rate for this last step.


### Problem 6

### Compare Models
Compare the performance of the original model and your `new_model`.  Which one performs better?  Include code for the comparison and your answer below.


### Visualizations with Tensorboard
This week we'll continue working with TensorBoard. Let's begin by getting it set up.
```python
#imports
import tensorflow as tf
import numpy as np
from sklearn.datasets import make_circles
from matplotlib import pyplot
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# setup tensorboard, directories
!rm -rf ./logs
!mkdir ./logs/
!mkdir ./logs/hw2

log_dir="./logs/hw2/"
def tensorboard_callback(exp_name):
  return tf.keras.callbacks.TensorBoard(log_dir=log_dir + exp_name, profile_batch=0, histogram_freq=1)
# launch tensorboard with specific directory
%load_ext tensorboard
%tensorboard --logdir logs/hw2 
```

If this cell hangs when you first execute it, and you do not see any output, please interrupt execution and run it again.


In [None]:
#imports
import tensorflow as tf
import numpy as np
from sklearn.datasets import make_circles
from matplotlib import pyplot
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

# setup tensorboard, directories
!rm -rf ./logs
!mkdir ./logs/
!mkdir ./logs/hw2

log_dir="./logs/hw2/"
def tensorboard_callback(exp_name):
  return tf.keras.callbacks.TensorBoard(log_dir=log_dir + exp_name, profile_batch=0, histogram_freq=1)
# launch tensorboard with specific directory
%load_ext tensorboard
%tensorboard --logdir logs/hw2

## Regularization
In this problem you will experiment `l1` and `l2` layer regularization and early stopping for neural nets.

Consider the following toy data set.


```python
X, y = make_circles(n_samples=600, noise=0.25, random_state=0, factor=0.5)
X_train, X_other = X[:200], X[200:]
y_train, y_other = y[:200], y[200:]

X_valid, X_test = X_other[:200], X_other[200:]
y_valid, y_test = y_other[:200], y_other[200:]

colors = ['blue' if label == 1 else 'red' for label in y]
pyplot.scatter(X[:,0], X[:,1], color=colors)
```

Run the code below to show that the proposed model overfits the data.


In [None]:
X, y = make_circles(n_samples=600, noise=0.25, random_state=0, factor=0.5)
X_train, X_other = X[:200], X[200:]
y_train, y_other = y[:200], y[200:]

X_valid, X_test = X_other[:200], X_other[200:]
y_valid, y_test = y_other[:200], y_other[200:]

colors = ['blue' if label == 1 else 'red' for label in y]
pyplot.scatter(X[:,0], X[:,1], color=colors)

```python
def get_model():
  init = keras.initializers.glorot_uniform(seed=66)
  tf.random.set_seed(0)
  
  model = Sequential()
  model.add(Dense(20, input_dim=2, activation='relu', kernel_initializer=init))
  for i in range(6):
    model.add(Dense(20, activation='relu', kernel_initializer=init))
  model.add(Dense(1, activation='sigmoid'))
  return model

model = get_model()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(
    X_train,
    y_train,
    verbose=0, 
    epochs=1000, 
    batch_size=32,
    validation_data = (X_valid, y_valid),
    callbacks=[tensorboard_callback('hw2.4-baseline')]
)


def plot_decision_boundary(X, y, model, steps=100, cmap='Paired'):
  cmap = pyplot.get_cmap(cmap)
  
  # Define region of interest by data limits
  xmin, xmax = X[:,0].min() - 1, X[:,0].max() + 1
  ymin, ymax = X[:,1].min() - 1, X[:,1].max() + 1
  steps = 100
  x_span = np.linspace(xmin, xmax, steps)
  y_span = np.linspace(ymin, ymax, steps)
  xx, yy = np.meshgrid(x_span, y_span)
  
  # Make predictions across region of interest
  labels = np.rint(model.predict(np.c_[xx.ravel(), yy.ravel()]))
  z = labels.reshape(xx.shape)
  
  fig, ax = pyplot.subplots(figsize=(16, 12), dpi=80,)
  ax.contourf(xx, yy, z, cmap=cmap, alpha=0.5)
  
  train_labels = model.predict(X)
  ax.scatter(X[:,0], X[:,1], c=y, cmap=cmap, lw=0)
  
  return fig, ax

plot_decision_boundary(X, y, model, cmap='RdBu')

_, train_accuracy = model.evaluate(X_train, y_train, callbacks=[])
_, valid_accuracy = model.evaluate(X_valid, y_valid, callbacks=[])
_, test_accuracy = model.evaluate(X_test, y_test, callbacks=[])

print("final training accuracy:", train_accuracy)
print("final validation accuracy:", valid_accuracy)
print("final test accuracy:", test_accuracy)
```


In [None]:
def get_model():
  init = keras.initializers.glorot_uniform(seed=66)
  tf.random.set_seed(0)
  
  model = Sequential()
  model.add(Dense(20, input_dim=2, activation='relu', kernel_initializer=init))
  for i in range(6):
    model.add(Dense(20, activation='relu', kernel_initializer=init))
  model.add(Dense(1, activation='sigmoid'))
  return model

model = get_model()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit(
    X_train,
    y_train,
    verbose=0, 
    epochs=1000, 
    batch_size=32,
    validation_data = (X_valid, y_valid),
    callbacks=[tensorboard_callback('hw2.4-baseline')]
)


def plot_decision_boundary(X, y, model, steps=100, cmap='Paired'):
  cmap = pyplot.get_cmap(cmap)
  
  # Define region of interest by data limits
  xmin, xmax = X[:,0].min() - 1, X[:,0].max() + 1
  ymin, ymax = X[:,1].min() - 1, X[:,1].max() + 1
  steps = 100
  x_span = np.linspace(xmin, xmax, steps)
  y_span = np.linspace(ymin, ymax, steps)
  xx, yy = np.meshgrid(x_span, y_span)
  
  # Make predictions across region of interest
  labels = np.rint(model.predict(np.c_[xx.ravel(), yy.ravel()]))
  z = labels.reshape(xx.shape)
  
  fig, ax = pyplot.subplots(figsize=(16, 12), dpi=80,)
  ax.contourf(xx, yy, z, cmap=cmap, alpha=0.5)
  
  train_labels = model.predict(X)
  ax.scatter(X[:,0], X[:,1], c=y, cmap=cmap, lw=0)
  
  return fig, ax

plot_decision_boundary(X, y, model, cmap='RdBu')

_, train_accuracy = model.evaluate(X_train, y_train, callbacks=[])
_, valid_accuracy = model.evaluate(X_valid, y_valid, callbacks=[])
_, test_accuracy = model.evaluate(X_test, y_test, callbacks=[])

print("final training accuracy:", train_accuracy)
print("final validation accuracy:", valid_accuracy)
print("final test accuracy:", test_accuracy)

Over the next few problems, we'll implement the following regularization techniques:

- early stopping
- $l_1$ regularization
- $l_2$ regularization


® Implement early stopping for the dataset and neural network above. Assess whether the model is still overfitting and comment on the shape of the decision boundary.


### Problem 7

® Implement $l_1$ regularization for the dataset and neural network above. Assess whether the model is still overfitting and comment on the shape of the decision boundary.


### Problem 8

® Implement $l_2$ regularization for the dataset and neural network above. Assess whether the model is still overfitting and comment on the shape of the decision boundary.


### Problem 9

You also can save your Tensorboard to a custom web link using [TensorBoard.dev](https://tensorboard.dev/):

```python
!tensorboard dev upload \
  --logdir logs \
  --name "DATA1010 hw experiments" \
  --description "Each hw in its own sub-directory"
  --one_shot
```

And what's great about this is that the window will also graph your other models as long as your store them properly and then upload them using the command above. 

When you run this cell, answer the yes/No prompt and authenticate.  It will take 3 or 4 minutes to transfer all of the training logs to Tensorboard.dev.

®  Provide a link to the resulting tensorboard.dev website for the hw2 experiments that used the tensorboard callback.


In [None]:
!tensorboard dev upload \
  --logdir logs \
  --name "DATA1010 hw experiments" \
  --description "Each hw in its own sub-directory"
  --one_shot

### Problem 10

## Conceptual problems
® Consider the computational problem of training a multilayer perceptron. During forward propagation, we cache the activations computed at each layer. Why do we cache these values (as opposed to discarding them after they have been used to compute the next layer's activations)? Select the correct answer choice and explain.
- (a) It is used to keep track of the hyperparameters that we are searching over, to speed up computation.
- (b) We use it to pass variables computed during backward propagation to the corresponding forward propagation step. It contains useful values for forward propagation to compute activations. 
- (c) We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives. 
- (d) It is used to cache the intermediate values of the cost function during training.


### Problem 11

® Provide a brief conceptual explanation of (i) why deeper models might perform better and (ii) why they are more difficult to train.


### Problem 12

® What is the vanishing gradient problem, and why are some activating functions better for the vanishing gradient problem? Compare leaky ReLU, ReLU and the Sigmoid function.


### Problem 13

### Chain rule
® The figure below illustrates a simple neural network that we looked at in DATA 1010. The goal of this network is to say for each point in the unit square whether it's to the left of the semicircle shown. For more explanation, check it out in the [gallery](https://prismia.chat/shared/gallery).

- (a) Click the "show best" button to set the values of the parameters to a particular collection which was obtained by training this neural network. You can move the point around to see that the decision boundary for the prediction function is indeed quite close to the semicircle. Now, wiggle the green slider connecting the top neuron in the input layer to the top neuron in the hidden layer. How much does this affect the predicted yellow probability (the top post-softmax value shown in the output layer)? Can you use this information to exactly determine the derivative of the loss function with respect to this particular weight (without doing any calculations)?
- (b) At the original input point (0.2, 0.3), there are two dead hidden-layer neurons. Does this mean that those neurons are dead for _every_ input point? Trying moving the point around within the square to investigate.
- (c) Move the input point to approximately (0.85, 0.15). Wiggle the point left and right to change the value in the top input neuron. Along how many pathways in the computational graph does this change influence the value in the top _output_ neuron? Apply the chain rule to compute the the derivative of the logit printed in the top output neuron with respect to the top input neuron (for an input of (0.85, 0.15) and parameters set to the 'show best' values).


_Note: for the last part, you don't need to frame the problem in symbolic mathematical terms to solve it. Think directly about how small changes propagate in the network. You can experiment with [this mathlet](https://prismia.chat/shared/9NY8-45CQ) if you want to think through propagation of small changes in a simpler context first._


### Problem 14