# Chapter 10: Introduction to Artificial Neural Networks with Keras Exercises

## 1.

> The TensorFlow Playground is a handy neural network simulator built by the TensorFlow team. 

> In this exercise, you will train several binary classifiers in just a few clicks, and tweak the model's architecture and its hyperparameters to gain some intuition on how neural networks work and what their hyperparameters do. Take some time to explore the following:

>> a. The patterns learned by a neural net.
>>
>>> - Try training the default neural network by clicking the Run button (top left).
>>>
>>> - Notice how it quickly finds a good solution for the classification task.
>>>
>>> - The neurons in the first hidden layer have learned simple patterns, while the neurons in the second hidden layer have learned to combine the simple patterns of the first hidden layer into more complex patterns.
>>>
>>> - In general, the more layers there are, the more complex the patterns can be.
>>
>> b. Activation functions.
>>
>>> - Try replacing the tanh activation function with a ReLU activation function, and train the network again.
>>>
>>> - Notice that it finds a solution even faster, but this time the boundaries are linear. This is due to the shape of the ReLU function.
>>
>> c. The risk of local minima.
>>
>>> - Modify the network architecture to have just one hidden layer with three neurons.
>>>
>>> - Train it multiple times (to reset the network weights, click the Reset button next to the Play button).
>>>
>>> - Notice that the training time varies a lot, and sometimes it even gets stuck in a local minimum.
>>
>> d. What happens when neural nets are too small.
>>
>>> - Remove one neuron to keep just two.
>>>
>>> - Notice that the neural network is now incapable of finding a good solution, even if you try multiple times.
>>>
>>> - The model has too few parameters and systematically underfits the training set.
>>
>> e. What happens when neural nets are large enough.
>>
>>> - Set the number of neurons to eight, and train the network several times.
>>>
>>> - Notice that it is now consistently fast and never gets stuck.
>>>
>>> - This highlights an important finding in neural network theory: large neural networks almost never get stuck in local minima, and even when they do these local optima are almost as good as the global optimum.
>>>
>>> - However, they can still get stuck on long plateaus for a long time.
>>
>> f. The risk of vanishing gradients in deep networks.
>>
>>> - Select the spiral dataset (the bottom-right dataset under "DATA"), and change the network architecture to have four hidden layers with eight neurons each.
>>>
>>> - Notice that training takes much longer and often gets stuck on plateaus for long periods of time.
>>>
>>> - Also notice that the neurons in the highest layers (on the right) tend to evolve faster than the neurons in the lowest layers (on the left).
>>>
>>> - This problem, called the "vanishing gradients" problem, can be alleviated with better weight initialization and other techniques, better optimizers (such as AdaGrad or Adam), or Batch Normalization (discussed in Chapter 11).
>>
>> g. Go further.
>>
>>> - Take an hour or so to play around with other parameters and get a feel for what they do, to build an intuitive understand about neural networks.

a.
- I can see that for the 1st hidden layer, the patterns learned are linear decision boundaries.
- For the 2nd hidden layer, the decision boundaries are more complicated.

b.
- Switching to ReLU, I can see that it converges much faster than tanh (lower number of epochs).
- The boundaries are straight lines (a hexagon decision boundary) compared to tanh's circular decision boundary.

c.
- I noticed that ReLU tended to get stuck on a local minimum much more than using tanh.
- When using tanh, it would get stuck on a local minimum for a little bit but eventually converge correctly.

d.
- I see that by decreasing the number of neurons to just 2, the neural network fails to find a good decision boundary (with test loss at about 0.24 vs. 0.005).
- It is underfitting the training set.

e.
- With 8 neurons, the neural network never gets stuck at a local optimum and always converges.

f.
- With 4 hidden layers of 8 neurons each, training takes a significantly longer time.
- I also see that it tends to get stuck on plateaus a lot more often.

g.
- I can see how the too large of a learning rate can cause the neural network to diverge.
- And too small of a learning rate takes it forever to converge.
- Despite having many layers (6), the neural network can still fail to converge if there's just not enough neurons (1 per layer).

## 2.

> - Draw an ANN using the original artificial neurons (like the ones in Figure 10-3 "logic operations") that computes $ A \oplus B $ (where $\oplus$ represents the XOR operation).

>> Hint: $ A \oplus B = (A \wedge \neg B \lor (\neg A \wedge B)$.

## 3.

> - Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (ie. a single layer of threshold logic units trained using the Perceptron training algorithm)?

> - How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

## 4.

> - Why was the logistic activation function a key ingredient in training the first MLPs?

## 5.

> - Name three popular activation functions.

> - Can you draw them?

## 6.

> Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function.

> - What is the shape of the input matrix $\mathbf{X}$?

> - What are the shapes of the hidden layer's weight vector $\mathbf{W}_h$ and its bias vector $\mathbf{b}_h$?

> - What are the shapes of the output layer's weight vector $\mathbf{W}_o$ and its bias vector $\mathbf{b}_o$?

> - What is the shape of the network's output matrix $\mathbf{Y}$?

> - Write the equation that computes the network's output matrix $\mathbf{Y}$ as a function of $\mathbf{X}$, $\mathbf{W}_h$, $\mathbf{b}_h$, $\mathbf{W}_o$, and $\mathbf{b}_o$.

## 7.

> - How many neurons do you need in the output layer if you want to classify email into spam or ham?

> - What activation function should you use in the output layer?

> - If instead you want to tackle MNIST, how many neurons do you need in the output layer?

> - Which activation function should you use?

> - What about for getting your network to predict housing prices, as in Chapter 2?

## 8.

> - What is backpropagation and how does it work?

> - What is the difference between backpropagation and reverse-mode autodiff?

## 9.

> - Can you list all the hyperparameters you can tweak in a basic MLP?

> - If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?

## 10.

> 1. Train a deep MLP on the MNIST dataset.

>> You can load it using `keras.datasets.mnist.load_data()`.

> 2. See if you can get over 98% precision.

> 3. Try searching for the optimal learning rate by using the approach presented in this chapter.

>> ie. by growing the learning rate exponentially, plotting the loss, and finding the point where the loss shoots up.

> 4. Try adding all the bells and whistles - save checkpoints, use early stopping, and plot learning curves using TensorBoard.