**1. Draw an ANN using the original artifical neurons that comput A XOR B. Hint: A XOR B = (A and !B) or (!A and B)**

![XOR](./XOR.jpg)

**2. Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?**

The classical Perceptron uses the step function as its activation. Thus, it will only return a 0 if it doesn't think a certain class applies and 1 if it does. Logistic Regression on the other hand gives a probability value for each class. To fix this, we just need to use the sigmoid function as the perceptron's activation function.

**3. Why was the logistic activation function a key ingredient in training the first MLPs?**

If you were only using the step activation function in your MLP, you would have a non-differentiable point at x=0 and dy/dx = 0 everywhere else. When you are computing the gradient at each step, this is not helpful and gives no information on the "direction to head downhill" to minimize the loss. The logistic activation function gives a nice derivative that points in the direction of minimizing loss.

**4. Name three popular activation functions. Can you draw them?**

![Activation functions](./activations.jpg)

**5. Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finall one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function.**


- **What is the shape of the input matrix X?**

The shape would be (batch size, 10) because there are *batch size* instances each with 10 features.

- **What about the shape of the hidden layer's weight vector Wh and the shape of its bias vector bh?**

The weight vector has shape (50) for each of the 50 neurons and the bias vector has shape (1) because each layer has one bias value.

- **What is the shape of the output layer's weight vector Wo and its bias vector bo?**

The weight vector has shape (3) because there are 3 output neurons and the bias vector has no shape because the output has no bias value.

- **What is the shape of the network's output matrix Y?**

It is (batch size, 3).

- **Write the equation that computes the network's output matrix Y as a function of X, Wh, bh, Wo, bo.**

**Y** = (**X**_transposed x Wh)_transposed x Wo

**6. How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? In instead you wanted to tackle MNIST, how many neurons do you need in the output layer, using what activation function? Answer the same questions for getting your network to predict housing prices as in Chapter 2.**


- Email: 2 output neurons with softmax

- MNIST: 10 output neurons with softmax

- Housing prices: 1 output neuron with no activation fn

**7. What is backpropagation and how does it work? What is the difference betwen backpropogation and reverse-mode autodiff?**

Backpropogation is the process of computing the error contributions of a hidden layer on the next layer, and then repeating the same process on the previous hidden layer, ie computing the contributions of the previous layer on the previously computed error contributions. This process repeats for every hidden layer in the network. It is no different from reverse-mode autodiff.

**8. Can you list all the hyperparameters you can tweak in an MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?**

You can reduce the number of neurons in each layer as that will reduce the complexity the model can understand. You could reduce the number of layers which is likely the best way to reduce overfitting. Another hyperparameter would be activation functions but changing these shouldn't have much effect on overfitting.

**9. Train a deep MLP on the MNIST dataset and see if you can get over 98% precision. Just like in the last exercise of Chapter 9, try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, plot learning curves using TensorBoard, and so on).**

In [8]:
import tensorflow as tf
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

In [11]:
# Defining model
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(150, activation='relu'),
    tf.keras.layers.Dropout(.33),
    tf.keras.layers.Dense(150, activation='relu'),
    tf.keras.layers.Dropout(.33),
    tf.keras.layers.Dense(150, activation='relu'),
    tf.keras.layers.Dropout(.33),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='loss',patience=2,restore_best_weights=True)

In [13]:
model.fit(x_train, y_train, callbacks=[early_stop],batch_size=32,epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


<tensorflow.python.keras.callbacks.History at 0x2b1028bf940>

In [14]:
model.evaluate(x_test, y_test) # Achieves an accuracy of 98.14% after 31 epochs of training!



[0.08138713985681534, 0.9814000129699707]