---
# Cairo University Faculty of Engineering
## Deep Learning
## Assignment 1

---

Please write your full name here
- **Name** : "-----------"

## Table of Contents
- [Part1: Tensorflow and Python](#1)
    - [1.1 - Sigmoid function, tf.exp()](#1-1)
    - [1.2 - Sigmoid gradient](#1-2)
    - [1.3 - Reshaping arrays](#1-3)
    - [1.4 - Normalizing rows](#1-4)
        - [normalize_rows](#1-4-1)
        - [softmax](#1-4-2)
    - [2 - Vectorization](#2)
        - [2.1 - Implement the L1 and L2 loss functions](#2-1)
            - [L1](#2-1-1)
            - [L2](#2-1-2)
- [Part2: TensorFlow](#3)

In [None]:
import math
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import random

np.random.seed(1234)
tf.random.set_seed(1234)
random.seed(1234)

<a name='1'></a>
# Part1: Tensorflow and Python
## **Instructions:**

- Avoid using for-loops and while-loops, unless you are explicitly told to do so.
- After coding your function, run the cell right below it to check if your result is correct.
- **Use tensorflow in all your codes unless stated otherwise** ⏰

**You only need to write code between the ### START CODE HERE ### and ### END CODE HERE ### comments.**

<a name='1-1'></a>
## 1 - Building basic functions with tensorflow ##

### 1.1 - Sigmoid function ###

**Exercise**: Build a function that returns the sigmoid of a real number x. Use math.exp(x) for the exponential function.

**Reminder**:
$sigmoid(x) = \frac{1}{1+e^{-x}}$ is sometimes also known as the logistic function. It is a non-linear function used not only in Machine Learning (Logistic Regression), but also in Deep Learning.

<img src="https://i.ibb.co/4fw1Qzk/sigmoid.png" alt="sigmoid" border="0">

In [None]:
# GRADED FUNCTION: basic_sigmoid

def basic_sigmoid(x):
    """
    Compute sigmoid of x.

    Arguments:
    x -- A scalar

    Return:
    s -- sigmoid(x)
    """

    ### START CODE HERE ### (≈ 1 line of code)

    ### END CODE HERE ###

    return s

In [None]:
basic_sigmoid(3)

In [None]:
### One reason why we use "tf" instead of "math" in Deep Learning ###
x = [1, 2, 3]
basic_sigmoid(x) # you will see this give an error when you run it, because x is a vector.

|**Exercise**: Implement the sigmoid function using TENSORFLOW.

**Instructions**: x could now be either a real number, a vector, or a matrix.
$$ \text{For } x \in \mathbb{R}^n \text{,     } sigmoid(x) = sigmoid\begin{pmatrix}
    x_1  \\
    x_2  \\
    ...  \\
    x_n  \\
\end{pmatrix} = \begin{pmatrix}
    \frac{1}{1+e^{-x_1}}  \\
    \frac{1}{1+e^{-x_2}}  \\
    ...  \\
    \frac{1}{1+e^{-x_n}}  \\
\end{pmatrix}\tag{1} $$

In [None]:
# GRADED FUNCTION: sigmoid

def sigmoid(x):
    """
    Compute the sigmoid of x

    Arguments:
    x -- A scalar or numpy array of any size

    Return:
    s -- sigmoid(x)
    """

    ### START CODE HERE ### (≈ 1 line of code)


    ### END CODE HERE ###

    return s

In [None]:
x = np.array([1, 2, 3], dtype=float)
sigmoid(x)

<a name='1-2'></a>
### 1.2 - Sigmoid gradient

As you've seen, you will need to compute gradients to optimize loss functions. Let's calculate the gradient of the sigmoid function.

**Exercise**: Calculate the gradient/derivative of the sigmoid function using LaTeX. SHOW YOUR WORK
$$\sigma(x) = \frac{1}{1+e^{-x}}$$

**Answer**




Let's code your first gradient function.

**Exercise**: Implement the function sigmoid_grad() to compute the gradient of the sigmoid function with respect to its input x. Use the formula you calculated in the previous exercise

In [None]:
# GRADED FUNCTION: sigmoid_derivative

def sigmoid_derivative(x):
    """
    Compute the gradient (also called the slope or derivative) of the sigmoid function with respect to its input x.
    You can store the output of the sigmoid function into variables and then use it to calculate the gradient.

    Arguments:
    x -- A scalar or numpy array

    Return:
    ds -- Your computed gradient.
    """

    ### START CODE HERE ### (≈ 2 lines of code)


    ### END CODE HERE ###

    return ds

In [None]:
x = np.array([1, 2, 3], dtype=float)
print ("sigmoid_derivative(x) = " + str(sigmoid_derivative(x)))

<a name='1-3'></a>
### 1.3 - Reshaping arrays ###

Two common functions used in deep learning are [tf.shape] and [tf.reshape()].
- X.shape is used to get the shape (dimension) of a matrix/vector X.
- X.reshape(...) is used to reshape X into some other dimension.

For example, in computer science, an image is represented by a 3D array of shape $(length, height, depth = 3)$. However, when you read an image as the input of an algorithm you convert it to a vector of shape $(length*height*3, 1)$. In other words, you "unroll", or reshape, the 3D array into a 1D vector.



<img src="https://i.ibb.co/2PVNWKL/image2vector-kiank.png" alt="image2vector-kiank" border="0">

**Exercise**: Implement `image2vector()` that takes an input of shape (length, height, 3) and returns a vector of shape (length\*height\*3, 1).

- Please don't hardcode the dimensions of image as a constant. Instead look up the quantities you need.

In [None]:
# GRADED FUNCTION: image2vector
def image2vector(image):
    """
    Argument:
    image -- a numpy array of shape (length, height, depth)

    Returns:
    v -- a vector of shape (length*height*depth, 1)
    """

    ### START CODE HERE ### (≈ 1 line of code)

    ### END CODE HERE ###

    return v

In [None]:
# This is a 3 by 3 by 2 array, typically images will be (num_px_x, num_px_y,3) where 3 represents the RGB values
image = np.array([[[ 0.67826139,  0.29380381],
        [ 0.90714982,  0.52835647],
        [ 0.4215251 ,  0.45017551]],

       [[ 0.92814219,  0.96677647],
        [ 0.85304703,  0.52351845],
        [ 0.19981397,  0.27417313]],

       [[ 0.60659855,  0.00533165],
        [ 0.10820313,  0.49978937],
        [ 0.34144279,  0.94630077]]])

print ("image2vector(image) = " + str(image2vector(image)))

<a name='1-4'></a>
### 1.4 - Normalizing Rows ####


Another common technique we use in Machine Learning and Deep Learning is to normalize our data. It often leads to a better performance because gradient descent converges faster after normalization. Here, by normalization we mean changing x to $ \frac{x}{\| x\|} $ (dividing each row vector of x by its norm).

For example, if
$$x = \begin{bmatrix}
        0 & 3 & 4 \\
        2 & 6 & 4 \\
\end{bmatrix}\tag{3}$$
then
$$\| x\| =  \begin{bmatrix}
    5 \\
    \sqrt{56} \\
\end{bmatrix}\tag{4} $$
and
$$ x\_normalized = \frac{x}{\| x\|} = \begin{bmatrix}
    0 & \frac{3}{5} & \frac{4}{5} \\
    \frac{2}{\sqrt{56}} & \frac{6}{\sqrt{56}} & \frac{4}{\sqrt{56}} \\
\end{bmatrix}\tag{5}$$

Note that you can divide matrices of different sizes and it works fine: this is called broadcasting.

HINTS:
- `keepdims`
- tf.norm has another parameter `ord` where we specify the type of normalization to be done (in the exercise below you'll do 2-norm).

<a name='1-4-1'></a>
#### 1.4.1 - Normalize_rows
Implement normalizeRows() to normalize the rows of a matrix. After applying this function to an input matrix x, each row of x should be a vector of unit length (meaning length 1).

In [None]:
# GRADED FUNCTION: normalize_rows

def normalize_rows(x):
    """
    Implement a function that normalizes each row of the matrix x (to have unit length).

    Argument:
    x -- A numpy matrix of shape (n, m)

    Returns:
    x -- The normalized (by row) numpy matrix. You are allowed to modify x.
    """

    #(≈ 2 lines of code)
    # Compute x_norm as the norm 2 of x. Use tf.norm
    # x_norm =
    # Divide x by its norm.
    # x =
    # YOUR CODE STARTS HERE


    # YOUR CODE ENDS HERE

    return x

In [None]:
x = np.array([[0, 3, 4],
              [1, 6, 4]], dtype=float)
print("normalizeRows(x) = " + str(normalize_rows(x)))

**Note**:
In normalize_rows(), you can try to print the shapes of x_norm and x, and then rerun the assessment. You'll find out that they have different shapes. This is normal given that x_norm takes the norm of each row of x. So x_norm has the same number of rows but only 1 column. So how did it work when you divided x by x_norm? This is called broadcasting!

<a name='1-4-2'></a>
#### 1.4.2 - Softmax function ####

**Exercise**: Implement a softmax function using tensorflow. You can think of softmax as a normalizing function (makes the sum of features of a sample to equal 1) used when your algorithm needs to classify two or more classes. You will learn more about softmax later in the course.

**Instructions**:
- $ \text{for } x \in \mathbb{R}^{1\times n} \text{,     } softmax(x) = softmax(\begin{bmatrix}
    x_1  &&
    x_2 &&
    ...  &&
    x_n  
\end{bmatrix}) = \begin{bmatrix}
     \frac{e^{x_1}}{\sum_{j}e^{x_j}}  &&
    \frac{e^{x_2}}{\sum_{j}e^{x_j}}  &&
    ...  &&
    \frac{e^{x_n}}{\sum_{j}e^{x_j}}
\end{bmatrix} $

- $\text{for a matrix } x \in \mathbb{R}^{m \times n} \text{,  $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }$  $$softmax(x) = softmax\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
    x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots  & x_{mn}
\end{bmatrix} = \begin{bmatrix}
    \frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots  & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
    \frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots  & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    \frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots  & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} = \begin{pmatrix}
    softmax\text{(first row of x)}  \\
    softmax\text{(second row of x)} \\
    ...  \\
    softmax\text{(last row of x)} \\
\end{pmatrix} $$

**NOTE**

"m" is used to represent the "number of training examples".
Softmax should be performed for all features of each training example, so softmax would be performed on the rows.

$m$ is the number of rows and $n$ is the number of columns.

In [None]:
# GRADED FUNCTION: softmax

def softmax(x):
    """Calculates the softmax for each row of the input x.

    Your code should work for a row vector and also for matrices of shape (m,n).

    Argument:
    x -- A numpy matrix of shape (m,n)

    Returns:
    s -- A numpy matrix equal to the softmax of x, of shape (m,n)
    """

    ### START CODE HERE ### (≈ 3 lines of code)
    # Apply exp() element-wise to x to get x_exp.
    x_exp =
    # Create a vector x_sum that sums each row of x_exp.
    x_sum =
    # Compute softmax(x) by dividing results of 2 previous steps.
    s =
    ### END CODE HERE ###

    return s

In [None]:
x = np.array([
    [9, 2, 5, 0, 0],
    [7, 5, 0, 0 ,0]], dtype=float)
print("softmax(x) = " + str(softmax(x)))
print("sum of each row of softmax(x) = " + str(tf.reduce_sum(softmax(x), axis=1)))

**Note**:
- If you print the shapes of x_exp, x_sum and s above and rerun the assessment cell, you will see that x_sum is of shape (2,1) while x_exp and s are of shape (2,5). **x_exp/x_sum** works due to python broadcasting.

<font color='blue'>
**What you need to remember:**

- tf.exp(x) works for any np.array x and applies the exponential function to every coordinate
- the sigmoid function and its gradient
- image2vector is commonly used in deep learning
- tf.reshape is widely used. In the future, you'll see that keeping your matrix/vector dimensions straight will go toward eliminating a lot of bugs.
- broadcasting is extremely useful

<a name='1-2'></a>
## 2 - Vectorization

In deep learning, you deal with very large datasets. Hence, a non-computationally-optimal function can become a huge bottleneck in your algorithm and can result in a model that takes ages to run. To make sure that your code is computationally efficient, you will use vectorization.

<a name='2-1'></a>
### 2.1 Implement the L1 and L2 loss functions
<a name='2-1-1'></a>
#### 2.1.1 L1 loss:
**Exercise**: Implement the vectorized version of the L1 loss. You may find the function tf.abs(x) (absolute value of x) useful.

**Reminder**:
- The loss is used to evaluate the performance of your model. The bigger your loss is, the more different your predictions ($ \hat{y} $) are from the true values ($y$). In deep learning, you use optimization algorithms like Gradient Descent to train your model and to minimize the cost.
- L1 loss is defined as:
$$\begin{align*} & L_1(\hat{y}, y) = \frac{1}{m}\sum_{i=0}^m|y^{(i)} - \hat{y}^{(i)}| \end{align*}\tag{6}$$

In [None]:
# GRADED FUNCTION: L1
def L1(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)

    Returns:
    loss -- the value of the L1 loss function defined above
    """

    ### START CODE HERE ### (≈ 1 line of code)

    ### END CODE HERE ###

    return loss

In [None]:
yhat = np.array([.9, 0.2, 0.1, .4, .9])
y = np.array([1, 0, 0, 1, 1])
print("L1 = " + str(L1(yhat,y)))

<a name='2-1-2'></a>
#### 2.1.2 L2 loss:
**Exercise**: Implement the vectorized version of the L2 loss. There are several way of implementing the L2 loss.

- L2 loss is defined as $$\begin{align*} & L_2(\hat{y},y) = \frac{1}{m}\sum_{i=0}^m(y^{(i)} - \hat{y}^{(i)})^2 \end{align*}\tag{7}$$

In [None]:
# GRADED FUNCTION: L2

def L2(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)

    Returns:
    loss -- the value of the L2 loss function defined above
    """

    ### START CODE HERE ### (≈ 1 line of code)

    ### END CODE HERE ###

    return loss

In [None]:
yhat = np.array([.9, 0.2, 0.1, .4, .9])
y = np.array([1, 0, 0, 1, 1])
print("L2 = " + str(L2(yhat,y)))

<font color='blue'>
**What to remember:**

- Vectorization is very important in deep learning. It provides computational efficiency and clarity.
- You have reviewed the L1 and L2 loss.
- You are familiar with many tensorflow functions etc...

<a name='3'></a>
# Part2: Intro to TensorFlow

In this part of the assignment, you'll get exposure to using TensorFlow and learn how it can be used for solving deep learning tasks.

## 1.1 Why is TensorFlow called TensorFlow?

TensorFlow is called 'TensorFlow' because it handles the flow (node/mathematical operation) of Tensors, which are data structures that you can think of as multi-dimensional arrays.
The ```shape``` of a Tensor defines its number of dimensions and the size of each dimension. The ```rank``` of a Tensor provides the number of dimensions (n-dimensions) -- you can also think of this as the Tensor's order or degree.

In [None]:
### Defining higher-order Tensors ###

'''TODO: Define a 2-d Tensor'''
matrix = # TODO

assert isinstance(matrix, tf.Tensor), "matrix must be a tf Tensor object"
assert tf.rank(matrix).numpy() == 2

In [None]:
'''TODO: Define a 4-d Tensor.'''
# Use tf.zeros to initialize a 4-d Tensor of zeros with size 10 x 256 x 256 x 3.
#   You can think of this as 10 images where each image is RGB 256 x 256.
images = # TODO

assert isinstance(images, tf.Tensor), "matrix must be a tf Tensor object"
assert tf.rank(images).numpy() == 4, "matrix must be of rank 4"
assert tf.shape(images).numpy().tolist() == [10, 256, 256, 3], "matrix is incorrect shape"

## 1.2 Computations on Tensors

A convenient way to think about and visualize computations in TensorFlow is in terms of graphs. We can define this graph in terms of Tensors, which hold data, and the mathematical operations that act on these Tensors in some order. Let's look at a simple example, and define this computation using TensorFlow:

![alt text](https://raw.githubusercontent.com/aamini/introtodeeplearning/master/lab1/img/add-graph.png)

In [None]:
# Create the nodes in the graph, and initialize values
a = tf.constant(15)
b = tf.constant(61)

# Add them!
c1 = tf.add(a,b)
c2 = a + b # TensorFlow overrides the "+" operation so that it is able to act on Tensors
print(c1)
print(c2)

Notice how we've created a computation graph consisting of TensorFlow operations, and how  the output is a Tensor with value 76 -- we've just created a computation graph consisting of operations, and it's executed them and given us back the result.

Now let's consider a slightly more complicated example:

<img src="https://i.ibb.co/VQvRN3y/computational-graph.png" alt="computational-graph" border="0">

Here, we take four inputs, `a, b, x, y`, and compute an output `f`. Each node in the graph represents an operation that takes some input, does some computation, and passes its output to another node.

Let's define a simple function in TensorFlow to construct this computation function:

In [None]:
### Defining Tensor computations ###

# Construct a simple computation function
def func(x,a,b,y):
  '''TODO: Define the operation for c, d, e, f (use tf.add, tf.subtract, tf.multiply).'''
  c =
  d =
  e =
  f =
  return f

Now, we can call this function to execute the computation graph given some inputs `a,b`:

In [None]:
# Consider example values for a,b
x, a, b, y = 1.5, 2.5, 2, 4
# Execute the computation
f_out = func(x,a,b,y)
print(f_out)

Notice how our output is a Tensor with value defined by the output of the computation, and that the output has no shape as it is a single scalar value.

## 1.3 Gradients Computations

`GradientTape` provides an extremely flexible framework for automatic differentiation. In order to back propagate errors through a neural network, we track forward passes on the Tape, use this information to determine the gradients, and then use these gradients for optimization using SGD.

In [None]:
### Gradient computation with GradientTape ###

# y = x^2
# Example: x = 3.0
x = tf.Variable(3.0)

# Initiate the gradient tape
with tf.GradientTape() as tape:
  # Define the function
  y = x * x

# Access the gradient -- derivative of y with respect to x
dy_dx = tape.gradient(y, x)

assert dy_dx.numpy() == 6.0

In training neural networks, we use differentiation and stochastic gradient descent (SGD) to optimize a loss function. Now that we have a sense of how `GradientTape` can be used to compute and access derivatives, we will look at an example where we use automatic differentiation and SGD to find the minimum of
$$L=(wx-y_{true})^2$$
Here $y_{true}$ is a variable for a desired value we are trying to optimize for; $L$ represents a loss that we are trying to  minimize. While we can clearly solve this problem analytically ($w_{min}=\frac{y_{true}}{x}$), considering how we can compute this using `GradientTape` sets us up nicely for future assignments where we use gradient descent to optimize entire neural network losses.

In [None]:
### Function minimization with automatic differentiation and SGD ###
tf.random.set_seed(1234)
# Initialize a random value for our initial w
w = tf.Variable([tf.random.normal([1])])
print("Initializing w={}".format(w.numpy()))

x = 2.0

learning_rate = 3e-2 # learning rate for SGD
history = []
history.append(w.numpy()[0])
# Define the target value
y_true = 4.0

# We will run SGD for a number of iterations. At each iteration, we compute the loss,
#   compute the derivative of the loss with respect to x, and perform the SGD update.
for i in range(500):
    with tf.GradientTape() as tape:
        '''TODO: define the loss as described above'''
        # TODO

  # loss minimization using gradient tape
    grad =  # TODO: compute the derivative of the loss with respect to x
    new_w =  # TODO: sgd update eqtn
    # TODO: update the value of w

    history.append(w.numpy()[0])

# Plot the evolution of wx as we optimize it towards y_true!
fig = plt.figure(figsize = (10,7))
plt.plot(np.array(history)*x)
plt.plot([0, 500],[y_true,y_true])
plt.legend(('Predicted', 'True'))
plt.xlabel('Iteration')
plt.ylabel('w value')

In [None]:
# print the final value of w
print(w)

The following cell shows the evolution of the w value during gradien descent starting from initial w value

In [None]:
w = np.linspace(-2, 6, 200)
loss_f = (w*x-y_true)**2
loss = (np.array(history)*x-y_true)**2
fig = plt.figure(figsize = (10,7))
plt.title("Evolution of the cost function during gradient descent", fontsize=15)
plt.plot(w,loss_f)
plt.plot(history, loss,'*', label = "Cost function")
plt.xlabel('Weight', fontsize=11)
plt.ylabel('Loss', fontsize=11)
plt.legend(loc = "upper right")
plt.show()

#### Learning Rate

**Exercise**: Try the previous code blocks with learning rates ${0.3, 0.000005}$

In [None]:
## TODO
#### Implement SGD with learning_rate = 0.3
tf.random.set_seed(1234)



In [None]:
## TODO
#### Plot w value evolution against cost function
#### -- make sure to remove the nans and inf from history before plotting
#### -- constrain the value of w and history to be between -10 and 10 before plotting



In [None]:
## TODO
#### Implement SGD with learning_rate = 0.000005

tf.random.set_seed(1234)



In [None]:
## TODO
#### Plot w value evolution against cost function



# Part3: A neural Network

In the tutorial we learned how to create a network model that predicts the handwritten digits from the MNIST dataset. This time we are trying recognize different items of clothing, trained from a dataset containing 10 different types.

The Fashion MNIST data is available directly in the tf.keras datasets API.

### Question 1 Loading and Viewing data

The Fashion MNIST data is available directly in the tf.keras datasets API.
- **Q** Load it like we did in the tutorial from keras.

In [None]:
 #TODO

- **Q** Normalize it like we did in the tutorial.

In [None]:
 #TODO

- **Q** Display 10 *random* images from the training images in 1 figure.

In [None]:
random.seed(1234)
#TODO#


### Question 2 The Model

Let's now design the model. Design a sequential neural network model with 2 hidden dense layers and 1 output dense layer. Use ReLu as activation function for the hidden layers and softmax as the activation for the output layer.

For the hidden layers' neurons, their number is left to your decision 😏

In [None]:
#TODO# Sequential Model

model =

Create the some model but using the `Functional` API

In [None]:
#TODO# Functional Model

model =

Compile the model with the appropriate crossentropy loss (take note of type of label) and Adam optimizer. Use accuracy for the metrics.

In [None]:
#TODO# Compile the model


In [None]:
model.fit(training_images, training_labels, epochs=5)

In [None]:
model.evaluate(test_images, test_labels)

Run the below code: It creates a set of classifications for each of the test images, and then prints the first entry in the classifications. The output, after you run it is a list of numbers.

In [None]:
classifications = model.predict(test_images)

print(classifications[0])

Hint: try running print(test_labels[0]) -- and you'll get a 9. Does that help you understand why this list looks the way it does?

In [None]:
print(test_labels[0])

- **Q** What does this list represent?


1.   It's 10 random meaningless values
2.   It's the first 10 classifications that the computer made
3.   It's the probability that this item is each of the 10 classes


- TODO (Answer)

**Q** How do you know that this list tells you that the item is an ankle boot?


1.   There's not enough information to answer that question
2.   The 10th element on the list is the biggest, and the ankle boot is labelled 9
2.   The ankle boot is label 9, and there are 0->9 elements in the list


- TODO (Answer)

Let's now look at the layers in your model. Experiment with different values for the dense layer. What different results do you get for loss, training time etc? Why do you think that's the case?


**Q** Use a larger number of neurons in the 2 hidden layers -- What's the impact?


- TODO (Answer)

In [None]:
#TODO# create model
model =

#TODO# Compile the model



model.fit(training_images, training_labels, epochs=5)

model.evaluate(test_images, test_labels)

classifications = model.predict(test_images)

print(classifications[0])
print(test_labels[0])

**Q** Before you trained, you normalized the data, going from values that were 0-255 to values that were 0-1. What would be the impact of removing that? Here's the complete code to give it a try. Why do you think you get different results?

- TODO (Answer)

In [None]:
#### CODE FOR DATA NOT NORMALIZED
mnist = tf.keras.datasets.mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(training_images, training_labels, epochs=5)

In [None]:
#### CODE FOR DATA NORMALIZED
mnist = tf.keras.datasets.mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()

training_images = training_images/255.0


model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(training_images, training_labels, epochs=5)

References:
- MIT 6.S191
- DL.ai
