In [None]:
# Neural Network

The previous chapters taught you how to build models in TensorFlow 2. In this chapter, you will apply those same tools to build, train, and make predictions with neural networks. You will learn how to define dense layers, apply activation functions, select an optimizer, and apply regularization to reduce overfitting. You will take advantage of TensorFlow's flexibility by using both low-level linear algebra and high-level Keras API operations to define and train models.

# (1) Dense layers

## The linear regression model

<img src="image/Screenshot 2021-01-24 232039.png">

## What is a nueral network?

<img src="image/Screenshot 2021-01-24 232132.png">

<imge src="image/Screenshot 2021-01-24 232213.png">

A dense layer applies weights to all nodes from the previous layer.

## A simple dense layer

```
import tensorflow as tf
```

```
# Define inputs (features)
inputs = tf.constants([[1, 35]])
```

```
# Define weights
weights = tf.Variable([[-0.05], [-0.01]])
```

```
# Define the bias
bias = tf.Variable([0.5])
```

```
# Multiply input (features) by the weights
product = tf.matmul(inputs, weights)
```

```
# Define dense layer
dense = tf.keras.activations.sigmoid(product+bias)
```
<img src="image/Screenshot 2021-01-24 232706.png">

## Defining a complete model

```
import tensorflow as tf
```

```
# Define input (features) layer
inputs = tf.constant(data, tf.float32)
```

```
# Define first dense layer
dense1 = tf.keras.layers.Dense(10, activation='sigmod')(input)
```

```
# Define second dense layer
dense2 = tf.keras.layers.Dense(5, activation='sigmod')(dense1)
```

```
# Define output (predictions) layer
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense2)
```

<img src="image/Screenshot 2021-01-24 233110.png">

## High-level versus low-level approach

- **High-level approach**
    - High-level API operations

```
dense = keras.layers.Dense(10, activation='sigmoid')
```

- **Low-level approach**
    - Linear-algebraic operations

```
prod = matmul(inputs, weights)
dense = keras.activations.sigmoid(prod)
```


# Exercise I: The linear algebra of dense layers

There are two ways to define a dense layer in `tensorflow`. The first involves the use of low-level, linear algebraic operations. The second makes use of high-level `keras` operations. In this exercise, we will use the first method to construct the network shown in the image below.

<img src="image/3_2_1_network2.png">

The input layer contains 3 features -- education, marital status, and age -- which are available as `borrower_features`. The hidden layer contains 2 nodes and the output layer contains a single node.

For each layer, you will take the previous layer as an input, initialize a set of weights, compute the product of the inputs and weights, and then apply an activation function. Note that `Variable()`, `ones()`, `matmul()`, and `keras()` have been imported from `tensorflow`.

### Instructions 

- Initialize `weights1` as a variable using a 3x2 tensor of ones.
- Compute the product of `borrower_features` by `weights1` using matrix multiplication.
- Use a sigmoid activation function to transform `product1 + bias1`.

- Initialize `weights2` as a variable using a 2x1 tensor of ones.
- Compute the product of `dense1` by `weights2` using matrix multiplication.
- Use a sigmoid activation function to transform `product2 + bias2`.




In [None]:
# From previous step
bias1 = Variable(1.0)
weights1 = Variable(ones((3, 2)))
product1 = matmul(borrower_features, weights1)
dense1 = keras.activations.sigmoid(product1 + bias1)

# Initialize bias2 and weights2
bias2 = Variable(1.0)
weights2 = Variable(ones((2, 1)))

# Perform matrix multiplication of dense1 and weights2
product2 = matmul(dense1, weights2)

# Apply activation to product2 + bias2 and print the prediction
prediction = keras.activations.sigmoid(product2 + bias2)
print('\n prediction: {}'.format(prediction.numpy()[0,0]))
print('\n actual: 1')

# Exercise II: The low-level approach with multiple examples

In this exercise, we'll build further intuition for the low-level approach by constructing the first dense hidden layer for the case where we have multiple examples. We'll assume the model is trained and the first layer weights, `weights1`, and bias, `bias1`, are available. We'll then perform matrix multiplication of the `borrower_features` tensor by the `weights1` variable. Recall that the `borrower_features` tensor includes education, marital status, and age. Finally, we'll apply the sigmoid function to the elements of `products1 + bias1`, yielding `dense1`.

$products1 = \begin{bmatrix} 3 & 3 & 23 \\ 2 & 1 & 24 \\ 1 & 1 & 49 \\ 1 & 1 & 49 \\ 2 & 1 & 49\end{bmatrix}\begin{bmatrix} -0.6 & 0.6 \\ 0.8 & -0.3 \\ -0.09 & -0.08\end{bmatrix}$  

Note that `matmul()` and `keras()` have been imported from `tensorflow`.

### Instructions


- Compute `products1` by matrix multiplying the features tensor by the weights.
- Use a sigmoid activation function to transform `products1 + bias1`.
- Print the shapes of `borrower_features`, `weights1`, `bias1`, and `dense1`.


In [None]:
# Compute the product of borrower_features and weights1
products1 = matmul(borrower_features, weights1)

# Apply a sigmoid activation function to products1 + bias1
dense1 = keras.activations.sigmoid(products1 + bias1)

# Print the shapes of borrower_features, weights1, bias1, and dense1
print('\n shape of borrower_features: ', borrower_features.shape)
print('\n shape of weights1: ', weights1.shape)
print('\n shape of bias1: ', bias1.shape)
print('\n shape of dense1: ', dense1.shape)

# Exercise III: Using the dense layer operation

We've now seen how to define dense layers in `tensorflow` using linear algebra. In this exercise, we'll skip the linear algebra and let `keras` work out the details. This will allow us to construct the network below, which has 2 hidden layers and 10 features, using less code than we needed for the network with 1 hidden layer and 3 features.

<img src="image/10_7_3_1_network.png">

To construct this network, we'll need to define three dense layers, each of which takes the previous layer as an input, multiplies it by weights, and applies an activation function. Note that input data has been defined and is available as a 100x10 tensor: `borrower_features`. Additionally, the `keras.layers` module is available.

### Instructions


- Set `dense1` to be a dense layer with 7 output nodes and a sigmoid activation function.
- Define `dense2` to be dense layer with 3 output nodes and a sigmoid activation function.
- Define `predictions` to be a dense layer with 1 output node and a sigmoid activation function.
- Print the shapes of `dense1`, `dense2`, and `predictions` in that order using the `.shape` method. Why does each of these tensors have 100 rows?


In [None]:
# Define the first dense layer
dense1 = keras.layers.Dense(7, activation='sigmoid')(borrower_features)

# Define a dense layer with 3 output nodes
dense2 = keras.layers.Dense(3, activation='sigmoid')(dense1)

# Define a dense layer with 1 output node
predictions = keras.layers.Dense(1, activation='sigmoid')(dense2)

# Print the shapes of dense1, dense2, and predictions
print('\n shape of dense1: ', dense1.shape)
print('\n shape of dense2: ', dense2.shape)
print('\n shape of predictions: ', predictions.shape)

# (2) Activation functions

## What is an activation function?

- **Components of a typical hidden layer**
    - **Linear**: Matrix multiplication
    - **Nonlinear**: Activation function
    
## Why nonlinearities are important

<img src="image/Screenshot 2021-01-25 000204.png">

<img src="image/Screenshot 2021-01-25 000254.png">

## A simple example

```
import numpy as np
import tensorflow as tf

# Define example borrower features
young, old = 0.3, 0.6
low_bill, high_bill = 0.1, 0.5
```

```
# Apply matrix multiplication step for all feature combinations
young_high = 1.0*young + 2.0*high_bill
young_low = 1.0*young + 2.0*low_bill
old_high = 1.0*old + 2.0*high_bill
old_low = 1.0*old + 2.0*low_bill
```

```
# Difference in default predictions for young
print(young_high - young_low)

# Difference in default predictions for old
print(old_high - old_low)
```

```
# Difference in default predictions for young
print(tf.keras.activations.sigmoid(young_high).numpy() - \
    tf.keras.activations.sigmoid(young_low).numpy())

# Difference in default predictions for old
print(tf.keras.activations.sigmoid(old_high).numpy() - \
    tf.keras.activations.sigmoid(old_low).numpy())
```

## The sigmoid activation function

- **Sigmoid activation function**
    - Binary classification
    - Low-level: `tf.keras.activations.sigmoid()`
    - High-level: `sigmoid`

<img src="image/Screenshot 2021-01-25 001511.png">

## The relu activation function

- **ReLu activation function**
    - Hidden layers
    - Low-level: `tf.keras.activations.relu()`
    - High-level: `relu`

<img src="image/Screenshot 2021-01-25 001708.png">

## The softmax activation function

- **Softmax activaiton function**
    - Output layer (>2 classes)
    - Low-level: `tf.keras.activation.softmax()`
    - High-level: `softmax`

## Activation functions in neural networks

```
import tensorflow as tf
```

```
# Define dense layer 1
dense1 = tf.keras.layers.Dense(16, activation='relu')(inputs)
```

```
# Define dense layer 2
dense2 = tf.keras.layers.Dense(8, activation='sigmoid')(dense1)
```

```
# Define dense layer output
outputs = tf.keras.layers.Dense(4, activation='softmax')(dense2)
```

# Exercise IV: Binary classification problems

In this exercise, you will again make use of credit card data. The target variable, `default`, indicates whether a credit card holder defaults on his or her payment in the following period. Since there are only two options--default or not--this is a binary classification problem. While the dataset has many features, you will focus on just three: the size of the three latest credit card bills. Finally, you will compute predictions from your untrained network, `outputs`, and compare those the target variable, `default`.

The tensor of features has been loaded and is available as `bill_amounts`. Additionally, the `constant()`, `float32`, and `keras.layers.Dense()` operations are available.

### Instructions


- Define `inputs` as a 32-bit floating point constant tensor using `bill_amounts`.
- Set `dense1` to be a dense layer with 3 output nodes and a `relu` activation function.
- Set `dense2` to be a dense layer with 2 output nodes and a `relu` activation function.
- Set the output layer to be a dense layer with a single output node and a `sigmoid` activation function.


In [None]:
# Construct input layer from features
inputs = constant(bill_amounts, float32)

# Define first dense layer
dense1 = keras.layers.Dense(3, activation='relu')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(2, activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(1, activation='sigmoid')(dense2)

# Print error for first five examples
error = default[:5] - outputs.numpy()[:5]
print(error)

# Exercise V: Multiclass classification problems

In this exercise, we expand beyond binary classification to cover multiclass problems. A multiclass problem has targets that can take on three or more values. In the credit card dataset, the education variable can take on 6 different values, each corresponding to a different level of education. We will use that as our target in this exercise and will also expand the feature set from 3 to 10 columns.

As in the previous problem, you will define an input layer, dense layers, and an output layer. You will also print the untrained model's predictions, which are probabilities assigned to the classes. The tensor of features has been loaded and is available as `borrower_features`. Additionally, the `constant()`, `float32`, and `keras.layers.Dense()` operations are available.

### Instructions


- Define the input layer as a 32-bit constant tensor using `borrower_features`.
- Set the first dense layer to have 10 output nodes and a `sigmoid` activation function.
- Set the second dense layer to have 8 output nodes and a rectified linear unit activation function.
- Set the output layer to have 6 output nodes and the appropriate activation function.


In [None]:
# Construct input layer from borrower features
inputs = constant(borrower_features, float32)

# Define first dense layer
dense1 = keras.layers.Dense(10, activation='sigmoid')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(8, activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(6, activation='softmax')(dense2)

# Print first five predictions
print(outputs.numpy()[:5])

# (3) Optimizers

## How to find a minimum

<img src="image/Screenshot 2021-01-25 003337.png">

## The gradient descent optimizer

- **Stochastic gradient descent (SGD) optimizer**
    - `tf.keras.optimizer.SGD()`
    - `learing_rate`
- **Simple and easy to interpret**

## The RMS prop optimizer

- ** Root mean squared (RMS) propagation optimizer**
    - Applies different learning rates to each feature
    - `tf.keras.optimizers.RMSprop()`
    - `learning_rate`
    - `momentum`
    - `decay`
- **Allows for momentum to both build and decay

## The adam optimizer

- **Adaptive moment (adam) optimizer
    - `tf.keras.optimizer.Adam()`
    - `learning_rate`
    - `beta1`
- **Performs well with default parameter values**

## A complete example

```
import tensorflow as tf

# Define the model function
def model(bias, weights, features=bollower_features):
    product = tf.matmul(features, weights)
    return tf.keras.activations.sigmoid(product+bias)
```

```
# Compute the predicted values and loss
def loss_function(bias, weights, targets=default, features=borrower_features):
    return tf.keras.losses.binary_crossentropy(targets, predictions)
```

```
# Minimize the loss function with RMS propagation
opt = tf.keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.9)
opt.minimize(lamda: loss_function(bias, weights), var_list=[bias, weights])
```

# Exercise VI: The dangers of local minima

Consider the plot of the following loss function, `loss_function()`, which contains a global minimum, marked by the dot on the right, and several local minima, including the one marked by the dot on the left.

<img src="image/local_minima_dots_4_10.png">

In this exercise, you will try to find the global minimum of `loss_function()` using `keras.optimizers.SGD()`. You will do this twice, each time with a different initial value of the input to `loss_function()`. First, you will use `x_1`, which is a variable with an initial value of 6.0. Second, you will use `x_2`, which is a variable with an initial value of 0.3. Note that `loss_function()` has been defined and is available.

### Instructions


- Set `opt` to use the stochastic gradient descent optimizer (SGD) with a learning rate of 0.01.
- Perform minimization using the `loss function`, loss_function(), and the variable with an initial value of 6.0, `x_1`.
- Perform minimization using the loss function, `loss_function()`, and the variable with an initial value of 0.3, `x_2`.
- Print `x_1` and `x_2` as `numpy` arrays and check whether the values differ. These are the minima that the algorithm identified.


In [None]:
# Initialize x_1 and x_2
x_1 = Variable(6.0,float32)
x_2 = Variable(0.3,float32)

# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)

for j in range(100):
	# Perform minimization using the loss function and x_1
	opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
	# Perform minimization using the loss function and x_2
	opt.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

# Exercise VII: Avoiding local minima

The previous problem showed how easy it is to get stuck in local minima. We had a simple optimization problem in one variable and gradient descent still failed to deliver the global minimum when we had to travel through local minima first. One way to avoid this problem is to use momentum, which allows the optimizer to break through local minima. We will again use the loss function from the previous problem, which has been defined and is available for you as `loss_function()`.

<img src="image/local_minima_dots_4_10.png">

Several optimizers in `tensorflow` have a momentum parameter, including `SGD` and `RMSprop`. You will make use of `RMSprop` in this exercise. Note that `x_1` and `x_2` have been initialized to the same value this time. Furthermore, `keras.optimizers.RMSprop()` has also been imported for you from `tensorflow`.

### Instructions


- Set the `opt_1` operation to use a learning rate of 0.01 and a momentum of 0.99.
- Set `opt_2` to use the root mean square propagation (RMS) optimizer with a learning rate of 0.01 and a momentum of 0.00.
- Define the minimization operation for `opt_2`.
- Print `x_1` and `x_2` as numpy arrays.


In [None]:
# Initialize x_1 and x_2
x_1 = Variable(0.05,float32)
x_2 = Variable(0.05,float32)

# Define the optimization operation for opt_1 and opt_2
opt_1 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.99)
opt_2 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.00)

for j in range(100):
	opt_1.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Define the minimization operation for opt_2
	opt_2.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

# (4) Training a network in TensorFlow

<img src="image/Screenshot 2021-01-25 010036.png">

# Random initializers

- **Ofter needed to initialize thousands of the varibles**
    - `tf.ones()` may perform poorly
    - Tedious and difficult to initialize variables individually
- **Alternatively, draw initial values form distribution**
    - Normal
    - Uniform
    - Glorot initializer

## Initializing variable in TensorFlow

```
import tensorflow as tf

# Define 500x500 random nirmal variable
weights = tf.Variable(tf.random.normal([500, 500]))

# Define 500x500 truncated random normal variable
weights = tf.Variable(tf.random.truncated_normal([500, 500 ]))
```

```
# Define a dense layer with the default initializer
dense = tf.keras.layers.Dense(32, activation='relu')

# Define a dense layer with the zeros initializer
dense = tf.keras.layers.Dense(32, activation='relu', kernel_initializer='zeros')
```

## Neural Networks and overfitting

<img src="image/Screenshot 2021-01-25 011037.png">

## Apply dropout

<img src="image/Screenshot 2021-01-25 011131.png">

## Implementing dropout in a network

```
import numpy as np
import tensorflow as tf

# Define input data
inputs = np.array(borrower_features, np.float32)
```

```
# Define dense layer 1
dense1 = tf.keras.layers.Dense(32, activation='relu')(inputs)
```

```
# Define dense layer 2
dense2 = tf.keras.layers.Dense(16, activation='relu')(dense1)
```

```
# Apply dropout operation
dropout1 = tf.keras.layers.Dropout(0.25)(dense2)
```

```
# Define output layer
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dropout1)
```

# Exercise VIII: Initialization in TensorFlow

A good initialization can reduce the amount of time needed to find the global minimum. In this exercise, we will initialize weights and biases for a neural network that will be used to predict credit card default decisions. To build intuition, we will use the low-level, linear algebraic approach, rather than making use of convenience functions and high-level `keras` operations. We will also expand the set of input features from 3 to 23. Several operations have been imported from `tensorflow`: `Variable()`, `random()`, and `ones()`

### Instructions


- Initialize the layer 1 weights, `w1`, as a `Variable()` with shape `[23, 7]`, drawn from a normal distribution.
- Initialize the layer 1 bias using ones.
- Use a draw from the normal distribution to initialize `w2` as a `Variable()` with shape `[7, 1]`.
- Define `b2` as a `Variable()` and set its initial value to 0.0.


In [None]:
# Define the layer 1 weights
w1 = Variable(random.normal([23, 7]))

# Initialize the layer 1 bias
b1 = Variable(ones([7]))

# Define the layer 2 weights
w2 = Variable(random.normal([7, 1]))

# Define the layer 2 bias
b2 = Variable(0)

# Exercise IX: Defining the model and loss function

In this exercise, you will train a neural network to predict whether a credit card holder will default. The features and targets you will use to train your network are available in the Python shell as `borrower_features` and `default`. You defined the weights and biases in the previous exercise.

Note that the `predictions layer` is defined as $\sigma(layer1 * w2 + b2)$ , where $\sigma$ is the sigmoid activation, `layer1` is a tensor of nodes for the first hidden dense layer, `w2` is a tensor of weights, and `b2` is the bias tensor.

The trainable variables are `w1`, `b1`, `w2`, and `b2`. Additionally, the following operations have been imported for you: `keras.activations.relu()` and `keras.layers.Dropout()`.

### Instructions

- Apply a rectified linear unit activation function to the first layer.
- Apply 25% dropout to `layer1`.
- Pass the `target`, targets, and the predicted values, `predictions`, to the cross entropy loss function.


In [None]:
# Define the model
def model(w1, b1, w2, b2, features = borrower_features):
	# Apply relu activation functions to layer 1
	layer1 = keras.activations.relu(matmul(features, w1) + b1)
    # Apply dropout
	dropout = keras.layers.Dropout(0.25)(layer1)
	return keras.activations.sigmoid(matmul(dropout, w2) + b2)

# Define the loss function
def loss_function(w1, b1, w2, b2, features = borrower_features, targets = default):
	predictions = model(w1, b1, w2, b2)
	# Pass targets and predictions to the cross entropy loss
	return keras.losses.binary_crossentropy(targets, predictions)

# Exercise X: Training neural networks with TensorFlow

In the previous exercise, you defined a model, `model(w1, b1, w2, b2, features)`, and a loss function, `loss_function(w1, b1, w2, b2, features, targets)`, both of which are available to you in this exercise. You will now train the model and then evaluate its performance by predicting default outcomes in a test set, which consists of `test_features` and `test_targets` and is available to you. The trainable variables are `w1`, `b1`, `w2`, and `b2`. Additionally, the following operations have been imported for you: `keras.activations.relu()` and `keras.layers.Dropout()`.

### Instructions


- Set the optimizer to perform minimization.
- Add the four trainable variables to `var_list` in the order in which they appear as arguments to `loss_function()`.
- Use the model and `test_features` to predict the values for `test_targets`.


In [None]:
# Train the model
for j in range(100):
    # Complete the optimizer
	opt.minimize(lambda: loss_function(w1, b1, w2, b2), 
                 var_list=[w1, b1, w2, b2])

# Make predictions with model
model_predictions = model(w1, b1, w2, b2, test_features)

# Construct the confusion matrix
confusion_matrix(test_targets, model_predictions)