### The linear algebra of dense layers
- The input layer contains 3 features -- education, marital status, and age -- which are available as borrower_features. The hidden layer contains 2 nodes and the output layer contains a single node.
- For each layer, you will take the previous layer as an input, initialize a set of weights, compute the product of the inputs and weights, and then apply an activation function.

In [None]:
# Initialize bias1
bias1 = Variable(1.0)

# Initialize weights1 as 3x2 variable of ones
weights1 = Variable(ones((3,2)))

# Perform matrix multiplication of borrower_features and weights1
product1 = matmul(borrower_features,weights1)

# Apply sigmoid activation function to product1 + bias1
dense1 = keras.activations.sigmoid(product1 + bias1)

# Print shape of dense1
print("\n dense1's output shape: {}".format(dense1.shape))

- Initialize weights2 as a variable using a 2x1 tensor of ones.
- Compute the product of dense1 by weights2 using matrix multiplication.
- Use a sigmoid activation function to transform product2 + bias2.

In [None]:
# From previous step
bias1 = Variable(1.0)
weights1 = Variable(ones((3, 2)))
product1 = matmul(borrower_features, weights1)
dense1 = keras.activations.sigmoid(product1 + bias1)

# Initialize bias2 and weights2
bias2 = Variable(1.0)
weights2 = Variable(ones((2, 1)))

# Perform matrix multiplication of dense1 and weights2
product2 = matmul(dense1,weights2)

# Apply activation to product2 + bias2 and print the prediction
prediction = keras.activations.sigmoid(product2 + bias2)
print('\n prediction: {}'.format(prediction.numpy()[0,0]))
print('\n actual: 1')

### Using the dense layer operation
- In this example, we'll skip the linear algebra and let keras work out the details. This will allow us to construct the network below, which has 2 hidden layers and 10 features, using less code than we needed for the network with 1 hidden layer and 3 features.
- To construct this network, we'll need to define three dense layers, each of which takes the previous layer as an input, multiplies it by weights, and applies an activation function. Note that input data has been defined and is available as a 100x10 tensor: borrower_features

In [None]:
# Define the first dense layer
dense1 = keras.layers.Dense(7, activation='sigmoid')(borrower_features)

# Define a dense layer with 3 output nodes
dense2 = keras.layers.Dense(3,activation='sigmoid')(dense1)

# Define a dense layer with 1 output node
predictions = keras.layers.Dense(1,activation='sigmoid')(dense2)

# Print the shapes of dense1, dense2, and predictions
print('\n shape of dense1: ', dense1.shape)
print('\n shape of dense2: ', dense2.shape)
print('\n shape of predictions: ', predictions.shape)

### Activation functions
- These are mathematical function applied to the output of a neuron. The purpose fo an activation function is to introduce non-linearity to the model, allowing the network to learn and represent complex patterns in the data.
- A typical hidden layer consists of two operations:
    - Linear - performs matrix multiplication
    - Non-linear - applies an activation function
- Example of Activation function include:
    - The sigmoid activation - applied on the output layer of binary classification
    - The relu activation - applied in all the hidden layers
    - Teh softmax activation - applied in the output layer in classification with more than two classes     

In [None]:
# Define input layer
inputs = tf.constant(borrower_features,tf.float32)

# Define dense layer 1
dense1 = tf.keras.layers.Dense(16,activation='relu')(inputs)

# Define dense layer 2
dense2 = tf.keras.layers.Dense(8,activation='sigmoid')(dense1)

# Define output layer
output1 = tf.keras.layers.Dense(4,activation='softmax')(dense2)

### Binary classification problems
- In this exercise, you will again make use of credit card data. The target variable, default, indicates whether a credit card holder defaults on his or her payment in the following period. Since there are only two options--default or not--this is a binary classification problem.
- Define inputs as a 32-bit floating point constant tensor using bill_amounts.
- Set dense1 to be a dense layer with 3 output nodes and a relu activation function.
- Set dense2 to be a dense layer with 2 output nodes and a relu activation function.
- Set the output layer to be a dense layer with a single output node and a sigmoid activation function.


In [None]:
# Construct input layer from features
inputs = constant(bill_amounts,float32)

# Define first dense layer
dense1 = keras.layers.Dense(3, activation='relu')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(2,activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(1,activation='sigmoid')(dense2)

# Print error for first five examples
error = default[:5] - outputs.numpy()[:5]
print(error)

### Multiclass classification problems
- In this exercise, we expand beyond binary classification to cover multiclass problems. A multiclass problem has targets that can take on three or more values. In the credit card dataset, the education variable can take on 6 different values, each corresponding to a different level of education.
- Define the input layer as a 32-bit constant tensor using borrower_features.
- Set the first dense layer to have 10 output nodes and a sigmoid activation function.
- Set the second dense layer to have 8 output nodes and a rectified linear unit activation function.
- Set the output layer to have 6 output nodes and the appropriate activation function.

In [None]:
# Construct input layer from borrower features
inputs = constant(borrower_features,float32)

# Define first dense layer
dense1 = keras.layers.Dense(10, activation='sigmoid')(inputs)

# Define second dense layer
dense2 = keras.layers.Dense(8, activation='relu')(dense1)

# Define output layer
outputs = keras.layers.Dense(6, activation='softmax')(dense2)

# Print first five predictions
print(outputs.numpy()[:5])

### The dangers of local minima
- Consider the plot of the following loss function, loss_function(), which contains a global minimum, marked by the dot on the right, and several local minima, including the one marked by the dot on the left.
- In this exercise, you will try to find the global minimum of loss_function() using keras.optimizers.SGD(). You will do this twice, each time with a different initial value of the input to loss_function(). First, you will use x_1, which is a variable with an initial value of 6.0. Second, you will use x_2, which is a variable with an initial value of 0.3. Note that loss_function() has been defined and is available
   - Set opt to use the stochastic gradient descent optimizer (SGD) with a learning rate of 0.01.
   - Perform minimization using the loss function, loss_function(), and the variable with an initial value of 6.0, x_1.
   - Perform minimization using the loss function, loss_function(), and the variable with an initial value of 0.3, x_2.
   - Print x_1 and x_2 as numpy arrays and check whether the values differ. These are the minima that the algorithm identified.

In [None]:
# Initialize x_1 and x_2
x_1 = Variable(6.0,float32)
x_2 = Variable(0.3,float32)

# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)

for j in range(100):
	# Perform minimization using the loss function and x_1
	opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
	# Perform minimization using the loss function and x_2
	opt.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

### Avoiding local minima
- The problem above showed how easy it is to get stuck in local minima. We had a simple optimization problem in one variable and gradient descent still failed to deliver the global minimum when we had to travel through local minima first. One way to avoid this problem is to use momentum, which allows the optimizer to break through local minima. We will again use the loss function from the previous problem, which has been defined and is available for you as loss_function().
- Several optimizers in tensorflow have a momentum parameter, including SGD and RMSprop. You will make use of RMSprop in this exercise
    - Set the opt_1 operation to use a learning rate of 0.01 and a momentum of 0.99.
    - Set opt_2 to use the root mean square propagation (RMS) optimizer with a learning rate of 0.01 and a momentum of 0.00.
    - Define the minimization operation for opt_2.
    - Print x_1 and x_2 as numpy arrays.

In [None]:
# Initialize x_1 and x_2
x_1 = Variable(0.05,float32)
x_2 = Variable(0.05,float32)

# Define the optimization operation for opt_1 and opt_2
opt_1 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.99)
opt_2 = keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.00)

for j in range(100):
	opt_1.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Define the minimization operation for opt_2
	opt_2.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

### Initialization in Tensorflow
- A good initialization can reduce the amount of time needed to find the global minimum. In this exercise, we will initialize weights and biases for a neural network that will be used to predict credit card default decisions. To build intuition, we will use the low-level, linear algebraic approach, rather than making use of convenience functions and high-level keras operations. We will also expand the set of input features from 3 to 23.
     - Initialize the layer 1 weights, w1, as a Variable() with shape [23, 7], drawn from a normal distribution.
     - Initialize the layer 1 bias using ones.
     - Use a draw from the normal distribution to initialize w2 as a Variable() with shape [7, 1].
     - Define b2 as a Variable() and set its initial value to 0.0.

In [None]:
# Define the layer 1 weights
w1 = Variable(random.normal([23, 7]))

# Initialize the layer 1 bias
b1 = Variable(ones([7]))

# Define the layer 2 weights
w2 = Variable(random.normal([7,1]))

# Define the layer 2 bias
b2 = Variable(ones([0]))

### Defining the model and loss function
- In this exercise, you will train a neural network to predict whether a credit card holder will default. The features and targets you will use to train your network are available in the Python shell as borrower_features and default. You defined the weights and biases in the above.
- Note that the predictions layer is defined as , where  is the sigmoid activation, layer1 is a tensor of nodes for the first hidden dense layer, w2 is a tensor of weights, and b2 is the bias tensor.
- The trainable variables are w1, b1, w2, and b2. Additionally, the following operations have been imported for you: keras.activations.relu() and keras.layers.Dropout().
    - Apply a rectified linear unit activation function to the first layer.
    - Apply 25% dropout to layer1.
    - Pass the target, targets, and the predicted values, predictions, to the cross entropy loss function.

In [None]:
# Define the model
def model(w1, b1, w2, b2, features = borrower_features):
	# Apply relu activation functions to layer 1
	layer1 = keras.activations.relu(matmul(features, w1) + b1)
    # Apply dropout rate of 0.25
	dropout = keras.layers.Dropout(0.25)(layer1)
	return keras.activations.sigmoid(matmul(dropout, w2) + b2)

# Define the loss function
def loss_function(w1, b1, w2, b2, features = borrower_features, targets = default):
	predictions = model(w1, b1, w2, b2)
	# Pass targets and predictions to the cross entropy loss
	return keras.losses.binary_crossentropy(targets, predictions)

### Training neural networks with TensorFlow
- In the sample above, you defined a model, model(w1, b1, w2, b2, features), and a loss function, loss_function(w1, b1, w2, b2, features, targets), both of which are available to you in this exercise. You will now train the model and then evaluate its performance by predicting default outcomes in a test set, which consists of test_features and test_targets and is available to you. The trainable variables are w1, b1, w2, and b2.
  - Set the optimizer to perform minimization.
  - Add the four trainable variables to var_list in the order in which they appear as arguments to loss_function().
  - Use the model and test_features to predict the values for test_targets.

In [None]:
# Train the model
for j in range(100):
    # Complete the optimizer
	opt.minimize(lambda: loss_function(w1, b1, w2, b2), 
                 var_list=[w1, b1, w2, b2])

# Make predictions with model using test features
model_predictions = model(w1, b1, w2, b2, test_features)

# Construct the confusion matrix
confusion_matrix(test_targets, model_predictions)