# Ch. 1 Basics of Deep Learning and Neural Networks
- Neural Networks are a powerful modeling approach that accounts for interaction in the model really well
- Deep Learning uses especially powerful neural networks
    - Text, images, videos, audio, source code, and really anything else
- Neural Network Structure
    - input layer
    - hidden layer(s): consists of nodes that represent aggregations of information from our input data. More Nodes generally means the model can account for more interactions
    - output layer

### Forward Propogation
- Input data are multiplied by weights and added together at hidden layer nodes, this continues for each node going forward in the each hidden layer until the output is reached.

Bank transaction example
- Make predictions based on:
    - Number of Children
    - Number of existing accounts
    
#### Writing Code to forward propogate a small neural network

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Input data and weights from DataCamp
input_data = np.array([3,5])
weights = {'node_0': np.array([2, 4]), 'node_1': np.array([ 4, -5]), 'output': np.array([2, 7])}

# Calculate node 0 value: node_0_value
node_0_value = (input_data * weights['node_0']).sum()

# Calculate node 1 value: node_1_value
node_1_value = (input_data * weights['node_1']).sum()

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_value, node_1_value])

# Calculate output: output
output = (hidden_layer_outputs * weights['output']).sum()

# Print output
print(output)

-39


### Activation Functions
- Applied in the hidden layers and allows the model to capture non-linearity
- if the relationships in the data are not straight line functions, we need activation functions that can capture the non-linearity
- Applied to the input coming into the node, the result is stored and used at that nodes output
- Standard today is the Rectified Linear Activation (ReLU) Function
    - relu = 0 if x < 0, and x if x > 0
        - relu(3) = 3
        - relu(-3) = 0

In [3]:
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(input, 0)
    
    # Return the value just calculated
    return(output)

In [4]:
# Input data and weights from DataCamp
input_data = np.array([3,5])
weights = {'node_0': np.array([2, 4]), 'node_1': np.array([ 4, -5]), 'output': np.array([2, 7])}

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)

52


Great work! You predicted 52 transactions. Without this activation function, you would have predicted a negative number! The real power of activation functions will come soon when you start tuning model weights.

##### Applying the network to many observations

In [5]:
input_data = np.array([[3, 5], [ 1, -1],[0, 0], [8, 4]])
weights = {'node_0': np.array([2, 4]), 'node_1': np.array([ 4, -5]), 'output': np.array([2, 7])}

# Define predict_with_network()
def predict_with_network(input_data_row, weights):

    # Calculate node 0 value
    node_0_input = (input_data_row * weights['node_0']).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (node_0_output * weights['node_1']).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights['output']).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)


# Create empty list to store prediction results
results = []
for input_data_row in input_data:
    # Append prediction to results
    results.append(predict_with_network(input_data_row, weights))

# Print results
print(results)
        

[52, 0, 0, 64]


### Deeper Networks
- Neural Networks have become much better as we've been able to add more hidden layers
- subsequent hidden layers take the output of previous nodes as the input, until eventually reaching the output node, or result

Representation Learning
- Deep Networks internally build representations of patterns in the data
- Partially replace the need for feature engineering
- Subsequent layers build increasingly sophisticated representations of the raw data, until we get to the prediction stage
- modeler does not need to specify what interactions to look for. 

In [6]:
input_data = np.array([3,5])
weights = {'node_0_0': np.array([2, 4]), 'node_0_1': np.array([ 4, -5]),
           'node_1_0': np.array([-1, 2]), 'node_1_1': np.array([1, 2]),
           'output': np.array([2, 7])}

def predict_with_network(input_data):
    # Calculate node 0 in the first hidden layer
    node_0_0_input = (input_data * weights['node_0_0']).sum()
    node_0_0_output = relu(node_0_0_input)

    # Calculate node 1 in the first hidden layer
    node_0_1_input = (input_data * weights['node_0_1']).sum()
    node_0_1_output = relu(node_0_1_input)

    # Put node values into array: hidden_0_outputs
    hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])
    
    # Calculate node 0 in the second hidden layer
    node_1_0_input = (hidden_0_outputs * weights['node_1_0']).sum()
    node_1_0_output = relu(node_1_0_input)

    # Calculate node 1 in the second hidden layer
    node_1_1_input = (hidden_0_outputs * weights['node_1_1']).sum()
    node_1_1_output = relu(node_1_1_input)

    # Put node values into array: hidden_1_outputs
    hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

    # Calculate model output: model_output
    model_output = (hidden_1_outputs * weights['output']).sum()
    
    # Return model_output
    return(model_output)

output = predict_with_network(input_data)
print(output)

182


# Ch. 2 Optimizing Neural Network with Backward Propogation
### The Need for Optimization
- The perfect weights for one data point are unlikely to be perfect for another
- When developing a model based on multiple points, you want to find the weights that minimize the loss function.
- Goal: find the weights that give the lowest value for the loss function
- <b>Gradient Descent</b> Algorithm: aims to find the lowest value. can think of as finding the bottom of a valley. If the ground is very steep you can take a larger step down the hill before measuring. As the ground becomes more flat you will take smaller steps. This continues in the shrinking of step sizes until you find that any more steps, no matter how small, will cause you to move uphill and furthest from the lowest point
    - start at a random point
    - find the slope
    - take a step
    - repeat until at lowest point
    
#### Coding how Weight changes accuracy

In [7]:
# Define predict_with_network()
def predict_with_network(input_data_row, weights):

    # Calculate node 0 value
    node_0_input = (input_data_row * weights['node_0']).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (node_0_output * weights['node_1']).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights['output']).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)

In [8]:
# The data point you will make a prediction for
input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': np.array([2, 1]),
             'node_1': np.array([1, 2]),
             'output': np.array([1, 1])
            }

# The actual target value, used to calculate the error
target_actual = 3

# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make perfect prediction (3): weights_1
weights_1 = {'node_0': np.array([2, 1]),
             'node_1': np.array([1, 2]),
             'output': np.array([1, 0])
            }

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)


9
0


#### Scaling to multiple Data Points

In [9]:
input_data = np.array(([0, 3], [1, 2], [-1, -2], [4, 0]))
weights_0 = {'node_0': np.array([2, 1]), 
             'node_1': np.array([1, 2]), 
             'output': np.array([1, 1])
            }
weights_1 = {'node_0': np.array([2, 1]),
             'node_1': np.array([1. , 1.5]),
             'output': np.array([1. , 1.5])
            }
target_actuals = [1, 3, 5, 7]

from sklearn.metrics import mean_squared_error

# Create model_output_0 
model_output_0 = []
# Create model_output_1
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)

Mean squared error with weights_0: 235.000000
Mean squared error with weights_1: 354.390625


## Backpropogation
- Takes the error from the output layer and sends it backwards through the hidden layers to the input layer
- allows gradient descent to update all weights in neural network (by getting gradients for all weights)
- comes from chain rule of calculus
- Important to understand the process, but you will generally use a library to implent this

Process
- Backpropogation is trying to estimate the slope of the loss function with respect to each weight
- Do forward propogation to calculate predictions and errors before backpropogation
- Go back one layer at a time
- Gradient for weight is the product of:
    - Node value feeding into that weight
    - Slope of loss function with respect to the node it feeds into
    - Slope of activation function at the node it feeds into
- Need to also keep track of the slopes of the loss function with respect to node values
- Slope of node values are the sum of the slopes for all weights that come out of them

Recap of Backpropogation
- Start at some random set of weights
- use forward propogation to make a prediction
- use backward propogation to calculate the slope of the loss funciton with respect to each weight
- multiple that slope by the learning rate, and subract from the current weights
- repeat cycle until we get to a "flat" part of the curve

#### Stochastic Gradient Descent
- calculate the slopes on only a subset of the data, or "batch"
- use a different batch of data to calculate the next update
- start over from the beginning once all data is used
- each time through the training data is called an epoch
- Slopes are calculated on one batch at a time: stochastic gradient descent

# Ch. 3 Building Models with Keras
## Model Building Steps
- Specify Architecture: number of layers, number of nodes, activation function
- Compile the model: specify loss function and details about optimization
- Fit the model: cycle of forward and backward propogation
- Predict

### Model Specification

In [10]:
import pandas as pd
df = pd.read_csv('wages.txt')
target = df['wage_per_hour']
predictors = df.drop('wage_per_hour', axis=1)

In [11]:
import pandas as pd
import numpy as np
from keras.layers import Dense
from keras.models import Sequential

# Load Data
df = pd.read_csv('wages.txt')
target = df['wage_per_hour']
predictors = df.drop('wage_per_hour', axis=1)

# Find the number of nodes in the input layer, equal to number of input features
n_cols = predictors.shape[1]

# Dense means all nodes will connect to each other node in the next layer
# Specify the model
model = Sequential()
model.add(Dense(50, activation='relu', input_shape = (n_cols,))) # connects all input into 100 nodes
model.add(Dense(32, activation='relu')) # Connects all 100 input nodes to all 100 in this layer
model.add(Dense(1, activation='relu')) # Converges into 1 final output node

### Compiling the Model
#### Why you need to compile the model
- Specify the Optimizer
    - Many options and mathematically complex
    - best to choose versatile option and use that for most problems
    - "Adam" is usually a good choice. It adjusts the learning rate as it does gradient descent to ensure reasonable values throughout the weight optimization process
- Loss Function
    - MSE is common for regression problems
    - Classificatino has a different default metric

In [12]:
# Compile the Model
model.compile(optimizer='adam', loss='mean_squared_error')

#### What is Fitting a model
- applying backpropogation and gradient descent with your data to update the weights
- Scaling data before fitting can ease optimization

In [13]:
# Fit the model
model.fit(predictors, target)



<tensorflow.python.keras.callbacks.History at 0x29627c44040>

### Full Process - Specify, Compile, Fit

In [14]:
# Specify the model
model = Sequential()
model.add(Dense(50, activation='relu', input_shape = (n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='relu')) 
# Compile the Model
model.compile(optimizer='adam', loss='mean_squared_error')
# Fit the model
model.fit(predictors, target)



<tensorflow.python.keras.callbacks.History at 0x296299d40d0>

### Classification Models
- 'categorical_crossentropy' as the loss function. Lower score is better
- Add Metrics=['accuracy'] to compile step for easy to understand diagnostics
- Output layer has seperate node for each possible outcome, and use "softmax" activation function in the output layer

In [15]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical

# Load Data
df = pd.read_csv('titanic.txt')
target = to_categorical(df.survived)
predictors = df.drop('survived', axis=1)
predictors.replace(False, 0, inplace=True)
predictors.replace(True, 1, inplace=True)
n_cols = predictors.shape[1]

# Set up the model
model = Sequential()

# Add the first layer
model.add(Dense(32, activation='relu', input_shape=(n_cols,)))

# Add the output layer
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(predictors, target, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2962ac555b0>

### Using Models
- Save model after training
- Reload the model
- Make predictions


In [16]:
pred_data = np.array([[2, 34.0, 0, 0, 13.0, 1, False, 0, 0, 1],
       [2, 31.0, 1, 1, 26.25, 0, False, 0, 0, 1],
       [1, 11.0, 1, 2, 120.0, 1, False, 0, 0, 1],
       [3, 0.42, 0, 1, 8.5167, 1, False, 1, 0, 0]])

# Calculate predictions: predictions
predictions = model.predict(pred_data)

# Calculate predicted probability of survival: predicted_prob_true
predicted_prob_true = predictions[:,1]

# print predicted_prob_true
print(predicted_prob_true)

[0.26049826 0.43580088 0.5783743  0.5376731 ]


# Ch. 4 Fine Tuning Keras Models
Why optimization is hard
- simultaneously optimizing 1000s of parameters with complex relationships
- Updates may not improve model meaningfully
- updates too small (if learning rate is low) or too large (if learning rate is high)

In [17]:
# Import necessary modules
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.optimizers import SGD

# Load Data
df = pd.read_csv('titanic.txt')
target = to_categorical(df.survived)
predictors = df.drop('survived', axis=1)
predictors.replace(False, 0, inplace=True)
predictors.replace(True, 1, inplace=True)
n_cols = predictors.shape[1]

def get_new_model(n_cols):
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    return(model)

# Create list of learning rates: lr_to_test
lr_to_test = [0.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model(n_cols)
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')
    
    # Fit the model
    model.fit(predictors, target, epochs=5)



Testing model with learning rate: 0.000001

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing model with learning rate: 0.010000

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing model with learning rate: 1.000000

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Model Validation
### validation in Deep Learning
- commonly use a validation split rather than cross validation
- Because Deep Learning widely used on large datasets and cross-validation would be computationally expensive
- Single validation score is based on large amount of data, and is reliable
- "validation_split" can be used in fitting of model and takes a decimal for what fraction to use in validation

Early Stopping can be used to make sure we stop optimizing the model once the best validation score is reached. Patience is how many epochs the model can go without improving before stopping.
- from keras.callbacks import EarlyStopping
- early_stopping_monitor = EarlyStopping(patience=2)
- model.fit(predictors, target, validation_split=0.3,
            nb_epoch=20, callbacks=[early_stopping_monitor])

In [18]:
# Imports
import keras
from keras.callbacks import EarlyStopping
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
# Load Data
df = pd.read_csv('titanic.txt')
target = to_categorical(df.survived)
predictors = df.drop('survived', axis=1)
predictors.replace(False, 0, inplace=True)
predictors.replace(True, 1, inplace=True)
n_cols = predictors.shape[1]

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model = model.fit(predictors, target, 
                  validation_split=0.3, epochs=30,
                  callbacks=[early_stopping_monitor],
                  verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30


## Thinking about Model Capacity
### Workflow for optimizing model capacity
- start with a small network and get the validation score
- gradually increase capacity as long as score keeps improving
- Keep increasing capacity until validation score is no longer improving

## Stepping up to images
Recognizing handwritten digits
- MNIST Dataset
- 28 x 28 grid flattened to 784 values for each image
- value in each part of array denotes darkness of that pixel

In [19]:
print(y.shape)
print(X.shape)

NameError: name 'y' is not defined

In [None]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

# Import and format data
df = pd.read_csv('mnist.txt', header=None)
y = to_categorical(df[0])
X = df.drop([0], axis=1)
n_cols=X.shape[1]

# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(X, y, validation_split=0.3, epochs=30, callbacks=[early_stopping_monitor])