*Accompanying code examples of the book "Introduction to Artificial Neural Networks and Deep Learning: A Practical Guide with Applications in Python" by [Sebastian Raschka](https://sebastianraschka.com). All code examples are released under the [MIT license](https://github.com/rasbt/deep-learning-book/blob/master/LICENSE). If you find this content useful, please consider supporting the work by buying a [copy of the book](https://leanpub.com/ann-and-deeplearning).*
  
Other code examples and content are available on [GitHub](https://github.com/rasbt/deep-learning-book). The PDF and ebook versions of the book are available through [Leanpub](https://leanpub.com/ann-and-deeplearning).

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p tensorflow

Sebastian Raschka 

CPython 3.6.0
IPython 6.0.0

tensorflow 1.1.0


# Model Zoo -- Multilayer Perceptron with Dropout

Typically, dropout is applied after the non-linear activation function (a). However, when using rectified linear units (ReLUs), it might make sense to apply dropout before the non-linear activation (b) for reasons of computational efficiency depending on the particular code implementation.

> (a):  Fully connected, linear activation -> ReLU -> Dropout -> ...  
> (b):  Fully connected, linear activation -> Dropout -> ReLU -> ...

Why do (a) and (b) produce the same results in case of ReLU?. Let's answer this question with a simple example starting with the following *logits* (outputs of the linear activation of the fully connected layer):

> `[-1, -2, -3, 4, 5, 6]`

Let's walk through scenario (a), applying the ReLU activation first. The output of the non-linear ReLU functions are as follows:

> `[0, 0, 0, 4, 5, 6]`

Remember, the ReLU activation function is defined as $f(x) = max(0, x)$; thus, all non-zero values will be changed to zeros. Now, applying dropout with a probability 0f 50%, let's assume that the units being deactivated are units 2, 4, and 6:


> `[0*2, 0, 0*2, 0, 0*2, 0] = [0, 0, 0, 0, 10, 0]`


Note that in dropout, units are deactivated randomly by default. In the preceding example, we assumed that the 2nd, 4th, and 6th unit were deactivated during the training iteration. Also, because we applied dropout with 50% dropout probability, we scaled the remaining units by a factor of 2.

Now, let's take a look at scenario (b). Again, we assume a 50% dropout rate and that units 2, 4, and 6 are deactivated:

> `[-1, -2, -3, 4, 5, 6] ->  [-1*2, 0, -3*2, 0, 5*2, 0]`


Now, if we pass this array to the ReLU function, the resulting array will look exactly like the one in scenario (a):


> `[-2, 0, -6, 0, 10, 0] -> [0, 0, 0, 0, 10, 0]`

### Low-level Implementation

In [2]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data


##########################
### DATASET
##########################

mnist = input_data.read_data_sets("./", one_hot=True)


##########################
### SETTINGS
##########################

# Hyperparameters
learning_rate = 0.1
training_epochs = 20
batch_size = 64
dropout_keep_proba = 0.5

# Architecture
n_hidden_1 = 128
n_hidden_2 = 256
n_input = 784
n_classes = 10


##########################
### GRAPH DEFINITION
##########################

g = tf.Graph()
with g.as_default():

    # Dropout settings
    keep_proba = tf.placeholder(tf.float32, None, name='keep_proba')
    
    # Input data
    tf_x = tf.placeholder(tf.float32, [None, n_input], name='features')
    tf_y = tf.placeholder(tf.float32, [None, n_classes], name='targets')

    # Model parameters
    weights = {
        'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
        'h2': tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1)),
        'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_classes], stddev=0.1))
    }
    biases = {
        'b1': tf.Variable(tf.zeros([n_hidden_1])),
        'b2': tf.Variable(tf.zeros([n_hidden_2])),
        'out': tf.Variable(tf.zeros([n_classes]))
    }

    # Multilayer perceptron
    layer_1 = tf.add(tf.matmul(tf_x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.relu(layer_1)
    layer_1 = tf.nn.dropout(layer_1, keep_prob=keep_proba)
    
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.relu(layer_2)
    layer_2 = tf.nn.dropout(layer_2, keep_prob=keep_proba)
    
    out_layer = tf.add(tf.matmul(layer_2, weights['out']), biases['out'], name='logits')

    # Loss and optimizer
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=tf_y)
    cost = tf.reduce_mean(loss, name='cost')
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    train = optimizer.minimize(cost, name='train')

    # Prediction
    correct_prediction = tf.equal(tf.argmax(tf_y, 1), tf.argmax(out_layer, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')

    
##########################
### TRAINING & EVALUATION
##########################

with tf.Session(graph=g) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = mnist.train.num_examples // batch_size

        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            _, c = sess.run(['train', 'cost:0'], feed_dict={'features:0': batch_x,
                                                            'targets:0': batch_y,
                                                            'keep_proba:0': dropout_keep_proba})
            avg_cost += c
        
        train_acc = sess.run('accuracy:0', feed_dict={'features:0': mnist.train.images,
                                                      'targets:0': mnist.train.labels,
                                                      'keep_proba:0': 1.0})
        valid_acc = sess.run('accuracy:0', feed_dict={'features:0': mnist.validation.images,
                                                      'targets:0': mnist.validation.labels,
                                                      'keep_proba:0': 1.0})
        
        print("Epoch: %03d | AvgCost: %.3f" % (epoch + 1, avg_cost / (i + 1)), end="")
        print(" | Train/Valid ACC: %.3f/%.3f" % (train_acc, valid_acc))
        
    test_acc = sess.run(accuracy, feed_dict={'features:0': mnist.test.images,
                                             'targets:0': mnist.test.labels,
                                             'keep_proba:0': 1.0})                                             
    print('Test ACC: %.3f' % test_acc)

Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz
Epoch: 001 | AvgCost: 0.690 | Train/Valid ACC: 0.926/0.933
Epoch: 002 | AvgCost: 0.379 | Train/Valid ACC: 0.946/0.951
Epoch: 003 | AvgCost: 0.313 | Train/Valid ACC: 0.956/0.962
Epoch: 004 | AvgCost: 0.274 | Train/Valid ACC: 0.960/0.965
Epoch: 005 | AvgCost: 0.254 | Train/Valid ACC: 0.964/0.966
Epoch: 006 | AvgCost: 0.232 | Train/Valid ACC: 0.968/0.968
Epoch: 007 | AvgCost: 0.215 | Train/Valid ACC: 0.969/0.970
Epoch: 008 | AvgCost: 0.205 | Train/Valid ACC: 0.971/0.969
Epoch: 009 | AvgCost: 0.196 | Train/Valid ACC: 0.974/0.971
Epoch: 010 | AvgCost: 0.187 | Train/Valid ACC: 0.976/0.972
Epoch: 011 | AvgCost: 0.178 | Train/Valid ACC: 0.978/0.971
Epoch: 012 | AvgCost: 0.173 | Train/Valid ACC: 0.979/0.972
Epoch: 013 | AvgCost: 0.167 | Train/Valid ACC: 0.979/0.971
Epoch: 014 | AvgCost: 0.161 | Train/Valid ACC: 0.980/0.972
Epoch: 015 | Avg

### tensorflow.layers Abstraction

Bote that we define the *dropout rate*, not the *keep probability* when we are using dropout from `tf.layers`.

In [3]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
from tensorflow.python.ops import init_ops


##########################
### DATASET
##########################

mnist = input_data.read_data_sets("./", one_hot=True)


##########################
### SETTINGS
##########################

# Hyperparameters
learning_rate = 0.1
training_epochs = 20
batch_size = 64
dropout_rate = 0.5 
# note that we define the dropout rate, not
# the "keep probability" when using
# dropout from tf.layers

# Architecture
n_hidden_1 = 128
n_hidden_2 = 256
n_input = 784
training_epochs = 15


##########################
### GRAPH DEFINITION
##########################

g = tf.Graph()
with g.as_default():

    # Dropout settings
    drop_rate = tf.placeholder(tf.float32, None, name='dropout_rate')
    
    # Input data
    tf_x = tf.placeholder(tf.float32, [None, n_input], name='features')
    tf_y = tf.placeholder(tf.float32, [None, n_classes], name='targets')

    # Multilayer perceptron
    layer_1 = tf.layers.dense(tf_x, n_hidden_1, activation=tf.nn.relu, 
                              kernel_initializer=init_ops.truncated_normal_initializer(stddev=0.1))
    layer_1 = tf.layers.dropout(layer_1, rate=drop_rate)
    
    layer_2 = tf.layers.dense(layer_1, n_hidden_2, activation=tf.nn.relu,
                              kernel_initializer=init_ops.truncated_normal_initializer(stddev=0.1))
    layer_2 = tf.layers.dropout(layer_1, rate=drop_rate)
    
    out_layer = tf.layers.dense(layer_2, n_classes, activation=None, name='logits')

    # Loss and optimizer
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=tf_y)
    cost = tf.reduce_mean(loss, name='cost')
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    train = optimizer.minimize(cost, name='train')

    # Prediction
    correct_prediction = tf.equal(tf.argmax(tf_y, 1), tf.argmax(out_layer, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')


##########################
### TRAINING & EVALUATION
##########################
    
with tf.Session(graph=g) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = mnist.train.num_examples // batch_size

        for i in range(total_batch):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            _, c = sess.run(['train', 'cost:0'], feed_dict={'features:0': batch_x,
                                                            'targets:0': batch_y,
                                                            'dropout_rate:0': dropout_rate})
            avg_cost += c
        
        train_acc = sess.run('accuracy:0', feed_dict={'features:0': mnist.train.images,
                                                      'targets:0': mnist.train.labels,
                                                      'dropout_rate:0': 0.0})
        valid_acc = sess.run('accuracy:0', feed_dict={'features:0': mnist.validation.images,
                                                      'targets:0': mnist.validation.labels,
                                                      'dropout_rate:0': 0.0})
        
        print("Epoch: %03d | AvgCost: %.3f" % (epoch + 1, avg_cost / (i + 1)), end="")
        print(" | Train/Valid ACC: %.3f/%.3f" % (train_acc, valid_acc))
        
    test_acc = sess.run('accuracy:0', feed_dict={'features:0': mnist.test.images,
                                                 'targets:0': mnist.test.labels,
                                                 'dropout_rate:0': 0.0})
    print('Test ACC: %.3f' % test_acc)

Extracting ./train-images-idx3-ubyte.gz
Extracting ./train-labels-idx1-ubyte.gz
Extracting ./t10k-images-idx3-ubyte.gz
Extracting ./t10k-labels-idx1-ubyte.gz
Epoch: 001 | AvgCost: 0.383 | Train/Valid ACC: 0.931/0.935
Epoch: 002 | AvgCost: 0.207 | Train/Valid ACC: 0.953/0.954
Epoch: 003 | AvgCost: 0.159 | Train/Valid ACC: 0.963/0.962
Epoch: 004 | AvgCost: 0.130 | Train/Valid ACC: 0.969/0.966
Epoch: 005 | AvgCost: 0.110 | Train/Valid ACC: 0.973/0.968
Epoch: 006 | AvgCost: 0.095 | Train/Valid ACC: 0.977/0.971
Epoch: 007 | AvgCost: 0.083 | Train/Valid ACC: 0.979/0.972
Epoch: 008 | AvgCost: 0.075 | Train/Valid ACC: 0.982/0.972
Epoch: 009 | AvgCost: 0.067 | Train/Valid ACC: 0.985/0.974
Epoch: 010 | AvgCost: 0.061 | Train/Valid ACC: 0.985/0.976
Epoch: 011 | AvgCost: 0.055 | Train/Valid ACC: 0.986/0.975
Epoch: 012 | AvgCost: 0.050 | Train/Valid ACC: 0.989/0.978
Epoch: 013 | AvgCost: 0.046 | Train/Valid ACC: 0.990/0.977
Epoch: 014 | AvgCost: 0.042 | Train/Valid ACC: 0.990/0.978
Epoch: 015 | Avg

**Note:** The slight difference to the low-level implementation is due to using a non-fixed random seed in the weight initialization. However, even setting the random seed to a fixed value would still cause the individual runs to be non-deterministic due the random shuffling procedure in `tensorflow.examples.tutorials.mnist`