Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valid_dataset = save['valid_dataset']
    valid_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
    labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)

print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [4]:
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))/ predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

L2 regularization results:
1. sgd logistic model w/ regularization on 1 set of weights:
    a. beta = 0.0005: 87.9%
    b. beta = 0.005: 88.4%
    c. beta = 0.05: 86%
2. sgd nnet model w/ regularization on h1 and o1 weights:
    a. beta = 0.0005: 91.4%
    b. beta = 0.005: 91.8%
    c. beta = 0.05: 87.6

In [5]:
# SGD logistic model
batch_size = 128 # 128 best
beta = 0.005 # 0.005 best

graph = tf.Graph()
with graph.as_default():
    
    tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # Variables.
    weights = tf.Variable(tf.truncated_normal([image_size * image_size, num_labels]))
    biases = tf.Variable(tf.zeros([num_labels]))
    
    # Training computation.  
    logits = tf.matmul(tf_train_dataset, weights) + biases
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
    # PENALTY APPLIED TO SINGLE SET OF LOGISTIC WEIGHTS
    loss = loss + beta * tf.nn.l2_loss(weights)
    
    # Optimizer.
    optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
    test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

In [6]:
num_steps = 3001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        # Define offset
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        # Prepare dictionary
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 33.813164
Minibatch accuracy: 10.2%
Validation accuracy: 12.3%
Minibatch loss at step 500: 1.684052
Minibatch accuracy: 78.1%
Validation accuracy: 78.2%
Minibatch loss at step 1000: 0.660732
Minibatch accuracy: 85.2%
Validation accuracy: 80.9%
Minibatch loss at step 1500: 0.870532
Minibatch accuracy: 81.2%
Validation accuracy: 80.9%
Minibatch loss at step 2000: 0.755892
Minibatch accuracy: 79.7%
Validation accuracy: 81.9%
Minibatch loss at step 2500: 0.624915
Minibatch accuracy: 85.9%
Validation accuracy: 82.0%
Minibatch loss at step 3000: 0.638544
Minibatch accuracy: 83.6%
Validation accuracy: 81.5%
Test accuracy: 88.5%


In [48]:
batch_size = 128
hidden_layer1_size = 1024
hidden_layer2_size = 305
hidden_lastlayer_size = 75

use_multilayers = True

regularization_meta=0.03


graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  keep_prob = tf.placeholder(tf.float32)

  weights_layer1 = tf.Variable(
    tf.truncated_normal([image_size * image_size, hidden_layer1_size], stddev=0.0517))
  biases_layer1 = tf.Variable(tf.zeros([hidden_layer1_size]))

  if use_multilayers:
    weights_layer2 = tf.Variable(
      tf.truncated_normal([hidden_layer1_size, hidden_layer1_size], stddev=0.0441))
    biases_layer2 = tf.Variable(tf.zeros([hidden_layer1_size]))

    weights_layer3 = tf.Variable(
      tf.truncated_normal([hidden_layer1_size, hidden_layer2_size], stddev=0.0441))
    biases_layer3 = tf.Variable(tf.zeros([hidden_layer2_size]))
    
    weights_layer4 = tf.Variable(
      tf.truncated_normal([hidden_layer2_size, hidden_lastlayer_size], stddev=0.0809))
    biases_layer4 = tf.Variable(tf.zeros([hidden_lastlayer_size]))


  weights = tf.Variable(
    tf.truncated_normal([hidden_lastlayer_size if use_multilayers else hidden_layer1_size, num_labels], stddev=0.1632))
  biases = tf.Variable(tf.zeros([num_labels]))
  
    
  # get the NN models
  def getNN4Layer(dSet, use_dropout):
    input_to_layer1 = tf.matmul(dSet, weights_layer1) + biases_layer1
    hidden_layer1_output = tf.nn.relu(input_to_layer1)
    
    
    logits_hidden1 = None
    if use_dropout:
       dropout_hidden1 = tf.nn.dropout(hidden_layer1_output, keep_prob)
       logits_hidden1 = tf.matmul(dropout_hidden1, weights_layer2) + biases_layer2
    else:
      logits_hidden1 = tf.matmul(hidden_layer1_output, weights_layer2) + biases_layer2
    
    hidden_layer2_output = tf.nn.relu(logits_hidden1)
    
    logits_hidden2 = None
    if use_dropout:
       dropout_hidden2 = tf.nn.dropout(hidden_layer2_output, keep_prob)
       logits_hidden2 = tf.matmul(dropout_hidden2, weights_layer3) + biases_layer3
    else:
      logits_hidden2 = tf.matmul(hidden_layer2_output, weights_layer3) + biases_layer3
    
    
    hidden_layer3_output = tf.nn.relu(logits_hidden2)
    logits_hidden3 = None
    if use_dropout:
       dropout_hidden3 = tf.nn.dropout(hidden_layer3_output, keep_prob)
       logits_hidden3 = tf.matmul(dropout_hidden3, weights_layer4) + biases_layer4
    else:
      logits_hidden3 = tf.matmul(hidden_layer3_output, weights_layer4) + biases_layer4
    
    
    hidden_layer4_output = tf.nn.relu(logits_hidden3)
    logits = None
    if use_dropout:
       dropout_hidden4 = tf.nn.dropout(hidden_layer4_output, keep_prob)
       logits = tf.matmul(dropout_hidden4, weights) + biases
    else:
      logits = tf.matmul(hidden_layer4_output, weights) + biases
    
    return logits

  # get the NN models
  def getNN1Layer(dSet, use_dropout, w1, b1, w, b):
    input_to_layer1 = tf.matmul(dSet, w1) + b1
    hidden_layer1_output = tf.nn.relu(input_to_layer1)
        
    logits = None
    if use_dropout:
       dropout_hidden1 = tf.nn.dropout(hidden_layer1_output, keep_prob)
       logits = tf.matmul(dropout_hidden1, w) + b
    else:
      logits = tf.matmul(hidden_layer1_output, w) + b
    
    return logits

  
  
  # Training computation.
  logits = getNN4Layer(tf_train_dataset, True)  
  logits_valid = getNN4Layer(tf_valid_dataset, False)
  logits_test = getNN4Layer(tf_test_dataset, False)
    
  
  loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
  #loss_l2 = loss + (regularization_meta * (tf.nn.l2_loss(weights)))
  
  global_step = tf.Variable(0)  # count the number of steps taken.
  learning_rate = tf.train.exponential_decay(0.3, global_step, 3500, 0.86, staircase=True)
  
    
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(logits_valid)
  test_prediction = tf.nn.softmax(logits_test)

In [51]:
num_steps = 5001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in xrange(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob:0.75}
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step", step, ":", l)
            print("Minibatch accuracy: %.1f%%" % accuracy(train_prediction.eval(feed_dict={tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob:1.0}), batch_labels))
            print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(feed_dict={keep_prob:1.0}), valid_labels))
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(feed_dict={keep_prob:1.0}), test_labels))
    
# https://discussions.udacity.com/t/assignment-4-problem-2/46525/33

Initialized
Minibatch loss at step 0 : 2.45052
Minibatch accuracy: 20.3%
Validation accuracy: 14.4%
Minibatch loss at step 500 : 0.485938
Minibatch accuracy: 88.3%
Validation accuracy: 83.9%
Minibatch loss at step 1000 : 0.385559
Minibatch accuracy: 92.2%
Validation accuracy: 85.9%
Minibatch loss at step 1500 : 0.576546
Minibatch accuracy: 89.1%
Validation accuracy: 86.4%
Minibatch loss at step 2000 : 0.422729
Minibatch accuracy: 92.2%
Validation accuracy: 87.6%
Minibatch loss at step 2500 : 0.416127
Minibatch accuracy: 90.6%
Validation accuracy: 87.7%
Minibatch loss at step 3000 : 0.402459
Minibatch accuracy: 93.0%
Validation accuracy: 87.7%
Minibatch loss at step 3500 : 0.310197
Minibatch accuracy: 93.8%
Validation accuracy: 88.1%
Minibatch loss at step 4000 : 0.306649
Minibatch accuracy: 92.2%
Validation accuracy: 88.5%
Minibatch loss at step 4500 : 0.47194
Minibatch accuracy: 89.8%
Validation accuracy: 89.1%
Minibatch loss at step 5000 : 0.379968
Minibatch accuracy: 92.2%
Validatio

In [86]:
# SGD nnet model

import math as math
batch_size = 128

# depth / nodes
layer1_size = 1024
layer2_size = 305
layer3_size = 75

# regularization
regularize = False
layer1_beta = 0.001
layer2_beta = 0.001
layer3_beta = 0.001

# dropout
dropout = True
layer1_keep_prob = 0.7
layer2_keep_prob = 0.6
layer3_keep_prob = 0.5

# learning
learning = True
start_learn = 0.25
learn_decay = 0.75
learn_step = 1000
stair = True

seed = None  # Set to None for random seed.

graph = tf.Graph()
with graph.as_default():
    
    tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
    tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    tf_valid_dataset = tf.constant(valid_dataset)
    tf_test_dataset = tf.constant(test_dataset)
    
    # initialize variables
    #keep_prob = tf.placeholder(tf.float32)
    
    ## layer 1
    layer1_weights = tf.Variable(tf.truncated_normal([image_size * image_size, layer1_size],
                                                 stddev = math.sqrt(2.0/(image_size*image_size))))
    layer1_biases = tf.Variable(tf.zeros([layer1_size]))
    
    ## layer 2
    layer2_weights = tf.Variable(tf.truncated_normal([layer1_size, layer2_size],
                                                 stddev = math.sqrt(2.0/layer1_size)))
    layer2_biases = tf.Variable(tf.zeros([layer2_size]))
    
    ## layer 3
    layer3_weights = tf.Variable(tf.truncated_normal([layer2_size, layer3_size],
                                                 stddev = math.sqrt(2.0/layer2_size)))
    layer3_biases = tf.Variable(tf.zeros([layer3_size]))

    ## output layer - softmax linear
    out_weights = tf.Variable(tf.truncated_normal([layer3_size, num_labels],
                                                  stddev=math.sqrt(2.0/layer3_size)))
    out_biases = tf.Variable(tf.zeros([num_labels]))
    
    ## iterative fit
    def forward_prop(in_data, dropout):
        
        layer1_output = tf.nn.relu(tf.matmul(in_data, layer1_weights) + layer1_biases)
        layer1_logits = None
        if dropout:
            layer1_dropout = tf.nn.dropout(layer1_output, layer1_keep_prob)
            layer1_logits = tf.matmul(layer1_dropout, layer2_weights) + layer2_biases
        else:
            layer1_logits = tf.matmul(layer1_output, layer2_weights) + layer2_biases
        
        layer2_output = tf.nn.relu(layer1_logits)
        layer2_logits = None
        if dropout:
            layer2_dropout = tf.nn.dropout(layer2_output, layer2_keep_prob)
            layer2_logits = tf.matmul(layer2_dropout, layer3_weights) + layer3_biases
        else:
            layer2_logits = tf.matmul(layer2_output, layer3_weights) + layer3_biases
        
        layer3_output = tf.nn.relu(layer2_logits)
        layer3_logits = None
        if dropout:
            layer3_dropout = tf.nn.dropout(layer3_output, layer3_keep_prob)
            layer3_logits = tf.matmul(layer3_dropout, out_weights) + out_biases
        else:
            layer3_logits = tf.matmul(layer3_output, out_weights) + out_biases
            
        return layer3_logits
    
    # Loss
    logits_train = forward_prop(tf_train_dataset, True)
    logits_valid = forward_prop(tf_valid_dataset, False)
    logits_test = forward_prop(tf_test_dataset, False)
    
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits_train, tf_train_labels))
    if regularize:
        layer1_l2_loss = layer1_beta * tf.nn.l2_loss(layer1_weights)
        layer2_l2_loss = layer2_beta * tf.nn.l2_loss(layer2_weights)
        layer3_l2_loss = layer3_beta * tf.nn.l2_loss(layer3_weights)
        loss = loss + layer1_l2_loss + layer2_l2_loss + layer3_l2_loss

    # Learning rate
    if learning: 
        global_step = tf.Variable(0)  # count the number of steps taken.
        learning_rate = tf.train.exponential_decay(start_learn, global_step, learn_step, learn_decay, staircase=stair)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
    else:
        optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits_train)
    valid_prediction = tf.nn.softmax(logits_valid)
    test_prediction = tf.nn.softmax(logits_test)

In [87]:
num_steps = 6001

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print("Initialized")
    for step in range(num_steps):
        offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
        
        # Generate a minibatch.
        batch_data = train_dataset[offset:(offset + batch_size), :]
        batch_labels = train_labels[offset:(offset + batch_size), :]
        
        # Prepare a dictionary 
        feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
        
        _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
        if (step % 500 == 0):
            print("Minibatch loss at step %d: %f" % (step, l))
            #if learning: print("Learning rate at step %d: %f" % (step, learning_rate.eval()))
            print("Minibatch accuracy: %.1f%%" % accuracy(
                    train_prediction.eval(feed_dict={tf_train_dataset : batch_data, 
                                                     tf_train_labels : batch_labels}), batch_labels))
            print("Validation accuracy: %.1f%%\n" % accuracy(valid_prediction.eval(), valid_labels))
            
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 2.619740
Minibatch accuracy: 18.8%
Validation accuracy: 14.8%

Minibatch loss at step 500: 0.591181
Minibatch accuracy: 82.0%
Validation accuracy: 84.1%

Minibatch loss at step 1000: 0.438579
Minibatch accuracy: 91.4%
Validation accuracy: 85.5%

Minibatch loss at step 1500: 0.515658
Minibatch accuracy: 84.4%
Validation accuracy: 86.4%

Minibatch loss at step 2000: 0.550223
Minibatch accuracy: 85.9%
Validation accuracy: 86.6%

Minibatch loss at step 2500: 0.497638
Minibatch accuracy: 87.5%
Validation accuracy: 87.0%

Minibatch loss at step 3000: 0.408930
Minibatch accuracy: 88.3%
Validation accuracy: 87.4%

Minibatch loss at step 3500: 0.379106
Minibatch accuracy: 88.3%
Validation accuracy: 87.8%

Minibatch loss at step 4000: 0.335615
Minibatch accuracy: 89.1%
Validation accuracy: 88.0%

Minibatch loss at step 4500: 0.440812
Minibatch accuracy: 85.9%
Validation accuracy: 88.2%

Minibatch loss at step 5000: 0.345681
Minibatch accuracy: 90.6%
Validati

---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

Idea here is to overtrain (3001 steps) given small batches causing overfitting. In some of these runs, training accuracy hit 100%. 

Batch size:
1. sgd logistic model w/out regularization and batch_size:
    a. 100: 86.3% (-2% vs. 0.005 beta, 128 batch_size)
    b. 10: 82.5% (-6% vs. 0.005 beta, 128 batch_size)
    c. 1: 76.9% (-12% vs. 0.005 beta, 128 batch_size)
2. sgd nnet model w/ 0.005 beta and batch_size:
    a. 128: 91.5% (-0.3% vs. 0.005 beta, 128 batch_size)
    b. 100: 91.3% (-0.5% vs. 0.005 beta, 128 batch_size)
    c. 50: 89.7% (-2% vs. 0.005 beta, 128 batch_size)
3. sgd nnet model w/out regularization and batch_size:
    a. 128: 
    b. 100: 86% (-6% vs. 0.005 beta, 128 batch_size)
    c. 50: 89.7% (-2% vs. 0.005 beta, 128 batch_size)
    d. 10: 57.9% (-34% vs. 0.005 beta, 128 batch_size)
    e. 1: error

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

Idea here is that dropout should reduce overtraining as evidenced when over working a smaller dataset. 

Droupout:
1. sgd nnet model w/ regularization:
    a. 128 batch, 0.005 beta: 90.9% (-0.5% vs. 0.005 beta, 128 batch_size)
    c. 100 batch, 0.005 beta: 90.5% (-0.8% vs. 0.005 beta, 100 batch_size)
    d. 50 batch, 0.005 beta: 88.8% (-1% vs. 0.005 beta, 50 batch_size)
    e. 20 batch error
    f. 10 batch error
2. sgd nnet model w/out regularization:
    a. 128 batch, 0.0 beta: 87.3%
        + -4% vs. 0.005 beta, 128 batch_size, w/out dropout
        + -3% vs. 0.005 beta, 128 batch_size, w/ dropout
    b. 100 batch, 0.0 beta: 85%
        + -1% vs. 0 beta, 100 batch_size, w/out dropout
        + -5% vs. 0 beta, 100 batchP_size, w/ dropout
    c. 50 batch, error
    d. 10 batch, error
    
So, gist seems to be that dropout "regularizes" reducing overfitting particularly when you're using powerful model on smaller dataset. However, results seem to indicate that dropout and regularization are somewhat redundant. 
    

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


Optimization hyper-parms:
1. model
    a. all: neural net
2. SGD vs. GD
    a. all: SGD,
3. activation function
    a. all: relu
4. batch size
    a. 128
5. number of steps
    a. 5001
    f. 6001
6. layers
    a. 3 (1024, 305, 75)
    b. 3 (500, 1000, 300)
    c. 3 (392, 1176, 235)
    d. 3 (1024, 305, 75)
7. regularization (beta)
    a. no
    d. yes, (0.005, 0.005, 0.005)
    e. yes, (0.001, 0.001, 0.001)
    j. yes, (0.001, 0.001, 0.001)
    k. no
8. dropout
    a. no
    g. yes, (0.5, 0.5, 0.5)
    h. yes, (0.5, 0.6, 0.7)
    i. yes, (0.7, 0.6, 0.5)
    j. yes, (0.7, 0.6, 0.5)
    k. yes, (0.7, 0.6, 0.5)
9. learning (start -> end, steps)
    a. no
    k. yes (0.25, 0.75, 1000, True)
    l. yes (0.25, 0.96, 100, True)
    m. yes (0.1, 0.96, 100, True)
    n. yes (0.5, 0.95, 1000, True)
    o. yes (0.3, 0.95, 500, True)
    p. yes (0.3, 0.95, 500, False)
10. score
    a. 93.3%
    b. 93.3%
    c. 93% - nodes/layer don't make a lot of difference
    d. 86.7% - regularization of 0.005 had negative effect.
    e. 92.9% - regularization of 0.001 still negative effect. model still learning? 
    f. 92.5% - increased steps. even worse. regularization not valuable here.
    g. 93.4% - dropout (0.5) has negligible effect
    h. 94% - graduated dropout. wonder if it's related to shape
    i. 94.3% - declining dropout. droupout valuable. 
    j. 92.9% - declining dropout w/ regularization. regularization not valuable. 
    k. 94.8% - declining dropout w/ learning (initial rate: 0.25, decay rate: 0.75, decay steps: 1000). 
    Given formula: decayed_learning_rate = learning_rate * decay_rate ^ (global_step / decay_steps)
    
    1000 0.25 * 0.75^(1000/1000) = 0.188
    ...
    6000 0.25 * 0.75^(6000/1000) = 0.044
    
    l. 94.2% - (initial rate: 0.25, decay rate: 0.96, decay steps: 100, stair: True)
    
    1000 0.25 * 0.96^(1000/100) = 0.16
    ...
    6000 0.25 * 0.96^(6000/100) = 0.022
    
    m. 92.9% - (0.1, 0.96, 100, True). 
    
    1000: 0.1 * 0.96^(1000/100) = 0.0665
    ...
    6000: 0.1 * 0.96^(6000/100) = 0.0086
    
    you're starting w/ small learning and getting really, really small. so, probably need to start higher. 
    
    n. 94.6% - (0.5, 0.96, 1000, True)
    
    1000: 0.5 * 0.96^(1000/1000) = 0.48
    ...
    6000: 0.5 * 0.96^(6000/1000) = 0.39
    
    o. 94.7% - (0.3, 0.95, 500, True)
    
    1000: 0.3 * 0.95^(1000/500) = 0.27
    ...
    6000: 0.3 * 0.95^(6000/500) = 0.06
    
    p. % - (0.25, 0.75, 1000, False)
    
    So, winner at 3 layers w/ 94.8% was:
    
