# Creating a deep learning Architecture

*Author*: Frank Fichtenmueller <br>
*Goal*: Sample Implementation to learn about the Syntax of Tensorflow<br>
*Date*: 12/05/2017

<hr>
Using multiple layers of networks, the goal is to enable the network to learn 2-D Spacial Representation Features to improve the accuracy of the prediction. 

Building on top of [2015-05-12-ff-NeuralNetwork](http://localhost:8891/notebooks/Model_Implementations/2017-05-12-ff-NeuralNetwork.ipynb) we will now implement the picture layout by using a 'convolutional neural network' to compress and learn spacial features to help increase accuracy in distinguishing the harder to decipher parts of the data.

Architecture: <br>
- A convolutional layer learns on spacial subsets of the image representation, and over time will generalize to a 2-tensor for a specific digit shape. 
- A Pooling layer is then trained to compress the digit generalization into a smaller subset of patterns, to force a bottleneck to keep the model from overfitting the specifics and increase generalization
- [convolution , pooling] is repeated twice. The second combination will be learning conceptual patterns of the arrangement of the first combinations generalized patterns. Therefore learning more abstract patterns.
- The output is then fed into a fully connected layer to train the weights and biases to combine the individual features towards classification results.
- 10 individual Neurons are set up with a Softmax Function for multi-class classification to maximize the logistic output seperation between high and low valued predictions. 
- The last layer implements the 'loss function' to measure accuracy, and initiates the backpropagation function to adjust the weights and bias terms on the fully connected layer, which in turn sends adjusted derivatives down to the next layer. This continues trough all layers.

Reduce Overfitting: <br>
- Our Model has enough degrees of freedom to perfectly learn all relevant features within our training data. Likelihood to overfitting sample specifics is therefore high. 
- We use 'dropout' on the Fully connected layer to force the classification algorithm to learn distributed submodels on the same data and not rely too much on the presence of specific features (Nodes)

In [2]:
import tensorflow as tf

In [3]:
# Get Data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [4]:
sess = tf.InteractiveSession()

In [5]:
# Define the placeholders for MNIST input data
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])

# Reshaping the flattened vector in a 2-tensor
x_image = tf.reshape(x, [-1, 28,28,1], name='x_image')

In [6]:
# For our activation function we use 'ReLu', therefor we need to initialize
# with small random values, so that Relu does not cancel them out right away

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

We isolate the creation of the convolution and pooling layers, so that we can easily set parameters on the whole network in a single place. 

- Convolution Layers set a stride, and the padding
- Max Pooling sets the Kernel Size which determines the size of the array we are pooling together.

In [7]:
# Create functions to set up convolution and pooling layers for us
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1,2,2,1],
                         strides=[1,2,2,1], padding='SAME')

## Defining the Layers of the Neural Network

We initialize the layers and implement the architectural definitions by setting parameters to the model layers.

### 1. Convolutional Layer

Given our decission to convolute on a patch of 5x5 we will end up with 32 individiual features per image, that will be attributed with a specific weight, and an individual bias term. 

- Therefore we create a 4-tensor Weigh Matrix 'W_conv1': [5,5,1,32]
    - 5x5 input size
    - 1 channel (for greyscale)
    - 32 Features in size
- A 1-tensor bias variable 'b_conv1': [32]

In [14]:
W_conv1 = weight_variable([5,5,1,32])
b_conv1 = bias_variable([32])

# Do convolution on images, add bias and push through RELU activation
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
# Take results and run them trough max_pool
h_pool1 = max_pool_2x2(h_conv1)

### 2. Convolutional Layer

This layer processes the output of layer 1 in a 5x5 patch. Returning 64 Weights and Bias Terms.

- Therefore we create a 4-tensor Weigh Matrix 'W_conv1': [5,5,1,32]
    - 5x5 input size
    - 32 channel (Features from Layer one)
    - 64 Features Output
- A 1-tensor bias variable 'b_conv1': [32]

In [15]:
# Process the 32 features from  Conv1 in a 5x5 patch. Return 64 Weights and bias
W_conv2 = weight_variable([5,5,32,64])
b_conv2 = bias_variable([64])
# Do convolution on the output of layer 1. Pool results
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

### 3. Implement a fully connected Layer

This Layer receives a 7x7 Representation of the images, and outputs its weights to 10 propability function to classify the labels 0-9.

- Input is 7x7 images with 64 Features
- Connection of the whole system is 1024 Neurons all together

In [16]:
# Implementing the Fully Connected Layer
W_fc1 = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])

# Connect output of pooling layer 2 as input to full connected layer
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

As this very powerfull model can easily overfitt the comparably small dataset we use for training it, we need to implement a 'Dropout' on the fully connected layer, before passing the results to the Classification Output

In [17]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### 4. Implementing the 'Readout Layer'

This Layer takes the values and computes probability Statements about the Class prediction

In [18]:
# Implementing the Layer
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

# Defining the model
y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

Implementing the 'loss function' to calculate back propagation

In [19]:
# Loss measurement
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=y_conv, labels=y_))

# loss optimization
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

Defining the accuracy Calculations

In [20]:
# What is correct?
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_,1))
# How accurate
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [21]:
# Initialize all of the variables
sess.run(tf.global_variables_initializer())

Training the model

In [26]:
# Set variables to controll the training iterations
import time
num_steps = 3000
display_every = 100

# Training Loop
start_time = time.time()
end_time = time.time()

for i in range(num_steps):
    batch = mnist.train.next_batch(50)
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
    
    # Periodic status display
    if i%display_every == 0:
        train_accuracy = accuracy.eval(feed_dict= {
            x:batch[0], y_: batch[1], keep_prob: 1.0})
        end_time = time.time()
        print("step {0}, elapsed time {1: .2f} seconds, training accuracy {2: .3f}%".
              format(i, end_time-start_time, train_accuracy* 100))

step 0, elapsed time  0.24 seconds, training accuracy  22.000%
step 100, elapsed time  17.84 seconds, training accuracy  84.000%
step 200, elapsed time  35.40 seconds, training accuracy  92.000%
step 300, elapsed time  52.90 seconds, training accuracy  86.000%
step 400, elapsed time  70.40 seconds, training accuracy  94.000%
step 500, elapsed time  87.91 seconds, training accuracy  86.000%
step 600, elapsed time  105.41 seconds, training accuracy  98.000%
step 700, elapsed time  122.80 seconds, training accuracy  94.000%
step 800, elapsed time  140.26 seconds, training accuracy  94.000%
step 900, elapsed time  157.66 seconds, training accuracy  92.000%
step 1000, elapsed time  175.04 seconds, training accuracy  98.000%
step 1100, elapsed time  192.58 seconds, training accuracy  96.000%
step 1200, elapsed time  210.09 seconds, training accuracy  98.000%
step 1300, elapsed time  227.57 seconds, training accuracy  98.000%
step 1400, elapsed time  245.09 seconds, training accuracy  96.000%

In [27]:
# Display summary
end_time = time.time()
print('Total training time for {0} batches: {1:.2f} seconds'.format(i+1, end_time-start_time))

# Accuracy on the test set
print("Test accuracy {0:.3f}%".format(accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0
})*100))

Total training time for 3000 batches: 525.46 seconds
Test accuracy 98.100%
