# Neural Networks on Tensorflow

*The material for this lab session was inspired by [Google's Deep Learning course on Udacity](https://www.udacity.com/course/deep-learning--ud730).*

**Introduction.** This extends the previous tutorial to perform stochastic gradient descent. Most of the steps are identicals, so we will avoid repetition and will include <font color="red">comments only on those parts that are new</font>, which are marked in red.

**Goals.** In this tutorial, we will train a Multinomial Logistic Regressor with stochastic gradient descent.

**Import packages.**

In [None]:
import tensorflow as tf
import os, struct
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
from sklearn.datasets import fetch_mldata

**Load the data.**

In [None]:
# Load the dataset
mnist_dict = fetch_mldata("MNIST original")

# Get the data
X_all = mnist_dict['data']
Y_all = mnist_dict['target']
print("Shape of X_all: ", X_all.shape)
print("Shape of Y_all: ", Y_all.shape)

# Get the number of classes (10) and the dimensionality of the input data (28*28)
num_classes = len(np.unique(Y_all))
num_pixels = X_all.shape[1]
print("Number of classes: ", num_classes)
print("Number of pixels: ", num_pixels)

**Split into training/test/validation sets.** We will split the data into a training set of 60000 images, a test set of 8000 images, and a validation set of 2000 images.

In [None]:
# Number of observations in each group
n_training_cases = 60000
n_test_cases = 8000

# Create a permutation vector to shuffle the images
perm = np.random.permutation(X_all.shape[0])

# Split into training/test/validation
X_train = X_all[perm[:n_training_cases], :].astype(np.float32)
Y_train = Y_all[perm[:n_training_cases], None].astype(np.int32)

X_test = X_all[perm[n_training_cases:n_training_cases+n_test_cases], :].astype(np.float32)
Y_test = Y_all[perm[n_training_cases:n_training_cases+n_test_cases], None].astype(np.int32)

X_val = X_all[perm[n_training_cases+n_test_cases:], :].astype(np.float32)
Y_val = Y_all[perm[n_training_cases+n_test_cases:], None].astype(np.int32)

print("Shape of X_train: ", X_train.shape, "\t Shape of Y_train: ", Y_train.shape)
print("Shape of X_test: ", X_test.shape, "\t Shape of Y_test: ", Y_test.shape)
print("Shape of X_val: ", X_val.shape, "\t Shape of Y_val: ", Y_val.shape)

**<font color='red'>[Only if your computer is slow]</font> Create a smaller dataset.** You can choose the number of images to keep. We will first try with the full $60000$ training instances.

In [None]:
# For now, we take a subset of the training data to speed-up the computation
train_subset = 60000   # <--- We are using the full dataset. You can use a smaller number of images

X_train_small = X_train[:train_subset, :]  # First "train_subset" images:
Y_train_small = Y_train[:train_subset, :]  # We don't have to choose a random set because images have already been shuffled

print("Shape of X_train_small: ", X_train_small.shape, "\t Shape of Y_train_small: ", Y_train_small.shape)

## The Computational Graph

**Accuracy.**

In [None]:
def accuracy(predictions, labels):
    # Return % of correctly classified images
    return (100.0 * np.sum(np.argmax(predictions, axis=1) == np.squeeze(labels)) / predictions.shape[0])

<font color='red'>**(a) Input data.**</font> In stochastic gradient descent, we choose a new minibatch of data at each iteration. Thus, the dataset effectively "changes" between iterations. For that reason, we need *placeholders*: we do not know the value of the minibatch of training data at this point. Roughly, a placeholder says: "right now I can't tell you the value of this variable, but I can guarantee that I will tell you the value whenever you need it later."
```python
    # (a) Input data
    #     Load the training, validation and test data into constants that are
    #     attached to the graph
    tf_train_data = tf.placeholder(tf.float32, shape=(batch_size, num_pixels))
    tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)
```

**(b) Variables.** This block doesn't change.

In [None]:
def create_weight(shape):
    # creates and initializes a weight matrix of the specified size
    return tf.Variable( tf.truncated_normal(shape, stddev=0.01) )

def create_bias(shape):
    # creates and initializes a bias term of the specified size
    return tf.Variable( tf.constant(0.1, shape=shape) )

```python
    # (b) Variables
    #     Indicate the parameters that we need to infer
    weights = create_weight( [num_pixels, num_classes] )
    biases = create_bias( [num_classes] )
```

**(c) Computations.** This doesn't change. (Well, it changes only conceptually, because we obtain the loss as if we had a dataset with less observations.)


In [None]:
def all_nn_computations(X, weights, biases):
    return tf.matmul(X, weights) + biases

```python
    # (c) Computations
    #     Indicate the computations that we want to perform with the variables and data
    train_logits = all_nn_computations(tf_train_data, weights, biases)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )
```

** <font color="red">(d) Optimization.</font>** We are going to find the minimum of this loss using SGD. In SGD, we need to decrease the step size at each iteration to ensure convergence. We do that automatically using a method such as RMSProp or Adam. You don't need to know any further details about what these methods do; just keep in mind that they decrease the step size at each iteration of SGD. We typically use a very small learning rate (in this case, $10^{-5}$). Here, we will use the Adam method.
```python
    # (d) Optimizer
    #     Indicate the optimization procedure that we want to use
    optimizer = tf.train.AdamOptimizer(learning_rate=1.0e-5).minimize(loss)
```

** (e) Other tasks.** This doesn't change.
```python
    # (e) Other tasks
    #     Compute predictions on training, test, and validation
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights, biases) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights, biases) ) 
```

**All together.**

In [None]:
# Choose the minibatch size (i.e., the number of datapoints that we will
# use at each iteration of gradient descent)
batch_size = 200

# Create the computational graph
graph_MLR_SGD = tf.Graph()

with graph_MLR_SGD.as_default():
    # (a) Input data
    #     Load the training, validation and test data into constants that are
    #     attached to the graph
    tf_train_data = tf.placeholder(tf.float32, shape=(batch_size, num_pixels))
    tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)
    
    # (b) Variables
    #     Indicate the parameters that we need to infer
    weights = create_weight( [num_pixels, num_classes] )
    biases = create_bias( [num_classes] )
    
    # (c) Computations
    #     Indicate the computations that we want to perform with the variables and data
    train_logits = all_nn_computations(tf_train_data, weights, biases)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )
    
    # (d) Optimizer
    #     Indicate the optimization procedure that we want to use
    optimizer = tf.train.AdamOptimizer(learning_rate=1.0e-5).minimize(loss)
    
    # (e) Other tasks
    #     Compute predictions on training, test, and validation
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights, biases) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights, biases) )

## The Session

**<font color="red">Stochastic gradient descent.</font>** We will run 20000 SGD steps.

**<font color="red">Minibatches of data.</font>** At each iteration of SGD, we need to subsample a new minibatch of data. We do that using an auxiliary function.

In [None]:
def next_minibatch(X_, Y_, batch_size):
    # Create a vector with batch_size random integers
    perm = np.random.permutation(X_.shape[0])
    perm = perm[:batch_size]
    # Generate the minibatch
    X_batch = X_[perm, :]
    Y_batch = Y_[perm, :]
    # Return the images and the labels
    return X_batch, Y_batch

**Plotting function.**

In [None]:
def plot_weights(weights):
    plt.figure()
    for j in range(num_classes):
        # Create and choose subplot
        ax = plt.subplot(1,num_classes,j+1)
        # Obtain the weights corresponding to class j
        weights_j = weights[:,j]
        # Reshape
        weights_reshaped = np.reshape(weights_j,(28, 28))
        # Plot
        ax.imshow(weights_reshaped, cmap=plt.get_cmap('Greys'))
        plt.axis('off')
        plt.title('digit #'+str(j), fontsize=7.0)
    plt.show()

**<font color="red">Steps in the code.</font>** The only differences now are that:

- At each iteration of SGD, we need to obtain a new minibatch of data.

- This minibatch is passed to the placeholders through the `feed_dict` parameter of the `run()` method.

In [None]:
max_iterations = 20000

with tf.Session(graph=graph_MLR_SGD) as session:

    # 1. Initialize the weights and biases. This is a one-time operation
    tf.initialize_all_variables().run()
    print("Initialized")
    
    # 2. Run iterations of SGD
    for step in range(max_iterations):
        
        # Pick a subset of the datapoints at random
        X_batch, Y_batch = next_minibatch(X_train_small, Y_train_small, batch_size)
        
        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = { tf_train_data   : X_batch,
                      tf_train_labels : np.squeeze(Y_batch)}
        
        # Run the computations
        _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict )
        
        # Report every 500 iterations
        if (step % 500 == 0):
            # Print the loss
            print("Minibatch loss at step %d: %f" % (step, l))
            # Obtain and print the accuracy on the training set
            print(" +Minibatch accuracy: %.1f%%" % accuracy(predictions, Y_batch))
            # Obtain and print the accuracy on the validation set
            print(" +Validation accuracy: %.1f%%" % accuracy(val_prediction.eval(), Y_val))

    # 3. Accuracty on the test set
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), Y_test))


**<font color="red">[Questions]</font>** Run the code above.

1. Is the resulting accuracy on the test set better than it was in the previous tutorial? Why/Why not?

2. What line(s) of code would you need to do in the code if you wanted to regularize the unobserved parameters?