# Neural Networks on Tensorflow

*The material for this lab session was inspired by [Google's Deep Learning course on Udacity](https://www.udacity.com/course/deep-learning--ud730).*

**Introduction.** Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. Neural networks provide an information processing paradigm that is loosely inspired by the way biological nervous systems, such as the brain, process information. They are composed of a large number of highly interconnected processing elements (neurones) that work together to solve specific problems. 

**Goals.** In this tutorial, we will train a Multinomial Logistic Regressor with gradient descent.

**Packages.** We will make use of [Tensorflow](https://www.tensorflow.org). We import the package with

```python
import tensorflow as tf
```

**Data.** We will use MNIST as an example. MNIST is widely used as a testbed for neural networks, and it is even still used in research publications. It contains 70000 images corresponding to handwritten digits, each belongs to one of ten possible classes (the classes are from digit #0 to digit #9). Each image is of size $28\times 28$, i.e., $784$ pixels in total.

Thus, in contrast to other lab sessions, here we will consider *multiclass classification*. That is, we will predict the probability of each of the 10 different classes (digits).

## Loading and Preprocessing the Data

MNIST is a so common that is has been included in [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_mldata.html#sklearn.datasets.fetch_mldata). Just use
```python
mnist_dict = fetch_mldata("MNIST original")
```
to load it.

**Import packages.**

In [None]:
import tensorflow as tf
import os, struct
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
from sklearn.datasets import fetch_mldata

**Load the data.**

In [None]:
# Load the dataset
mnist_dict = fetch_mldata("MNIST original")

# Get the data
X_all = mnist_dict['data']
Y_all = mnist_dict['target']
print("Shape of X_all: ", X_all.shape)
print("Shape of Y_all: ", Y_all.shape)

# Get the number of classes (10) and the dimensionality of the input data (28*28)
num_classes = len(np.unique(Y_all))
num_pixels = X_all.shape[1]
print("Number of classes: ", num_classes)
print("Number of pixels: ", num_pixels)

**Plot two of the images.**

In [None]:
# Choose two indexes
idx1 = 0       # <--- You can choose any index between 0 and 69999 (each corresponds to a different image)
idx2 = 40000   # <--- You can choose any index between 0 and 69999 (each corresponds to a different image)

# Plot the two images
plt.figure()
ax1 = plt.subplot(121)
ax1.imshow(np.reshape(X_all[idx1,:], (28,28)), cmap=plt.get_cmap('gray'))
ax2 = plt.subplot(122)
ax2.imshow(np.reshape(X_all[idx2,:], (28,28)), cmap=plt.get_cmap('gray'))
plt.show()

**Split into training/test/validation sets.** We will split the data into a training set of 60000 images, a test set of 8000 images, and a validation set of 2000 images.

In [None]:
# Number of observations in each group
n_training_cases = 60000
n_test_cases = 8000

# Create a permutation vector to shuffle the images
perm = np.random.permutation(X_all.shape[0])

# Split into training/test/validation
X_train = X_all[perm[:n_training_cases], :].astype(np.float32)
Y_train = Y_all[perm[:n_training_cases], None].astype(np.int32)

X_test = X_all[perm[n_training_cases:n_training_cases+n_test_cases], :].astype(np.float32)
Y_test = Y_all[perm[n_training_cases:n_training_cases+n_test_cases], None].astype(np.int32)

X_val = X_all[perm[n_training_cases+n_test_cases:], :].astype(np.float32)
Y_val = Y_all[perm[n_training_cases+n_test_cases:], None].astype(np.int32)

print("Shape of X_train: ", X_train.shape, "\t Shape of Y_train: ", Y_train.shape)
print("Shape of X_test: ", X_test.shape, "\t Shape of Y_test: ", Y_test.shape)
print("Shape of X_val: ", X_val.shape, "\t Shape of Y_val: ", Y_val.shape)

**Also create a smaller dataset.** We will not apply any special preprocessing to the data. However, we will initially consider a smaller subset of the training set of $1000$ observations only, in order to speed-up the computations. (We will revert back to the $60000$ training instances later, when we consider stochastic gradient ascent.)

In [None]:
# For now, we take a subset of the training data to speed-up the computation
train_subset = 1000   # number of observations to keep

X_train_small = X_train[:train_subset, :]  # First 1000 images:
Y_train_small = Y_train[:train_subset, :]  # We don't have to choose a random set because images have already been shuffled

print("Shape of X_train_small: ", X_train_small.shape, "\t Shape of Y_train_small: ", Y_train_small.shape)

## Multinomial Logistic Regression (MLR) with Tensorflow

**Introduction.** We will implement a neural network without hidden layers. This is a common model more commonly known as multinomial logistic regression. It is a multiclass generalization of logistic regression. It also has weights and biases. However, we have one weight vector $w_j$ and one bias $b_j$ for each class $j$. While in logistic regression we use the sigmoid function to model the probability of the outputs, in multinomial logistic regression we use the softmax function. Thus, the probability (likelihood) of the $j$-th label for datapoint $n$ is given by
$$
p(y_n=j\;|\; x_n, \{w_j, b_j\})=\frac{\text{e}^{(w^\top_j x_n + b_j)}}{\sum_{j^\prime=1}^{J} \text{e}^{(w^\top_{j^\prime} x_n + b_{j^\prime})}},\qquad j=1,\ldots,J
$$
The softmax function converts a vector of real numbers into a probability vector. In this case, these real numbers are the "strength" of the prediction for each class. For instance, if $w^\top_1 x_n + b_1$ is very large, then class $1$ will be the most likely. If $w^\top_1 x_n + b_1$ and $w^\top_2 x_n + b_2$ are equal but they are much larger than the rest, then classes 1 and 2 will be equally likely (with probabilities close to $0.5$ each), while the rest of the classes will have probabilities very close to zero. Tensorflow incorporates a `softmax` function that passes its argument through the softmax function in a numerically stable manner and returns a set of probabilities (one for each class).

The goal is to find the weights $w_j$ and biases $b_j$ such that the probability of the observed outcomes in the training set are maximized. As you know, this corresponds to *maximum likelihood (ML) estimation*.

**Accuracy.** We define a very simple auxiliary function to compute the accuracy, which we define as the *percentage of correctly classified images*. The auxiliary function takes the predictions as input (either the arguments of the softmax of the ouptut of the softmax) and the observed labels, and it simply computes the average number of correctly classified images.

In [None]:
def accuracy(predictions, labels):
    # Return % of correctly classified images
    return (100.0 * np.sum(np.argmax(predictions, axis=1) == np.squeeze(labels)) / predictions.shape[0])

### The Computational Graph

**Introduction.** In TensorFlow, you first describe the computation that you would like to see performed. That is, you describe the inputs, the variables, and the operations. In this description, no actual computations is performed. This is like writing down the model on a piece of paper: you just specify which model you are using, but you do not compute anything yet. In Tensorflow, this is known as the "computational graph." Each variable is a "node" on this graph, and the edges of the computational graph connect nodes that interact together in some operation.

For now, all we need to know is that the definition of the computational graph appears under the block starting with
```python
with graph_name.as_default():
```
Within this block, we will call the function `all_nn_computations()` that we define next. This function simply takes the weights and biases and performs the operation $x_n^\top w_j + b_j$ for each class $j$.

The computational graph consists on all steps (a)-(e), which we cover below.

**(a) Input data.** As input, we need: the training images and their labels, the validation images, and the test images. `tf.constant` creates a variable that will not change during optimization.

```python
    # (a) Input data
    #     Load the training, validation and test data into constants that are
    #     attached to the graph
    tf_train_data = tf.constant(X_train_small)  # use the small dataset
    tf_train_labels = tf.constant(np.squeeze(Y_train_small))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)
```

**(b) Variables.** The variables are the weights $w_j$, which we group in a $D\times J$ matrix (number of pixels $\times$ number of classes); and the biases $b_j$, which we group in a $J$-dimensional vector. Since we will do gradient descent, we need an initial guess for these variables. We define auxiliary functions to create and initialize the weights and biases. We follow the typical approach in NNs to initialize the variables

- Initialize the weights randomly (using a truncated Gaussian distribution with small standard deviation).

- Initialize the biases to some small positive constant.

In [None]:
def create_weight(shape):
    # creates and initializes a weight matrix of the specified size
    return tf.Variable( tf.truncated_normal(shape, stddev=0.01) )

def create_bias(shape):
    # creates and initializes a bias term of the specified size
    return tf.Variable( tf.constant(0.1, shape=shape) )

Thus, block (b) becomes

```python
    # (b) Variables
    #     Indicate the parameters that we need to infer
    weights = create_weight( [num_pixels, num_classes] )
    biases = create_bias( [num_classes] )
```

**(c) Computations.** The computations of the neural network with no hidden layers are simple: simply multiply the inputs with the weights and add the biases. (We should additionally pass the result through the softmax function afterwards, but we will do this last computation later in the code). We create a function that does that.

In [None]:
def all_nn_computations(X, weights, biases):
    return tf.matmul(X, weights) + biases

We also compute the **loss function** here. We need the log-probabilities of the *observed* class for all the observations in our dataset, i.e., we need $\hat{y}_n$. We need to sum (or average) these log-probabilities for all data points. These operations are so common that TensorFlow has a function for that, called `sparse_softmax_cross_entropy_with_logits`. This function returns the log-probabilities (under the softmax) of the *observed* classes. In other words, this produces
$$\mathcal{L} = - \frac{1}{N}\sum_{n=1}^N \log p(\hat{y}_n\;|\; W, b)$$
(TensorFlow gives a negative log-likelihood and then *minimizes* the objective.)



Thus, block (c) becomes:
```python
    # (c) Computations
    #     Indicate the computations that we want to perform with the variables and data
    train_logits = all_nn_computations(tf_train_data, weights, biases)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )
```

** (d) Optimization.** We need to indicate that we want to minimize the loss using gradient descent. The optimizer requires that we specify the step size (also called learning rate). Importantly, **we do not need to provide the gradient**. TensorFlow will compute the gradients for us. We don't need to do anything else. This is called *automatic differentiation*.
```python
    # (d) Optimizer
    #     Indicate the optimization procedure that we want to use
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
```

** (e) Other tasks.** You may want to create other variables for debugging purposes, or to obtain the performance on test/validation. Here, we obtain the predictions (the output of the softmax) for the training, testing, and validation sets. This will be useful later to print the performance during optimization.
```python
    # (e) Other tasks
    #     Compute predictions on training, test, and validation
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights, biases) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights, biases) ) 
```

**All together.** Putting all together, we get to the following piece of code:

In [None]:
# Create a Tensorflow graph for multinomial logistic regression
graph_MLR = tf.Graph()

with graph_MLR.as_default():
    # (a) Input data
    #     Load the training, validation and test data into constants that are
    #     attached to the graph
    tf_train_data = tf.constant(X_train_small)  # use the small dataset
    tf_train_labels = tf.constant(np.squeeze(Y_train_small))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)
    
    # (b) Variables
    #     Indicate the parameters that we need to infer
    weights = create_weight( [num_pixels, num_classes] )
    biases = create_bias( [num_classes] )
    
    # (c) Computations
    #     Indicate the computations that we want to perform with the variables and data
    train_logits = all_nn_computations(tf_train_data, weights, biases)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )
    
    # (d) Optimizer
    #     Indicate the optimization procedure that we want to use
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5).minimize(loss)
    
    # (e) Other tasks
    #     Compute predictions on training, test, and validation
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights, biases) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights, biases) )

### TensorFlow Session

**Introduction.** The code above specifies the computational graph. It does *not* compute anything. It only specifies the way in which variables interact with each other.

If we want to actually perform the optimization, we need to call the `run` method. Each time we call `run`, Tensorflow will execute the operations in the graph. The runtime operations are indicated in the block starting with
```python
with tf.Session(graph) as session:
```

**Gradient descent.** Now, we will run 500 gradient descent steps using TensorFlow. Recall that for gradient descent, we need the gradient of the objective function. As we mentioned above, the greatest advantage of TensorFlow is that it automatically computes all the derivatives for us. We do not need to do the math ourselves.

*Note:* As convergence criterion, we will stop after 500 iterations. TensorFlow also allows more complicated stopping criteria, like stopping when the magnitude of the gradient is small. We will not do that here for simplicity.

**Plotting function.** We use an auxiliary function for plotting the weights.

In [None]:
def plot_weights(weights):
    plt.figure()
    for j in range(num_classes):
        # Create and choose subplot
        ax = plt.subplot(1,num_classes,j+1)
        # Obtain the weights corresponding to class j
        weights_j = weights[:,j]
        # Reshape
        weights_reshaped = np.reshape(weights_j,(28, 28))
        # Plot
        ax.imshow(weights_reshaped, cmap=plt.get_cmap('Greys'))
        plt.axis('off')
        plt.title('digit #'+str(j), fontsize=7.0)
    plt.show()

**Steps in the code.** In the code below, we

1. Initialize the variables (weights and biases) using `initialize_all_variables()`. This might give a warning, depending on your TensorFlow version.

2. Write a `for` loop for the gradient descent algorithm. At each step, we call `session.run()` to perform one optimization step. Every 100 iterations, we compute the loss function and the accuracy on the training and validation sets.

3. Compute the performance on the test set.

4. Plot the weight matrices for each class.

In [None]:
max_iterations = 500

with tf.Session(graph=graph_MLR) as session:
    # 1. Initialize the weights and biases. This is a one-time operation
    tf.initialize_all_variables().run()
    print('Initialized')
    
    # 2. Run iterations of gradient descent
    for step in range(max_iterations):
        # Run the computations. We tell .run() that we want to run the optimizer
        _, l, predictions = session.run([optimizer, loss, train_prediction])
        
        # Report every 100 iterations
        if (step % 100 == 0):
            # Print the loss
            print('Loss at step %d: %f' % (step, l))
            # Obtain and print the accuracy on the training set
            print('  +Training accuracy: %.1f%%' % accuracy(predictions, Y_train[:train_subset, :]))
            # Obtain and print the accuracy on the validation set
            print('  +Validation accuracy: %.1f%%' % accuracy(val_prediction.eval(), Y_val))
            
    # 3. Accuracty on the test set
    print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), Y_test))
    
    # 4. Plot the weights
    plot_weights(weights.eval())

**[Questions]** Run the code above.

1. What happens to the training accuracy? Does it differ from the test accuracy? Why?

2. Look at the resulting plots of the weights. What do they represent? Why do they look like actual digits?