# Tutorial on deep learning using Keras and Tensorflow

Tensorflow (2015) and Keras (2015) are gradient-based optimization libraries (e.g. deep learning libraries) developped by Google.

Why and when use Tensorflow?

- Tensorflow allows you to automatically find the gradient of complex functions (such as multi-layered or recurrent neural networks) and performs gradient-based optimization with state-of-the-art optimizers (e.g. stochastic gradient descent, ADAM optmizer, etc.)
- Seamlessly ports the computations on one or several GPUs (up to 50x faster than CPU)
- Ideal for training deep networks (or even shallow networks) on small or large datasets.

Why and when use Keras?

- Keras is a high-level library (coarser 'bricks') running on Tensorflow (or Theano).
- It simplifies the code (a lot) when creating classical networks
- When using Keras, one can always customize the network more finely by returning to Tensorflow
- Use Keras whenever you would use Tensorflow, and return to Tensorflow to customize the network

Alternative deep learning libraries: Among many others, PyTorch (developped by Facebook), Theano (MILA group, will be discontinued!).


In [1]:
# import modules
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

## Multilayer Perceptron

Multilayer perceptron (MLP) is the most basic form of artificial neural networks. Inputs are transformed into outputs through multiple layers of feedfoward processing.

For an input vector $\mathbf{x}$, a simple 1-hidden-layer MLP will convert the input to an output vector $\mathbf{y}$, through a hidden layer of neurons $\mathbf{h}$. For simplicity, we will ignore the bias terms.

\begin{eqnarray}
\mathbf{h} &=& f(W_h\mathbf{x}) \\
\mathbf{y} &=& W_y\mathbf{h}
\end{eqnarray}

$W_h, W_y$ are the weight matrices, and $f(\cdot)$ is the activation function for the hidden layer. The trainable parameters will be collectively referred to as the parameters $\theta$.

Our goal is to find parameters ($W_h, W_y$) that allow the predicted output to be close to the target output $\hat{y}$. We define the loss as
\begin{equation}
L = \langle(\mathbf{y}-\hat{y})^2\rangle
\end{equation}


## Stochastic gradient descent
We will use (variants of) stochastic gradient descent to decrease the loss. At each step, we will evaluate the loss using a minibatch of inputs.
\begin{equation}
\tilde{L} = (\mathbf{y}_i-\hat{y}_i)^2, \ \mathrm{for \ i = 1, ..., batch size}
\end{equation}

We move the parameters in the opposite direction of the gradient:

\begin{equation}
\Delta \theta =-\alpha \frac{\partial \tilde{L}}{\partial \theta},
\end{equation}
$\alpha$ is the learning rate.


## Multilayer perceptron in Numpy

In [None]:
# Numpy way
batch_size = 5
n_x = 10
n_h = 100
n_y = 3
x_np = np.random.rand(batch_size, n_x)
def relu(x):
    return x * (x>0.)

W_h_np = np.random.randn(n_x, n_h) * np.sqrt(2./(n_x + n_h))
W_y_np = np.random.randn(n_h, n_y) * np.sqrt(2./(n_h + n_y))

h_np = relu(np.dot(x_np, W_h_np))
y_np = np.dot(h_np, W_y_np)

y_target_np = np.random.rand(batch_size, n_y)
loss_np = np.mean((y_target_np - y_np)**2)

## Multilayer perceptron in Tensorflow

In [None]:
# Tensorflow way
tf.reset_default_graph()
x = tf.placeholder(tf.float32, (batch_size, n_x))

# Define variables
W_h = tf.get_variable('W_h', shape=(n_x, n_h), initializer=tf.constant_initializer(W_h_np))
W_y = tf.get_variable('W_y', shape=(n_h, n_y), initializer=tf.constant_initializer(W_y_np))

# Build the computational graph
h = tf.nn.relu(tf.matmul(x, W_h))
y = tf.matmul(h, W_y)

print(y)

sess = tf.Session()
# with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
y_val = sess.run(y, feed_dict={x: x_np})
sess.close()
# This should be (batch_size, n_y)
print(y_val)    

plt.scatter(y_val, y_np)

## Multilayer perceptron in Keras

In [None]:
# Keras way

import keras
from keras.layers import Dense
from keras.models import Sequential
keras.backend.clear_session()

model = Sequential()
model.add(Dense(n_h, input_dim=n_x, activation='relu', use_bias=False))
model.add(Dense(n_y, activation=None, use_bias=False))
model.layers[0].set_weights([W_h_np])
model.layers[1].set_weights([W_y_np])

model.summary()

# Q0: Compare the two networks outputs
The NumPy network and the Keras network are identical. Verify it by feeding the NumPy network and the Keras network with 10 random gaussian inputs. Compare the response of the two networks to these inputs. Are they the same?

Help: to obtain the output Y of the Keras network to an input X, use the function
Y = model.predict(X)


## Training a Keras network on MNIST

In [None]:
#Load MNIST dataset
from keras.datasets import mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

print(x_train.shape)

for i in range(3):
    plt.figure(figsize = (2,2))
    plt.imshow(x_train[i])
    plt.colorbar()

In [None]:
#Create network and train on MNIST

from keras.layers import Dense, Flatten, Conv2D, Dropout

model = Sequential()
model.add(Flatten(input_shape=(28,28)))
model.add(Dense(500, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

model.fit(x_train, y_train, epochs=5, batch_size = 32)

score = model.evaluate(x_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Q1: Observation questions
1.1 How many layers of neurons are there in this network? <br/>
1.2 How many neurons are there in each layer? <br/>
1.3 What is the non-linearity used in the first layer? <br/>
1.4 What is the non-linearity used in the second layer? What is the softmax function? <br/>
1.5 Why do we use the layer 'Flatten'? <br/>
1.6 What is the optimization method used? <br/>
1.7 What loss function are we using? <br/>
1.8 What batch size are we using? <br/>
1.9 What is the accuracy of the trained network on the training set? And on the testing set? Is the network overfitting?

# Q2: Play with the network hyperparameters

Copy the original network in different cells of the notebook and perform the following changes.

2.1 Set the number of neurons in the first layer to 10. How does this affect training accuracy? Testing accuracy? Is the network overfitting?<br/>

2.2 Add a second layer of neurons identical to the first layer. How does it affect performance?<br/>

2.3 Remove the non-linearity of the first layer. How does it affect performance?

2.4 (optional) What is a 'Dropout' layer? Is it active doing training only or also during testing? Add a dropout layer after the first layer. How does it affect the training and testing accuracies?


# Q3: Visualize the inner workings of the network

### 3.1 Neuron activations in the hidden layer
Select two neurons of the first layer (i.e. hidden layer) and plot their activations across the testing set in a scatterplot (x-axis : activation of neuron1, y-axis : activation of neuron2). Are these neurons correlated?<br/>

In [None]:
# CODE HELP
#get output of a layer 'layer_nb' for an input (here 600 samples of a random gaussian input):
from keras import backend as K
layer_nb = 1
inp = model.input # input placeholder
out = model.layers[layer_nb].output # layer output
func = K.function([inp], [out]) # function relating input to output
output = func([np.random.rand(600,28,28)])[0]  #  apply function to a particular input
print('Output shape:', output.shape)

###  3.2 Visualize hidden layer representation in Principal Component space

a) Perform SVD on the activation matrix of the hidden layer (dim1: MNIST samples, dim2: neuron responses). Represent each MNIST sample in a scatterplot where the x-axis is the projection of the sample representation along PC1 and the y-axis the projection along PC2. Color the points by the corresponding class of the sample (0 to 9). Are the classes well separated in PC space? <br/>

b) Re-do this analysis before training the network. Are the classes better separated after training?  

c) Re-do this analysis (a and b) in T-SNE space instead of PC space, before and after training. <br/>




In [None]:
# CODE HELP
# do SVD on a matrix (e.g on matrix 'output')
U,S,V = np.linalg.svd(output)
print('U shape:', U.shape)
print('S shape:', S.shape)
print('V shape:', V.shape)

# scatterplot
plt.figure(figsize = (5,5))
for i in range(10):  
    targets = y_test[0:600]==i
    plt.plot(U[targets,0],U[targets,1],'.')

# do TSNE on a matrix
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30.0)
R = tsne.fit_transform(output)
R.shape

# scatterplot
plt.figure(figsize = (5,5))
for i in range(10):  
    targets = y_test[0:600]==i
    plt.plot(R[targets,0],R[targets,1],'.')

###  3.3 Receptive fields of the hidden layer
a) Select a few neurons of the first layer and represent their receptive field (i.e. weights on the input image). <br/>

b) Create a white noise stimulus database with 10,000 samples. Feed it to the network. Perform activation-weighed average (i.e. STA) for the neurons selected in (a). Do you recover the neurons receptive fields? <br/>

BONUS QUESTION: Find the RF of the neurons of the second layer by performing STA. Alternatively, find the RF of these neurons by doing a one step gradient descent on the activation of this neuron from a gray image. Why are these two computations giving the same result? <br/>



In [None]:
# get weights and biases of a layer 'layer_nb'
W,b = model.layers[layer_nb].get_weights()
print('Weights shape:', W.shape, "biases shape", b.shape)

In [None]:
model.layers[1].output.shape

### BONUS QUESTION: 
Find the RF of the neurons of the second layer by performing STA. Alternatively, find the RF of these neurons by doing a one step gradient descent on the activation of this neuron from a gray image. Why are these two computations giving the same result? <br/>

In [None]:
# NO HELP!!!

# Other Ressource


## Installing Tensorflow and Keras (CPU-only, already installed on the lab computers)
1) Download and intall Anaconda for Python 3.6 https://www.anaconda.com/download/#macos <br/>
2) Install Tensorflow from command line: conda install -c conda-forge tensorflow <br/>
3) Install Keras from command line: conda install -c conda-forge keras <br/>
Need access to a FREE GPU in a few clicks? Check out Colab: https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d

## Training the network with Tensorflow

In [None]:
# Tensorflow way
tf.reset_default_graph()
keras.backend.clear_session()

x = tf.placeholder(tf.float32, (batch_size, n_x))

# Define variables
W_h = tf.get_variable('W_h', shape=(n_x, n_h), initializer=tf.constant_initializer(W_h_np))
W_y = tf.get_variable('W_y', shape=(n_h, n_y), initializer=tf.constant_initializer(W_y_np))

# Build the computational graph
h = tf.nn.relu(tf.matmul(x, W_h))
y = tf.matmul(h, W_y)

print(y)

sess = tf.Session()
# with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
y_val = sess.run(y, feed_dict={x: x_np})
sess.close()
# This should be (batch_size, n_y)
print(y_val)    

plt.scatter(y_val, y_np)


# Defining the target output and the loss
y_target = tf.placeholder(tf.float32, (None, n_y))
loss = tf.reduce_mean((y_target - y)**2)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    loss_val = sess.run(loss, feed_dict={x: x_np, y_target: y_target_np})
    
print(loss_np)
print(loss_val)

# Compute the gradient and train
opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
train_op = opt.minimize(loss)
# opt = tf.train.AdamOptimizer(learning_rate=0.01)


# We will overfit the network on the random target output we generated
steps = list()
losses = list()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(10):
        _, loss_val = sess.run([train_op, loss],
                               feed_dict={x: x_np, y_target: y_target_np})
        if step % 1 == 0:
            print(step, loss_val)
            steps.append(step)
            losses.append(loss_val)

plt.figure()
plt.plot(steps, losses)

## Convolutional Network

In [None]:
from keras.layers import Dense, Flatten, Conv2D, Dropout, MaxPooling2D


model = Sequential()
model.add(Conv2D(32, input_shape = (28,28,1), kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

model.fit(x_train[..., np.newaxis], y_train, epochs=5)
model.evaluate(x_test[..., np.newaxis], y_test)

## Recurrent network

In [None]:
from keras.layers import Dense, Flatten, Conv2D, Dropout, MaxPooling2D, GRU
from keras import Input

x_train_rnn = x_train.reshape((x_train.shape[0], -1, 1))
x_test_rnn = x_train.reshape((x_test.shape[0], -1, 1))

x = Input((None, x_train_rnn.shape[-1]))
# layer = tf.keras.layers.SimpleRNN(100)
# layer = tf.keras.layers.SimpleRNN(100)
layer = GRU(20)
y = layer(x)
print(x)
model = keras.models.Model(inputs=x, outputs=y)
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

model.fit(x_train_rnn, y_train, epochs=5)
model.evaluate(x_test_rnn, y_test)

In [None]:
# the loss is here the output of a neuron
loss = model.layers[2].output[:,4]
input_img = model.input

# we compute the gradient of the input picture wrt to this loss
grads = K.gradients(loss, input_img)[0]

# this function returns the loss and grads given the input picture
iterate = K.function([input_img], [loss, grads])

# we start from a gray image
input_img_data = np.zeros((1,28,28))

# we run gradient ascent for 1 step so it's just a computation of the gradient
loss_value, grads_value = iterate([input_img_data])

plt.figure()
plt.imshow(grads_value[0])