# Applied Deep Learning Tutorial 
contact: Mark.schutera@kit.edu
# Deep Learning Foundations

## Introduction
In this tutorial, you will attempt to get a first understanding of neural networks and the tensorflow library. You will get an understanding of backpropagation, convolutional layers, fully connected layers, activation functions, loss functions, and bring all this together into your first neural network architecture.

<img src="graphics/set_sails.jpg" width="700"><br>
<center> Fig. 1: Setting sails for the deep learning journey. Image from [pixabay](https://pixabay.com/de/photos/) </center>



## Back Propagation

Backpropagation is the central part of iterative optimization. The backpropagation utilizes the derivative of each unit in the neural network starting with the determined error resulting from the loss function. With the chain rule the various units are connected to obtain the derivatives of the loss function through all layers. The resulting matrix shows you how each weight effects the output and its error. Thus, aiming to minimize the error the derivatives tell you in which direction to update the associated weights.


Define an input vector x with two entries valued (0.2 and 0.4) and a fully connected weight matrix W with two units. The groundtruth should be defined as 1.

\begin{align}
	x =\begin{bmatrix}
	0.2 \\ 0.4 
	\end{bmatrix}
\end{align}	

\begin{align}
	y = 1
\end{align}	

In [None]:
# We will first do some experiments in numpy, no worries there will be enough tensorflow soon.
import numpy as np

x = np.array([[0.2], [2]])
W = np.array([[-0.3, 0.8]])
y = 1


Next we need to define our architecture to reach the prediction of our one unit neural network, we split the equation in two functions so we can define the derivates partially in the next step:

\begin{align}
	 f(x,W)= || W*x ||^2.
\end{align}

In [None]:
def multiplication(W,x):
    q = '''implement a scalar multiplication with numpy'''
    return q

def prediction(q):
    f_1 = '''implement the square function in two steps'''
    y_pred = '''implement the square function in two steps'''
   
    return y_pred

To calculate the error we need to specify a loss function or objective function and our weight update step:

In [None]:
def update (W, grad_W, learning_rate):
    W = '''implement the weight update step'''
    return W

In [None]:
def prediction_loss(y_pred, y):
    error = '''implement the error'''
    loss = '''implement the squared error loss function'''
    return loss

Now it is time to determine the partial derivatives of our neural network. We aim to determine the gradient of our function 
$f(x,W)$ dependent from the input values of our network $x$, and the networks parameters $W$.

\begin{align}
	\frac{\partial f(x,W)}{\partial W}
\end{align}

\begin{align}
    \frac{\partial f(x,W)}{\partial x}
\end{align}

By using the chain rule we can easily propagate the error through the local derivatives at each operation and unit.

The derivative of the last stage of our unit can be written as: 
\begin{align}
	f(q)= ||q||^2 = q_{1}^2+...+q_{n}^2, \frac{\partial f(q)}{\partial q} = 2*q_{i}.
\end{align}

For the next node we have two variable inputs and so we have to calculate the local derivatives with respect to both $W$, since this is where we can adjust the weights.
\begin{align}
	q = W*x	
\end{align}

\begin{align}
    \frac{\partial q_{k}}{\partial W_{i,j}}=x
\end{align}

With the chain rule, we can now backpropagate the error to those weights:

\begin{align}
    \frac{\partial f(x,W)}{\partial W} = \frac{\partial f(q)}{\partial q} * \frac{\partial q}{\partial W} = 2*q*x^{T}
\end{align}



In [None]:
# Starting with the loss function
def gradient_loss(y_pred, y):
    '''implement the derivative at the loss function'''
    return grad_loss

# Over the square operator
def gradient_prediction (q, grad_loss):
    grad_q = '''implement the derivative at the unit'''
    return grad_q

# Finally to the multiplication of our inputs
def gradient_multiplication (x, grad_q):
    if len(x) == 1:
        grad_W = '''implement the derivative at the input'''
    else:
        grad_W = '''implement the derivative at the input'''
    return grad_W

Finally we bring everything together in our iterative weight update process, in our optimization routine for our neural network.

In [None]:
learning_rate = 1e-2

weightlist1 = []
weightlist2 = []
predictionlist = []
groundtruthlist = []

for t in range(10):
    
    weightlist1.append(W[0][0])
    weightlist2.append(W[0][1])
    
    # Forward pass
    q = '''with the functions above implement the forward pass'''
    y_pred = '''with the functions above implement the forward pass'''
    loss = '''with the functions above implement the forward pass'''
    print("Training loss: ", loss)
    
    # Backpropagation
    grad_loss = '''with the functions above implement the backward pass'''
    grad_q = '''with the functions above implement the backward pass'''
    grad_W = '''with the functions above implement the backward pass'''
    W = '''with the functions above implement the backward pass'''
    predictionlist.append(y_pred)
    groundtruthlist.append(y)
    print("Current prediction: ", y_pred)

In [None]:
# How do the weights change with respect to the given input? Elaborate?

import matplotlib.pyplot as plt
plt.plot(weightlist1, weightlist2)
plt.xlabel('Weight 1')
plt.ylabel('Weight 2')
plt.show()

In [None]:
plt.plot(predictionlist, groundtruthlist)
plt.xlabel('Prediction')
plt.ylabel('Groundtruth')
plt.show()

# TensorFlow bring it on!
Now we are going to get our hands dirty with TensorFlow, where most of the hard work is already done for you. This way we can finally focus on "adding layers" and "going deeper".

## Optimization
First we concentrate on optimization, this is basically mirroring what we did with numpy a few cell above. This time we will implement a full neural network unit, including an activation function and also we are adding a third input, introducing a bias for the unit.

In [None]:
import tensorflow as tf
import numpy as np

# For reproducability
np.random.seed(2)

In [None]:
# Neural Network architecture
n_input = 3
n_output = 1
n_units = 1

# Training parameters
n_updates = 10


# Define graph / network
weights = {
    'h1': tf.Variable(np.reshape([np.float32(2.0), np.float32(2.0), np.float32(2.0)], (3, 1))),
    # 'h1': tf.Variable(tf.random_normal([n_input, n_units])),
}

biases = {
    'b1': tf.Variable(np.reshape([np.float32(4.0)], (1, 1))),
    # 'b1': tf.Variable(tf.random_normal([n_units])),
}

# This is where we design our unit
class unit(tf.keras.Model):
    def __init__(self, c_weights, c_biases):
        super(unit, self).__init__(name='')
        self.c_weights = c_weights
        self.c_biases = c_biases

    def call(self, x0):
        # unit / neuron structure
        layer_1 = tf.add(tf.matmul(tf.cast(x0, tf.float32), self.c_weights['h1']), self.c_biases['b1'])
        # activation function
        y_pred = tf.nn.relu(layer_1)
        return y_pred


# Input
x = tf.Variable(np.reshape([1.0, 1.0, 3.0], (1, 3)))
# predicted output
y_pred = unit(weights, biases)
# Expected output
y_gt = tf.Variable(np.reshape([10.0], (1, 1)))

In [None]:
cost = tf.losses.mean_squared_error()

# 1. Stochastic gradient descent
# opt = tf.train.GradientDescentOptimizer(learning_rate=0.0001)

# 2. Momentum optimizer
# opt = tf.train.MomentumOptimizer(learning_rate=0.0001, momentum=0.75)

# 3. Adaptive momentum optimizer
# opt = tf.train.AdamOptimizer(learning_rate=0.0001)

# 4. Adaptive momentum optimizer with adjusted initial learning rate
# opt = tf.train.AdamOptimizer(learning_rate=0.1)


In [None]:
# Open a GradientTape.
with tf.GradientTape() as tape:
    # Forward pass.
    logits = y_pred(x)
    # Loss value for this batch.
    loss_value = cost(y_gt, logits)

# Get gradients of loss wrt the weights.
gradients = tape.gradient(loss_value, y_pred.trainable_weights)

# Update the weights of the model.
opt.apply_gradients(zip(gradients, y_pred.trainable_weights))


print(y_pred.trainable_weights)

Try the different backpropagation approaches and see whether you can reproduce the findings presented in the lecture.

Notice the design of tensorflow, which first designs the model and then calls a session where the placeholders are filled and the forward and backward pass are executed.

## Regularization
Next we will get an intuition for regularization approaches (Evaluate the different designs 0. to 2.)  Also we are working with the Keras library, which is based on TensorFlow and introduces a more pythonic feeling. Do you have a notion why I would say so?

In [None]:
import tensorflow as tf

# Define model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),

  # Design 0. without regularization:
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),

  # Design 1. L2 Parameter norm penalty by kernel regularizer:
  # tf.keras.layers.Dense(512, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.l2(0.01)),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.l2(0.01)),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.l2(0.01)),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu, kernel_regularizer=tf.keras.regularizers.l2(0.01)),

  # Design 2. Dropout:
  # tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # tf.keras.layers.Dropout(0.5),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # tf.keras.layers.Dropout(0.5),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # tf.keras.layers.Dropout(0.5),
  # tf.keras.layers.Dense(512, activation=tf.nn.relu),
  # tf.keras.layers.Dropout(0.5),

  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

In [None]:
# Define training parameters, feel free to play with the different optimizers as well.
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Load training data (reduce training data to 10k samples)
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_validation, y_validation) = mnist.load_data()

# Normalize input images (comply with activation function)
x_train, x_validation = x_train / 255.0, x_validation / 255.0

In [None]:
# 3. Augmentation
# x_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
# datagenerator = tf.keras.preprocessing.image.ImageDataGenerator(
#                                                                 featurewise_center=True,
#                                                                 featurewise_std_normalization=True,
#                                                                 rotation_range=20,
#                                                                 width_shift_range=0.2,
#                                                                 height_shift_range=0.2,
#                                                                 horizontal_flip=False
#                                                                 )
#
# for e in range(10):
#     print('Epoch', e)
#     batches = 0
#     for x_batch, y_batch in datagenerator.flow(x_train, y_train, batch_size=32):
#         model.fit(np.reshape(x_batch, (-1, 28, 28)), y_batch, shuffle=True)
#         batches += 1
#         if batches >= len(x_train) / 32:
#             # we need to break the loop by hand because
#             # the generator loops indefinitely
#             break

# 4. Early stopping (usually you should monitor the validation accuracy)
# es = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
#                                      min_delta=0,
#                                      patience=1,
#                                      mode='auto'
#                                      )
#
# Fit model on training data (with callback)
# model.fit(x_train, 
#          y_train, 
#          epochs=10, 
#          shuffle=True, 
#          callbacks=[es],
#          validation_data=(x_validation, y_validation)
#         )

# Fit model on training data (without regularization)
model.fit(x_train, y_train, epochs=10, shuffle=True)

# Evaluate performance on validation set
_, validation_acc = model.evaluate(x_validation, y_validation)
print('validation accuracy:', validation_acc)


## Next steps to take it from here

- Keep in mind that [MNIST](http://yann.lecun.com/exdb/mnist/) is the "Hello world" of machine learning, this is the go-to-dataset if you are in need for a toy problem. Most new ideas in machine learning are presented by the aid of MNIST, make sure that you familiarize yourself with the dataset and its characteristics.
- Try different regularization approaches and their influence.
- Can you think of a good task to introduce Convolutional Neural Networks, including receptive field, strides, semantic information, and so on? (Send your jupyter notebook to mark.schutera@kit.edu for a chance to earn bonus points).
