# Intro to Keras, Tensorflow and advanced NN, part 2


<center><img src="figures/keras-tensorflow-logo.jpg"></center>

## Summary of first part

## https://keras.io

* Keras is a high-level neural networks API (front-end), written in Python
* Capable of running on top of TensorFlow, CNTK, or Theano (backends)
* Built to simplify access to more complex backend libraries

## https://tensorflow.org

Use *TensorFlow* if you want a finer level of control:

* Build your own NN layers
* Personalized cost function
* More complex architectures than those available on Keras
    
We will be mostly writing python code using Keras libraries, but "under the hood" Keras is using tensorflow libraries.

The documentation is at [keras.io](https://keras.io).

## Here's how a NN layer looks like in TensorFlow:

* 7 samples in batch
* 784 inputs
* 500 outputs

<center><img src="figures/run_metadata_graph.png"></center>

## A neural network in Keras is called a Model

The simplest kind of model is of the Sequential kind:

In [1]:
from tensorflow.keras.models import Sequential

model = Sequential()

This is an "empty" model, with no layers, no inputs or outputs are defined either.

Adding layer is easy:

In [2]:
from tensorflow.keras.layers import Dense

model.add(Dense(units=3, activation='relu', input_dim=3))
model.add(Dense(units=2, activation='softmax'))



A "Dense" layer is a fully connected layer as the ones we have seen in Multi-layer Perceptrons.
The above is equal to having this network:

<center><img src="figures/simplenet.png"></center>


If we want to see the layers in the Model this far, we can just call:

In [3]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 3)                 12        
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 8         
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________


Using "model.add()" keeps stacking layers on top of what we have:

In [4]:
model.add(Dense(units=2, activation=None))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 3)                 12        
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 8         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 6         
Total params: 26
Trainable params: 26
Non-trainable params: 0
_________________________________________________________________


## Part 2, more Keras layers (https://keras.io/api/layers/)

Common layers (we will cover all of these!)

* Trainable
    * <font color='red'>Dense (fully connected/MLP)</font>
    * Conv1D (2D/3D)
    * MaxPooling1D (2D/3D)
    * Recurrent: LSTM/GRU/Bidirectional


* Non-trainable
    * <font color='red'>Dropout</font>
    * Flatten
    * Merge (Add/Multiply/Subtract/Concatenate)
    * <font color='red'>Lambda (apply your own function)</font>
    * <font color='red'>Activation (Softmax/ReLU/Sigmoid/...)</font>

## Dropout is a regularization layer

* It's applied to a previous layer's output
* Takes those outputs and randomly sets them to 0 with probability p
* Other outputs are scaled up so that the sum of the inputs remains unchanged
* if p = 0.5: `model.add(Dropout(0.5))`

In [8]:
import numpy as np
from tensorflow.keras.layers import Dropout
from tensorflow.keras import backend as K

tf.random.set_seed(0)
drop = Dropout(0.5, input_shape=(4,))
data = tf.reshape(tf.range(1.0,13.0), (3, 4))

print("Before:", data, sep="\n")
output = drop(data, training=True)
print("After:", K.eval(output), sep="\n")

Before:
tf.Tensor(
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]], shape=(3, 4), dtype=float32)
After:
[[ 0.  0.  6.  8.]
 [ 0. 12.  0. 16.]
 [18.  0. 22. 24.]]


## Dropout is a regularization layer

* Applying the same input twice will give different results
* Means that it is harder for the network to memorize patterns
* Helps curb overfitting
* Especially used with Dense() layers which are prone to overfitting
* Active only at training time

In [7]:
import numpy as np
from tensorflow.keras.layers import Dropout
from tensorflow.keras import backend as K

tf.random.set_seed(0)
drop = Dropout(0.5, input_shape=(4,))
data = tf.reshape(tf.range(1.0,13.0), (3, 4))

print("Before:", data, sep="\n")
output = drop(data, training=True)
print("After:", K.eval(output), sep="\n")

Before:
tf.Tensor(
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]], shape=(3, 4), dtype=float32)
After:
[[ 0.  0.  6.  8.]
 [ 0. 12.  0. 16.]
 [18.  0. 22. 24.]]


## Lambda layers

* Work like regular lambda functions
* Inputs and outputs are tensors, functions inside must be keras/tensorflow functions
* Function has to be differentiable

In [11]:
from tensorflow.keras.layers import Lambda
from tensorflowkeras import backend as K

def sum_two_tensors(inputs):

    x, y = inputs
    sum_of_tensors = x + y

    return sum_of_tensors

input_tensor_1 = tf.range(0, 9)
input_tensor_2 = tf.range(0, 9)
print(input_tensor_1)
lambda_out = Lambda(sum_two_tensors)([input_tensor_1, input_tensor_2])
K.eval(lambda_out)

tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int32)


array([ 0,  2,  4,  6,  8, 10, 12, 14, 16], dtype=int32)

## Keras activations (https://keras.io/api/layers/activations/)

Activation functions for regression or inner layers:
* Sigmoid
* Tanh
* ReLU
* LeakyReLU
* Linear (None)

THE activation function for classification (output layer only):
* Softmax (ouputs probabilities for each class)

## Softmax

It's an activation function applied to a output vector z with K elements (one per class) and outputs a probability distribution over the classes:

<table><tr>
<td><img src="figures/simplenet.png" width=200></td>
<td><img src="figures/softmax.svg"></td>
</tr></table>

What makes softmax your favorite activation:

* K outputs sums to 1
* K probabilities proportional to the exponentials of the input numbers
* No negative outputs
* Monotonically increasing output with increasing input

Softmax is usually only used to activate the last layer of a NN

## ReLU vs. old-school logistic functions

* Historically, sigmoid and tanh were the most used activation functions
* Easy derivative
* Bound outputs (ex: from 0 to 1)
* They look like this:

<img src="figures/logistic_curve.png" width=400>

## ReLU vs. old-school logistic functions

* Problems arise when we are at large $|x|$
* The derivative in that area becomes small (saturation)
* Remember what the chain rule said?

<img src="figures/logistic_curve.png" width=400>


## ReLU vs. old-school logistic functions

* When we have $n$ layers, we go through $n$ activation functions
* At layer $n$ the derivative is proportional to:
$$\begin{eqnarray} 
\frac{\partial L(w,b|x)}{\partial w_{ln}} & \propto &  \frac{\partial a_{ln}}{\partial z_{ln}}
\end{eqnarray}$$
* At layer 1 the derivative is proportional to:
$$\begin{eqnarray} 
\frac{\partial L(w,b|x)}{\partial w_{l1}} & \propto &  \frac{\partial a_{ln}}{\partial z_{ln}} \times \frac{\partial a_{n-1}}{\partial z_{ln-1}} \times \frac{\partial a_{ln-2}}{\partial z_{ln-2}} \ldots \times \frac{\partial a_{l1}}{\partial z_{l1}}
\end{eqnarray}$$
* It is the product of many numbers $< 1$
* Gradient becomes smaller and smaller for the initial layers
* Gradient vanishing problem

<!--- <img src="figures/large_net.png" width=400> -->

## ReLU is the first activation to address the issue

<center><img src="figures/relu.png" width=400></center>

Used in "internal" layers, usually not at last layer

Pros:
* Easy derivative (1 for x > 0, 0 elsewhere)
* Derivative doesn't saturate for x > 0: alleviates gradient vanishing
* Non-linear

Cons:
* Non-derivable at 0
* Dead neurons if x << 0 for all data instances
* Potential gradient explosion
* Let's try this on Tensorflow playground: http://playground.tensorflow.org

## Other ReLU-like activations

LeakyReLU/PReLU
* y = $\alpha$x at x < 0
* In PReLU $\alpha$ is learned

<center><img src="figures/leakyrelu.png?0" width=400></center>


## <font color="white">Other</font>

ELU
* Derivable at 0
* Non-zero at x < 0

<center><img src="figures/elu.png?0" width=400></center>

In [1]:
from IPython.display import IFrame 
IFrame('https://polarisation.github.io/tfjs-activation-functions/', width=860, height=470) 

## Setting activations in Keras

We can add activations as string parameters, or as functions:

In [14]:
model = Sequential() 
model.add(Dense(units=2, activation='sigmoid'))
model.add(Dense(units=2, activation='relu'))
model.add(Dense(units=2, activation=tf.keras.activations.relu))
model.add(Dense(units=2, activation='softmax'))

But also as separate layers

In [15]:
import keras
from keras.layers import Activation

model = Sequential() 
model.add(Dense(units=2))
model.add(Activation('sigmoid'))
model.add(Dense(units=2))
model.add(Activation('relu'))
model.add(Dense(units=2))
model.add(Activation(tf.keras.activations.relu))
model.add(Dense(units=2))
model.add(Activation('softmax'))

## Passing classes as parameters

* Some parameters can be set by passing a string (optimizer='rmsprop')
* we need to explicitly import the object if we want better control (optimizer=RMSprop())

In [17]:
from tensorflow.keras.optimizers import RMSprop
model.compile(optimizer=RMSprop(),                    #adaptive learning rate method
              loss='sparse_categorical_crossentropy', #loss function for classification problems with integer labels
              metrics=['accuracy'])                   #the metric doesn't influence the training

model.optimizer.get_config()

{'name': 'RMSprop',
 'learning_rate': 0.001,
 'decay': 0.0,
 'rho': 0.9,
 'momentum': 0.0,
 'epsilon': 1e-07,
 'centered': False}

## Passing classes as parameters

* Some parameters can be set by passing a string (optimizer='rmsprop')
* we need to explicitly import the object if we want better control (optimizer=RMSprop())

In [18]:
from keras.optimizers import RMSprop
model.compile(optimizer=RMSprop(learning_rate=1.0),   #adaptive learning rate method
              loss='sparse_categorical_crossentropy', #loss function for classification problems with integer labels
              metrics=['accuracy'])                   #the metric doesn't influence the training

model.optimizer.get_config()

{'name': 'RMSprop',
 'learning_rate': 1.0,
 'decay': 0.0,
 'rho': 0.9,
 'momentum': 0.0,
 'epsilon': 1e-07,
 'centered': False}

## There are multiple ways to pass data to fit()

* You can load all of the data in memory, assign it to:
    * numpy array or list of arrays (if you have multiple inputs/outputs)
    * TensorFlow tensors
    * A dictionary to map input names to arrays/tensors

```python
data = np.genfromtxt('path/to/dataset.csv',delimiter=',')

X_train = data[:,0:10]
y_train = data[:,10]

model.fit(X_train, y_train,...)
```

## There are multiple ways to pass data to fit()


* Or you can pass it an object/function that generates data for you:
    * A generator() function
    * A keras.utils.Sequence object
    * A tensorflow.data.Dataset object

Here a quick example on how a generator that loads loads data from a list of files (images, pickle objects, csv files...) on the filesystem:


```python
def generator(input_list):
    input_list_file = open(input_list, 'r')
    while 1:
        for next_file in input_list_file:
        
            data = open(next_file, 'r').readlines()
            X = data[:,0:10]
            y = data[:,10]
        
            yield X,y
        input_list_file.seek(0)
        
model.fit(generator(train_data_list),...)
```

## Even more Keras layers

* Dense is the classic FFNN where all nodes between layers are connected
* Most of the other layers seen today are not trainable
* What other layers are trainable then?

## Convolutional layers

* Used where the _spatial_ relationship between inputs is significant
* Classic example: imaging
* Different types: 1D, 2D, 3D

```python
from tensorflow.keras.layers import Conv2D

model.add(Conv2D(filters, kernel_size, strides=(1, 1), padding="valid"))
```

## Convolutional layers

<img src="figures/Typical_cnn.png"></img>

[(source)](https://en.wikipedia.org/wiki/Convolutional_neural_network)

## Recurrent layers

* Used when the _temporal_ relationship between inputs is significant
* Examples: audio, text
* Different types: LSTM, GRU...

```python
from tensorflow.keras.layers import LSTM
model.add(LSTM(units, activation="tanh", recurrent_activation="sigmoid"))
```

## Recurrent layers

<img src=figures/rnn.png></img>


## Embedding layers

* Used to transform a discrete input into a vector
* Example: text input is made of words, how do we translate that into NN inputs?
* "cat" -> `[0.1, 0.003, 1.2 ..., 0]`

```python
from tensorflow.keras.layers import Embed
model.add(Embedding(input_dim, output_dim))
```

## Embedding layers

* Example: map amino acid names to 2D space
* Which amino acids are most similar to tryptophan (W)?

<img src=figures/aa_embed.png></img>

## The functional API in Keras

* https://keras.io/guides/functional_api
* Sequential() is quite simple, but limited
* What if we want to have multiple input/output layers?
* What if we want a model that is not just a linear sequence of layers?

<img src="figures/functional_api_40_0.png">

## Exercise 2 (reprise)

* Remember the XOR classifier?
* Can you apply some of the things we have learned today on yesterday's XOR classier?
* Do they help?

## Exercise 3: how do we build a regressor?

* We have only seen classifiers this far
* What are some things we need to change to predict continuous values?
* Check the exercise notebook!