# A brief introduction to deep learning
**By: Santiago Hincapie-Potes**

## Outline
1. Overview
2. Applications
3. Training Deep Neural Net
    + Neural what?
    + Vanishing gradient problem 
    + Transfer Learning
    + Regularization

## What is deep learning?
Deep learning is a class of machine learning algorithms that:
* Use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input.
* Learn in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners.
* Learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

## Why deep learning?

![](img/d2.png)

![](img/d1.png)

## Why now?

![](img/d3.png)

### Further Reading
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539


Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003

# Application
* Automatic speech recognition
* Image recognition
* Natural language processing

## Ok and bio?
+ [Drug discovery](https://doi.org/10.1016/j.drudis.2018.01.039)
+ [Biomarker discovery](http://www.aging.ai/)
+ [Proteomics](https://github.com/tavanaei/Cancer-Suppressor-Gene-Deep-Learning)
+ [Metabolomics](https://pubs.acs.org/doi/full/10.1021/acs.jproteome.7b00595)
+ Genomics
    * [Variant calling](https://github.com/google/deepvariant)
    * [Gene expression](https://www.biorxiv.org/content/early/2015/12/15/034421)
    * [Predicting enhancers and regulatory regions](https://www.nature.com/articles/nmeth.3547)
    * [Non-coding RNA](https://link.springer.com/article/10.1007%2Fs13721-016-0129-2)
    * [Methylation](https://www.nature.com/articles/srep19598)
+ [Systems biology](https://www.nature.com/articles/nmeth.4627/)

### More information
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., Way, G. P., … Greene, C. S. (2017). Opportunities And Obstacles For Deep Learning In Biology And Medicine. Cold Spring Harbor Laboratory. https://doi.org/10.1101/142760


Angermueller, C., Pärnamaa, T., Parts, L., & Stegle, O. (2016). Deep learning for computational biology. Molecular Systems Biology, 12(7), 878. https://doi.org/10.15252/msb.20156651


Ravi, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G.-Z. (2017). Deep Learning for Health Informatics. IEEE Journal of Biomedical and Health Informatics, 21(1), 4–21. https://doi.org/10.1109/jbhi.2016.2636665

# Training Deep Neural Net
**By: Santiago Hincapie-Potes**

## Neural Nets
![](img/d4.png)

## Train neural nets
* Gradient Descent
* Backpropagation

### Gradient Descent
![](img/d5.png)

### Backpropagation
![](img/d6.png)

![](img/d7.png)

```python
def get_model():
    tf.keras.backend.clear_session()
    model = Sequential()
    model.add(Dense(100, activation='tanh', input_dim=784))
    model.add(Dense(200, activation='tanh'))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    model.reset_states()
    return model

model = get_model_A()
model.summary()

model.compile(loss='categorical_crossentropy',
             optimizer='SDG',
             metrics=['accuracy'])

model.fit(data.x_test, data.y_test, epochs=40, batch_size=64, validation_data=(data.x_test, data.y_test))
```

## Faster Optimizers
* Momentum optimization
* Nesterov Accelerated Gradient
* AdaGrad
* RMSProp
* Adam Optimization

### Momentum optimization
$$ v_t = \gamma v_{t-1} + \nu \nabla_{\theta} J(\theta) $$
$$ \theta = \theta - v_t $$


### Nesterov Accelerated Gradient
$$ v_t = \gamma v_{t-1} + \nu \nabla_{\theta} J(\theta - \gamma v_{t-1}) $$
$$ \theta = \theta - v_t $$


![](http://ruder.io/content/images/2016/09/contours_evaluation_optimizers.gif)

**Stochastic gradient descent**
```python
optimizers.SGD(lr=0.01)
```
**Momentum optimization**
```python
optimizers.SGD(lr=0.01, momentum=0.9)
```
**Nesterov Accelerated Gradient**
```python
optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
```
**AdaGrad**
```python
optimizers.Adagrad(lr=0.01)
```
**RMSProp**
```python
optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)
```
**Adam Optimization**
```python
optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
```

**How to use**
```python
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)
```

## Vanishing Gradients Problems
Gradients often get smaller and smaller as the algorithm progresses down to the lower layers.

it is only around 2010 that significant progress was made in understanding it.

![](img/d8.png)

### Xavier and He Initialization
$$ Var(w_i) = \frac{2}{N_{in} + N_{out}} $$

| Activation function| Uniform $[-r, r]$                          | Normal distribution                      |
|--------------------|--------------------------------------------|------------------------------------------|
| Logistic           | r = $\sqrt{\frac{6}{n_{in} + n_{out}}}$    | $\sqrt{\frac{2}{n_{in} + n_{out}}}$      |
| $\tanh$            | r = $\sqrt[4]{\frac{6}{n_{in} + n_{out}}}$ |r = $\sqrt[4]{\frac{2}{n_{in} + n_{out}}}$|
| ReLU and variants  | r = $\sqrt[\sqrt2]{\frac{6}{n_{in} + n_{out}}}$|r = $\sqrt[\sqrt2]{\frac{2}{n_{in} + n_{out}}}$|

```python
model.add(Dense(64,
                activation='sigmoid',
                kernel_initializer='he_normal')) # he_uniform, xaviar by default
```

**Initialization is an active research field**

### Nonsaturating Activation Functions
One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/
exploding gradients problems were in part due to a poor choice of activation function.

* Until then most people had assumed that if Mother Nature had chosen to use roughly sigmoid activation functions in biological neurons, they must be an excellent choice

* ReLU rocks!

* dying ReLUs :c<br>
during training, some neurons effectively die, meaning
they stop outputting anything other than 0. In some cases, you may find that half of
your network’s neurons are dead, especially if you used a large learning rate

* Use: leaky ReLU, SeLU or eLU

**Leaky ReLU**
```python
keras.activations.relu(x, alpha=0.1, max_value=None, threshold=0.0)
```

**SeLU**
```python
keras.activations.selu(x, alpha=0.1, max_value=None, threshold=0.0)
```

**eLU**
```python
keras.activations.elu(x, alpha=1.0)
```


#### Use normalized data

#### Shuffle data

## Regularization

#### Dropout
![](img/d10.png)

```python
model.add(Dense(200, activation='tanh'))
model.add(Dropout(rate=0.5)) # convnet lower value
```

#### Early Stopping
![](img/d9.png)

See `keras.callbacks.EarlyStopping`

### Batch normalization
To increase the stability of a neural network, batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.


batch normalization allows each layer of a network to learn by itself a little bit more independently of other layers.

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization, Dropout
from tensorflow.keras.optimizers import SGD

# instantiate model
model = Sequential()

# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the hidden layer    
model.add(Dense(64, kernel_initializer='he_uniform'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))

# we can think of this chunk as the output layer
model.add(Dense(2))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# setting up the optimization of our weights 
sgd = SGD(lr=0.1, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 64)                960       
_________________________________________________________________
batch_normalization_7 (Batch (None, 64)                256       
_________________________________________________________________
activation_7 (Activation)    (None, 64)                0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 64)                4160      
_________________________________________________________________
batch_normalization_8 (Batch (None, 64)                256       
_________________________________________________________________
activation_8 (Activation)    (None, 64)                0         
__________

# Tomorrow:
* ConvNets
* Trasnfer learning