# Science of Science Summer School (S4) 2021
## Day 4: Deep Learning
### Neural networks, optimization
- Daniel E. Acuna, School of Information, Syracuse University

# Object recognition problem
- A bit of neuroscience.
- From logistic regression to neural networks.
- The perceptron.
- Multi-layer perceptron.
- Training neural networks.
- Demo with multi-layer perceptron.

# The object recognition problem
- Detect an object present in an image.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl1.png" width="100%" align="center"></center>  

<br>
<center>Caltech101 dataset (year 2003)</center>
<center>101 categories of objects, each category has 40 to 800 images</center>


# A high dimensional problem
- Each image is 300 x 200 x 3 pixels (width x height x channels.)
  - 180,000 "raw" features.  
  
- Relationship between raw features and classes is highly non-linear.
  - Spatial features that could appear anywhere on the image.
  - Local features (e.g., shapes) vs global features (e.g., contrast.)

# A high dimensional problem (2)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl2.png" width="70%" align="center"></center>
<br>
<center><sup>Credit: https://www.st-andrews.ac.uk/~www_pa/Scots_Guide/info/signals/pixels/</sup></center>

# A problem with logistic regression
- We are trying to learn:  

$$p(y \mid X) = \frac{1}{1 + \exp(-(\theta_0 + \sum_{j>0}{x_j\theta_j}))}$$  

- So only a linear relationship between pixels $x$ and class $y$.

# Classic data science
- Features fed into the algorithms are meaningful and provided by experts.  

- Machine learning algorithms only provide 1 or 2 step transformations.  

- Complexity of problems is relatively small.

# A bit of neuroscience
- Deep learning takes loose inspiration from how the brain works.  

- At an architectural level, the brain works by combining specialized neurons in hierarchies.  

- E.g., the visual system is organized into several areas (around 8.)

<center><img src="./images/unit-10/unit-10-0_csordl3.png" width="80%" align="center"></center>

<center><img src="./images/unit-10/unit-10-0_csordl4.png" width="100%" align="center"></center>

# A bit of neuroscience (2)
- The brain has relatively "simple" processing units: neurons. <br>   
<br>
<center><img src="./images/unit-10/unit-10-0_csordl5.png" width="80%" align="center"></center>
<br>
<center><sup>Credit: http://www.intechopen.com/source/html/39067/media/image1.png</center></sup>

<center><img src="./images/unit-10/unit-10-0_csordl6.png" width="80%" align="center"></center>  

<center><img src="./images/unit-10/unit-10-0_csordl7.png" width="80%" align="center"></center>  
$$p(y \mid X) = \frac{1}{1 + \exp(-(\theta_0 + \sum_{j>0}{x_j\theta_j}))}$$  
**<center>A model for the probability of an action potential!</center>**

# Of course reality is a lot more complicated
<br>
<center><img src="./images/unit-10/unit-10-0_csordl8.png" width="60%" align="center"></center>

# Of course reality is a lot more complicated
<br>
<center><img src="./images/unit-10/unit-10-0_csordl8_1.gif" width="60%" align="center"></center>

# The general idea of artificial neural networks
- Simple processing units with linear or non-linear functions $f$.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl9.png" width="50%" align="center"></center>

# The general idea of artificial neural networks (2)
- Multiple units belong to layers and those layers are interconnected to other layers.    
<br>
<center><img src="./images/unit-10/unit-10-0_csordl10.png" width="70%" align="center"></center>

# A highly non-linear function
- The output now depends on nested functions:  


$$y = \underbrace{f_3\left(\;\underbrace{f_2\left(\;\underbrace{f_1(x)}_{\text{first layer}} + b_0\right)}_{\text{second layer}} + b_1\right) + b_2}_{\text{third layer (nested non-linearities)}}$$

# Advantages and disadvantages of ANN
- Pros:
  - Can fit complex non-linear relationships.
  - Easy to try different types of layers.
  - Inspired by a very effective machine (the brain!)  
  
- Cons:
  - Many parameters.
  - Easy to overfit.
  - Needs lots of training data.
  - Takes a long time to fit.
  - The use of brain research is really minimal.

# How to train a neural network?
- We define a loss function and try to find best value:  

$$\arg\min_{\Theta} {l(\;f_\Theta(X),y)}$$

- For example:  

$$l(\;f_\Theta(X),y) = \sum(\;f_\Theta(X_i) - y_i)^2$$  
<br>
<center>where $f_\Theta(X)$ is the function represented by the entire network.</center>


# Optimizing the loss function
- There is no close solution to the loss function:  

$$\arg\min_{\Theta} {l(\;f_\Theta(X),y)}$$  

- In all but the trivial cases, the loss function is *non-convex* (hard to optimize.)  

- A very effective approach to perform stochastic gradient descent.

# Gradient descent
- Gradient descent is a simple approach to iteratively minimize a loss function.  

- Intuitively, we want to follow the *negative* of the gradient to find the minimum of a loss function:  

$$\Theta_{t+1} = \Theta_{t} - \eta \nabla l(\;f_\Theta(X),y)$$ 

# Gradient descent: example
<br>
<center><img src="./images/unit-10/unit-10-0_csordl11.png" width="100%" align="center"></center>

# Gradient descent: example (2)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl12.png" width="100%" align="center"></center>

# Gradient descent: example (3)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl13.png" width="100%" align="center"></center>

# Gradient descent: example (4)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl14.png" width="100%" align="center"></center>

# Gradient descent: example (5)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl15.png" width="100%" align="center"></center>

# Gradient descent: problems
- It finds the global optimum when function is convex (e.g., a parabola) and the learning rate is appropriate.
- But:
<br>
<center><img src="./images/unit-10/unit-10-0_csordl16.png" width="50%" align="center"></center>

# Stochastic gradient descent
- If we look at a fraction of the training data, we only observe a (noisy) sample of the loss function.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl17.png" width="80%" align="center"></center>

# Stochastic gradient descent (2)
- If we look at a fraction of the training data, we only observe a (noisy) sample of the loss function.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl18.png" width="80%" align="center"></center>

# Stochastic gradient descent (3)
- If we look at a fraction of the training data, we only observe a (noisy) sample of the loss function.  

- Stochastic gradient descent is good for big data (we only need pieces at a time.)  

- Stochastic gradient descent can escape local minima.  

- As long as we can compute gradients, we can use function we want!

# Worked-out example
- A simple neural network with one neuron and one layer.  

- More assumptions sigmoid activation $f$ and $a = y$.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl19.png" width="70%" align="center"></center>

# Worked-out example (2)
- A simple neural network with one neuron and one layer.  

- More assumptions sigmoid activation $f$ and $a = y$.  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl20.png" width="70%" align="center"></center>  
**<center>Similar to Logistic Regression!</center>**


# Worked-out example (3)
- Using quadratic loss function, let’s work out the stochastic gradient descent update rule:  

$$\Theta_{t+1} = \Theta_{t} - \eta \nabla l(\;f_\Theta(X),y)$$    

$$l(\;f_\Theta(X),y) = (\;f(z) - y)^2 = \left(\sigma\left(\sum{x_j\theta_j} - y\right)\right)^2$$  

$$\nabla{l(\;f_\Theta(X),y)} = \left[\frac{dl}{d\theta_0}, \frac{dl}{d\theta_1}, \ldots, \frac{dl}{d\theta_m}\right]$$

# Worked-out example (4)
- Let’s pick one the gradients:  

\begin{align}
\nabla{l(\;f_\Theta(X),y)} &= \left[\frac{dl}{d\theta_0}, \frac{dl}{d\theta_1}, \ldots, \frac{dl}{d\theta_m}\right] \\
\frac{dl}{d\theta_j} &= \frac{d\left(\sigma\left(\sum{x_j\theta_j} - y\right)\right)^2}{d\theta_j}
\end{align}

- We are trying to compute the gradient of a nested function, what to do?

# Worked-out example (5)
<br>
<left><img src="./images/unit-10/unit-10-0_csordl21.png" width="20%" align="left"></left>

- We can apply basic calculus.  

- By representing each nested function on its own:  

$$\frac{d\left(\sigma\left(\sum{x_j\theta_j} - y\right)\right)^2}{d\theta_j} = \frac{d}{d\theta_j} l(a(z(\theta_0, \cdots, \theta_m, x), y)$$

- Where $a$ is the activation, and $z$ is the summation.  

- Now we can simply apply the chain rule!

# Worked-out example (6)
- We can apply basic calculus.
- By representing each nested function on its own:
$$\frac{d}{d\theta_j} l(a(z(\theta_0, \cdots, \theta_m, x), y) = \frac{dl}{da}\frac{da}{dz}\frac{dz}{d\theta_j}$$

$$
\frac{dl}{da} = 2(a - y) \qquad
\frac{da}{dz} = a(1-a) \qquad
\frac{dz}{d\theta_j} = x_j
$$

\begin{align}
\Theta_j^{\text{new}} &\leftarrow \Theta_j^{\text{old}} - \eta \;{2(a-y)} \;{a(1-a)} \;{x_j} \\
&\leftarrow \Theta_j^{\text{old}} - \eta \;{\left(\sigma\left(\sum{x_j\theta_j}\right) - y\right)} \;{\sigma\left(\sum{x_j\theta_j}\right)} \;{\left(1 - \sigma\left(\sum{x_j\theta_j}\right)\right)} {x_j}
\end{align}

# Backpropagation
- The previous example is the basis of stochastic gradient descent for several layers.
- **Backpropagation** is a general algorithm that exploits the structure of the chain rule when applied to layers.
- To understand backpropagation, we need to understand middle layers:  
<br>
<center><img src="./images/unit-10/unit-10-0_csordl22.png" width="60%" align="center"></center>

# Backpropagation (2)
- $\theta_{ij}^{L-2}$: connection between neuron $j$ in layer $L-2$ to neuron $i$ in layer $L-1$.

- Representing all connections from $L-2$ to $L-2$ as one matrix $\theta^{L-2}$, we can predict all activation of layer $L-1$ as:  

$$a^{L-1} = f(\theta^{L-2} a^{L-2})$$  

<br>
<center><img src="./images/unit-10/unit-10-0_csordl22.png" width="50%" align="center"></center>

# Backpropagation (3)
- Then the gradient of a connection $\theta_{ij}^{L-2}$ with respect to the loss function will be:  

$$\frac{dl}{d\theta_{ij}^{L-2}} = \frac{dz_i^{L-1}}{d\theta_{ij}{L-2}}\frac{da_i^{L-1}}{dz_i^{L-1}} \sum{\frac{dz^L}{da_i^{L-1}} \frac{da^L}{dz^L} \frac{dl}{da^L}}$$  

<br>
<center><img src="./images/unit-10/unit-10-0_csordl22.png" width="50%" align="center"></center>

# Backpropagation (4)
- Then the gradient of a connection $\theta_{jk}^{L-3}$ with respect to the loss function will be:  

$$\frac{dl}{d\theta_{jk}^{L-3}} = \frac{dz_j^{L-2}}{d\theta_{jk}^{L-3}}\frac{da_j^{L-2}}{dz_j^{L-2}} \sum{\frac{dz_i^{L-1}}{da_j^{L-2}} \frac{da_i^{L-1}}{dz_i^{L-1}}} \sum{\frac{dz^L}{da_i^{L-1}} \frac{da^L}{dz^L} \frac{dl}{da^L}} $$  

<br>
<center><img src="./images/unit-10/unit-10-0_csordl22.png" width="50%" align="center"></center>

# Backpropagation (5)
- The same structure gets repeated during the chain rule:

$$\frac{dl}{d\theta_{ij}^{L-2}} = \frac{dz_i^{L-1}}{d\theta_{ij}{L-2}}\frac{da_i^{L-1}}{dz_i^{L-1}} \sum{\frac{dz^L}{da_i^{L-1}} \frac{da^L}{dz^L} \frac{dl}{da^L}}$$  

$$\frac{dl}{d\theta_{jk}^{L-3}} = \frac{dz_j^{L-2}}{d\theta_{jk}^{L-3}}\frac{da_j^{L-2}}{dz_j^{L-2}} \sum{\frac{dz_i^{L-1}}{da_j^{L-2}} \frac{da_i^{L-1}}{dz_i^{L-1}}} \sum{\frac{dz^L}{da_i^{L-1}} \frac{da^L}{dz^L} \frac{dl}{da^L}}$$

- Therefore, we can *backpropagate* the gradient of the loss function.  
- Notice also, that we reuse $a$ and $z$ multiple times, therefore we could precompute them.  
- Consequently, we could forward propagate $a$ and $z$ and *backpropagate* the errors.

# Backpropagation algorithm

Forward propagation
1. From layer $1$ thought $L$  
  1.1 Compute $a$ and $z$

Backpropagation
1. Define $\delta_i^L = a^L-y$ (gradient of loss function)
2. From layer $L–1$ to $1$  
  2.1 Define $\delta_j^{L-1} = a_j^{L-1} (1 - a_j^{L-1}) \sum{\theta_{ij}^{L-1} \delta_i^L}$

The gradients will be computed as follows:  
$$\frac{dl}{d\theta_{ij}^k} = a_j^k a_i^{k+1} (1 - a_i^{k+1}) \sum{a_i^{k+1} \delta_i^{k+1}} $$

# Demo

# Problems with classic methods
- The more layers, the more complex representations can be learned.  

- However, more layers also implies longer chains.  

- This produces numerical errors and small changes in learning.  

- Groundbreaking paper in 2006 proposes to train deep architectures in a novel way.

# Deep learning
- Also, new developments in:
  - Training techniques (2006)
  - GPU computing (2008)
  - Big data.
  - New types of architectures (new neurons and layers.)
  
- This opened the door to train increasingly more complex models.  

- Next class we will take a look at these advances


# Other architectures
- Multi-layer perceptron looses spatial location (specially in images.)  

- Some features are independent of location.

<br>
<center><img src="./images/unit-10/unit-10-0_csordl23.png" width="80%" align="center"></center>

# Convolutional neural networks
- Create translational-invariant spatial filters (share weights.)

<br>
<center><img src="./images/unit-10/unit-10-0_csordl24_3.png" width="90%" align="center"></center>
<br>
<center><sup>http://cs231n.github.io/convolutional-networks/</sup></center>

# Convolutional neural networks (2)
- It can learn filters similar to what is observed in nature!

<br>
<center><img src="./images/unit-10/unit-10-0_csordl25.png" width="90%" align="center"></center>

# Convolutional neural networks (3)
- Digits classification: classic MNIST problem.

<br>
<center><img src="./images/unit-10/unit-10-0_csordl26.png" width="50%" align="center"></center>
<br>
<center><sup>http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html</sup></center>

# Dealing with temporal data
- Sometimes we want to capture temporal-invariance:
  - E.g., in chess, the state of the next move only depends on the current move.
  - Makes learning more efficient.
  - Allows to interpret results.
  
<center><img src="./images/unit-10/unit-10-0_csordl27.png" width="55%" align="center"></center>

# Recurrent neural networks
<br>
<center><img src="./images/unit-10/unit-10-0_csordl28.png" width="75%" align="center"></center>

# Recurrent neural networks (2)
- Naïve recurrent neural network training leads to numerical problems:
$$y = h(A(A(A(\ldots A(y_{t-T}))$$  

- Why? How would the chain rule apply here?  

- Other more advanced neural network allows to learn long-term dependencies easier.  
<center><img src="./images/unit-10/unit-10-0_csordl29.png" width="70%" align="center"></center>

# Recurrent neural networks (3)
- Handwriting generation:
  http://www.cs.toronto.edu/~graves/handwriting.html  
  
  
- Generating Wikipedia articles:  
<center><img src="./images/unit-10/unit-10-0_csordl30.png" width="100%" align="center"></center>

# More complex temporal problems
<br>
<center><img src="./images/unit-10/unit-10-0_csordl31.png" width="100%" align="center"></center>  
<br>
<center>Can you think of examples for each of this cases?</center>

# More complex temporal problems (2)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl32.png" width="100%" align="center"></center>  

# More complex temporal problems (3)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl33.png" width="100%" align="center"></center>  

# More complex temporal problems (4)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl34.png" width="100%" align="center"></center>  

# Reinforcement learning
- Given semi-supervised feedback, learn how to perform actions so as to maximize total reward.  

- Classical framework to study behavior in humans and animals.  

<br>
<center><img src="./images/unit-10/unit-10-0_csordl35.png" width="50%" align="center"></center>  

# Reinforcement learning (2)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl36.png" width="100%" align="center"></center>  

# Reinforcement learning (3)
<br>
<center>http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html</center>  

# Inferring intentions
![](./images/dota.jpg)

# WaveNet
<br>
<center><img src="./images/unit-10/unit-10-0_csordl38.png" width="60%" align="center"></center>  
<br>
<center><sup>https://deepmind.com/blog/wavenet-generative-model-raw-audio/</sup></center>


# WaveNet
<br>
<center><img src="./images/unit-10/unit-10-0_csordl38_4.png" width="60%" align="center"></center>  
<br>
<center><sup>https://deepmind.com/blog/wavenet-generative-model-raw-audio/</sup></center>


# WaveNet
<br>
<center><img src="./images/unit-10/unit-10-0_csordl38_5.png" width="90%" align="center"></center>  
<br>
<center><sup>https://deepmind.com/blog/wavenet-generative-model-raw-audio/</sup></center>


# Adversarial Neural Networks
<br>
<center><img src="./images/unit-10/unit-10-0_csordl39.png" width="90%" align="center">

# Adversarial Neural Networks (2)
https://thispersondoesnotexist.com/
<center><img src="./images/unit-10/unit-10-0_csordl40.png" width="90%" align="center">

# Adversarial Neural Networks (3)
<br>
<center><img src="./images/unit-10/unit-10-0_csordl41.png" width="90%" align="center">

# Adversarial Neural Networks (4)
<center><img src="./images/unit-10/unit-10-0_csordl42.png" width="55%" align="center">

<center><img src="./images/unit-10/unit-10-0_csordl43.png" width="100%" align="center">

# Take home messages
- Neural networks try to mimic how the brain works: hierarchical layers of non-linear functions.  

- Neural networks take time to train and overfit badly.  

- Neural networks need big data.  

- Neural networks are hard or impossible to interpret.  

- Backpropagation allows efficient computation of gradients.