# Model types & architectures

[Peer Herholz (he/him)](https://peerherholz.github.io/)  
Postdoctoral researcher - [NeuroDataScience lab](https://neurodatascience.github.io/) at [MNI](https://www.mcgill.ca/neuro/)/[McGill](https://www.mcgill.ca/), [UNIQUE](https://sites.google.com/view/unique-neuro-ai)  
Member - [BIDS](https://bids-specification.readthedocs.io/en/stable/), [ReproNim](https://www.repronim.org/), [Brainhack](https://brainhack.org/), [Neuromod](https://www.cneuromod.ca/), [OHBM SEA-SIG](https://ohbm-environment.org/) 

<img align="left" src="https://raw.githubusercontent.com/G0RELLA/gorella_mwn/master/lecture/static/Twitter%20social%20icons%20-%20circle%20-%20blue.png" alt="logo" title="Twitter" width="32" height="20" /> <img align="left" src="https://raw.githubusercontent.com/G0RELLA/gorella_mwn/master/lecture/static/GitHub-Mark-120px-plus.png" alt="logo" title="Github" width="30" height="20" />   &nbsp;&nbsp;@peerherholz 

<img align="right" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ml-dl_workshop.png" alt="logo" title="Github" width="400" height="280" />


### A brief recap & first overview

<img align="right" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/AI.png" alt="logo" title="Github" width="320" height="120" />

**Artificial intelligence (AI)** is [intelligence](https://en.wikipedia.org/wiki/Intelligence) demonstrated by [machines](https://en.wikipedia.org/wiki/Machine), as opposed to the natural intelligence [displayed by humans](https://en.wikipedia.org/wiki/Human_intelligence) or [animals](https://en.wikipedia.org/wiki/Animal_cognition). Leading AI textbooks define the field as the study of ["intelligent agents"](https://en.wikipedia.org/wiki/Intelligent_agent): any system that perceives its environment and takes actions that maximize its chance of achieving its goals. Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the [human mind](https://en.wikipedia.org/wiki/Human_mind), such as "learning" and "problem solving", however this definition is rejected by major AI researchers.

[https://en.wikipedia.org/wiki/Artificial_intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence)




<img align="right" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/AI_ML.png" alt="logo" title="Github" width="320" height="120" />

**Machine learning (ML)** is the study of computer [algorithms](https://en.wikipedia.org/wiki/Algorithm) that can improve automatically through experience and by the use of data. It is seen as a part of [artificial intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence). Machine learning algorithms build a model based on sample data, known as ["training data"](https://en.wikipedia.org/wiki/Training_data), in order to make predictions or decisions without being explicitly programmed to do so. A subset of machine learning is closely related to [computational statistics](https://en.wikipedia.org/wiki/Computational_statistics), which focuses on making predictions using computers; but not all machine learning is statistical learning. The study of [mathematical optimization](https://en.wikipedia.org/wiki/Mathematical_optimization) delivers methods, theory and application domains to the field of machine learning. [Data mining](https://en.wikipedia.org/wiki/Data_mining) is a related field of study, focusing on [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) through [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning). Some implementations of machine learning use data and [neural networks](https://en.wikipedia.org/wiki/Neural_networks) in a way that mimics the working of a biological brain.

[https://en.wikipedia.org/wiki/Machine_learning](https://en.wikipedia.org/wiki/Machine_learning)


<img align="right" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/AI_ML_DL.png" alt="logo" title="Github" width="320" height="120" />

**Deep learning** (also known as deep structured learning) is part of a broader family of [machine learning](https://en.wikipedia.org/wiki/Machine_learning) methods based on [artificial neural networks](https://en.wikipedia.org/wiki/Artificial_neural_networks) with [representation learning](https://en.wikipedia.org/wiki/Representation_learning). Learning can be [supervised](https://en.wikipedia.org/wiki/Supervised_learning), [semi-supervised](https://en.wikipedia.org/wiki/Semi-supervised_learning) or [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning). [Artificial neural networks (ANNs)](https://en.wikipedia.org/wiki/Artificial_neural_network) were inspired by information processing and distributed communication nodes in [biological systems](https://en.wikipedia.org/wiki/Biological_system). ANNs have various differences from biological [brains](https://en.wikipedia.org/wiki/Brain). Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Early work showed that a linear [perceptron](https://en.wikipedia.org/wiki/Perceptron) cannot be a universal classifier, but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can. 


[https://en.wikipedia.org/wiki/Deep_learning](https://en.wikipedia.org/wiki/Deep_learning)



- very important: **deep learning is machine learning**
    - DL is a specific subset of ML
    - structured vs. unstructured input
    - linearity
    - model architectures


- you and "the machine"
    - ML models can become better at a specific task, however they need some form of guidance
    - DL models in contrast require less human intervention

- Why the buzz? 

    - works amazing on structured input
    - highly flexible → universal function approximator 

- What are the challenges?

    - large number of parameters → data hungry 
    - large number of hyper-parameters → difficult to train

- When do I use it?

    - if you have highly-structured input, eg. medical images. 
    - you have a lot of data and computational resources.


<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/core_aspects_examples.png" alt="logo" title="Github" width="500" height="280" />

Why go `deep learning` in `neuroscience`? (all highly discussed)

- complexity of biological systems
    - integrate knowledge of biological systems in computational systems
      (excitation vs. inhibition, normalization, LIF)
    - linear-nonlinear processing
    - utilize computational systems as `model systems`

Why go `deep learning` in `neuroscience`? (all highly discussed)

- limitations of "simple models"
    - fail to capture diversity of biological systems
      (response heterogeneity, sensitivity vs. specificity, etc.)
    - fail to perform as good as biological systems

Why go `deep learning` in `neuroscience`? (all highly discussed)

- addressing the "why question"
    - why do biological systems work in the way they do
    - insights into objectives and constraints defined by evolutionary pressure

### Aim(s) of this section

- learn about basics behind deep learning, specifically artificial neural networks
- become aware of central building blocks and aspects of artificial neural networks
- get to know different model types and architectures

### Outline for this section

1. Deep learning - basics & reasoning
    - learning problems
    - representations
2. From biological to artificial neural networks
    - neurons 
    - universal function approximation
3. components of ANNs
    - building parts
    - learning
4. ANN architectures
    - Multilayer perceptrons
    - Convolutional neural networks

### Deep learning - basics & reasoning

- as said before: `deep learning` is (a subset of) `machine learning` 
- it thus includes the core aspects we talked about in the [previous section]() and builds upon them:
    - different learning problems and resulting models/architectures
    - loss function & optimization
    - training, evaluation, validation
    - biases & problems

- this furthermore transfers to the key components you as a user has to think about
    - objective function (What is the goal?)
    - learning rule (How should weights be updated to improve the objective function?)
    - network architecture (What are the network parts and how are they connected?)
    - initialisation (How are weights initially defined?)
    - environment (What kind of data is provided for/during the learning?)

##### Learning problems

As in [machine learning]() in general, we have `supervised` & `unsupervised learning problems` again:

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/supervised_unsupervised.png" alt="logo" title="Github" width="1200" height="350" />

However, within the world of `deep learning`, we have three more `learning problems`:

- [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/RL.png" alt="logo" title="Github" width="600" height="350" />

- [semi-supervised learning](https://en.wikipedia.org/wiki/Semi-supervised_learning)

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/semisupervised.png" alt="logo" title="Github" width="600" height="350" />

- [self-supervised learning](https://en.wikipedia.org/wiki/Self-supervised_learning)

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/self-supervised.png" alt="logo" title="Github" width="600" height="350" />

- depending on the data and task, these `learning problems` can be employed within a diverse set of [artificial neural network](https://en.wikipedia.org/wiki/Artificial_neural_network) architectures (most commonly):
    - [Multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron)
    - [Convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network)
    - [Recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network)    

But why employ [artificial neural networks](https://en.wikipedia.org/wiki/Artificial_neural_network) at all?

##### The problem of variance & how representations can help

Think about all the things you as an `biological agent` do on a typical day ... Everything (most things) you do appear very easy to you. Then why is so hard for `artificial agents` to achieve a comparable `behavior` and/or `performance`?

One major problem is the `variance` of the input we encounter which subsequently makes it very hard to find appropriate `transformations` that can lead to/help to achieve `generalizable behavior`. 

How about an example? We'll keep it very simple and focus on `recognizing` a certain `category` of the natural world.

You all waited for it and now it's finally happening: cute cats! 

- let's assume we want to learn to recognize, label and predict "cats" based on a set of images that look like this

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_prototype.png" alt="logo" title="Github" width="200" height="450" />

- utilizing the `models` and `approaches` we talked about so far, we would use `predetermined transformations` (`features`) of our data `X`:

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_ml.png" alt="logo" title="Github" width="600" height="350" />

- this constitutes a form of [inductive bias](https://en.wikipedia.org/wiki/Inductive_bias), i.e. `assumptions` we include in the `learning problem` and thus back into the respective `models`

- however, this is by far not the only way we could encounter a cat ... there are a lots of sources of variation of our data `X`, including:

- illumination

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_illumination.png" alt="logo" title="Github" width="400" height="250" />

- deformation

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_deformation.png" alt="logo" title="Github" width="600" height="350" />

- occlusion

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_occlusion.png" alt="logo" title="Github" width="600" height="350" />

- background clutter

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_background.png" alt="logo" title="Github" width="600" height="350" />

- and intraclass variation

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_variation.png" alt="logo" title="Github" width="600" height="350" />

- these variations (and many more) are usually not accounted for and our mapping from `X` to `Y` would fail

- what we want to learn to prevent this are `invariant representations` that capture `latent variables` which are variables you (most likely) cannot directly observe, but that affect the variables you can observe 

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/cat_dl.png" alt="logo" title="Github" width="600" height="350" />

- the "simple models" we talked about so far work with `predetermined transformations` and thus perform `shallow learning`, more "complex models" perform `deep learning` in their `hidden layers` to learn `representations`

<img align="center" src="https://media1.giphy.com/media/26ufdipQqU2lhNA4g/giphy.gif?cid=ecf05e47wv88pqvnas5utdrw2qap9xn9lmjvwv4kn3qenjr9&rid=giphy.gif&ct=g" alt="logo" title="Github" width="300" height="300" />

<sub><sup><sub><sup><sup>https://media1.giphy.com/media/26ufdipQqU2lhNA4g/giphy.gif?cid=ecf05e47wv88pqvnas5utdrw2qap9xn9lmjvwv4kn3qenjr9&rid=giphy.gif&ct=g
</sup></sup></sub></sup></sub>

But how?

### From biological to artificial neural neurons and networks

- decades ago researchers started to create artificial neurons to tackle tasks "conventional algorithms" couldn't handle
- inspired by the learning and performance of biological neurons and networks
- mimic defining aspects of biological neurons and networks 
- examples are: [integrate and fire neurons](https://en.wikipedia.org/wiki/Biological_neuron_model#Leaky_integrate-and-fire), [rectified linear rate neuron](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), [perceptrons](https://en.wikipedia.org/wiki/Perceptron), [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron), [convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network), [recurrent neural networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), [autoencoders](https://en.wikipedia.org/wiki/Autoencoder), [generative adversarial networks](https://en.wikipedia.org/wiki/Generative_adversarial_network) 

<img align="center" src="https://upload.wikimedia.org/wikipedia/en/5/52/Mark_I_perceptron.jpeg" alt="logo" title="Github" width="300" height="300" />

<sub><sup><sub><sup><sup>https://upload.wikimedia.org/wikipedia/en/5/52/Mark_I_perceptron.jpeg
</sup></sup></sub></sup></sub>

- using biological neurons and networks as the basis for artificial neurons and networks might therefore also help to learn `invariant representations` that capture `latent variables`
- `deep learning` = `representation learning`
- our minds (most likely) contains `(invariant) representations` about the world that allow us to interact with it
    - `task optimization`
    - `generalizability` 

Back to biology...

- `neurons` receive one or more inputs
    - [excitatory postsynaptic potentials](https://en.wikipedia.org/wiki/Excitatory_postsynaptic_potential)
    - [inhibitory postsynaptic potentials](https://en.wikipedia.org/wiki/Inhibitory_postsynaptic_potential)
-  inputs are summed up to produce an output
    - an activation
- inputs are separably [weighted](https://en.wikipedia.org/wiki/Weighting) and sum passed through a [non-linear function](https://en.wikipedia.org/wiki/Non-linear_function)
    - [activation](https://en.wikipedia.org/wiki/Activation_function) or [transfer function](https://en.wikipedia.org/wiki/Transfer_function)

<img align="right" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Neuron3.svg/2560px-Neuron3.svg.png" alt="logo" title="Github" width="300" height="300" />

<sub><sup><sub><sup><sup>https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Neuron3.svg/2560px-Neuron3.svg.png
</sup></sup></sub></sup></sub>

- these processes can be translated into mathematical problems including the input `X`, its weights `W` and the activation function `f`

<img align="center" src="https://miro.medium.com/max/1400/1*BMSfafFNEpqGFCNU4smPkg.png" alt="logo" title="Github" width="600" height="300" />

<sub><sup><sub><sup><sup>https://miro.medium.com/max/1400/1*BMSfafFNEpqGFCNU4smPkg.png
</sup></sup></sub></sup></sub>



- the thing about `activation function`s...

    - they define the resulting type of an `artificial neuron`
    - thus they also define its capabilities
    - require non-linearity
        - because otherwise only linear functions and decision probabilities

- the thing about `activation function`s...


$$\begin{array}{l}
\text { Non-linear transfer functions}\\
\begin{array}{llc}
\hline \text { Name } & \text { Formula } & \text { Year } \\
\hline \text { none } & \mathrm{y}=\mathrm{x} & - \\
\text { sigmoid } & \mathrm{y}=\frac{1}{1+e^{-x}} & 1986 \\
\tanh & \mathrm{y}=\frac{e^{2 x}-1}{e^{2 x}+1} & 1986 \\
\text { ReLU } & \mathrm{y}=\max (\mathrm{x}, 0) & 2010 \\
\text { (centered) SoftPlus } & \mathrm{y}=\ln \left(e^{x}+1\right)-\ln 2 & 2011 \\
\text { LReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \approx 0.01 & 2011 \\
\text { maxout } & \mathrm{y}=\max \left(W_{1} \mathrm{x}+b_{1}, W_{2} \mathrm{x}+b_{2}\right) & 2013 \\
\text { APL } & \mathrm{y}=\max (\mathrm{x}, 0)+\sum_{s=1}^{S} a_{i}^{s} \max \left(0,-x+b_{i}^{s}\right) & 2014 \\
\text { VLReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \in 0.1,0.5 & 2014 \\
\text { RReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha=\operatorname{random}(0.1,0.5) & 2015 \\
\text { PReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \text { is learnable } & 2015 \\
\text { ELU } & \mathrm{y}=\mathrm{x}, \text { if } \mathrm{x} \geq 0, \text { else } \alpha\left(e^{x}-1\right) & 2015 \\
\hline
\end{array}
\end{array}$$

In [3]:
from IPython.display import IFrame

IFrame(src='https://polarisation.github.io/tfjs-activation-functions/', width=700, height=400)


- historically either [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) or [tanh](https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent) utilized
- even though they are [non-linear functions] their properties make them insufficient for most problems, especially `sigmoid`
    - rather simple `polynomials`  
    - mainly work for `binary problems`
    - computationally expensive
    - they saturate causing the neuron and thus network to "die", i.e. stop `learning`
- modern `ANN` frequently use `continuous activation functions` like [Rectified Linear Unit](https://deepai.org/machine-learning-glossary-and-terms/rectified-linear-units)
    - doesn't saturate
    - faster training and convergence
    - introduce network sparsity

Still, the question is: how does this help us?

Let's imagine the following situation:

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/UAT_problem.png" alt="logo" title="Github" width="600" height="350" />

- we could try to iterate over all possible `transformations`/`functions` necessary to enable and/or optimize the `output`

However, we could also introduce a [hidden layer]() that learns or more precisely `approximates` what those `transformations`/`functions` are on its own:

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/UAT_hiddenlayer.png" alt="logo" title="Github" width="600" height="350" />


The idea: there is a `neural network` so that for every possible input `X`, the outcome is `f(X)`.

Importantly, the [hidden layer]() consists of [artificial neurons]() that perceive `weighted inputs` `w` and perform [non-linear]() ([non-saturating]()) [activation functions]() `v` which `output` will be used for the `task` at hand

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/UAT_hiddenlayer_function.png" alt="logo" title="Github" width="600" height="350" />


It gets even better: this holds true even if there are multiple `inputs` and `outputs`:

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/UAT_generalizability.png" alt="logo" title="Github" width="600" height="350" />


- this is referred to as `universality` and finally brings us to one core aspect of `deep learning`

##### Universal function approximation theorem

- `artificial neural networks` are considered `universal function approximators`
    - the possibility of `approximating` a(ny) `function` to some accuracy with  
      (a set of) [artificial neurons]() in [hidden layer](s)
    - instead of providing a predetermined set of `transformations` or `functions`,
      the `ANN` learns/approximates them by itself

-  two problems:
    - the theorem doesn't tell us how many [artificial neurons we need]()
    - either arbitrary number of artificial neurons ("arbitrary width" case) or
      arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" 
      case)

- going back to "shallow learning": we provide pre-extracted/pre-computed `features` of our `data` `X` and maybe apply further `preprocessing` before letting our model `M` `learns` the mapping to our outcome `Y` via `optimization` (minimizing the `loss function`) 

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/core_aspects_preprocessing.png" alt="logo" title="Github" width="500" height="280" />

- what `deep learning` does instead is to `learn` `features` by itself, namely those that are most useful for the `objective function`, i.e. `task` as defined by `optimization`

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/dl_features.png" alt="logo" title="Github" width="500" height="280" />

To bring the things we talked about so far together, we will focus on `ANN` components and how `learning` takes place next...but at first, let's take a breather.

<img align="center" src="https://media4.giphy.com/media/1LmBFphV4XNSw/giphy.gif?cid=ecf05e47og07li3vrdt89rgz8uux1qjicb3ykg2z5qdgigu7&rid=giphy.gif&ct=g" alt="logo" title="Github" width="300" height="300" />

<sub><sup><sub><sup><sup>https://media4.giphy.com/media/1LmBFphV4XNSw/giphy.gif?cid=ecf05e47og07li3vrdt89rgz8uux1qjicb3ykg2z5qdgigu7&rid=giphy.gif&ct=g
</sup></sup></sub></sup></sub>

#### Components of `ANN`s

- now that we've spent quite some time on the `neurobiological informed` underpinnings it's time to put the respective pieces together and see how they are actually employed within `ANN`s  
- for this we will talk about two aspects:
    - building blocks of `ANN`s
    - learning in `ANN`s

##### Building blocks of `ANN`s

- we've actually already seen quite a few important building blocks before but didn't defined them appropriately

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/UAT_generalizability.png" alt="logo" title="Github" width="600" height="350" />


<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_layer.png" alt="logo" title="Github" width="600" height="350" />


| Term         | Definition | 
|--------------|:-----:|
| Layer |  Structure or network topology in the architecture of the model that consists of `nodes` and is connected to other layers, receiving and passing information. |
| Input layer |  The layer that receives the external input data. |
| Hidden layer(s) |  The layer(s) between `input` and `output layer` which performs `transformations` via `non-linear activation functions` . |
| Output layer |  The layer that produces the final output/task. |




<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_subparts.png" alt="logo" title="Github" width="600" height="350" />


| Term         | Definition | 
|--------------|:-----:|
| Node |  `Artificial neurons`. |
| Connection | Connection between `nodes`, providing `output` of one `node`/`neuron` as `input` to the next `node`/`neuron`.  |
| Weight |  The relative importance of the `connection`. |
| Bias |  The bias term that can be added to the `propagation function`, i.e. input to a neuron computed from the outputs of its predecessor neurons and their connections as a weighted sum. |



- `ANN`s can be described based on their amount of `hidden layers` (`depth`, `width`)

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_multilayer.png" alt="logo" title="Github" width="600" height="350" />

- having talked about `overt building blocks` of `ANN`s we need to talk about `building blocks` that are rather `covert`, that is the aspects that define how `ANN`s learn...

##### Learning in `ANN`s

- let's go back a few hours and talk about `model fitting` again

- when talking about `model fitting`, we need to talk about three central aspects:
    - the model
    - the loss function
    - the optimization

| Term         | Definition | 
|--------------|:-----:|
| Model |  A set of parameters that makes a prediction based on a given input. The parameter values are fitted to available data.|
| Loss function | A function that evaluates how well your algorithm models your dataset |
| Optimization | A function that tries to minimize the loss via updating model parameters. |
	

#### An example: linear regression

- Model:  $$y=\beta_{0}+\beta_{1} x_{1}^{2}+\beta_{2} x_{2}^{2}$$
- Loss function: $$ M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
- optimization: [Gradient descent]()


- `Gradient descent` with a `single input variable` and `n samples`
    - Start with random weights (`β0` and `β1`) $$\hat{y}_{i}=\beta_{0}+\beta_{1} X_{i}$$
    - Compute loss (i.e. `MSE`) $$M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
    - Update `weights` based on the `gradient`
    
<img align="center" src="https://cdn.hackernoon.com/hn-images/0*D7zG46WrdKx54pbU.gif" alt="logo" title="Github" width="550" height="280" />
<sub><sup><sub><sup><sup>https://cdn.hackernoon.com/hn-images/0*D7zG46WrdKx54pbU.gif
</sup></sup></sub></sup></sub>


- `Gradient descent` for complex models with `non-convex loss functions`
    - Start with random weights (`β0` and `β1`) $$\hat{y}_{i}=\beta_{0}+\beta_{1} X_{i}$$
    - Compute loss (i.e. `MSE`) $$M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
    - Update `weights` based on the `gradient`
    
<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/gradient_descent_complex_models.png" alt="logo" title="Github" width="500" height="280" />

### ANN architectures

- now that we've gone through the underlying basics and important building blocks of `ANN`s, we will check out a few of the most commonly used architectures
- in general we can [group `ANN`s based on their `architecture`](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks), that is how their building blocks are defined and integrated


- possible `architectures` include (only a very tiny subset listed):
    - [feedforward](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Feedforward) (information moves in a forward fashion through the ANN, without cycles and/or loops)
        - [Multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron)
        - [Convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network)
        - [autoencoders](https://en.wikipedia.org/wiki/Autoencoder)
    - [recurrent](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Recurrent_neural_network) (information moves in a forward and a backward fashion through the ANN)
        - [fully recurrent](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Fully_recurrent)
        - [Long short-term memory](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Long_short-term_memory)
    - [radial basis function](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Radial_basis_function_(RBF)) (networks that use radial basis functions as activation function)
        - [General regression network](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#General_regression_neural_network)
        - [Deep belief networks](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks#Deep_belief_network)

- we will spend a closer look at `feedforward` and `recurrent architectures` as they will (most likely) be the ones you see frequently utilized within `neuroscience` 



- show sigmoid, tanh, ReLU and name problems of first two
- not sufficient for universal function approximation, ReLU needed
- explain universal function approximation and why at least one hidden layer necessary (learn functions themselves)
- MLPs
- CNNs
- mention RNN, VAE, GANs

- learning problems, architectures 
- introduce common problems
    - variation, etc.
- importance representations to address those variations, learn latent variables
    - how derive them?
    - based on performance of biological systems folks start to create artificial neurons
    - limitation of LIF (activation function) -> only linear
- universal function approximators    
    - missing non-linearity
    - cumbersome to impossible to iterate over all possible functions and underlying parameters
    - thus universal function approximators -> hidden layers learn functions by themselves
- MLPs as most simple ANNs
    - show example
    - non-linear activation functions
    - fully connected
    - softmax layer
    - outline problems with that which lead to CNNs
- introduce CNNs with convolution, pooling layers, etc.    
    - hierarchy, etc.

Why are fully connected layers required?
We can divide the whole neural network (for classification) into two parts:

Feature extraction: In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification: After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable. The fully connected layers learn a (possibly non-linear) function between the high-level features given as an output from the convolutional layers.

- introduce indicative bias/hierarchy when introducing CNNs