# Motivation: The tale of two intelligences

The mere concept of intelligence is a tricky one to deal with, and there is considerable shift during times in what we think of as being the "core" or "essence" of intelligence. In fact, many argue, that "intelligence" is not in itself a unified thing, but comprises multiple activites. Amongst those who hold this view, maybe one of the most famous one is the Nobel laurate Daniel Kahneman. 

He proposed, that what we call everyday problem solving is in fact comprised of two complementary systems of functioning:

A rational, analytical system, that evaluates data and reasons through knowledge (he called this **"System 2"**) and a fast, heuristic system, that is good at learned patter recognition and quick judgment (**"System 1"** in his naming convention).

<img src="https://investopress.com/sites/default/files/2018-06/fast-slow.jpg" width=45%>

<img src="https://miro.medium.com/max/1731/1*o__4FykMIU_gwfv3PagX0A.png" width=55%>

As in the theories of natural intelligence, this important dichotomy is strongly present in the history of AI itself, and is the motivating factor behind the different "schools" or "trends" of AI, that themselves are subject to the cycle of hype and disillusionment.

<img src="https://www.actuaries.digital/wp-content/uploads/2018/08/ai1.jpg" width=75%>

<img src="http://drive.google.com/uc?export=view&id=1jcNu8YiX29wy6Jx41a3KdMP5miGT0J7A" width=75%>
<img src="http://drive.google.com/uc?export=view&id=1mVX0eN5Tf7b55ZvpUBiYEXHwbAy7sKNl" width=75%>

It is important to note, based on the chart above, that **symbolic and pattern based AI could not really yet co-exist harmonically in one system** during history. We always had to choose one over the other, and accept the limitations.

More on this topic [here](https://towardsdatascience.com/explainable-ai-vs-explaining-ai-part-2-statistical-intuitive-vs-symbolic-reasoning-systems-8b05f8e0a3a0).

Recently, with the rapid advancement of Deep Learning, and after it, the growing concerns with **the limitations of Deep Learning**, the debate of symbolic vs. pattern based AI came into focus again. In the now famous debate of the cognitive scientist Gary Marcus and Joshua Bengio, the former called into question the DL paradigm because of the lack of symbolic reasoning and generalization abilities. In fact, in his answer, as well as the NeurIPS keynote shortly before, the DL pioneer Bengio highlighted exactly this, the merger of System 1 and System 2 as the focal point of AI research, and declared it as a prime target, that has to be tackled with new DL approaches. 

In [1]:
from IPython.display import IFrame

IFrame('https://www.youtube.com/embed/EeqwFjqFvJA', width=1000, height=500)


It is very well worth studying prof Bengio's **"agenda setting" keynote in detail**, [here](https://slideslive.com/38921750/from-system-1-deep-learning-to-system-2-deep-learning). 

Regardless of the approaches we can take, we can say, that **this is in a sense THE frontier we have to concern ourselves with.**

# Limitations (as of now)

As put forward by Prof. Bengio in the above mentioned keynote, despite all the progress Deep Learning made during the last decade, there are some fundamental limitations, which are hampering the progress towards more "intelligent" models.

<img src="http://drive.google.com/uc?export=view&id=1uD_4g0t-fr9ju2GxeWRW-cy2AVY3x9_k" width=55%>

## Narrowness of models

All the models we are capable of training today are to be categorized as narrow AI, that is, they are **task specific models**, that can fulfill one narrowly defined purpose only. It is no doubt, that in some cases, they are outperforming human level accuracy.

For example in ImageNet image classification, the human level top-5 accuracy is 94.9%, while Noisy Student
(EfficientNet-L2) claims 98.7%. 	 

<img src="http://drive.google.com/uc?export=view&id=1uFTfI1nc2bCKsP0egdCCiXdIaTKtsDWC" width=55%>

(The top-1 accuracy is printed on the chart.)

Source for the leaderboard [here](https://paperswithcode.com/sota/image-classification-on-imagenet)

Source for the human accuracy level [here](https://arxiv.org/pdf/1409.0575.pdf)

Some interesting analysis about human accuracy on ImageNet can be found [here](https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/).

Nevertheless, if we would transfer this model to a new domain, it will already loose some of it's performance, let alone if we move it out to a **different task**, or even to a **different modality**.

Transfer learning techniques, as well as multi-task learning is trying to mitigate these limitations, but no definite breakthrough has been achieved as of yet.


## Too data hungry

Humans are typically not requiring more than one or two examples of a certain object, to form a good generalization of it, but eg. in the above detailed ImageNet "More than __14 million images__ have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than __20,000 categories__ with __a typical category, such as "balloon" or "strawberry", consisting of several hundred images.__"

With this in mind building up of large enough supervised datasets for all machine learning problems is a prohibitively costly endeavor.

(More on the cost analysis of ImageNet scale crowdsourced database annotation can be found [here](http://image-net.org/tutorials/cvpr2015/crowdsourcing_slides.pdf).)


This problem can be decomposed to two sub-problems:

### Annotation driven by humans

#### "The machine does not explore"

In the supervised paradigm of machine learning the data containing the teaching signals has to be "given", produced before the (or at least during the training, in the online learning case) by human annotators. The machine learning models and systems themselves do not actively seek out information, just passively consume it.

For this problem in the approaches of **active learning** and more importantly **reinforcement learning** can be the answer. In the latter, the system ("agent") is presented with a complex environment, in which **it seeks out new input with it's own behavior, effectively generating it's own training data**.

#### Generalization comes from the infusion of human concepts

Basically we can assume, that the major source of generalization in machine learning comes from the labels themselves, since the human annotators infuse their **knowledge about the true classes**, thus (hopefully) the real causes behind the observed sensory data.

This is a major limitation, since on the one hand, our own hard work even infuses our own prejudices on learning systems, on the other, it makes "exploratory" learning on brand new, poorly understood data domains extremely challenging.

**Causal machine learning**, as well as **unsupervised learning** aims at mitigating this problem.

### Sample efficiency

Even in cases where we can successfully "outsource" the task of exploration, thus the generation of training data to a machine learning system with eg. reinforcement learning, **ML systems are not sample efficient**, meaning, that **they derive less (generalizeable) knowledge from a single datapoint** than a human being does.  

Take for example the case of gameplay:

<img src="https://www.alexirpan.com/public/rl-hard/rainbow_dqn.png" width=35%>

"The y-axis is “median human-normalized score”. This is computed by training 57 DQNs, one for each Atari game, normalizing the score of each agent such that human performance is 100%, then plotting the median performance across the 57 games. RainbowDQN passes the 100% threshold at about 18 million frames. This corresponds to about 83 hours of play experience, plus however long it takes to train the model. A lot of time, for an Atari game that most humans pick up within a few minutes."

A very good discussion about this (and other) limitation(s) can be fond in the essay [here](https://www.alexirpan.com/2018/02/14/rl-hard.html).

This limitation is still very much present, though some [approaches](https://arxiv.org/abs/1806.07366) try to marry traditional models (like eg. differential equations) with neural learning models, thus achieving strong generalization, as well as some techniques of **few shot learning** look very promising.


## Stupid mistakes

Though the performance of machine learning models is many times impressive, when they fail, they do it really bad.

### They fail towards "common correlations"

<img src="http://drive.google.com/uc?export=view&id=1pHTxMCnZzoKOSPYTWL87r6k_wp1C9qfm" width=45%>


### They fail in "uncommon contexts"

<img src="https://zdnet4.cbsistatic.com/hub/i/r/2018/11/30/92ad1ec6-428a-4c7e-b92d-23557902cc2b/resize/370xauto/5d02cdac55bce949d767306f1b612000/google-inception-fooled-into-mis-classification.png" width=55%>

### They over rely on surface patterns instead of shapes

<img src="https://miro.medium.com/max/3112/1*uvXG1xULxUuI_8FtUd4_xA.png" width=55%>


### They fail miserably under "engineered noise"

<img src="https://m-cdn.dashdigital.com/communications/november_2016/data/articles/img/Pc0140100.jpg" width=55%>

Sadly, there is even some evidence, that for a great multitude of problems, the [existence of adversarial examples is unavoidable](https://arxiv.org/abs/1809.02104).

The problem of adversarial examples is deeply connected with generalization, since we would expect the model to come up with the _"real" defining features_ (whatever they might be, and if they exist at all), that are robust  to this kind of noise. (Or to be more cynical: they fail in the same direction as we do. :-)

Also there is a connection to explainability, since humans wish for a good explanation of the model's inner representations and decision algorithm so as to ensure, such errors do not occure.

## More art than science

Machine learning, especially Deep Learning, is plagued with the **lack of a unifying theory**. Though considerable successes have been achieved, considerable empiric knowledge acquired about what works and what does not, but there is still a painful lack of a general theory, that would explain the "why"-s behind the phenomena. 

This in itself led some well known researchers to [compare the state of machine learning to alchemy](http://www.argmin.net/2017/12/05/kitchen-sinks/), the "proto chemistry", which had great partial and practical successes, but a completely misguided theory. 

If we agree with this or not, it is a fact, that the best practices of machine learning change with a frustratingly quick pace, thus it requires considerable effort to follow, and one can absolutely not be sure, that the applied approach is the best possible. Some **standardization would help a lot**, which we - in absence of a unified theory - can try to **achieve by automation and rigorous adherence to the principles of [reproducible research](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research)**. 


-----------------------------------------------

As a summary, the key notion of this course is, that:

**It is not enough to use existing models and paradigms on bigger datasets, some fundamental new approaches have to be taken, to push the frontier of what AI is capable of.**


# Main frontiers (non-exhaustive)

## Level of techniques (some examples only!)

There is an overwhelming amount of progress made in the tooling and detailed techniques of Deep Learning, which is simply impossible to be comprehensively covered, but some areas are general enough to be notable, eg.:

### New inputs (eg. Learning on graphs)

As machine learning models, and the modeling process itself gets more mature and can achieve solid results in multiple fields, the "appetite" for applying ML in yet untested domains and problems grows exponentially.

Though the "traditional" settings of **tabular data**, **spatial (image) data** and **time series** are covering a remarkably big proportion of interesting business problems, there are some areas, in which the data lends itself to be represented in alternate forms.

One of these forms is the domain of **graphlike data**, which can be found in a surprisingly broad set of domains. like social networks, interactions between financial actors, molecules and graphs representing knowledge or meaning. 

<img src="https://miro.medium.com/max/758/1*atP7qJV_Un2mtmE43NalzQ.png" width=55%>

Later is all the more interesting, since **it was the de-facto form of representation for classical "good old-fashioned AI" like models of reasoning and knowledge representation**, hence the ability to do statistical machine learning on graphs can be understood as **a step towards the merger of "System 1" and "System 2" like intelligence.**

The area of graph based (deep) machine learning got a huge boost recently, with great progress in the field and **multiple dedicated graph ML frameworks popping up**.

A nice collection of graph based ML and DL methods can be found in the repositories and libraries of eg. [Benedek Rozemberczki](https://github.com/benedekrozemberczki) as well as a promising book [here](https://www.manning.com/books/graph-powered-machine-learning).

### New optimizers

With the progress of experimentation with (deep) machine learning models and the "conquest" of newer areas, it became more and more clear, that the **stability and effectiveness of the training process across models and datasets** is one of the crucial bottlenecks of progress. 

Setting and tuning the learning rate of the training was the key to good performance, thus multiple avenues of research have been tried to **come up with schedules or regimes of learning rates for optimizers** that ensure good convergence with minimal tuning.

Even now classical optimization methods, like [Adam](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/) aimed at a generally stable training with no parameters to tune. It still had some strong limitations, like worse generalization performance, and slow convergence.

The breakthrough in this problem came from a somewhat surprising direction. 

#### Rectified Adam or RAdam

A great motivation behind the some new innovations in the area of optimizers was the widspread usage of "cyclic learning rates" (and "superconvergence" as a side effect) made popular by Fast.AI.

<img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-25-at-8.44.49-PM.png" width=45%>

The "one cycle" policy, especially it's first part, the **"warmup" phase**, where we **start with a small initial learning rate, and gradually increase it** can have unexpected side effects in case of adaptive learning rate optimizers, especially **Adam**, since **warmup mitigates the generalization problem of adaptive LR methods**.

The crucial understanding of the paper ["On the Variance of the Adaptive Learning Rate and Beyond"](https://arxiv.org/abs/1908.03265) is, that much of the loss in generalization performance in case of Adam comes from it's naive over reliance on the **initial variance in gradients**, thus the **weight distribution gets quickly distorted** and never fully finds it's way back to fruitful, more global optima.

<img src="http://drive.google.com/uc?export=view&id=1mYfms1HJeL7O_OuvPcDg3HZpdVpyh2ow" width=50%>

So the newest, state of the art optimization method seems to be a form of Adam, namely **rectified Adam (or RAdam)** that incorporates some learnings from the warm-up method and variable learning rates.

_It is basically trying to set up an adaptive regularization scheme with which it balances the amount of Adam's adaptive LR properties, thus in extreme case it can act as SGD, disregarding the aggregated variance data, and only if appropriate does it start to behave like Adam._

Or with the words of the authors:

"Comparing these two strategies (warmup and RAdam), RAdam deactivates the adaptive learning rate when its variance is divergent, thus avoiding undesired instability in the first few updates."
 
A nice introduction can be found [here](https://medium.com/@lessw/new-state-of-the-art-ai-optimizer-rectified-adam-radam-5d854730807b).

It promises to be fast, but more importantly **robust across a wide selection of learning rates**. 

<img src="https://miro.medium.com/max/700/1*BMwu8Km-CtPsvaH8OM5_-g.jpeg" width=65%>

Since the [paper](https://arxiv.org/abs/1908.03265) describing the method is still pretty fresh, the verdict is still out, but looks very promising! (Implementations are not yet mainstream...)

#### Lookahead

Though the inspiration is not that direct and obvious, but the idea of storing some weights during the training had some "spin-off" ideas in optimization. 

<img src="http://ruder.io/content/images/2017/11/snapshot_ensembles.png" width=55%>


In their recent paper [LookAhead optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610) Zhang et al. proposed an optimization method where they **keep a copy of the weights, and use two optimization regimes, one "slow" and one "fast" for the network.**

<img src="http://drive.google.com/uc?export=view&id=1EhpRMvpgKowimKMnuVrkcXtd6hB79iIo" width=75%>

**After a short period (some some iterations, eg. 5) they than "synchronize" the weights.**

Or with the words of the authors: 

"Lookahead maintains a set of slow weights $φ$ and fast weights $θ$, which get synced with the fast weights every $k$ updates. The fast weights are updated through applying $A$, any standard optimization algorithm, to batches of training examples sampled from the dataset $D$. After $k$ inner optimizer updates using $A$, the slow weights are updated towards the fast weights by linearly interpolating in weight space, $θ − φ$. We denote the slow weights learning rate as $α$. After each slow weights update, the fast weights are reset to the current slow weights value."

Why is this good?

As this [excellent description](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) puts it:

"...in effect it allows a faster set of weights to ‘look ahead’ or explore while the slower weights stay behind to provide longer term stability. The result is reduced variance during training, and much less sensitivity to sub-optimal hyper-parameters and reduces the need for extensive hyper-parameter tuning... By way of simple analogy, LookAhead can be thought of as the following. Imagine you are at the top of a mountain range, with various dropoff’s all around. One of them leads to the bottom and success, but others are simply crevasses with no good ending. To explore by yourself would be hard because you’d have to drop down each one, and assuming it was a dead end, find your way back out. But, if you had a buddy who would stay at or near the top and help pull you back up if things didn’t look good, you’d probably make a lot more progress towards finding the best way down because exploring the full terrain would proceed much more quickly and with far less likelihood of being stuck in a bad crevasse."

<img src="http://drive.google.com/uc?export=view&id=1A43MSBp0s-zKO8H8EtUxY_B1rcTLJnRC" width=85%>

The thing seems to actually work!

#### Surprise: a combination, Ranger

Well, if both RAdam and Lookahead achieved impressive new state-of-the-art results, why not combine the two?

This is exactly what happened, thus the new optimization method [Ranger](https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d) was born.

**The results were impressive, the normalization effects of RAdam at the beginning, and the overall stabilization of Lookahead combine seamlessly!**

As of 2019 September, this was the state-of-the-art, but the saga continues (even on some parallel avenues)... 

There is a notable side-effect, or better to say **secondary aim** in the development of these new optimization methods: to mitigate the fragileness of the training process, thus to come up with **generally robust, problem agnostic, and preferredly parameter free or _automatically self regulating_ training methods**, thus enabling higher level of automation of the modeling work.

### Automation

One of the trends in machine learning is that of automation with regard to some of the crucial, but tedious parts of the data science process. This approach is made possible by the **increasingly widespread adoption of common tools** (such as the Scikit syntax and pipeline concept) as well as the **relative standardization of the data science process itself** from the high level (see eg. the [CRISP DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) methodology), to the lower (eg. standard data preprocessing steps).

This kind of "maturation" in the field enables the rollout of standard, hence (at least partially) automatable solutions. These can greatly advance the productivity of data scientists enabling shortcuts in certain parts of the job, like simplifying data exploration and wrangling (eg.: [Bamboolib](https://bamboolib.8080labs.com/)), automated error discovery in the data (eg.: [Cleanlab](https://github.com/cgnorthcutt/cleanlab)) or automatic visualization tools.

**[Data2Vis](https://arxiv.org/abs/1804.03126)** represents an extreme point in this, since it is a machine learning model, that learned to translate between the datasets themselves, and the short description languages of visualization tools, like Vega Light, this **generating the code of the visualizations by itself**, given only the dataset.

<img src="https://hci.stanford.edu/~cagatay/data2vis/static/assets/model.jpg" width=55%>
<img src="https://miro.medium.com/max/3736/1*MIpc69nBQMU-IBEMsCL-iA.jpeg" width=55%>


#### AutoML

This approach, the usage of machine learning techniques for solving machine learning problems, like **hyperparameter optimization**, **model choice**, or even **model architecture search / generation**.

A very nice overview of this topic was given by **[Jeff Dean at ICML 2019](https://slideslive.com/38917526/an-overview-of-googles-work-on-automl-and-future-directions)**. (The whole workshop on AutoML is available [here](https://nuit-blanche.blogspot.com/2019/06/saturday-video-morning-automl-workshop.html).)

He defined the levels of possible automation as:


1. Handcraft predictors, learn nothing.
2. Handcraft feature, learn prediction. Automated Hyperparameter Optimization (HPO) tools like Hyperopt, Optuna, SMAC3, scikit-optimize, etc.
3. Handcraft algorithm, learn features and predictions end-to-end. HPO + some tools like featuretools, tsfresh, boruta, etc.
4. Handcraft nothing. Learn algorithm, features, and predictions end-to-end. Automated Algorithm (Model) Selection tools like Auto-sklearn, TPOT, H2O, auto_ml, MLBox, etc.

([source](https://towardsdatascience.com/overview-of-automl-from-pycon-jp-2019-c8996954692f))

In research settings, even bolder initiatives are being used, see below the case of SWISH.

This field is maturing enough, the AutoML services like Google's [Cloud AutoML](https://cloud.google.com/automl) or Microsoft's [AutoML](https://www.microsoft.com/en-us/research/project/automl/) or Amazon's [AWS AutoPilot](https://techcrunch.com/2019/12/03/aws-autopilot-gives-you-more-visible-automl-in-sagemaker-studio/) are already mature products from this paradigm.



### New building blocks

Beside the considerable amount of innovation regarding the architectural concepts of deep neural networks, so basically the innovation in connectivity, there is notable progress with regard to the **basic building blocks** also. 

**New activation functions**, as well as **new mechanisms** are also being deployed on a large scale. 

#### New activation functions

From the initial approach, which considered the **logistic sigmoid** the most relevant and plausible **non-linear activation function essential for the functioning of neural networks**, considerable amount of innovation happened.

<img src="https://miro.medium.com/max/1192/1*4ZEDRpFuCIpUjNgjDdT2Lg.png" width=55%>

It turned out to be, that the initially conceptualized **"boundedness" of the functions is not essential**, Thus the still dominantly popular ReLU was born.

Then, as a next step, for the **problem of dying neuron** problem plagued ReLU-s (they get stuck in the sub zero domain, and since there is no gradient, they can not leave that territory anymore. For mitigting this,  some solutions, like **LeakyReLU re-introduced some activation in the negative domain.**

##### SWISH

During a more in depth evaluation of possible activation functions, the research group at Google Brain utilized some innovative techniques, and came up with the SWISH activation function:

<img src="https://pbs.twimg.com/media/DMaTpdsW4AAklGn.jpg:large" width=55%>

$$\text{Swish}(x) = x*\sigma(x)$$

Interestingly enough, SWISH was found by an **exhaustive automated search in the function space** (See https://arxiv.org/pdf/1710.05941.pdf), and lead to a quite general increase in performance.

##### MISH

After the success of automated methods producing SWISH, some researchers tried to analyze the root causes of the good performance of SWISH.

Turns out, that **the local slope around zero** was the key property that made SWISH successful, so it is not only true, that you can have a **non-upper-bounded** function, **with some decision boundary**, and **near-zero activation in the far negative**, but that you **need some steeper local gradients around zero**, the decision boundary.

After realizing this, a modified, new activation function MISH was born.

<img src="https://miro.medium.com/max/1158/1*JPL4dbd95hS5g7EGVLs1pQ.png" width=45%>

More information on MISH can be found [here](https://towardsdatascience.com/mish-8283934a72df).

<img src="https://miro.medium.com/max/1758/1*QlrS1CzhWqxu8qu2p8wrkg.png" width=55%>

As one might notice, MISH is really a slight, expert driven adjustment of SWISH, for a slight performance increase.

(This by the way is a nice story of how **automated discovery and human expertise can work "in turns".** By the way, the story is far from over...)

#### Attention!

Though te roots of their main mechanisms can be traced back at least to the works of [Schmidhuber and Hochreiter](https://www.bioinf.jku.at/publications/older/2604.pdf) on LSTM networks, explicit attention mechanism are dated to be born with the seminal work of Bahdanau et al. [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473).


<img src="http://drive.google.com/uc?export=view&id=1qgNpCVfAlVg8ptO7CsuxCli3OnKi6HrL" width=75%>

Source: [Attention In Detail - author: carrie.ywj@alibaba-inc.com](https://alitech-private.oss-cn-beijing.aliyuncs.com/1562049342375/Attention%20%20PPT%20.pdf?Expires=1581095403&OSSAccessKeyId=LTAIqKGWQyF6Vd3W&Signature=/U%2BDxDtre9keSfAfxk3h6D7DL7k%3D)

Attention mechanisms were first proposed as means of extending the capabilities of LSTM - seq2seq models **by enabling the decoder to focus on localized parts of the encoder input**, thus enabling the learning of **more precise correlations between parts of the input and the output**.  Later on though, very much inspired by the "state transport" mechanisms of LSTMs and residual networks (especially [Highway networks](https://arxiv.org/abs/1507.06228), the concept of **self attention** appeared in the seminal paper [Attention is all you need](https://arxiv.org/abs/1706.03762).

The main understanding of this approach was, that the **input and the output of an attention layer can be one and the same ("self-attention")**, thus the layer will compute localized transformations on top of the input itself, and project it back to the input space, much like how a residual block computes some transformations on the input "passing through". In this approach, the engineering of "residuality" was necessary step towards unlocking the potential of **localized transformations over the input, in the input space**, thus a kind of successive transformation of it. (Hence the name of Transformer in the paper.)

<img src="https://wiki.pathmind.com/images/wiki/attention_translation_grid.png" width=35%>

A very nice summary of the different proposed attention mechanisms and their usage contexts can be found [here](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html).

One noteworthy aspect of locality, and the mechanism how it is achieved is the presence of an **addressing mechanism** for calculating locations over an input space. The promised "locality", which is a delicate balance between the distributed and differentiable nature of neural computation and the atomicity of symbolic operations is indeed not yet fully explored.

Attention mechanisms can be though of as a promising avenue to bridge the gap between fully differentiable networks and potentially atomic, non differentiable storage mechanisms - as demonstrated in the case of [neural Turing machines](https://arxiv.org/abs/1410.5401), thus again, can form a **bridge between "System 1" and "System 2"** type solutions.

(It seems, that Bengio himself entertains this view in his above cited NeurIPS talk.)

It is also worth mentioning, that some argue, that attention mechanisms are basically not really new at all (they have some valid points), so marking them as a revolutionary new device, is misguided. See for example [Attention Mechanisms in Deep Learning — Not So Special](https://towardsdatascience.com/attention-mechanisms-in-deep-learning-not-so-special-26de2a824f45). 

It is also worth noting, that in the more recent, optimized versions of transformers, namely [Reformers](https://arxiv.org/abs/2001.04451), the attention step itself is approximated:

<img src="http://drive.google.com/uc?export=view&id=1_66OMIbMj9_z3ZCev7C5J9JE3yfHFKdS" width=75%>

"In the paper, Reformer: The Efficient Transformer, the authors improve the computational time complexity of Transformer from quadratic to $N*logN$ time by approximating the attention step instead of fully calculating it - where N is sequence length. This self attention step amounts to a mutual interaction, i.e matrix-matrix multiply, $QK*$, hence the $O(N²)$. The authors instead employ a locality-sensitive hashing (LSH) mechanism to only compute the interaction with a select few of the neighbors - those closest to $q_i$ for any given query $q_i$. This drops time complexity to $N*logN$. This works since most values in the resulting weight distribution are $~0$, since the exponential function in the softmax formula exaggerates the larger values and suppresses the smaller ones. One can just ignore the small ones and focus on those $k_j$ closest to $q_i$. The Locality-sensitive hashing scheme assigns hashes to vectors randomly within a space such that vectors close together in the space have a higher likelihood and receiving the same hash. All in all, it is a welcome approximate algorithm which improves on the time complexity of the Transformer without overly sacrificing learning performance."

This again shows the pattern observed before, in case of the non-linearities: 

**There are innovations, which work well, but we still did not capture, what the essential mechanisms are behind them.**


## Level of theory

### Generalization of overparametrized models

We are all too familiar with the drawbacks of empirical risk minimization, that is: we are very much afraid of learning the "quirks of the dataset" (Hinton) and overfit. Further more, we wholeheartedly ascribed to the understandings of **statistical learning theory** about the **relationship between complexity and overfitting / generalization** as follows:

<img src="http://drive.google.com/uc?export=view&id=1aPYe74krTI3RY1Nf2rYO5G06qZVbI3Sc" width=35%>

But what if this is not the while picture? What if the **connection between complexity and generalization is not that simple?**

Some recent observations (like in the recent paper [Reconciling modern machine learning and the bias-variance trade-off](https://arxiv.org/abs/1812.11118)) point in the other direction. Maybe the effect of more capacity is detrimental **only in case of models smaller than the "memorization capacity"**, and maybe **we should have gone even bigger!**

<img src="http://drive.google.com/uc?export=view&id=1xvtWpiUxYkiYqPzrCx7BwFe1_YYh3eo5" width=85%>

Much is still yet unknown.

There are some interesting results pointing in a bit of the opposite direction also, raising the question:


### Do we need training and big networks at all?

To say that the field of deep learning is in flux is a mayor understatement. We at least assumed, that the fact, that we need large networks and train them for an extensive period of time with sophisticated methods holds true.

But the research in with practical motivation to get models running on small, constrained hardware (with pruning and floating point manipulation) had some side effects, that called in to question this notion in two ways:

#### The "Lottery Ticket Hypothesis"

"...after training a network, **set all weights smaller than some threshold to zero (prune them), rewind the rest of the weights to their initial configuration, and then retrain the network from this starting configuration keeping the pruned weights weights frozen (not trained).** Using this approach, they obtained two intriguing results.

First, they showed that the pruned networks performed well. **Aggressively pruned networks (with 99.5 percent to 95 percent of weights pruned) showed no drop in performance compared to the much larger, unpruned network. Moreover, networks only moderately pruned (with 50 percent to 90 percent of weights pruned) often outperformed their unpruned counterparts.**

Second, as compelling as these results were, the characteristics of the remaining network structure and weights were just as interesting. Normally, if you take a trained network, re-initialize it with random weights, and then re-train it, its performance will be about the same as before. But with the skeletal Lottery Ticket (LT) networks, this property does not hold. The network trains well only if it is rewound to its initial state, including the specific initial weights that were used. Reinitializing it with new weights causes it to train poorly. As pointed out in Frankle and Carbin’s study, it would appear that the **specific combination of pruning mask** (a per-weight binary value indicating whether or not to delete the weight) **and weights underlying the mask form a lucky sub-network** found within the larger network, or, as named by the original study, a winning “Lottery Ticket.”"

<img src="https://1fykyq3mdn5r21tpna3wkdyi-wpengine.netdna-ssl.com/wp-content/uploads/2019/05/blog_header_2-1068x458.png" width=85%>

[Original paper](https://arxiv.org/pdf/1803.03635.pdf)

A [more thorough analysis](https://eng.uber.com/deconstructing-lottery-tickets/) 

**Takeaways:**
- Much of the capacity of deep models and the associated training time is wasted
- Initialization is a dominant factor, maybe the size of networks only matters for giving large enough room to randomity to come up with "lottery tickets"
- There is a very interesting interplay between structure, learning and performance in deep networks
    
#### Weight agnostic networks

"...Schmidhuber et al. have shown that a randomly-initialized LSTM [13] with a learned linear output layer can predict time series... we aim to search for **weight agnostic neural networks**, architectures with strong inductive biases that **can already perform various tasks with random weights.**"

<img src="https://storage.googleapis.com/quickdraw-models/sketchRNN/wann/png/mnist_cover.png" width=85%>


[Original](https://weightagnostic.github.io)

**Takeaways:**
- Maybe we do not need (that much) training at all?
- Structure can be key, as well as the right inductive bias!

## Level of learning paradigms

Of maybe the greatest importance among the changes in the field is the emergence and gradual popularization of more learning settings and paradigms over and beyond simple supervised learning. Some of these approaches are only addons, some represent radical shifts of frameworks and thinking.

### Few-shot and zero shot learning

In few shot learning, the goal is to expand the "reach" of generalization of trained models, beyond the classes they saw during training.

"The usual setup is that you have categories with many examples you can use at training time; then at test time, you are given novel categories (usually 5) with only a few examples per category (usually 1 or 5; called “support-set”) and query images from the same categories."

<img src="https://miro.medium.com/max/1116/1*Iip2Ydfig9_EcM9Tifk6VQ.png" width=45%>

This would have obvious implications for the amount of data needed for machine learning model training.

A good summary of recent advancements can be found [here](https://towardsdatascience.com/few-shot-learning-in-cvpr19-6c6892fc8c5)

There is considerable research going on in this area, which is just now gaining traction.

### Active learning

#### Reflection: not all datapoints are created equal

It is interesting to note, that the usage of "informative" examples can drastically reduce the needed amount of data for a classifier to learn, that is to say: **not all datapoints contribute equally to the performance** of a classifier.

In a recent publication submitted to ICLR 2019. [An Empirical Study of Example Forgetting During Deep Neural Network Learning](https://openreview.net/pdf?id=BJlxm30cKm) by Toneva et al. it is quite obvious, that there are in fact not so important datapoints (which the network never "forgets", misclassifies during the learning run). In fact, a shocking amount of data from this type can be removed without having much of an effect on the learning procedure!

<img src="https://crazyoscarchang.github.io/images/2019-02-16-seven-myths-in-machine-learning-research/myth_4_2.png" width=65%>

"Shockingly, 30% of the datapoints in CIFAR-10 can be removed, without changing test accuracy by much."

[source](https://crazyoscarchang.github.io/2019/02/16/seven-myths-in-machine-learning-research/#myth-4)

Maybe we have hope for small, not just big data?

Often - though not always - the informativeness of a datapoint is again in connection with the decision "margin". Informative points can be "support vectors". (See discussion on "support vector machines".)

<img src="https://images.slideplayer.com/15/4793236/slides/slide_9.jpg" width=55%>

The concept of "influential points" also has connections to "outliers". Often times outliers have disproportionate influence (or even represent ["label noise"](https://www.semanticscholar.org/paper/A-comprehensive-introduction-to-label-noise-Fr%C3%A9nay-Kab%C3%A1n/c44f388832d6f309b1bb9ccdeddee491f195e6cd), that is, annotation errors), so it is not clear, if extensive focus on these helps or hurts performance.

The [active learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)) approach is capitalizing exactly on this effect: it asks targeted queries from an "oracle" (typically a human expert) so that it maximizes learning from the least amount of annotated data possible. 

<img src="https://cdn-images-1.medium.com/max/490/0*doMj6A96nyLxrzIU.png" width=55%>

It is obvious, that this approach can greatly mitigate the hunger of models for annotated data.

### Reinforcement Learning

Reinforcement learning is in essence a very well, cognitively motivated learning paradigm, which is essentially broadening the horizon of how we pose the machine learning problem itself.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/250px-Reinforcement_learning_diagram.svg.png" width=35%>

Instead of only relying on input-output pair for a numerical model, our framework shifts to an agent and an environment. In this environment, the agent can execute actions, thus actively modifying the state of the world, and observe such resulting state, together with some eventual, possibly much delayed feedback in from of a scalar number, that is computed by some mechanism of the environment.

In this paradigm, if there is an environment and a reward available, and **if experimentation is sufficiently cheap or harmless** (say in a simulation), the agent can directly learn complex behavioral functions (policies), not just individual classification or prediction tasks.

Reinforcement is an extremely hot topic, and accounts for some of the major, remarkable breakthroughs achieved by AI models from Go play to online games.


### Unsupervised learning

The interesting dichotomy is, that though annotated data is in extreme short supply, non annotated, raw data is painfully abundant. (Think all the problems of big data management!)

As a main proponent of this approach, the Turing award winning Yann LeCun asserts, that the supervised learning paradigm will not be enough for achieving the next AI breakthrough.

With some street graffiti analogy:

<img src="https://bruceturkel.com/wp-content/uploads/2018/04/Revolution.png" width=55%>
	
<img src="http://drive.google.com/uc?export=view&id=1KdpCVTXrdzXc-E1PI6YDudiAmz9V735M" width=55%>

More specifically he argues, that the human learning process is overwhelmingly utilizing "unsupervised", or "predictive" approaches to learn, all other paradigms are just the "icing on the cake".

<img src="https://miro.medium.com/max/1176/0*sQmcKODThlssh2V5.png" width=55%>

See more on this [here](https://medium.com/syncedreview/yann-lecun-cake-analogy-2-0-a361da560dae).

LeCun's talk about unsupervised learning (one among many) can be found [here](https://www.youtube.com/watch?v=HzgfPNeqJuQ). (Skip 5 min intro!)

One of the eminent avenues for unsupervised learning is the **"generative adversarial" paradigm**, which shows great promise and achievements recently.

### Complex view of the world

Another, yet only partially fulfilled expectation towards machine learning models is, that they would - just as we do - synergistically learn about the world in general, not just about a given task or a given perceptual modality, and would form a valid representation about the causal structure of the environment. 

#### Multi modal / task learning


Basic idea: advantage to force a model to learn multiple tasks concurrently!
It promises, that we generally learn something about the world - model constrained by variety of data. 

An example, which by the way, is in connection with "few shot learning", where few is zero in this case:

##### "Zero-shot translation"
[Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation](https://arxiv.org/pdf/1611.04558.pdf)

Or in a more readable format [here](https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html)

- Seq2seq, "normal" encoder-decoder + attention mechanism machine translation

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/0245d8cbc0a405e553311bd20a01099c9aa11c14/4-Figure1-1.png" width=55%>

- Parallel corpora on _language pairs_
- Only addition, that a separate tag for the given language is put before the input sequence (to inform the model...

**Result:**

1. It learns to translate to language pairs for which _it was never given data_
2. It's inner representation is _language independent_

<img src="https://2.bp.blogspot.com/-AmBczBtfi3Q/WDSB0M3InDI/AAAAAAAABbQ/1U_51u5ynl4FK4L0KOEllfRCq0Oauzy5wCEw/s640/image00.png" width=55%>

This result amazed even the researchers, much analysis went into it.

##### Multi-modality

Even more ambitious experiment:  Google ["One model to rule them all"](https://arxiv.org/abs/1706.05137)

- Single model that yields good results on a number of problems spanning multiple domains: trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. 
- Model architecture: convolutional layers, an attention mechanism, and sparsely-gated layers
- Each of these computational blocks is crucial for a subset of the tasks we train on.
- Even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. 
- We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.
- A model has been given a host of tasks in parallel (language as well as visual tasks!).

General description of multi task learning:

[Ruder: Multi task learning](https://arxiv.org/abs/1706.05098)
Or more "friendly" description [here](http://ruder.io/multi-task-learning-nlp/)

#### The real causes

The other huge expectation towards our models is, that they should not just be able to predict the occurences of the world well, but do it based on the intuition, that **there should be finite, rather small number of real causal factors contributing to the emergence of the phenomena**. We as natural observers, as well as scientists operate under the assumption, that the **causes for the phenomena we observe can be disentangled, and only a few meaningful real causes are present**. 

<img src="https://github.com/google-research/disentanglement_lib/raw/master/sample.gif?raw=true" width=55%>

source: [disentanglement_lib](https://github.com/google-research/disentanglement_lib)

The topic of causal machine learning is in strong connection with unsupervised learning also, since such disentanglement constraints are frequent tools that are applied to unsupervised models, that try to extract the real causal mechanisms. Large scale research is being carried out in this topic, with a multitude of tools, since **if we can unlock efficient causal learning, we can use the whole apparatus of causal reasoning on the resulting representations** - thus, we can again try to achieve unity between "System 1" and "System 2" type intelligence. 

### Reasoning and the merger of the two big schools

The final goal is, in a sense, to merge the advantageous properties of the approaches above, and come up with a system, that can efficiently self explore, learn from raw data, detect the real causes behind phenomena, and finally do it in a way, that is compatible with symbolic knowledge and reasoning - thus it can be transmitted to humans, analyzed and validated. This is the grand goal all the researchers on the frontier are working towards.

#### Side goal: XAI

As machine learning models got more and more complex, and got deployed in a multitude of mission critical environments, one of the concerns, that emerged is the issue of **explainable machine learning** (or XAI). All the stakeholders involved in the operation of a machine learning model have the goal to transparently understand, audit and correct all the decision mechanisms that operate in machine learning models. This in itself poses serious problems, since an huge overparametrized neural network does not lend itself to easy analysis. Though considerable amount of effort went into explainability, we can assume, that whatever steps we take towards the merger of symbolic and distributed learning systems, that would greatly enhance the explainability of these systems.

#### Less stupid mistakes

Last, but not least, we hope, that with all the (even sometimes small) steps we take in merging "System 1" with "System 2", as we ourselves, our models will be able to operate more correctly, and suffer less from simple stupid mistakes. Maybe if they do it well enough, we might be able to learn something for them in mitigating our own biases.

# What will we cover in this course?

From the broad frontier sketched above, we are unable to dwell in depth on all the topics, so we had to restrict ourselves two three:

1. Causal machine learning
2. Unsupervised learning with Generative Adversarial Networks
3. Reinforcement Learning (and it's application in NLP)