<a id="representation"></a>
# Representation learning revisited

After gaining insight about the architectures and procedures that make training of deep neural networks possible, let's get back to the topic of representations. Originally we motivated the usage of deep learning with two things: 
1. Superior performance

That is the ability to surpass traditional models in relevant measures.

2. End-to-end learning

The ability to forego much of the manual feature engineering process necessary for classical models.

The representation ability of deep learning models is in strong connection with the latter. 
Let us examine this in detail!

## Learned representations in "shallow" networks: word2vec
**"Don't count, predict!"**


<a href="http://drive.google.com/uc?export=view&id=1uu00eAJi3-3tH2Iz8IMP9vz3iWtgdmqM"><img src="https://drive.google.com/uc?export=view&id=1XVg5kBnHm3N5NrnRvTgwKf6Pk_0-TM7u" width=600 heigth=600></a>


With the publication of "Distributed Representations of Words and Phrases and their Compositionality" by [Mikolov et al. 2013](https://arxiv.org/pdf/1310.4546.pdf) a _huge_ shift occured in the NLP community, that led away from frequency based methods and introdiced the usage of prediction based methods for the generation of efficient language models. (First at word level.)

### Schematics

<a href="https://raw.githubusercontent.com/rohan-varma/paper-analysis/master/word2vec-papers/models.png"><img src="https://drive.google.com/uc?export=view&id=1ZUwSGzEj4EWHCEabrrvmXadfRsdRF7_f" width=600 heigth=600></a>

<a href="https://i.stack.imgur.com/igSuE.png"><img src="https://drive.google.com/uc?export=view&id=1R5hI-k4Zu8P6bE3oGgmJ5cMpDYshU7ML" width=300 heigth=300></a>

(Important to note that the invention of "hierarchic softmax" came from this research, since the many $v$ vocabulary width layers were consuming extreme amout of computation (by 2013 standards) for a vocabulary of 300k. Based on this there are CPU programmable efficient implementations of word2vec "out of the box", like in [Gensim](https://radimrehurek.com/gensim/models/word2vec.html).  

### Advantage

The real advantage of these "word embeddings" (which became the workhorse of NLP eversince) was not that they were useful in predicting the next words "autocomplete style" (as we have seen before in our training), but much more as general dense vector representations, "embeddings" of words. The main breakthrough of Mikolov et al. was to discover the deep structure that the vectorspaces exhibit after training!


<a href="http://drive.google.com/uc?export=view&id=1heogQhMfvtiOSfPtKvmc2OtyGadAsOmd"><img src="https://drive.google.com/uc?export=view&id=1HbQDy8orwiRH7SiaJjE5Gu5g99iavZa5" width=600 heigth=600></a>


A good analysis of this topic can be found in [Marek Rei's blogpost](http://www.marekrei.com/blog/dont-count-predict/).

(Naturally, progress did not stop here, you can read up on successive generations of vector embedding for NLP [here](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a).)

### Why can this be?

For the model to solve the prediction task effectively, it has to come up with a representation that captures the salient features of the data in the most copact way, that is, it lossfully memorizes and compresses the data, during which it captures it's main features.

**It turns out that the decisive advantage of deep learning based methods is exactly this: the "byproduct" of learning hierarchic, meaningful features during training.** Throughout this class, we will examine the effects and possibilities arising from this. 

Remember:
**Representation is everything!**

## New distance-metric oriented supervised tasks and loss functions

The growing recognition that models trained on particular supervised tasks can learn representations useful for a wide range of different purposes led to a research into tasks and objectives that are more conducive to learning good distributed representations.

One of the explicit goals of these objectives is to learn useful similarity distance metrics over the domain.

Image similarity recognition (in particular face recognition) was the most important problem that motivated this research, but in the last few years the methods have been heavily used in other domains as well, e.g. for voice verification/identification. The most important characteristics of these tasks and objectives is that they

- work with coarse grained ranking data about similarity, crucially, they can be used on labeled/classification data sets where the only available similarity information comes from class membership (examples belong to the same or different class)
- produce (similarly to Word2Vec) _sparse distributed representations_ / _embeddings_ of the input 
- try to measure and maximize the quality of the distance-based similarity metric provided by the learned representations
- the objective is typically to keep members of the same class close to each other and distant from those of other classes (in the embedding space)

### Contrastive loss (2005)

The first step in this direction was the introduction of __contrastive loss__ by Hadsell et al. in 2005 ([Dimensionality Reduction by Learning an Invariant Mapping](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)), which is a loss function for __pairs of examples__ $\langle \mathbf x_1, \mathbf x_2\rangle$:

$$\mathcal L_{\mathrm{Contrastive}} = (1-y)\frac{1}{2}D(\mathbf x_1, \mathbf x_2)^2 + y \frac{1}{2}(\max(0, m-D(\mathbf x_1, \mathbf x_2))^2$$

where $y$ is 0 if the two examples are similar and 1 if not, and D is a distance metric between the embeddings, e.g.,  Euclidean or cosine distance. The intention is obvious: bring similar examples closer and separate dissimilar examples by an $m$ margin.  

The architecture used with this kind of loss is typically a so-called "Siamese network", in which embeddings are produced with the same network topology and weights for both examples:

<a href="https://miro.medium.com/max/841/1*fUY-bpGFoUMWBkh_rychCQ.jpeg"><img src="https://drive.google.com/uc?export=view&id=1GnyGX-FsC_29NLGYEyJRNZYVi6kPabpj" width="350px"></a>

(Image source: [Siamese Networks for Visual Tracking](https://medium.com/intel-student-ambassadors/siamese-networks-for-visual-tracking-96262eaaba77))

__Challenges__

Perhaps the most important challenge for contrastive loss is that incentives the model to keep examples from the same class very close to each other, so it is prone to concentrate the classes into small areas of the space. As a consequence, the learned representations lack distinctions within the classes.

###  Triplet loss (2010)

In many respects a follow-up to contastive loss, triplet loss tries to solve the class concentration problem of contrastive loss by considering data point _triplets_ and enforcing only the correct _relative distances_. The inputs are $\langle A, P, N\rangle$ triplets containing a so called $A$ _anchor example_, and a $P$ positive and an $N$ negative example, of which the former belongs to the same class as the anchor, while the second to a different one.

The objective is to keep an (SVM-like) positive margin between the distance/similarity of the anchor and the positive and the anchor and the negative example:

$$
\mathcal L(A, P, N) = \max(D(A, P) - D(A, N) + margin, 0)
$$

Where the $D(\cdot)$ distance function is commonly Euclidean or cosine distance between the embeddings of the examples.


<a href="https://omoindrot.github.io/assets/triplet_loss/triplet_loss.png"><img src="https://drive.google.com/uc?export=view&id=1w7Yx2yJbTJ7pdaLaUfOtWDNi9BRQLDiz" width="500px"></a>

([Image source](https://omoindrot.github.io/triplet-loss))

__Challenges__

Training with triplet loss is known to be challenging, since the results are dependent on the "difficulty" of the triplets that are fed into to network. As training on all possible triplets is usually unfeasible and would not be desirable anyway, smart and often resource intensive "triplet mining" techniques have to employed to select the "hard" and "semi-hard" triplets in which the negative example is closer to the anchor than the positive or the distance is smaller than the margin:

<a href="https://omoindrot.github.io/assets/triplet_loss/triplets.png"><img src="https://drive.google.com/uc?export=view&id=1Ur_mBDuhDRdSacfarQ9CnhGGCstUJadX" width="400px"></a>

([Image source](https://omoindrot.github.io/triplet-loss))

__See also__

+ A good, TF-based introduction from which the above figures were taken: [Triplet Loss and Online Triplet Mining in TensorFlow](https://omoindrot.github.io/triplet-loss)
+ The classic, original triplet loss paper on image similarity detection: [Chechik et al.: Large Scale Online Learning of Image Similarity Through Ranking (2010)](http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf)
+ Using triplet loss for face recognition: [Schroff et al: FaceNet: A Unified Embedding for Face Recognition and Clustering (2015)](https://arxiv.org/pdf/1503.03832.pdf)

### New "softmax variants"

The training difficulties associated with triplet loss led to the development of new, similarity metric-oriented variants of the traditional softmax--cross entropy objective for classification. Recall that for one-hot encoded classification training data, softmax-cross entropy boils down to a negative log-likelihood objective:

$$\mathcal L_\mathrm{Softmax} = - \log(P(y)) $$

where thinking in terms of the last fully connected layer of a classification network we have further

$$-\log(P(y)) = -\log\left(\frac{\exp(\mathbf w_y \mathbf x + b_y)}{\sum_c \exp(\mathbf w_c \mathbf x + b_c) }\right)$$

where $\mathbf x$ is the embedding and $y$ the correct class of the example in question.

#### Angular softmax (A-softmax, 2018)

Angular softmax is based on the recognition that with normalized weight vectors and zero biases (this if of course an important modification in itself!!) the softmax loss can actually be rewritten as

$$-\log\left(\frac{\exp(\mathbf w_y \mathbf x)}{\sum_c \exp(\mathbf w_c \mathbf x) }\right) = -\log\left(\frac{\exp(\|\mathbf x\| \cos \theta_{\mathbf x, \mathbf w_y })}{\sum_c \exp( \|\mathbf x\| \cos \theta_{\mathbf x, \mathbf w_c })}\right) = -\log\left(\frac{\exp(\cos \theta_{\mathbf x, \mathbf w_y })}{\sum_c \exp( \cos \theta_{\mathbf x, \mathbf w_c })}\right)$$

that is, this modified version requires the angle between the embedding and the correct class's weight vector to be smaller then the angles to those of the other classes. The main idea of angular softmax is to strengthen this requirement to the criterion that $m$ times the angle should still be significantly smaller (introducing an "angular margin"), where $m\geq 2$ is a hyperparameter, so we have

$$
-\log\left(\frac{\exp(\cos (m  \theta_{\mathbf x, \mathbf w_y }))}{\exp(\cos (m \theta_{\mathbf x, \mathbf w_y })) + \sum_{c\neq y}\exp( \cos \theta_{\mathbf x, \mathbf w_c })}\right)
$$

A slight problem is caused by the fact that the cosine function is not monotonic outside the $[0, \pi]$ interval, so $\theta_{\mathbf x, \mathbf w_y}$ here must be restricted to the $[0, \frac{\pi}{m}]$ interval. The problem can be solved by using a cosine derivative which is monotonic in the full $[0, \pi]$, concretely

$$\phi(\theta) = (-1)^k \cos(m \theta) - 2k$$
$$\left(\theta\in \left[\frac{k\pi}{m},\frac{(k+1)\pi}{m}\right], k \in [0, m-1]\right).$$ 

With this change we have the final form which is

$$
\mathcal L_{\mathrm{A-softmax}} = -\log\left(\frac{\exp(\phi (\theta_{\mathbf x, \mathbf w_y }))}{\exp(\phi ( \theta_{\mathbf x, \mathbf w_y })) + \sum_{c\neq y}\exp( \cos \theta_{\mathbf x, \mathbf w_c })}\right).
$$
For further details see the paper in which A-Softmax was introduced: [Huang et al: Angular Softmax for Short-Duration Text-independent Speaker Verification (2018)](https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1545.pdf)

#### Additive margin softmax  (AM-softmax, 2018)

Another softmax variant built on the same basic idea is the so called additive margin softmax loss, which demands that the embedding should be $m$ more similar to the correct class's weight vector than to the others:

$$
\mathcal L_{\mathrm{AM-softmax}} = -\log\left(\frac{\exp(s\cdot\cos (\theta_{\mathbf x, \mathbf w_y }-m))}{\exp(s\cdot \cos ( \theta_{\mathbf x, \mathbf w_y }-m)) + \sum_{c\neq y}\exp(s\cdot\cos \theta_{\mathbf x, \mathbf w_c })}\right)
$$

where $s$ is a scaling hyperparameter. 

See the original paper for further details: [Wang et al: Additive Margin Softmax for Face Verification (2018)](https://arxiv.org/pdf/1801.05599.pdf)

## Autoencoders and GAN-s

An obvious follow up of the question raised by representation learning is, whether we can use unsupervised learning techniques to learn good representations of data.

This is all the more important, since in most cases we have **exponentially more raw data then labeled data**, so if we could pre-train our models on a broad raw dataset with unsupervised techniques, we could learn a lot about the world.

In fact some scholars, notably [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) argues that enabling broad scale unsupervised learning is the key to general intelligence.

<a href="https://i2.wp.com/syncedreview.com/wp-content/uploads/2019/02/image-1a.png?resize=784%2C502&ssl=1"><img src="https://drive.google.com/uc?export=view&id=16PSrj3ujKoA-93zMKYnLMEf0p64vy8cE" width=70%></a>

(There are also deep connections between un/self supervised learning and theories of mind, see eg. the theory of [predictive coding](https://en.wikipedia.org/wiki/Predictive_coding).)


### ["The Ganfather"](https://www.technologyreview.com/s/610253/the-ganfather-the-man-whos-given-machines-the-gift-of-imagination/)

In the field of unsupervised learning [Ian Goodfellow](https://en.wikipedia.org/wiki/Ian_Goodfellow) made great contributions with the elaboration of the GAN architecture. LeCun attributes him with the start of the "Generative Revolution" inside the DL field.

<a href="https://www.deeplearningitalia.com/wp-content/uploads/2018/03/56180123458e517763fae26da757a924.jpg"><img src="https://drive.google.com/uc?export=view&id=1DGE58IvyDD8GvXPg13acRF56Y9F_ac3P" width=400 heigth=400></a>

His in-depth ["Deep Learning Book"](https://www.deeplearningbook.org/) became somewhat of a canonical work, definitely worth reading.

### Architecture of AEs and GANs

The first widespread unsupervised neural models were the so called autoencoders.

Autoencoders are unsupervised, or more properly **self-supervised learning models** that are trained to reconstruct the original data (with some noise or as sampling from a data distribution). 

"According to the history provided in Schmidhuber, ["Deep learning in neural networks: an overview,", Neural Networks (2015)](https://arxiv.org/abs/1404.7828), auto-encoders were proposed as a method for unsupervised pre-training in Ballard, "Modular learning in neural networks," Proceedings AAAI (1987). It's not clear if that's the first time auto-encoders were used, however; it's just the first time that they were used for the purpose of pre-training ANNs." ([soruce](https://stats.stackexchange.com/questions/238381/what-is-the-origin-of-the-autoencoder-neural-networks))

Nowdays the purpose of this exercise is not pre-training (since "depth" is more or less conquered), but to learn dense "semantic" representations of the data.

The big "trick" in autoencoders is the usage of the right objective and learning setting, since in the [words of Francois Chollet](https://blog.keras.io/building-autoencoders-in-keras.html):

"In order to get self-supervised models to learn interesting features, you have to come up with an interesting synthetic target and loss function, and that's where problems arise: merely learning to reconstruct your input in minute detail might not be the right choice here. At this point there is significant evidence that focusing on the reconstruction of a picture at the pixel level, for instance, is not conductive to learning interesting, abstract features of the kind that label-supervized learning induces (where targets are fairly abstract concepts "invented" by humans such as "dog", "car"...). In fact, one may argue that the best features in this regard are those that are the worst at exact input reconstruction while achieving high performance on the main task that you are interested in (classification, localization, etc)."

Because of the limitations of Autoencoders, [Goodfellow et al.](https://arxiv.org/abs/1406.2661) came up with the idea of a "Generative adversarial network" (GAN) training regime, whereby a generative (forger) network is trained jointly with a "discriminator" network, which provides the (inverse) gradients.

Let's discuss these models in detail!

In [None]:
from IPython.display import HTML

HTML('<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTeO6wBbmDp4pCyEqd9VPRIdqZ_nV__cPbr83ofA41mtnR5MZXMaQf1-NBnfKpYcxJqcgnHdsSoll0G/embed?start=false&loop=true&delayms=60000" frameborder="0" width="600" height="600" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>')


GANs are representing an amazing state of the art in creating novel content, but their main appeal is **unsupervised learning**, with the potential to learn joint and general representation of phenomena, which many - most notably [Yann LeCun from Fcebook AI Research](http://yann.lecun.com/) regard as the major step towards general artificail intelligence.

<a href="https://cdn-images-1.medium.com/max/1600/1*KDvA9Fq3lm-eQOyGlcKAKg.png"><img src="https://drive.google.com/uc?export=view&id=1qa_ytGej9ZH_6dSW_TD-ssIG6xRzIhcd" width=600 heigth=600></a>


### What can they be good for?

- GAN-s are also capable of acting on video streams

In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Nq2xvsVojVo" frameborder="0" width="600" heigth="600" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

- It can handle acoustic inputs also, see for example these [voice cloning experiments](https://audiodemos.github.io/)
- It can enhance creativity

In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/hW1_Sidq3m8" frameborder="0" width="600" heigth="600" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

**More generally:**

- New instance generation (hopefully in a controlled manner)
- Input for longer (classifier) pipelines
- Similarity search, clustering
...and many more things that is only bounded by creativity. :-)

### Play with GANs

There is a very nice recent visualization tool, [Play with Generated Adversarial Networks (GANs) in your browser!
](https://poloclub.github.io/ganlab/)

Since the dynamics of GAN training is non  trivial, it is worth studying.

### Conclusions

The generative paradigm shift is considered one of the frontiers of AI (together with reinforcement learning, "zero shot" and "multi task" learning - to name a few). It is well worth watching this space!

### Caveat: We are not yet there!

<a href="https://1.bp.blogspot.com/-KwCuE2PZccs/XL-XNmDenoI/AAAAAAAAEF8/rMwS1PepVk40nuX0TvcK52d9NBv6IBziwCLcBGAs/s640/ground-truth-imagemagick%252Bcoalesce.gif"><img src="https://drive.google.com/uc?export=view&id=1cvid_ggQVNFVkn4kzahNdx7qn4ynfOAf" width=55%></a>

Unsupervised learning of the **real causal factors** and a nice, independent and interpretable ("disentangled") representation of them would be the "holy grail" of machine learning.

Unfortunately, Google AI researchers have conducted large scale experiments with such models, and the results are somewhat disheartening.

"We propose a fair, reproducible experimental protocol to benchmark the state of unsupervised disentanglement learning by implementing **six different state-of-the-art models** (BetaVAE, AnnealedVAE, FactorVAE, DIP-VAE I/II and Beta-TCVAE) and **six disentanglement metrics** (BetaVAE score, FactorVAE score, MIG, SAP, Modularity and DCI Disentanglement). **In total**, we train and evaluate **12,800 such models on seven data sets**. 

Key findings of our study include:

- We **do not find any empirical evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised way**, since **random seeds** and hyperparameters seem to **matter more than the model choice.** In other words, even if one trains a large number of models and some of them are disentangled, these disentangled representations seemingly cannot be identified without access to ground-truth labels. Furthermore, good hyperparameter values do not appear to consistently transfer across the data sets in our study. These results are consistent with the theorem we present in the paper, which states that the unsupervised learning of disentangled representations is impossible without inductive biases on both the data set and the models (i.e., one has to make assumptions about the data set and incorporate those assumptions into the model).
- For the considered models and data sets, **we cannot validate the assumption that disentanglement is useful for downstream tasks**, e.g., that with disentangled representations it is possible to learn with fewer labeled observations."


More detailed results [here](https://ai.googleblog.com/2019/04/evaluating-unsupervised-learning-of.html).

For now, **we have no choice but use the _right_ inductive bias.**

<a id="rl"></a>
# Reinforcement learning - as outlook


- Most widely applicable learning paradigm in AI
- Adds a lot to it's appeal, but it also is the cause of it's drawbacks: extreme hunger for computation, brittle and slow convergence of learning. But when it succeeds, it suceeds big time! :-)

## Short history, main milestones

In the 50s Richard Bellman investigated problems of "optimal control", which were aiming at optimizing the behavior of physical systems. He proposed the **Bellman equations**, the foundational concepts of dynamic programming, as well as the **Markov Decision Process**, which is the discrete stochastic case of optimal control (see below). He did not mention learning at all.

In parallel to this in psychology the **behaviorist paradigm** rose to dominance, which interpreted the learning of animal in frames pf a **"trial and error"** process. It took the two fields, optimal control and behaviorism couple of decades to meet. In 1989 Chris Watkins formalized the reinforcement paradigm of learning as Markov Decision Process. This idea saw widespread adoption.

Later Dimitri Bertsekas and John Tsitsiklis started to experiment with combining dynamic programming and neural networks (1996).

Around 1983 - 1986 Sutton, Barto, Anderson realized breakthroughs with the so called  temporal-difference learning, which is an important part of today's RL algorithms. This method formed the basis of the TD-gammon algorithm of Gerald Tesauro, which was successful in Backgammon. This was a comparable breakthrough in 1992 to the one for Go in 2015.


#### Recent success

In 2013 the team of Valdimir Mnih taught a neural network to play Atari games. During the teaching the agent only received the pixel representation of the screen as input, no information about game rules were give. One of the classic games on the platform was Breakout, which the DQN algorithm of DeepMind mastered perfectly. [Video](https://www.youtube.com/watch?v=TmPfTpjtdgg)

An even bigger breakthrough happened in 2015, when the AlphaGo model of Google DeepMind was able to beat a professional player (Fan Hui) for the first time on a 19x19  field. Following this it was able to overcome the 9 dan player Li Sedol  in 2016 by 4:1, and finally in 2017 Ke Jie, the supposedly best player in the world by 3:0. (In some opinions this was the "sputnik moment" of Chinese consciousness and fastened the elaboration of the Chinese state's massive AI iniciative.) The next step was the development of AlphaZero which is able to dominantly beat AlphaGo. The main advantage of the new model is, that it does not contain any hand crafted features which would aid it in recognizing game patterns. It is a truely end-to-end system. Go as a testbed is all the more remarkable, since it has a very high branching faktor, that the possibilities in it's state space explode rapidly. [Summary video](https://www.youtube.com/watch?v=SUbqykXVx0A<a href="http://zone.msn.com/images/v9/en-us/game/bckg/380x285_bckg.gif">https://drive.google.com/uc?export=view&id=1b_zpsn5-nAvfS32z8Bh5MWJLeb3XoWF4) 

## Basic RL 

![rlmodel](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ8ozFJfJXjRQj6vKNaKYu9wi7IKb6YFt5ilCwv00XXGjD0rmL-)

The most important concepts in RL are:
* state (and observation)
* action
* immediate reward, return
* policy
* environment (its dynamics).

We try to illustrate these with the extended example of chess play.
![chesstable](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/ChessSet.jpg/250px-ChessSet.jpg)

**State:** Totality of parameters describing the environment. 
- In chess this is the position of the chess pieces. 
- If we remove a piece and all else remains the same, it is considered a new state<br/>

**Observation:** how an agent perceives this state (the representation of the state). 
- If it cannot measure everything exactly or has no full access to state information, it can identify potentially different states as the same in lack of information. 
- In chess when we see the table then state == observation, but in the chess variant Kriegsspiel we do not. 
- Observation is in most of the cases lacking some information about state.

**Action:** mode of modifying the system by taking action and thus causing state change in the system. 
- In chess this is taking a move with a piece. 
- The space of possible actions is dependent also on the state and its representation!

**Immediate reward, return:** 
- After taking action the agent can potentially receive feedback, what we call _immediate reward_. 
- The more interesting element though, is the so called _return_ (or utility), which is an accumulated form of immediate rewards during long-running action sequence of the agent. 
- Return can not be any kind of function, it has to satisfy the requirement of stationarity illustrated below:

<a href="http://drive.google.com/uc?export=view&id=1Iji_ykwG0WEVSmBkpwwZe8TqgTT5VJad"><img src="https://drive.google.com/uc?export=view&id=19KM7Q-I7WW0F38xOOTlMtx9eUV4u02A1" width="700px"></a>

If it seems like the red trajectory is more rewarding when we decide in $s_0$, then the utility of the green trajectory must be higher than for the brown one. This can be satisfied by two types of returns (sum, discounted):

$$ G(\tau) = \sum_{r_i \in \tau} {r_i},$$

and:

$$ G(\tau) = \sum_{r_i \in \tau} {\gamma^i r_i}.$$

Where $0 < \gamma < 1$, $\tau$ is the trajectory, that is a sequence ($s_0, a_0, r_0, s_1, a_1, r_1, \dots s_i, a_i, r_i \dots$), $r_i$ is the next element in the trjectory. In our chess example, there is no immediate reward, only at the end one reward signal for winning or loosing. We could try to calculate points even during the gameplay, but that would be misleading, since it would not define properly if we will win or loose.

**Policy:** The agent is forced to make a decision in every state. InRL, we suppose, that time is discretized, that there are "steps". Put simply _policy_ is a function assigning an action to a state $\pi: S \rightarrow A$, or more precisely: 

$$\pi(s) = p(a|s).$$

This formula means that policy is a function, that assigns a probability distribution over the possible actions for each state, that is it shows how much it is "worth" to choose a given action in a given state (in the form of a probability value). This type of policy we call a "stochastic policy". If the policy is structured as assigning the probability of $1$ to only one action, while all others are $0$, we talk about a deterministic policy.

**Environment:** The agent acts in an environment, which transfers to a different state based on every occurence. This is defined by the dynamics of the environment. In chess we move AND the opponent moves also.

Technically the following are possible:
* Environment can be stochastic or deterministic
* Environment can be partially or fully observable (I see the complete state in this case)
* Action can be discrete or continuous (move a piece vs. turn apply more pressure on the accelerator pedal)
* Policy can be deterministic or stochastic


**The goal function of RL is the return itself. We try to maximize this.**

## Deep reinforcement learning

**In one sentence: We use deep neural networks to approximate policy and/or value functions and utilize gradient based or evolutionary learning methods for their teachnig.**

## Where is this relevant?

Reinforcement learning is capable of attacking problems and domains, where:
- Data is unlabeled and potentially unlabelable. It is simply infeasible to label each and every second of a road driving session with a label of "good move - bad move" based on the driver behavior.
- Complex optimization scenarios, where we can tell something about the "better or worse" situations, but can ónot explicitly come up with strategy elements that have to be evaluated.
- **It is an end-to-end learning approach taken to the extreme, meaning: we only know the "reward" but nothing about input processing, features, policy elements, let alone the final policy.**


One of the main areas of application for RL is robotics and process control.

The Japanese company [Fanuc](https://www.fanuc.com/)  build robots that can learn to carry out a new task within a day. (Mainly moving things aroud.)

It is also an eminent application, when we try to optimize the energy consumption of complex machinery or production processes (Google DeepMind uses it for cooling control of server farms).



### The most suitable problems for RL:
The areas most suited for RL are:
* control problems (though Sutton's results allow us to generalize to prediction - not a common technique)
* the system is described with a huge amount of complex parameters, feature engineering is hard
* the intervention signal for control is very complex


### Pros and cons of RL
**Pro:**
- Extremely promising as a general AI learning framework (plausibly we humans do reinforcement learning)
- Can enter domains where nothing else can proceed
- _Huge amount_ of innovation is going on (just like with GANs)

**Cons:**
- Still a distinct subfield of AI (though interest is catching up, but it is still a distinct "tribe")
- Training is notoriously **instable** and exceptionally **computation hungry** (weeks on GPU clusters)
- It is still quite poorly understood.
- **It depends on the ability to try out billions of actions, whoch is only feasible in a good simulator** (Simulators are themselves costly, there is a transfer learning problem between the simulator and the real environment... Interesting idea is to use neural models as simulators, this is pointiong towards a GAN style approach, see [World models - can agents learn from their dreams?](https://worldmodels.github.io/))

For a detailed description of the problems see [here](https://www.alexirpan.com/2018/02/14/rl-hard.html).



### Further reading

#### Books

* Sutton and Barto, Reinforcement Learning: An introduction, 2018, [Link](http://incompleteideas.net/book/RLbook2020.pdf) 
* Szepesvári Csaba, Algorithms for Reinforcement Learning, 2009, [Link](https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs-lecture.pdf)

#### Blogs, websites

* Andrej Karpathy, Deep Reinforcement Learning: Pong from Pixels [Link](http://karpathy.github.io/2016/05/31/rl/)
* DeepMind, Deep Reinforcement Learning, [Link](https://deepmind.com/blog/deep-reinforcement-learning/)
* David Silver, Reinforcement Learning Courses, [Link](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html)

#### Articles

* Mnih et. al,Human-level control through deep reinforcement learning, 2015, [Link](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf)

* Silver et. al, Mastering the game of Go without human knowledge, 2017, [Link](https://www.nature.com/articles/nature24270.pdf)

* Mnih et. al, Asynchronous Methods for Deep Reinforcement Learning, 2016, [Link](https://arxiv.org/pdf/1602.01783.pdf)