# Recaps and intro

[Paper](https://medium.com/generate-vision/how-to-increase-the-value-of-your-anonymized-data-c182ba4fc8db)

# Deep reinforcement learning

## RL course outline

### Day 3

1. Introduction (What is RL? Why do we need this?)
2. N-armed bandits
3. Monte Carlo Tree search (MCTS)
4. Bellmann-equation
5. Practical sessions (gridworld and N-armed)

### Day 4

1. Recap of Day 3
2. Taxonomy
3. Value-based RL with classic methods
4. Value-based RL with deep networks
5. Examples
6. Practical sessions (mountain car, atari with DQN)


### Day 5

1. Policy-based methods, REINFORCE
2. Policy gradient methods, natural gradients
3. Actor-critic methods
4. Other directions
5. Exploration-exploitation tricks
6. Practical sessions (finance)

## What is RL?



“If one of the goals that we work for here is AI then it is at the core of that. Reinforcement Learning is a very general framework for learning sequential decision making tasks. And Deep Learning, on the other hand, is of course the best set of algorithms we have to learn representations. And combinations of these two different models is the best answer so far we have in terms of learning very good state representations of very challenging tasks that are not just for solving toy domains but actually to solve challenging real world problems.”

-- Koray Kavukcuoglu, the director of research at Deepmind

Reinforcement learning is the learning paradigm that uses interactions to learn a behavior. This seems a natural learning process if we think of a human or animal. It is rare that we have questions and answeres then we have just learn both. It is natural to learn things by doing it.

The field of reinforcement learning provides a framework to learn according to the experiences gained during interactions, explorations in the world. 

## Success of RL

### Backgammon

A computer version of the backgammon was developed by Gerald Tesauro at IBM's Thomas J. Watson Research Center in 1992. They called it TD-gammon.

They developed a TD-learning (temporal difference learning algorithm) for learning a good strategy for playing the game. We will discuss TD-learning later on. 

They managed to achieve the level of human players.

The board of the game:
<img src="http://drive.google.com/uc?export=view&id=1KY6VwerZxL1Vjq-cReemeja-oK-IocOD" width=45%>

Characteristics of the game:

* complexity of the game
    * number of states: $10^{20}$
    * branching factor: about $400$
* stochasticity of the opponent
* easy to simulate the steps on the board

### Flying a helicopter

[A helicopter learns to fly](https://www.youtube.com/watch?v=M-QUkgk3HyE)

Peter Abbeel and Andrew Ng in 2008.

Previous works by Andrew Ng to train a helicopter to perform complicated maneuvers. He developed the algorithm called [Pegasus](https://arxiv.org/abs/1301.3878).

Characteristics of the problem:
* continuous control signals
* difficult to model and understand the reaction of the helicopter
* aerodinamical uncertainty (and uncertainty)
* real-time interactions

### Atari games

<img src="http://drive.google.com/uc?export=view&id=1Gs6s2riwVswUSPyvg75fIeK3GnBmZiIV" width=55%>

Important breakthrough in the history of reinforcement learning. This was the first time when deep neural networks were combined with reinforcement learning successfully (DQN and its variants, see later on). This achievement was in 2013-2015 with several similar papers.

The learning is done by observing only the raw frames (with a slight preprocessing) and the scores achieved in the game. The rules are not known and the exact state of the game engine is not known too. The algorithm has to deduce those information from the image sequence and the scores.

The algorithm was able to achieve human-level performance on more than half of the Atari games (49 games were played) with the same hyper-paramteres. The algorithm was trained for each game separately but with the same settings. [Video playing Atari](https://www.youtube.com/watch?v=V1eYniJ0Rnk)

<img src="http://drive.google.com/uc?export=view&id=1i2hn7vDsYdyrr6BRdktmynzeRB5tKyt4" width=55%>

Characterictics of the task:
* the number of states: $10^{12000}$
* the number of actions are small (2-10)
* only the frames can be seen

### Playing Go on human-level and beyond

"Go is an abstract strategy board game for two players, in which the aim is to surround more territory than the opponent. The game was invented in China more than 2,500 years ago and is believed to be the oldest board game continuously played to the present day. A 2016 survey by the International Go Federation's 75 member nations found that there are over 46 million people worldwide who know how to play Go and over 20 million current players"

-- [Wikipedia article](https://en.wikipedia.org/wiki/Go_(game))

<img src="http://drive.google.com/uc?export=view&id=1mtEp02WIb9d4Ww1KU9_UKtM45gZM2dP6" width=55%>

In October 2015, AlphaGo played its first match against the reigning three-time European Champion, Mr Fan Hui. AlphaGo won the first ever game against a Go professional with a score of 5-0.

[AlphaGo vs Fan Hui](https://en.wikipedia.org/wiki/AlphaGo_versus_Fan_Hui)

AlphaGo then competed against legendary Go player Mr Lee Sedol, the winner of 18 world titles, who is widely considered the greatest player of the past decade. AlphaGo's 4-1 victory in Seoul, South Korea, on March 2016 was watched by over 200 million people worldwide. This landmark achievement was a decade ahead of its time.

[AlphaGo vs Lee Sedol](https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol)

In January 2017, DeepMind revealed an improved, online version of AlphaGo called Master. This online player achieved 60 straight wins in time-control games against top international players. 

[AlphaGo vs AlphaGo](https://deepmind.com/alphago-vs-alphago)

Four months later, AlphaGo took part in the Future of Go Summit in China, the birthplace of Go. Following the summit, Deepmind revealed **AlphaGo Zero**. While AlphaGo learnt the game by playing thousands of matches with amateur and professional players, AlphaGo Zero learnt by playing against itself, starting from completely random play.

<img src="https://miro.medium.com/max/900/1*MlwJE5cDvc3WL76v_ViERw.gif" width=55%>

-- [Source](https://medium.com/predict/a-step-towards-agi-the-story-of-alphago-e0fafd83e6b9)

Deepmind developed a more advanced version, called **MuZero**. It is a remarkable milestone in RL because it is able to achieve high results on Chess, Shogi, Go and Atari! 

<img src="https://miro.medium.com/max/653/1*EsI4_4cfhtUi-iAgaUeyvQ.png" width=55%>

-- [Source](https://towardsdatascience.com/deepmind-unveils-muzero-a-new-agent-that-mastered-chess-shogi-atari-and-go-without-knowing-the-d755dc80ff08)

Characteristics of the game:
* a lot of possible sequences of moves: $250^{150}$
* board can be seen, moves are well defined
* opponent is stochastic
* the game is compley, requires intuition and creativity

**References:**

[Story of AlphaGO](https://deepmind.com/research/case-studies/alphago-the-story-so-far)

[Mastering the game of Go paper](https://www.nature.com/articles/nature16961.pdf)

[MuZero paper](https://arxiv.org/abs/1911.08265)

### Playing Dota2

Dota2 is a really complex game. It requires strategies, the controlling of a team based on collaboration. There are a diverse set of actions. The strategies should take into account the future, this mean a long-term horizon. It is very hard for an RL agent to be proactive.

Successful approach was OpenAI Five.

"We started working with Dota 2 because we expected it to be a good testbed for developing general-purpose AI technologies. It has additionally turned out to be a great avenue for helping people experience modern AI — which we expect to become a high-stakes part of people’s lives in the future, starting with systems like self-driving cars."

-- [Source](https://openai.com/blog/openai-five-finals/)

**Game rules and goal**

* 5 heroes on the field
* each hero has 4 abilities
* objective: destroy the ancient of the enemy
* 2 sides: radiant, dire

<img src="http://drive.google.com/uc?export=view&id=1c9LlTjAMSDESEtxQrhH6cjCASO6ROS3K" width=30%>

<img src="http://drive.google.com/uc?export=view&id=1bqs42fr2REpMH2p0R5zTdytOnEK1vVim" width=50%>

[Dota2 gameplay](https://www.youtube.com/watch?time_continue=1&v=UZHTNBMAfAA&feature=emb_logo)

During training an enormous power was used:

* 128000 CPU cores
* 256 P100 GPUs
* 10000 years worth of experience (180 years per day)
* 64.000 watts

OpenAI Five wins back-to-back games versus Dota 2 world champions OG at Finals, becoming the first AI to beat the world champions in an esports game.

**References:**

[Official webpage](https://openai.com/projects/five/)

[Dota2 paper](https://cdn.openai.com/dota-2.pdf)

### Playing Starcraft

Similar game like Dota2. The goal and the difficulty are similar as well. The project is maintained by Deepmind.

"AlphaStar is the first AI to reach the top league of a widely popular esport without any game restrictions. This January, a preliminary version of AlphaStar challenged two of the world's top players in StarCraft II, one of the most enduring and popular real-time strategy video games of all time. Since then, we have taken on a much greater challenge: playing the full game at a Grandmaster level under professionally approved conditions."

-- [Source](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning)

<img src="https://fudzilla.com/media/k2/items/cache/40f24d728e5f4849828b04937e4b92a2_L.jpg" width=45%>

How AlphaStar improved over time until it has beaten top players of the world:

<img src="https://cdn.vox-cdn.com/thumbor/I0h7LHOwssieeS_lYYif7LYJqiU=/0x0:1440x810/920x0/filters:focal(0x0:1440x810):format(webp):no_upscale()/cdn.vox-cdn.com/uploads/chorus_asset/file/19332213/Figure_3_static.jpg" width=65%>

Characteristics of the game:
* long-term horizon
* partially-observable environment
* a lot of possible actions
* a lot of possible situations, decision sequences

**References:**

[AlphaStar](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning)

[API](https://github.com/deepmind/pysc2)

[AI master online](https://www.nature.com/articles/d41586-019-03343-4)

### Decreasing energy consumption in Google data centers

[Energy consumption decreasing](https://sustainability.google/projects/machine-learning/)

## Real world applications

### Nascar racing

[Nascar racing example](https://youtu.be/lrv8ga02VNg?t=852)

### Osaro

[osaro](https://www.osaro.com)

Uses RL for:
* picking tasks
* assembly tasks
* creating other robots

### Covariant

[covariant](https://covariant.ai/)

Uses RL for robotics.

### Bonsai

[bonsai](https://www.bons.ai/)

Creates an RL platform for industrial usage.

### Facebook's Horizon

[Horizon](https://research.fb.com/publications/horizon-facebooks-open-source-applied-reinforcement-learning-platform/)

RL platform, opensource.

### Waymo

[waymo](https://waymo.com/)

I am not sure but RL can have applications here.

### J.P.Morgan

[RL for foreign exchange](https://www.jpmorgan.com/global/markets/machine-learning-fx)

## Background, history

Two major area led to todays reinforcement learning: psychology inspired research and optimal control research.

Regarding **psychology**, the trial-and-error as a principle for learning was expressed by Edward Thorndike in 1911:

*Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond.*

This is basically the "Law of Effect".

The term "reinforce" came later in the 1927 English translation of Pavlov's monograph on conditioned reflexes. The idea of implementing trial-and-error learning in computers appeared in a 1948 report of Alan Turing:

*When a configuration is reached for which the action is undetermined, a random choice for the missing data is made and the appropriate entry is made in the description, tentatively, and is applied. When a pain stimulus occurs all tentative entries are cancelled, and when a pleasure stimulus occurs they are all made permanent.*

-- (Turing, 1948)

[Pigeons playing ping pong](https://www.youtube.com/watch?v=vGazyH6fQQ4)

[Deep RL algorithm plays ping pong](https://www.youtube.com/watch?v=YOW8m2YGtRg)

Regarding **optimal control**, we should look back to the late 1950s, where researchers tried to solve the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches was suggested by Richard Bellman, who used dynamic programming to formalize an equation which solution is the optimal solution of the system. Later, this equations were called Bellman-equations. 

After several decades, slowly, at the intersection of the two reinforcement learning was born. So far it become a standalone research field, and a huge body of research was done so far. 

It is promising and emerging, it worth knowing its principles and the current-state-of-the-art.

## Comparision to other learning methods

There are 3 major paradigms of machine learning:
* supervised learning
* unsupervised learning
* reinforcement learning

The first two has a common characteristic because both of them rely on already gathered, existing, cleaned, available data.

For **supervised learning**, the data has two parts: input and expected output (a.k.a target, label). Therefore the learning algorithms have to learn a mapping between the input and output. Deep neural networks attract a large attention due to their success in this.

For **unsupervised learning**, the data is not a set of pairs. However, the structure (like clusters, anomalies, relationships) can be deduced by a well chosen model. 

None of the cases above is able to change the distribution of the data because they can not gather new ones. It is the task of the engineer.

Reinforcement learning has no data. It gets data (information) by interactiing with the world, its surrounding or a simulator. Therefore one of the main difficulty is, that the behavior (the way how it interacts) effects the data distribution it gathers. However the learning is based on the data encountered so far.

## RL architecture

<img src="http://drive.google.com/uc?export=view&id=1VqFRFYo8Tv2YDG__6ntVliUNMb-RlEmU" width=65%>

## N-armed bandits

Decision problems without states (or you can say it has only one state).

First appeared in a paper written by William R. Thompson in 1933. He examined medical trials and the adaptation of the treatment allocation on-the-fly as the drug appears more or less effective.

The name itself originates from 1950, and was coined by Frederick Mosteller and Robert Bush. They studied animal learning on mice. The mice faced the dilemma of choosing to go left and right in a T-shaped  maze. In one of the directions there was food. However it was not known where the food was located, it was random.

They do similar experiment in order to study human learning with two-armed bandits. The name came from one-armed bandit which was a lever-operated slot machine.

<img src="http://drive.google.com/uc?export=view&id=12gOhzjbuBsnQDrEPsONZi7im5fsJLPco" width=75%>

Practical applications of bandits:
* news recommendation
* dynamic pricing
* ad placement
* network routing

And lots of others.

Example for a classical 2-armed bandit.

<img src="http://drive.google.com/uc?export=view&id=1nRIT78KdQ7vsxZTmyUK0E2DFEmi3vST6" width=75%>

**Definition of stochastic bandits:**

A stochastic bandit is a collection of distributions $\nu= (P_a: a \in A)$, where $A$ is the set of available actions. The learner and the environment interact sequentially over $n$ rounds. $n$ is the horizon. In each round $t \in \left\{1,...,n\right\}$, the learner chooses an action $A_t \in A$, which is fed to the environment. The environment then samples a reward $X_t \in R$ from distribution $P_{A_t}$ and reveals $X_t$ to the learner. The interaction between the learner (or policy) and environment induces a probability measure on the sequence of outcomes $A_1, X_1, A_2, X_2, ..., A_n, X_n$.

**Definition of the regret:**

We need an objective in order to optimize. The method to choose the actions in each round, called the policy. Then, for a given instance of bandit, the **regret** is defined as follows:

$$R_n = n\mu^* - E\left[ \sum_{t=1}^n X_t \right]$$

where,

$$\mu^* = \max_a \mu_a$$
$$\mu_a = \int_{-\infty}^\infty {X \cdot dP_a(X)} $$

$\mu_a$ is basically the mean of the rewards, coming from arm $a$ (or when action $a$ was chosen). $\mu^*$ is the optimal mean reward, when the optimal action is selected all the time. The goal is to **minimize the regret**.

**Types of bandits:**

1. structured bandits
2. unstructured bandits
3. contextual bandits
4. adversarial bandits

**Structured vs unstructured**

A bandit is unstructured if learning about an arm $a$, there is no information to deduce for another arm.
Otherwise it is structured.

**Contextual bandits**

In case of a recommendation system, the policy could take into account the geographical and the personal information (age, sex etc.). Shortly, the contextual information can be included into the problem and it affects what is the optimal policy.

**Adversarial bandits**

Imagine the environment has the power to see your algorithm and it can alter how the rewards are signaled. Therefore the actions change the environment itself (changes the distribution functions).

**ETC and UCB**

Here, we will discuss two algorithms to solve N-armed (or multi-armed) finite stochastic bandits. Both of them can be applied successfully, if the probability distributions of the arms are subgaussian with a fixed $\sigma$.

Definition of subgaussianity: X random variable is $\sigma$-subgaussian if $\forall \lambda \in R$, it holds that 
$$E\left[ e^{\lambda X} \right] \leq e^\left( \frac{\lambda^2 \sigma^2}{2} \right)$$

**ETC (explore-then-commit)**

The number of actions: $k$

The number of times the algorithm explores each arm: $m$

As the name suggests, the algorithm's has two phases: 
1. exploration of the arms
2. exploitation the best arm so far

The ETC algorithm spends exactly $k \cdot m$ rounds on exploration. The score it uses to decide which arm is the best to exploit, is the average reward received from arm $i$ after round $t$.

The score:

$$\mu_i(t) = \frac{1}{T_i(t)} \cdot \sum_{s=1}^t{I\left[ A_s = i \right] X_s}$$

and 

$$T_i(t) = \sum_{s=1}^t{I\left[ A_s = i \right]}$$

Where $A_s$ is the chosen action in round $s$. The $X_s$ is the reward, received in round $s$.

Pseudo code:

1. $m$ is given by the user
2. In round $t$ choose the action according to:

$$A_t = (t\ mod\ k) + 1,t\leq mk$$
$$A_t = \arg \max_i \mu_i(mk), t > mk$$

This is a simple algorithm but it is powerful. With the right choice of $m$ it is able to get close to the optimal solution.

**UCB (Upper Confidence Bound)**

This algorithm has many different forms, depending on the distributional assumptions on the noise. However, the main idea is common in them: optimism. Optimism in the sense that the algorithm sees the world as it is nice and worthy to explore new things. We will encounter this approach later in the course as well.

The optimistic algorithm is implemented by assigning a score which is high at the beginning (assumes high value of arms not explored so far). By gathering new data, the score (or utility) of the arms will converge its real utilities.

Optimistic principle: use the observed data to assign to each arm a value, called the **upper confidence bound**. 
Upper confidence bound means: with high probability it overestmiates the unknown mean.

It is calculated as follows:

$$UCB_i(t-1, \delta) = \infty,\ T_i(t-1)=0$$
$$UCB_i(t-1, \delta) = \mu_i(t-1) + \sqrt{\frac{2\cdot \log(1 / \delta)}{T_i(t-1)}},\ otherwise$$

Pseudo code:
1. $k$ and $\delta$ is given as input ($\delta$: error probability)
2. for $t \in 1 ... n$ do:
$$A_t = \arg \max_i UCB_i(t-1, \delta)$$
Observe $X_t$ and update the upper confidence bounds