# <center>Deep Reinforcement Learning</center>

## <center>Instructor: Professor R. Venkatesh Babu</center>


## Table Of Content

1) Introduction and Applications

2) Formal Definition and Mathematical Tools (MDPs)

3) How to Implement RL?


# <center> Some Recent Achievements with Reinforcement Learning </center>

## RL for Games

* AI for ATARI 2600 Games: 2014 [Link](https://arxiv.org/pdf/1312.5602v1.pdf)

* AI for the game of Go: 2016 [Link](https://www.nature.com/articles/nature16961)

* AI for the game of DOTA: 2017 [Link](https://blog.openai.com/dota-2/)

* Alpha Go Zero, Learning from scratch: 2017 [Link](https://www.nature.com/articles/nature24270)

* AI for Poker, Libertus: 2017 [Link](https://www.ijcai.org/proceedings/2017/0772.pdf)

* RL for 3D Video Games Review: 2017 [Link](https://arxiv.org/pdf/1708.07902.pdf)

## Robotics


* Multiple Visio-motor task in Robotics: 2016 [Link](https://arxiv.org/abs/1504.00702)

* Learning to Navigate in Complex Environments: 2017 [Link](https://arxiv.org/abs/1611.03673)

## Natural Language Processing

* Dialogue Systems, Machine Translations, Text generation : [Link](https://arxiv.org/pdf/1701.07274.pdf)

## Operations Research 

* Business Management, Finance, Healthcare, transportation:  [Link](https://arxiv.org/pdf/1701.07274.pdf)


## Computer Vision

* **2016** 
  * Reinforcement Learning for Visual Object Detection
  * Active Object Localization With Deep Reinforcement Learning

* **2017**
  * A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping
  * Tracking as Online Decision-Making: Learning a Policy From Streaming Videos With Reinforcement Learning
  * Learning Cooperative Visual Dialog Agents With Deep Reinforcement Learning
  * Attention-Aware Deep Reinforcement Learning for Video Face Recognition
  * 3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic Parsing of Large-Scale 3D Point Clouds
  * Deep Reinforcement Learning-Based Image Captioning With Embedding Reward
  * Attention-Aware Face Hallucination via Deep Reinforcement Learning
  * Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection
  * Collaborative Deep Reinforcement Learning for Joint Object Search
  * Action-Decision Networks for Visual Tracking With Deep Reinforcement Learning
  * PoseAgent: Budget-Constrained 6D Object Pose Estimation via Reinforcement Learning
  * A Reinforcement Learning Approach to the View Planning Problem
  * A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
  * Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
  * Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward
  * Deep Reinforcement Learning for Image Hashing
  * Modeling Attention in Panoramic Video: A Deep Reinforcement Learning Approach
  * Video Captioning via Hierarchical Reinforcement Learning
  * Accelerated Methods for Deep Reinforcement Learning

### <center> 19 papers in ICLR18,24 papers in NIPS17! </center>

# <center>RL Formal Definition</center>


"Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards."

 For example, consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem.

# <center>RL vs Supervised Learning vs Unsupervised Learning</center>

### RL vs Supervised Learning
* Training Examples
    * Supervised Learning: Training examples from a knowledgeable external supervisor (situation together with a label).
    * RL: No such training examples.
* Objective Functions
    * Supervised Learning: Aim is to extrapolate, or generalize so that it acts correctly in situations not present in the training set. 
    * In RL, it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations and an agent must be able to learn from its own experience.

### RL vs Unsupervised Learning
* Unsupervised Learning is about finding structure hidden in collections of unlabeled data.
* Uncovering structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does not address the reinforcement learning agent’s problem of maximizing a reward signal.

# <center>Important RL Terms and Definitions</center>

## Goal of RL (Reward Hypothesis)
<center><img src="img/3.png" alt="RewardHypothesis" style="width: 1000px;"/></center>


## Interaction between Agent and Environment
<center><img src="img/7.png" alt="RewardHypothesis" style="width: 1000px;"/></center>

## History and State
<center><img src="img/8.png" alt="HistoryandState" style="width: 1000px;"/></center>

## Information State
<center><img src="img/11.png" alt="HistoryandState" style="width: 1000px;"/></center>

## Major Components of a RL Agent
An RL agent may include one or more of these components:
* **Policy**: agent’s behaviour function
* **Value function**: how good is each state and/or action
* **Model**: agent’s representation of the environment

## Policy
* A policy is the agent’s behaviour
* It is a map from state to action, e.g.
* Deterministic policy: $a = π(s)$
* Stochastic policy: $π(a|s) = P[A_t = a|S_t = s]$

## Value function
* Value function is a prediction of future reward
* Used to evaluate the goodness/badness of states
* And therefore to select between actions, e.g.
$$v_π(s) = E_π[R _{t+1} + γ*R_{t+2} + γ^{2}*R_{t+3} + ... | S_t = s]$$

## Model
* A model predicts what the environment will do next
* **Transitions**: P predicts the next state (i.e. dynamics)
* **Rewards**: R predicts the next (immediate) reward, e.g.
$$ P_{ss'} = P[S_{t+1} = s' | S_t = s, A_t = a]$$
$$ R^{a}_{s} = E[R_{t+1} | S_t = s, A_t = a]$$

## Maze Example
<center><img src="img/e1.png" alt="HistoryandState" style="width: 1000px;"/></center>


# Maze Example: Policy
<center><img src="img/e2.png" alt="HistoryandState" style="width: 1000px;"/></center>

* Arrows represent policy $\pi(s)$ for each state s. 

# Maze Example: Value Function
<center><img src="img/e3.png" alt="HistoryandState" style="width: 1000px;"/></center>

# Maze Example: Model
<center><img src="img/e4.png" alt="HistoryandState" style="width: 1000px;"/></center>

# <center>Building Blocks of MDP</center>

## <center>The Agent Environment Interface</center>


<center> <img src="img/agent_env.PNG" alt="MarkovProperty Definition" style="width:700px;"/> </center>

## <center>Markov Decision Process</center>
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

<center><img src="img/4.png" alt="Matrix" style="width: 700px;"/></center>

## Policy in MDP notation
<center><img src="img/6.png" alt="Matrix" style="width: 700px;"/></center>
* A policy fully defines the behaviour of an agent
* MDP policies depend on the current state (not the history)
* i.e. Policies are **stationary** (time-independent),
    $A_t ∼ π(·|S_t ), \forall t > 0$
    

## Return
<center><img src="img/3 - Copy.png" alt="Matrix" style="width: 700px;"/></center>
* The discount $γ ∈ [0, 1]$ is the present value of future rewards
* The value of receiving reward R after k + 1 time-steps is $γ^k R$.
* This values immediate reward above delayed reward.
    * $γ$ close to 0 leads to ”myopic” evaluation
    * $γ$ close to 1 leads to ”far-sighted” evaluation

## Value Function in MDP notation
<center><img src="img/7 - Copy.png" alt="Matrix" style="width: 700px;"/></center>


# <center>Bellman Optimality Equation</center>

## Optimal Value Function
<center><img src="img/o.png" alt="Matrix" style="width: 700px;"/></center>

## Optimal Policy
<center><img src="img/o1.png" alt="Matrix" style="width: 700px;"/></center>

## Bellman Optimality Equation for $V^{*}$
<center><img src="img/op3.png" alt="Matrix" style="width: 700px;"/></center>

## Bellman Optimality Equation for $Q^{*}$
<center><img src="img/op4.png" alt="Matrix" style="width: 700px;"/></center>

# <center> How to Learn $Q^{*}$ or $V^{*}$ ? </center> 

## Monte Carlo Methods



Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for $V(S_t)$.



#### Constant $\alpha$ MC

$$
V(S_t) = V(S_t) + \alpha\big[ G_t - V(S_t) \big]
$$



* $G_t$ is the actual return following time t,
* $\alpha$ is a constant step-size parameter.

## Monte Carlo Backup

<center><img src="img/MCback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## DP methods



Although DP methods update at every iteration, they use a model of the environment for the update. Basically they use, $p(s',r|s,a)$, the transition probability values.



#### Iterative policy evaluation in DP
$$
V (s) = \sum_{a}\pi(a|s) \sum_{s',r}{p(s',r|s,a)\big[ r+\gamma V(s') \big]}
$$

## DP backup

<center><img src="img/DPback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## Temporal Difference Prediction



TD methods need wait only until the next time step.



#### TD(0)

$$
V(S_t) = V(S_t) + \alpha\big[ R_{t+1} +\gamma V(S_{t+1})- V(S_t) \big]
$$



* target for the Monte Carlo update is $G_t$, whereas the target for the TD update is $R_{t+1} + \gamma V(S_{t+1})$.
* Because the TD method bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP.


## TD Backup

<center><img src="img/TDback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

#### Monte Carlo Backup

<center><img src="img/MCback.JPG" alt="Multi-armed Bandit" style="width: 200px;"/></center>

#### DP backup

<center><img src="img/DPback.JPG" alt="Multi-armed Bandit" style="width: 200px;"/></center>

#### TD Backup

<center><img src="img/TDback.JPG" alt="Multi-armed Bandit" style="width: 200px;"/></center>


<center><img src="img/bootsam.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>Q-learning: Off-Policy TD Control</center>



$$
Q(S_t, A_t) = Q(S_t, A_t) + \alpha \big[ R_{t+1} + \gamma {max}_a Q(S_{t+1}, a) − Q(S_t, A_t) \big] 
$$

** The learned action-value function, $Q$, directly approximates $q∗$, the optimal action-value function, independent of the policy being followed**.

## Q-learning: An off-policy TD control algorithm

**Initialize** $Q(s, a), \forall s \in S, a \in A(s)$, arbitrarily, and $Q($terminal-state$, ·) = 0$<br>
**Repeat** (for each episode):<br>
$\quad$Initialize $S$<br>
$\quad$**Repeat** (for each step of episode):<br>
$\quad \quad$ Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
$\quad \quad$ Take action $A$, observe $R, S'$<br>
$\quad \quad$ $Q(S, A) = Q(S, A) + \alpha \big[ R +\gamma {max}_a Q(S', a) − Q(S, A) \big] $ <br>
$\quad \quad$ $S = S';$<br>
$\quad$ until $S$ is terminal

# Summary

1) Introduction to RL and its Applications

2) The Intuitive + formal Formmulation of mathematical tools, like MDPs

3) A brief overview of methods to implement RL

# Next Session

1) Deep Q-learning

2) Policy Gradient

3) Latest Methods: TRPO,PPO,AlphaGoZero Review.

# <center> Any Questions? </center>