# <center>Deep Reinforcement Learning</center>

## <center>Part - II</center>
<br>
<br>

## <center>Instructor: Professor R. Venkatesh Babu</center>


# Slide Credits:

1) Sutton & Barto Book: Reinforcement Learning: An Introduction: [Link To Book](http://incompleteideas.net/book/bookdraft2017nov5.pdf)

2) David Silver: Introduction to Reinforcement Learning: [Link to Video Lectures](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT)


## Table Of Content

1) Reintroduction

2) Deep Q Learning

4) Policy Gradient method

5) Introduction to advanced methods.


# <center>Introduction</center>


## <center>"Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards."</center>
<br><br>
<br><br>
<center><img src="img/multiarmedbandit.jpg" alt="Multi-armed Bandit" style="width: 400px;"/></center>


## <center>The Agent Environment Interface</center>


<center> <img src="img/agent_env.PNG" alt="MarkovProperty Definition" style="width:700px;"/> </center>

## Major Components of a RL Agent
An RL agent may include one or more of these components:
* **Policy**: agent’s behaviour function
* **Value function**: how good is each state and/or action
* **Model**: agent’s representation of the environment

## Maze Example
<center><img src="img/e1.png" alt="HistoryandState" style="width: 1000px;"/></center>


# Maze Example: Policy
<center><img src="img/e2.png" alt="HistoryandState" style="width: 1000px;"/></center>

* Arrows represent policy $\pi(s)$ for each state s. 

# Maze Example: Value Function
<center><img src="img/e3.png" alt="HistoryandState" style="width: 1000px;"/></center>

# Maze Example: Model
<center><img src="img/e4.png" alt="HistoryandState" style="width: 1000px;"/></center>

## <center>Markov Decision Process</center>
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

<center><img src="img/4.png" alt="Matrix" style="width: 700px;"/></center>

## Value Function in MDP notation
<center><img src="img/7 - Copy.png" alt="Matrix" style="width: 700px;"/></center>


## Bellman Optimality Equation for $V^{*}$
<center><img src="img/op3.png" alt="Matrix" style="width: 700px;"/></center>

# <center> How to Learn $Q^{*}$ or $V^{*}$ ? </center> 

## Monte Carlo Backup

<center><img src="img/MCback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## DP backup

<center><img src="img/DPback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## TD Backup

<center><img src="img/TDback.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>Q-learning: Off-Policy TD Control</center>



$$
Q(S_t, A_t) = Q(S_t, A_t) + \alpha \big[ R_{t+1} + \gamma {max}_a Q(S_{t+1}, a) − Q(S_t, A_t) \big] 
$$

** The learned action-value function, $Q$, directly approximates $q∗$, the optimal action-value function, independent of the policy being followed**.

## Q-learning: An off-policy TD control algorithm

**Initialize** $Q(s, a), \forall s \in S, a \in A(s)$, arbitrarily, and $Q($terminal-state$, ·) = 0$<br>
**Repeat** (for each episode):<br>
$\quad$Initialize $S$<br>
$\quad$**Repeat** (for each step of episode):<br>
$\quad \quad$ Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
$\quad \quad$ Take action $A$, observe $R, S'$<br>
$\quad \quad$ $Q(S, A) = Q(S, A) + \alpha \big[ R +\gamma {max}_a Q(S', a) − Q(S, A) \big] $ <br>
$\quad \quad$ $S = S';$<br>
$\quad$ until $S$ is terminal

# <center>A Simple example of Tabular Q-learning</center>


#### Credits to Arthur Juliani [link](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0)


<center><img src="img/frozen_1.jpg" alt="Multi-armed Bandit" style="width: 500px;"/></center>

* A Frozen Lake with Holes.
* Avoid Holes, and reach the save zone



<center><img src="img/frozen_2.jpg" alt="Multi-armed Bandit" style="width: 700px;"/></center>

* we have 16 possible states (one for each block)
* 4 possible actions (the four directions of movement)
* giving us a 16x4 table of Q-values.

In [None]:
import gym
import numpy as np

env = gym.make('FrozenLake-v0')

#Initialize table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
    #Reset environment and get first new observation
    s = env.reset()
    rAll = 0
    d = False
    j = 0
    #The Q-Table learning algorithm
    while j < 99:
        j+=1
        #Choose an action by greedily (with noise) picking from Q table
        a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
        #Get new state and reward from environment
        s1,r,d,_ = env.step(a)
        #Update Q-Table with new knowledge
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:
            break
    #jList.append(j)
    rList.append(rAll)

# <center>Value Function Approximation</center>
## <center>Using Deep Learning</center>

## Introduction

Reinforcement learning can be used to solve large problems, e.g.
<br><br>
* Backgammon: $10^{20}$ states
<br><br>
* Computer Go: $10^{170}$ states
<br><br>
* Helicopter: continuous state space


How can we use the methods learnt previously on such huge state-spaces?
<br><br><br>
For eg: A Tabular methods using only Value Function of states (In half Precision) for the game of Backgammon:

* Memory Requirement: $\frac{10^{20} \times 2}{10^{15}} = $2 Lac PetaBytes

## Story so far...

* So far we have represented value function by a lookup table;
<br><br>
* Every state $s$ has an entry $V(s)$<br>
  Or every state-action pair $s,a$ has an entry $Q(s, a)$.

* Problem with large MDPs:
  * There are too many states and/or actions to store in memory
  * It is too slow to learn the value of each state individually

* **Solution for large MDPs**:
  * Estimate value function with function approximation<br>
### $$\hat{v}(s, w) \sim v_\pi(s)$$ <br>
    or<br>
### $$\hat{q}(s, a, w) \sim q_\pi(s, a)$$ <br>

* Generalise from seen states to unseen states;<br>

* Update parameter w using MC or TD learning

## Value Approximation
### Types of Value Function Approximation

<center><img src="img/func_approx.JPG" alt="Multi-armed Bandit" style="width: 600px;"/></center>

##  The Prediction Objective (MSVE)
* Mean Squared Value Error, or MSVE:
### $$ MSVE(\theta) = \sum_{s \in S}^{}{d(s) \big[ v_\pi (s)-\hat(v)(s,\theta) \big]^2}$$

Weighting or distribution  d(s)≥0d(s)≥0  representing how much we care about the error in each state  ss .

* **More states than weights**
  * making one state’s estimate more accurate leads to making others’ less accurate. 
  * We must specify which states we care most about.

## <center>Stochastic Gradient Descent for MSVE</center>
<center><img src="img/fa_slides2.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Feature Vector</center>

<center><img src="img/fa_slides3.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Linear Value Function Approximation</center>

<center><img src="img/fa_slides4.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Incremental Prediction Methods</center>

<center><img src="img/fa_slides6.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>TD Learning with Value Function Approximation</center>

<center><img src="img/fa_slides8.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Control with Value Function Approximation</center>

<center><img src="img/fa_slides10.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>A Simple example of approximate Q-learning</center>


#### Credits to Arthur Juliani [link](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0)


<center><img src="img/frozen_1.jpg" alt="Multi-armed Bandit" style="width: 500px;"/></center>

* A Frozen Lake with Holes.
* Avoid Holes, and reach the save zone




<center><img src="img/frozen_2.jpg" alt="Multi-armed Bandit" style="width: 700px;"/></center>


* we have 16 possible states (one for each block)
* 4 possible actions (the four directions of movement)
* giving us a 16x4 table of Q-values.

### The Network

In [None]:

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

### The Training

In [None]:
for i in range(num_episodes):
    #Reset environment and get first new observation
    s = env.reset()
    rAll = 0
    j = 0
    #The Q-Network
    while j < 99:
        j+=1
        #Choose an action by greedily (with e chance of random action) from the Q-network
        a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
        if np.random.rand(1) < e:
            a[0] = env.action_space.sample()
        #Get new state and reward from environment
        s1,r,d,_ = env.step(a[0])
        #Obtain the Q' values by feeding the new state through our network
        Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
        #Obtain maxQ' and set our target value for chosen action.
        maxQ1 = np.max(Q1)
        targetQ = allQ
        targetQ[0,a[0]] = r + y*maxQ1
        #Train our network using target and predicted Q values
        _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
        rAll += r
        s = s1
    jList.append(j)
    rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"

# <center>DQN in ATARI</center>

## The model

<center><img src="img/fa2_ex1.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## Performance

<center><img src="img/fa2_ex2.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## Benefits of Experience Replay and Double DQN

<center><img src="img/fa2_ex3.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center> Deep Q-Learning in Computer Vision</center>

### Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning
[Link](http://openaccess.thecvf.com/content_cvpr_2017/papers/Yun_Action-Decision_Networks_for_CVPR_2017_paper.pdf)

Video : [https://www.youtube.com/watch?v=q8HU_bK8LOk](https://www.youtube.com/watch?v=q8HU_bK8LOk)

# <center>Policy Gradients</center>

## <center> Introduction </center>
<center><img src="img/pg_1.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Why Policy-Based RL</center>
<center><img src="img/pg_3.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Can Learning Policy be easier than Learning Values of states?</center>
* The policy may be a simpler function to approximate.
* This is the simplest advantage that policy parameterization may have over action-value parameterization.

Why?
* Problems vary in the complexity of their policies and action-value functions. 
* For some, the action-value function is simpler and thus easier to approximate. 
* For others, the policy is simpler. 


** In the latter case a policy-based method will typically be faster to learn and yield a superior asymptotic policy.**

Example: In Robotics Tasks with continuous Action space.

## <center> Example of Stochastic Optimal Policy</center>
<center><img src="img/pg_4.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>REINFORCE: Simplest Policy Gradient Method</center>

## <center>Quality Measure of Policy</center>
<center><img src="img/pg_8.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Analytic Gradient Ascent</center>
<center><img src="img/pg_12.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Example- Softmax Policy</center>
<center><img src="img/pg_13.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Example- Gaussian Policy</center>
<center><img src="img/pg_14.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>One-step MDP</center>
<center><img src="img/pg_15.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center>Monte-Carlo Policy Gradient (REINFORCE)</center>
<center><img src="img/pg_17.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>Policy Gradient Example</center>

[https://www.youtube.com/watch?v=m-DiH_Fq6lg](https://www.youtube.com/watch?v=m-DiH_Fq6lg)

# <center>Policy Gradient in Computer Vision</center>

### Visual Tracking by Reinforced Decision Making
[Link](https://arxiv.org/pdf/1702.06291.pdf)


<center><img src="img/pg_ex_2.jpg" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center> Actor Critic Methods</center>

## <center> Value-Based Vs Policy-Based RL</center>
<center><img src="img/pg_2.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Reducing Variance Using a Critic</center>
<center><img src="img/pg_19.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Estimating the Action-Value Function</center>
<center><img src="img/pg_20.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

## <center> Action Value Actor Critic</center>
<center><img src="img/pg_21.JPG" alt="Multi-armed Bandit" style="width: 700px;"/></center>

# <center>The Path Ahead</center>

## The Field Explodes from a singular narrative from here:
  
  
  



  * Additional Enhacements for reducing variance : Experience Replay, AC with baseline etc.


  * More Complex architecture: Double DQN, A3C


  * Other Optimization Routes: TRPO, TRPO


  * Other kinds of RL: Inverse RL, Imitation Learning. 

# <center>Summary</center>

1) Deep Q - learning

2) Policy Gradient

3) Actor Critic 

# <center>Any Questions?</center>



## Reminder: Quiz on 26th March