# <center>Finite Markov Decision Processes</center>

### <center> Reference: Chapter 3, Sutton and Barto </center> 

# <center>Contents</center>


* Why MDPs?

* Markov Property

* Building Blocks of MDP
    * Episodic vs Continuous Tasks
    * State Transition Matrix
    * Return
    * Discount
    * Value Function
    

* MDP Parameters
    * Policy in MDP notations
    * Value Functions in MDP notations

* Bellman Expectation Equations

* Bellman Optimal Equations


# <center>Why Markov Decision Process?</center>

* Markov decision processes formally **describe an environment** for reinforcement learning
* Where the environment is **fully observable**
* i.e. The **current state** completely characterises the process
* Almost all RL problems can be formalised as MDPs, e.g.
    * Optimal control primarily deals with continuous MDPs
    * Partially observable problems can be converted into MDPs
    * Bandits are MDPs with one state

# <center>Markov Property</center>

“The future is independent of the past given the present”
<center> <img src="img/1.png" alt="MarkovProperty Definition" style="width:700px;"/> </center>

* The state captures all relevant information from the history
* Once the state is known, the history may be thrown away
* i.e. The state is a sufficient statistic of the future

# <center>Building Blocks of MDP</center>

## Episodic vs Continuing Tasks

### Episodic Tasks
* Each episode ends in a special state called the terminal state, 
* Followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. 

### Continuing Tasks

* The agent–environment interaction does not break naturally into identifiable episodes.
* It goes on continually without limit. 

## Unified Notation for Episodic and Continuous Tasks





#### Return for Episodic Tasks
sum over a finite number of terms

#### Return for Continuous Tasks 
sum over an infinite number of terms

We need one convention to obtain a single notation that covers both episodic and continuing tasks.

How to do that?

These can be unified by considering episode termination to be the entering
of a **special absorbing state** that **transitions only to itself** and that **generates only
rewards of zero**. For example, consider the state transition diagram -
<center><img src="img/unified.png" alt="Matrix" style="width: 1000px;"/></center>
Hence, return can be written as-
<center><img src="img/return.png" alt="Matrix" style="width: 200px;"/></center>

## State Transition Matrix
<center><img src="img/2.png" alt="Matrix" style="width: 1000px;"/></center>

## Return
<center><img src="img/3.png" alt="Matrix" style="width: 1000px;"/></center>
* The discount $γ ∈ [0, 1]$ is the present value of future rewards
* The value of receiving reward R after k + 1 time-steps is $γ^k R$.
* This values immediate reward above delayed reward.
    * $γ$ close to 0 leads to ”myopic” evaluation
    * $γ$ close to 1 leads to ”far-sighted” evaluation

## Discount 

Most Markov reward and decision processes are discounted. Why?
* Mathematically convenient to discount rewards
* Avoids infinite returns in cyclic Markov processes
* Uncertainty about the future may not be fully represented
* If the reward is financial, immediate rewards may earn more interest than delayed rewards
* Animal/human behaviour shows preference for immediate reward
* It is sometimes possible to use undiscounted Markov reward processes (i.e. $γ = 1$), e.g. if all sequences terminate.

## Value Function
The value function $v(s)$ gives the long-term value of state s
<center><img src="img/5.png" alt="Matrix" style="width: 1000px;"/></center>


# <center>MDP Parameters</center>

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

<center><img src="img/4.png" alt="Matrix" style="width: 1000px;"/></center>

## Policy in MDP notation
<center><img src="img/6.png" alt="Matrix" style="width: 1000px;"/></center>
* A policy fully defines the behaviour of an agent
* MDP policies depend on the current state (not the history)
* i.e. Policies are **stationary** (time-independent),
    $A_t ∼ π(·|S_t ), \forall t > 0$
    

## Policy in MDP notation
Given a MDP $M = \left \langle S, A, P, R, \gamma \right \rangle$ and a policy $\pi$

$$P_{s,s'}^{\pi} = \sum_{a \epsilon A} \pi(a|s) P_{ss'}^{a}$$
$$R_{s}^{\pi} = \sum_{a \epsilon A} \pi(a|s) R_{s}^{a}$$

## Example: Recycling Robot
<center><img src="img/robot.jpg" alt="Matrix" style="width: 100px;"/></center>

** Task: ** 

Collect Empty soda cans in office
    
** Sensors: **
    
1) Detector : For detecting cans
    
2) Arm + Gripper : To pick up and place can in onboard bin
        
** <center>How can we formulate this as a MDP?</center> **

** We first need to identify States (S), Actions (A) and Rewards (R) **

** Actions: **
    
1) {Search} - Actively search for a can

2) {Wait} - Remain stationary and wait for someone to bring a can. (Will lose less battery)

3) {Recharge} - Head back home for recharging
    
** States: **
    
1) high - Battery is charged considerably well
2) low - Battery is not charged
    
** Rewards: ** 
    
1) zero most of the time

2) become positive when the robot secures an empty can, 

3) negative if the battery runs all the way down
    
** <center>How can we now formulate this as a MDP?</center> **

## Transition Probabilities and Expected Rewards
<center><img src="img/pic1.png" alt="Matrix" style="width: 500px;"/></center>
## Transition Graph
<center><img src="img/pic2.png" alt="Matrix" style="width: 500px;"/></center>


## Value Function in MDP notation
<center><img src="img/7.png" alt="Matrix" style="width: 1000px;"/></center>


# <center>Bellman Expectation Equation</center>

## Bellman Expectation Equation
<center><img src="img/b1.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Expectation Equation for $V^\pi$
<center><img src="img/b2.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Expectation Equation for $Q^\pi$
<center><img src="img/b3.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Expectation Equation for $v_\pi$
<center><img src="img/b4.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Expectation Equation for $q_\pi$
<center><img src="img/b5.png" alt="Matrix" style="width: 1000px;"/></center>

# <center>Bellman Optimality Equation</center>

## Optimal Value Function
<center><img src="img/o.png" alt="Matrix" style="width: 1000px;"/></center>

## Optimal Policy
<center><img src="img/o1.png" alt="Matrix" style="width: 1000px;"/></center>

## Finding an Optimal Policy
<center><img src="img/o2.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Optimality Equation for $v_{*}$
<center><img src="img/op1.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Optimality Equation for $Q_{*}$
<center><img src="img/op2.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Optimality Equatin for $V^{*}$
<center><img src="img/op3.png" alt="Matrix" style="width: 1000px;"/></center>

## Bellman Optimality Equation for $Q^{*}$
<center><img src="img/op4.png" alt="Matrix" style="width: 1000px;"/></center>

## Summary
* We looked into the MDP formulation of a RL problem.
* We looked into the formulation of Value functions.
    * action-value pairs
    * state-action pairs
* Understood the motivation and necessity of Bellman Expectation Equations and Bellman Optimlality Equations