# Finite Markov Decision Processes
Notes taken from: "Reinforcement Learning" by Richard S. Sutton and Andrew G. Barto <br>
Link: http://www.incompleteideas.net/book/the-book.html 

## What is an MDP
A Markov Decision Process (MDP) is a formalization of a sequential decision making scenario where each action influences not just immediate rewards but also subsequent situations, and thus future rewards.
A MDP has two main components, an **Agent** and the **Environment**
* **Agent** - the learner and decision maker. Typically anything that can be freely and directly changed by the decision maker is considered part of the agent.
* **Environment** - this is basically everything outside of the agent. Parts of the environment can be acted upon by the agent and the environment influences what rewards are given to the agent

The agent and environment act upon each other in a series of **discrete** time steps. The agent performs an action A on the environment which then causes the Environment to produce some reward and new state variable which are fed back to the Agent<br>
![image.png](attachment:361b4606-335a-4d4c-8286-4af0327271bc.png)

Through these interactions ther MDP produces a **trajectory** of state, action , and reward that can be described as such:<br>
Trajectory = $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, ...$ <BR>
**Note:** in a **finite** MDP the sets of states, actions, and rewards are finite

For each State and Reward ther is a discrete probability of them occuring based on the preceding state and action as shown:<br>
$$p(s_{t+1}, r_{t+1}|s_t, a_t)$$
<br>
$$\sum_{s \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s_{t+1}, r_{t+1}|s_t, a_t) = 1$$

**Note:** In a **Markov** decision process each state and reward are wholely dependent only on the preceding state and action. They are not influenced by even earlier states. Non-markov processes can actually be represented as MDPs by using a more complicated state variable


## Goals and Rewards

At each step the reward the agent receivs is calculated as a simple number, the agent's ojective is to maximize the cumulative reward over all, not immediate reward.<br>
<br>
The reward signal should only communicate the agent's success at achieving a final goal, not how it should acheive it. For example, a chess robot should only be rewarded for winning, not for taking the opponent's pieces.

### Discounting
Oftentimes rewards in the future might not be seen as valuable as rewards in the present. Therefore future rewards might be multiplied by a **discount rate** factor $\gamma$. So the total expected discounted rewards at a time might be:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... $$

## Returns and Episodes

As stated above, the agent's objective is to maximize the cumulative reward. However, the cumulative reward is unknown until the end of the trajectory. So the agent is really trying to maximize *expected* reward

In general MDPs can be broken into two types: episodic and continuous tasks
* **Episodic** - When the agent-environment interactions can be broken into subsequences which each end with a **terminal state**. After which the environment is reset to a standard starting state. An example is a game of chess which ends with a terminal state of winning or losing followed by a game reset.
* **Continuing tasks** - as the name suggests, this is when the agent-environment interaction continues without stop.

However, we would like to frame both continuing and episodic MDP problems in a way that allows for the same mathematical representation. Therefore, we represent the terminal state of the episodic MDP as returning to itself for eternity with a reward of 0. For example:

![image.png](attachment:fbfdc43e-c9e2-4b0e-815c-a907cb987834.png)

<br>
Therefore the total cumulative reward $G_t$ for all MDPs can be represented as
$$G_t = \sum_{k=t+1}^T \gamma^{k-t-1}R_k$$

## Policies and Value Functions
* **Value functions** are functions that determine how "good" a state or state/action pair is.
* **policy** is an action strategy for the agent. Formally it is a mapping from the set of states to probability of certain actions. Can be represented as $\pi$

![image.png](attachment:b80cc7c7-b1ec-45ee-b175-e49882003874.png)
<br>
![image.png](attachment:46802cf1-7021-4746-a75a-53e93d1fa15f.png)
<br>
The value functions $v_\pi$ and $q_\pi$ can be estimated through experiance via Monte Carlo Method

### Bellman equation

To calculate the value of a state, you sum all of the values of future states multiplied by the probability of the actions that would lead to those states.
![image.png](attachment:c07afcae-618e-4138-8c77-51b01407cfb1.png)
The following is known as a *backup diagram*, it is a visual representation of the Bellman equation
<br>
![image.png](attachment:62062f1e-62ad-4f2e-bf58-6ad50f1f6393.png)
#### Grid World
The agent gets reward of -1 for going off the board. +10 for jumping from A to A' and +5 for jumping from B to B'
![image.png](attachment:e4b3ac29-a34d-4b02-9fbc-4d052a5b88fc.png)

Note that in Grid world each tile is effectively a state

# Coded example of gridworld in the Dynamic programming notebook

## Optimal Policies and Optimal Value Functions

Simply the optimal way to play the game:
<br>
$$v_*(s) = max v_\pi(s)$$
<br>
$$q_*(s, a) = E[R_{t+1} + \gamma v_*(S_{t+1}) | s, a]$$