# 1. Markov Decision Processes
We are now going to formalize some of the concepts that we have learned about in reinforcement learning. We have learning about the terms:
> * **Agent**
* **Environment**
* **Action**
* **State**
* **Reward**
* **Episode**

This section is about putting these concepts into a formal framework called **Markov Decision Processes**.

## 1.1 Gridworld
In this section we are going to describe the game that we are going to use for the rest of this course. It is in some ways simpler than tic-tac-toe, but it has some properties that allow us to explore some of the more interesting properties of RL.

<img src="images/gridworld.png">

In this game our agent is a robot, and the environment is a grid. The agent is allowed to move in 4 directions: up, down, left, and right. Grid world is generally built in the following way: 
> * at position (1, 1) there is a wall, so if the robot tries to go there it will bump into the wall. 
* (0,3) is a winning state (terminal state with a +1 reward)
* (1, 3) is a losing state (terminal state with a -1 reward)

One thing we will notice about gridworld is that it has a much smaller number of states than tic-tac-toe; there are only 12 positions, 11 states (where the robot is), and 4 actions-that is a small game! However, there are many concepts to be learned! 

---

# 2. The Markov Property
Let us first review the **Markov Property** in the strict mathematical sense. Suppose we have a sequence:

#### $$\{x_1, x_2,...,x_t\}$$

We can define a conditional probability on $x_t$, given all the previous $x$'s:

#### $$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\}$$

Generally speaking, this can't be simplified. However, if we assume that the markov property is true, than it can be simplified. The markov property specifies how many previous $x$'s the current $x$ depends on. So, **First-Order Markov** means that $x_t$ depends only on $x_{t-1}$:

#### $$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\} = p \{x_t \mid x_{t-1} \}$$

A **Second Order Markov** means that $x_t$ only depends on $x_{t-1}$ and $x_{t-2}$:

#### $$p\{x_t \mid x_{t-1}, x_{t-2}, ..., x_1\} = p \{x_t \mid x_{t-1}, x_{t-2} \}$$

For now we will be working with the first order markove only, and we typically refer to this as *the markov property*. 

## 2.1 Simple Example
Consider the sentence: **"Let's do a simple example"**. Let's say that you are given:

> "Let's do a simple"

In this case it is relatively easy to guess that the next word is "example". Now, all we are given is:

> "simple"

It is not longer easy to predict the next word. This can be even hard! For instance, what if we were just given:

> "a"

Now it is _very_ difficult to predic the next word, "simple". Well, that is what the markov assumption is; we can clearly see that it can be quite limiting. However, we can also define the problem so that it is not. 

## 2.2 Markov Property in RL
So, what exactly does the markov property look like in RL? Recall, that taking an action $A(t)$ while in state $S(t)$ produces two things: the next state $S(t+1)$ and a reward $R(t+1)$:

#### $$\{S(t), A(t) \} \rightarrow \{ S(t+1), R(t+1)\}$$

What the markov property in this case says, is that $S(t+1)$ and $R(t+1)$ depend only on $A(t)$ and $S(t)$, but not any $A$ or $S$ before that:

#### $$p\big(S_{t+1}, R_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1},...,S_0, A_0\big) = p \big( S_{t+1}, R_{t+1} \mid S_t, A_t\big) $$

For convenience, we can also use the shorthand symbols we have mentioned earlier: $s, a,  r, s'$:

#### $$p(s', r \mid s, a) = p(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a)$$ 

So, how is the different from the normal way that we usually write the markov property? Well, notice that this is a _joint distribution_ on $s'$ and $r$. So, it is telling us the joint distribution of two variables, conditioned on two other variables. This is different from the usual markov form, where we have one variable on the left, and one variable on the right. 

## 2.3 Other Conditional Distributions
Given the above joint conditional distribution, it is of course just a matter of using the rules of probability to find the marginal and conditional distributions. For example, if we just want to know $s'$ given $s$ and $a$ we can use:

#### $$p(s' \mid s, a) = \sum_{r \in R}p(s', r \mid s, a )$$

And if we just wanted to know $r$ given $s$ and $a$:

#### $$p(r \mid s,a) = \sum_{s' \in S} p(s', r \mid s, a)$$

Also, note that for essentially all cases that we will consider, these probabilities will be deterministic. That means that the reward you get for going to the state will always be the same reward, and taking an action in a state will always take you to the same next state. 

## 2.4 Is the Markov Assumption Limiting?
Let's look at a recent application of RL to demonstrate that the markov assumption is not necessarily limiting. DeepMind used the concatenation of the 4 most recent frames in order to represent the current state when playing Atari games. 

---

# 3. Markov Decision Processes (MDPs)
We have essentially been looking at MDPs this entire time, but just not referring to them by name. Any RL task with a set of states, actions, and rewards, that follows the markov property is a markov decision process. 

Formally speaking, the MDP is a 5 tuple, made up of:

> * **Set of states**
* **Set of actions**
* **Set of rewards**
* **State-transition probabilities, reward probabilities (as defined jointly earlier**
* **Discount factor**

## 3.1 Policy
There is one final piece needed to complete our puzzle. The other key term in markov decission process is **decision**. The way that we make decisions, and chose what actions to take in what states, is called a policy. We generally denote the policy with the symbole $\pi$. Technically, $\pi$ is not part of the MDP itself, but it is part of the solution, along with the value function. 

The reason we are just talking about the policy now is because it is somewhat of a weird symbol. We write down $\pi$ as a mathematical symbol, but there is no equation for $\pi$. For example, if $\pi$ is epsilon-greedy, how do we write that as an equation? It is more like an algorithm. The only exception to this is when you want to write down the **optimal policy**, which can be defined mathematically, in terms of the *value function*; we will discuss this later. So, for now we can just think of $\pi$ as the shorthand notation for the algorithm that the agent is using to navigate the environment. 

## 3.2 State-transition Probability
Let's look at the state transition probability again:

#### $$p(s' \mid s, a)$$

Recall that we said that this is typically deterministic, but that is not always the case. Why might that be so? Recall, that the state is only what is derived from what the agent senses from the environment; it is not the environment itself. The state can be an imperfect representation of the environment, in which case you would expect the state transition to be probabilistic. For example, the state you measure could represent multiple configurations of the environment. As an example of an imperfect representation of the environment, think about blackjack; you may think of the dealers next card as part of the state. But, as the agent, you can't see the next card so it is not part of your state. It _is_ part of the environment. 

## 3.3 Actions
When we think of actions, we typically think of joystick inputs (up/down/left/right/jump) or blackjack moves (hit/stand). However, actions can be very broad as well, such as how to distribute government funding. So, RL can be applied to making political decisions as well. 

## 3.4 Agent vs Environment
Sometimes there is a bit of confusion surrounding what constitutes the agent vs. the environment. You are navigating your environment, but what constitutes you? Are you your body? Your body is, more correctly, part of the environment! Your body isn't making decisions or learning; your body has sensors which pass on signals to your brain, but it is your brain and mind that make all decisions and do all learning! 

---

# 4. Future Rewards
## 4.1 Total Reward
We are now going to formalize the idea of **total future reward**. This refers to everything from $t+1$ and onward. We call this the **return**, $G(t)$:

#### $$G(t) = \sum_{\tau = 1}^\infty R(t + \tau)$$

Notice how it does not depend on the current reward, $R(t)$. This is because when we arrive a state, we receive the reward for that state-there is nothing to predict about it, because it has already happened. 

Now, think of a very long task, potentially containing thousands of steps. Your goal is to maximize your total reward. However, is there a difference between getting a reward now, and getting that same reward 10 years from now? Think about finance; we know that \$1000 today is worth less than $1000 10 years ago. Would you rather get \$1000 now, or 10 year from now? Choose today! 

## 4.2 Discount Factor
This causes us to introduce a discount factor on future rewards. We call the discount factor $\gamma$, and we use a number between 0 and 1 to represent it:

#### $$G(t) = \sum_{\tau = 0}^ \infty \gamma ^{\tau} R(t + \tau + 1)$$

A $\gamma = 1$ means that we don't care how far in the future a reward is, all rewards should be weighted equally. A $\gamma = 0$ means that we don't care about the future rewards at all, and is a truly greedy algorithm since the agent would only try to maximize its immediate reward. Usually we choose something close to 1, such as 0.9. If we have a very short episodic task, it may not be worth discounting at all. An intuitive reason for discounting future rewards is that the further you look into the future, the harder it is to predict. Hence, there is not a lot of sense getting something 10 years from now, unless you are sure you can make it happen, and that your circumstances won't change. 

## 4.3 Merging Continuous and Episodic Tasks
You may notice that the sum for the return goes from 0 to $\infty$; this implies that we are looking at a continuous task, when in reality the tasks we have looked at so far (tic-tac-toe) are episodic. This is a mathematical subtlety, but we actually want to write all of our equations in continuous form; simply put, it makes the math a little easier to work with. 

There is a way to merge episodic and continuous tasks so that they are equivalent. The way you do it is this: The episodic task has a terminal state. Pretend that there is a state transition from the terminal state to itself, that always happens with probability of 1, and always yields a reward of 0. In this way, the episodic task remains the same, but since it goes on forward, it is technically a continuous task. 

<img src="images\continuous-episodic.png">

---

# 5. Value Function Introduction & The Bellman Equation
We are now going to go over a very intuitive, graphical explanation of the bellman equation. 

### 5.0 Expected Values
The most important concept in understanding the bellman equation, is the **expected value**. This concept is strange to many people as first; here is why: suppose we have a coin that has heads and tails, where heads is a win, and tails is a loss. Numerically, we encode these as heads = 1 and tails = 0. Now suppose the probability of winning is 60%, i.e. P(win) = 60%. Our expected value is then:

#### $$0.6 * 1 + 0.4 * 0 = 0.6$$

Why is this weird? Because, in this case, the expected value is a value that you can never expect to get! You will never flip a 0.6!

### 5.01 So what is the point?
The point of an expected value is that it tells us its mean, or the average. We may be gathering the heights of students in a classroom and find the average height; perhaps no student has that height, it is still a useful statistic. Similarly, it doesn't matter if the coin flip will never yield 0.6. If we flip the coin 1000 times, we know to expect 600 wins! 

### 5.02 Expected Value - Mathematical Definition
We can define expected value to be:

#### $$E(X) = \sum_x p(x)x$$

### 5.03 Why are averages so important
Suppose we are playing the game of tic tac toe. Realistically, we can't perfectly predict what our opponent is going to do next. They can do any number of things, and sometimes we may end up winnings, other times losing. Each time, this will leave us with a different total reward. Again, we can use a tree to represent this idea:

<img src="images/tic-tac-toe-tree.png" width="350">

You may also recall that trees are recursive. So, if we take the tree above, and look at one of the children nodes by the root, that is also called at tree! In other words, the child of any node in the tree, is the root of another tree! This idea of recursiveness will become very important later on. 

Now, every time we play this game we are going to take a different path down the tree. Each path has a different probability, but we can still have this concept of:

> _What is the weighted sum of total rewards I get with this tree, with these states and these probabilities._

And so, the value of this state is precisely the **average reward** I will accumulate in the future, by being in this state now. Note this is not the _exact reward_, because that can be different every time. We are saying that it is the _average_, or what we would get on average if we chose that tree many times. 

# 5.1 The Value Function
So, lets quickly recap what we know so far:

> * At each state $s$, we are going to get a reward $R$.
* The overall return, $G$, is the sum of all the rewards we get. 

However, games are random, so we need to be able to answer: "If we are in state $s$, what is the sum of rewards we will get in the future, on average?" To answer this we can say: 

#### $$V(s) = E( G \mid s)$$

In english this just means "The value at state $s$ is equal to the expected value of the overall return, $G$, given that we are in state $s$." As a note, anything to the left of the $\mid$ (given) symbol is random, while that to the right is not random. This is called a **conditional expectation**. 

## 5.2 Fundamental Concept
The next concept we discuss is one of the most fundamental concepts in RL. The idea is as follows:

We know that every game is going to consist of a series of states and rewards:

<img src="images/rewards-states.png">

Let's pretend for a moment that everything is deterministic, hence the reward we would get is the only reward we could possibly get at any state. This lets us drop the expected value symbol for now. You may recall, the expected value of a value that is not random, is just the value itself. I.e., the expected value of 3 is 3: $E(3) = 3$. 

So, because the value of a state is just the sum of all future rewards (if they are deterministic), we can say that:

#### $$V(s_1) = r_2 + r_3 + r_4 + ... + r_N$$
#### $$V(s_1) = r_3 + r_4 + ... + r_N$$

But, we can now see a special relationship between the values of successive states! In particular, we can plug in the expression for $V(s_2)$ into the expression for $V(s_1)$:

#### $$V(s_1) = r_2 + V(s_2)$$

In other words, this is a **recursive** equation! 

### 5.2.1 Discounted Version
Of course, if you wanted to discount future rewards, that can be done too without any significant changes. We can say that:

#### $$V(s_1) = r_2 + \gamma r_3 + \gamma^2 r_4 + ...$$
#### $$V(s_2) = r_3 + \gamma r_4 + ... $$
#### $$V(s_1) = r_2 + \gamma V(s_2)$$

This recursiveness is going to be a _very_ important theme in this course. We know that this value function is going to be an expected value. And we know from our tic tac toe example that our job is to estimate this value. But we can see how it has this recursive structure. The value at this state is an estimate, and it depends on the value at another state-which is also an estimate. So, it is an estimate which itself depends on an estimate. This set of notebooks, in general, is all about how we optimize this array of estimates-which depend on eachother-all at the same time.

## 5.3 General Terms
So, what can we do now that we have this relationship between the value at state $s$, and the value at state $s'$ (Notice that $s_1$ and $s_2$ have been replaced with $s$ and $s'$ just be more general). So:

> **$s$ = current state**<br>
**$s'$ = next state**

We can call the reward $r$, but it can also be denoted as $R(s, s')$ since it is technically the reward we get going from $s$ to $s'$. 

> **$r$ = $R(s, s')$ = reward**

In any case, remember that we said that the rewards and state transitions can all be probabilistic. So, in order to denote that, we just put the expected value symbol back in our equation:

#### $$V(s) = E \Big[ r + \gamma V(s')\Big]$$

This is the essence of the Bellman Equation!

### 5.3.1 Expansion
Since we can express $V(s)$ as an expected value, we can even expand it out:

#### $$V(s) = E \Big[ r + \gamma E[r' + \gamma V(s'')]\Big]$$
#### $$V(s) = E \Big[ r + \gamma E\big[r' + \gamma E[r''+...]\big]\Big]$$

What we have done above is just expand the recursion. This is mainly done for visual purposes, and we won't actually do this in any of our algorithms. 

### 5.3.2 Adding Details back in 
For simplicity sake, the conditional was dropped (among other things). We can put that back in easily:

#### $$V(s) = E \Big[ r + \gamma V(s') \mid s\Big]$$

### 5.3.3 Extension
One useful function of the value function, $V$, which depends on $s$, is another value function $Q$. $Q$ not only depends on $s$, but also the action $a$. We call $V(s)$ the state-value function, and $Q(s,a)$ the action-value function:

> **V(s) = state-value function**<br>
What is my expected future return, now that I am in state $s$?<br>
<br>
**Q(s,a) = action-value function**<br>
What is my expected future return, given that I am now in state $s$ and I take action $a$? 

Hence, $Q$ provides a way of incorporating more data into the prediction, and provide more granularity. What this means is that since $V$ doesn't depend on $a$, it must somehow take it into account, we just don't know how yet. We will discuss this more in a later section!

---

# 6. Value Function Derivation