<h1>Math</h1>

In this section we are going to introduce the next notebook of Reinforcement Learning which is a Markov Decision Processes

---

<h3>Markov Decision Processes (MDPs)</h3>

We usually abbreviate this as MDP 

The MDP is probably the most fundamental concepts we'll learn in the few following notebooks because it's the framework from which everything else is derived

Any algorithm we learn about whether it's Q-learning Deep Q-learning or any advanced version of Deep-Q learning they're all derived from the MDP framework

In other words if we want to do anything in RL, we have to know about MDP

<img src='extras/54.1.PNG' width='500'></img>

---

<h3>Notebook Outline</h3>

So Let's outline what we will learn in this notebook

Firstly just so what we're doing isn't entirely theoretical, we are going to discuss the Gridworld environment 

The Gridworld environment is very useful because
it helps us visualize what's going on in reinforcement learning, in particular it makes the states and their transitions visible

Now we might object to that and say well, Super Mario is obviously visible because in order to play Super Mario we have to look at the screen

Of course that's not the kind of visible we are talking about

<img src='extras/54.2.PNG' width='300'></img>

---

<h3>The Markov Property/Markov Models</h3>

The next step will be to discuss the Markov Property and Markov Models

What we can tell from its name is obviously important in learning about MDP 

We might be familiar with Markov models from other fields since they are widely applicable

For example Google's page rank algorithm was derived based on Markov models 

Hidden Markov models are an important algorithm for speech recognition and genomics, that's the study of DNA sequences 

Markov Chain Monte Carlo or MCMC is a popular statistical method that has become widely used due to its applicability in Bayesian machine learning 

Markov models are used in finance

One immediate application is the random walk hypothesis

Finally Markov models are useful in control systems which is traditionally a subject of study under electrical or mechanical engineering rather than computer science

What's very interesting however is that these two fields converge once we introduce uncertainty

So if we took a course on stochastic control systems we will also learn about MDPs

---

<h3>Building on MDPs</h3>

After we discuss the basics of what makes up an MDP, we can start defining some new terms and some new ideas which are built on top of MDPs that allow us to build intelligent agents

Specifically we're going to introduce the concept of the return which is the sum of future rewards and the value function which is the expected sum of future awards given a state

<img src='extras/54.3.PNG' width='400'></img>

We'll look at the very important Bellman equation which is what allows us to solve for the value function and to create an agent that behaves optimally

In fact the one thing we always say is that if we had to summarize the few following notebooks into a single sentence, we could say that these whole notebooks are about how to solve for the value function

Obviously this doesn't make sense right now but it will by the end of these notebooks

---

<h3>After this notebook</h3>

After this notebooks, some of the subsequent notebooks will be devoted to specific solution methods for the Bellman equation

What this will mean for us is that this notebook is entirely theoretical so be prepared to just think mathematically for some time

Obviously this is necessary to build up the prerequisites for the subsequent notebooks where we discuss actual algorithms at which point we will put those algorithms into code

Once we learn the material we should have some design in our minds for how we will implement the algorithm in code

---

<h3>All data is the same</h3>

The last thing we want to discuss in this section is to remind ourselves of the rule

$$\text{All data is the same}$$

This is a rule that beginners often forget

We have to separate the algorithm and the method from the application 

To give one of our favorite examples, consider linear regression 

Linear regression is used in many fields such as biology and finance

But do we ever ask, is there a different kind of linear regression for biology than there is for finance

Obviously not, the same linear regression algorithm can be applied no matter what industry we are in

Similarly since the following notebooks are about reinforcement learning algorithms such as Q-learning, it doesn't make sense to ask, how do we use Q-learning in biology or how do we use Q-learning in finance

The first step is to learn Q-learning in the first place and after that whatever our particular specialty is, it is our job to
get the right data in order to apply those algorithms 

In order to apply reinforcement learning to our
particular field, we have to be a subject's matter expert in our particular field

Whether that's finance biology economics, online advertising, power systems or whatever else

Keep in mind that this is our responsibility

The important thing to remember is that, the Q-learning algorithm or the linear regression algorithm do not care what our data is, those algorithms remain the same

---

<h3>Real-World Example</h3>

To give ourselves a real world example of this, consider that Google engineers use reinforcement learning both for optimizing their data centers and for YouTube recommendations

Now let's ask where did those Google engineers learn how to do that

Did they take a course called "How to use RL for optimizing your data center" or "How to use or to make YouTube recommendations"

Obviously the answer is no

They all read the same RL books and took the same RL courses

It was their job to figure out how to apply RL to their particular subfield

So we want you to keep that in mind especially if we are beginners and don't really have the ability yet to think abstractly

<h1>Math</h1>

In this section, we are going to describe the environment we'll be working with for a majority of the notebook which is called Gridworld

We're also going to take this opportunity to define a few more essential terms

---

<h3>Gridworld</h3>

So what is Gridworld? 

Grid world is the simplest environment we can use in reinforcement learning that allows us to think about and understand all of the concepts we're going to discuss

If it we smaller it might not get the points across in a way that makes sense, and if it were larger it would be unnecessarily complicated

So Gridworld is the perfect size environment to learn about RL 

We can imagine Gridworld like a very simple video game

Hopefully we've played video games for ourselves in the past games like Pacman or similar 

---

So what does Gridworld look like?

Gridworld is a $3 \times 4$ table that our agent lives on

<img src='extras/54.4.PNG' width='800'></img>

Our agent starts the game at the bottom left corner

Our agent can move up down left or right, one square at a time

The goal of the agent, although we use that term loosely, is to get to the top right corner which is the winning state

When the agent arrives in the state it receives plus one reward

Just underneath the winning state, there is the losing state 

The agent wants to avoid that state or in other words it wants to avoid going to that location 

When the agent arrives in this state it receives $-1$ reward

We use the terms winning and losing loosely as well

Remember that in reinforcement learning the agent doesn't have any concept of whether it's won or it's lost, it only receives a reward

The only objective of the agent is to maximize the reward

Thus while the agent does not know what it means to win or lose it does no know, hey I want to go to the top right box because that's where I get plus one reward

---

<h3>Gridworld states</h3>

An important aspect of grid world is that the states are very easily defined

The state is simply the position of the agent

<img src='extras/54.5.PNG' width='800'></img>

So for example the tuple (2,0) represents the initial state of the agent

The winning state would be represented by the tuple (0,3) and the losing state would be represented by the tuple (1,3)

Note that there is a wall at the location (1,1) which represents a state that the agent cannot occupy

---

<h3>Super Mario States Example</h3>

Earlier we said that, although a videogame like Super Mario is a game that we can see so that the state is quote-to-quote visible, this is not the same as saying the states are visible as in grid world

So if we didn't understand what we meant when we said that let's think harder about why that is 

In Super Mario which obviously we can see because it's displayed on our TV screen, the state is actually very complex 

One way of recording the state would be to simply take an image of the screen

In fact some algorithms use only screen recordings to train an agent

<img src='extras/54.6.PNG' width='400'></img>

This was how deep Q-learning became popular because researchers were able to train agents to play Atari games using only screen recordings 

Using a screen recording, if we represent each state of the game as an image of the screen, then we would get an image of size $256 \times 224 \times 3$

This gives us about $172,000$ numbers to represent the state which is much larger than the grid $12$ states which can be represented by only two numbers, the horizontal coordinate and the vertical coordinate 

Another way to represent the state of the game, would be if we had privileged information about the game program 

For example the position of Mario, the position of all the enemies, the position of other obstacles and so forth

Of course each of these positions can range anywhere from $0$ to $256$ and $0$ to $224$ but a more likely scenario is that we use one-hot encoding

So again our state is a matrix of size $256 \times 224$ and that's just for the information about what's on the screen at that current moment

We might have other information such as the velocity of various objects, how many lives we have left, how much time we have left before the level is over and so forth

Hopefully we're getting the idea

We can't see any of these thousands of numbers unlike with grid world

Finally, note that we call the set of all possible states the state space 

For a screenshot the state space would have a size approximately $256^{172,000}$ (assume unint8) which as we can see is pretty much infinite, for all intents and purposes

---

<h3>More Terminology</h3>

So now that we understand grid world let's look at more terminology

The point of reinforcement learning, if we stare really hard at its name, is that we're going to try to reinforce the desired behavior of the Agent and this is what we call the learning process

This implies that the Agent will not just play our Gridworld game once, but rather it will play the game many times 

By playing the game many times, it gains experience and we hope that by the end of it the Agent will have learned how to maximse its reward using that experience

Now We know that this probably seems obvious, but in fact if we think about it it's not

It may seem obvious because it's very analogous to the real world into how humans and animals learn

But remember computer code is inhuman

And so these concepts are not just things we can take for granted 

Each time our agent plays a game in grid world it will end up either in the winning state or the losing state and then that game is over

Instead of calling this a game which is an ambiguous term and can be used in many contexts we're going to call this an episode 

Thus an agent will play many episodes in order to gain experience

How do we know when an episode is over?

We said earlier that it lands in either the winning state or the losing state

But let's define this more precisely

In fact just because we land in a state and yet +1 reward or -1 reward, this does not automatically imply that the episode is over

It could be that we get -1 reward in every state no matter the state

So it's important not to get our previous intuition mixed up with the mathematical definitions

Getting a reward is just getting a reward

It says nothing about whether an episode is over or not

What does define whether an episode is over is whether we've reached a terminal state 

A terminal state is any state such that when you land in it the episode is over

An example of that would be we lose 100% of our health and we lose a life or we fall off a cliff 

In chess, this would be any state in which our king or our opponent's king is taken

All right so we just learned three new terms

Episode terminal state and state space

---

<h3>Epsiodic Tasks</h3>

So Gridworld and Super Mario are examples of episodic tasks because there are clearly defined episodes

On the other hand we can have tasks which are non episodic also called continuing tasks because they can just keep going on forever

One example of that is controlling the temperature of a room

There is no meaningful end to this task

Please note that the difference between these is actually not clearly well-defined

If we think about it a little more we realize that there are some subtleties which can make this difficult

However this will not be an important distinction in this notebook

For most reinforcement learning algorithms we will assume that we are learning an episodic task

---

<h3>Enviroment</h3>

Another term we want to more clearly define is environment

We've been using this term somewhat informally so far but in fact this is an actual term we use in reinforcement learning 

<img src='extras/54.1.PNG' width='400'></img>

The environment describes the world that the agent lives in 

So grid world is an example of an environment

Super Mario is an example of an environment

Pong is an example of an environment and Chess is an example of an environment 

To say we've solved that environment means that we're able to build an agent that achieves some reward that surpasses a threshold that we've set 

For example, we could say if we build an agent that gets over 20 points in Pong then we'll consider it solved

Obviously one environment we are very interested in is the real physical world itself

Of course it's not even clear what the reward is in this environment

Currently we're focused on building robots that can walk, run and drive cars

---

<h3>Policy</h3>

The final concept we want to talk about in this section is the policy

Again this is a concept we've discussed informally so far without naming it

The policy is a kind of function, loosely speaking, that maps the state to an action

In other words we can think of it as the agents brain 

Below we demonstrate a policy that an agent might want to use to maximize its reward in Gridworld

<img src='extras/54.7.PNG' width='300'></img>

The important part of this is it always leads to the winning state

So if we're to the left of the winning state then obviously the correct action would be to go right

If we're to the left of the losing state then obviously the correct action cannot be to go right

We'll notice that from the initial state there are actually two equally good options at least knowing what we know so far

Specifically we could go up up tight right right, and this would lead to the winning state

Or we could go right right up up right, and this would also lead to the winning state in the same number of steps

---

<h3>Representing the Policy</h3>

One point of confusion is how to represent the policy

Is that a function we write in a computer program or is it an equation or is it a neural network

In fact it can be all of these things

The policy can be deterministic or probabilistic


In this notebook we'll look at the most general case where the policy is probabilistic

If we have a deterministic policy we could represent it using

$$\large \text{Probabilistic : } \pi(a\vert s)$$

$\pi$ as a function that takes in a state $s$ and returns some action

If we have a probabilistic policy we could represent it using

$$\large \text{Probabilistic : } \pi(a\vert s)$$

Here $\pi$ is a probability distribution over the action $a$ given a state $s$

---

<h3>Policy Examples</h3>

To give ourselves an example of these suppose that we are playing Gridworld and we would like to know what should we do when we're in state (0,0)?

Well we can pass state (0,0) into a dictionary so (0,0) is a key in the dictionary and the corresponding value is which action to perform 

So that's how we might do it in code

```python
def deterministic_policy(s):
    return policy_dict[s]
# use it
action = deterministic_policy((0,0))
```

Note that this would be deterministic

What if we did something like Epsilon greedy instead

In that case we might follow our dictionary 90% of the time but 10% of the time we choose an action at random by sampling from our action space

Note that if we have four actions this gives 92.5% probability to the action stored in our dictionary and 2.5% to the rest of the actions in the action space 

<table>
    <tr>
        <td>Up</td>
        <td>Down</td>
        <td>Left</td>
        <td>Right</td>
    </tr>
    <tr>
        <td>0.925</td>
        <td>0.025</td>
        <td>0.025</td>
        <td>0.025</td>
    </tr>
</table>

And to be clear the action space analogous to the state space is the set of all possible actions

---

<h3>Summary</h3>

At this point, we've learned about a few more terms, Environment Policy and Action space 

To summarize here's what we've learned in this section

First we use grid world as a backdrop since it serves as a nice visualization of all the concepts 

We learned about terms such as episode which is comprised of one round of a game from initial state to terminal state

We learned about terminal states which is the final state before an episode is considered over

We learned the environment which represents what game we are playing

We learned policy which represents a function that acts as the agents brain and tells it how to map states to actions

We learned about state space which is the set of all possible states and we learned about action space which is the set of all possible actions

<h1>Math</h1>

In the previous section we talked about how our goal in training in Agent is to get it to maximize its reward 

---

<h3>How are rewards decided?</h3>

In Gridworlds, we get a reward of +1 at the winning state and a reward of -1 at the losing state

But this brings up another great point which is how our rewards decided?

In fact rewards are engineered by the user which is us

Our job is to design the reward in such a way that it results in the behavior that we want
from our agent

One great example of this is trying to get an agent to solve a maze

Seems like a pretty trivial task, we get +1 reward if we find the maze exit and 0 otherwise 

What's the problem with this approach

Well what happens if our agent simply wanders around aimlessly and never solves the maze

Since the agent only knows about the reward, all they can ever know is that no matter what I do I get zero reward, therefore it doesn't matter what I do 

Somewhat depressingly we may know people who think the same way :o

Instead a better approach may be to assign a -1 to every state that the agent lands in

Now we might think that's weird

We might consider -1 to be losing but remember the agent has no concept of what it means to
win or lose it sees only numbers 

As such, a -1 reward encourages the agent to not wander around because if all it does is go in random directions it's going to continue to accumulate a negative reward

If it exits the maze faster, that reward will get closer to 0

So it's important for us as the machine learning engineers to structure our rewards in a way that is conducive to the agents solving the environment

---

<h3>Agent is not subject to "real-life" rewards</h3>

Remember that when we're programming an agent the agent is not subject to the real life rules of the game

For example suppose we're creating a robot that can play basketball

By the way, this is totally outside our current capabilities, so just remember that this is a fictitious example, We can't yet create robots that can play basketball 

But anyway if we could we'll get three points if
we score beyond the three point line and two points if we score inside the three point line

But what about things like stealing the ball

What about stuffing a shot

These are considered good behaviors, and so the agent should be rewarded for doing them

---

<h3>Walking Example</h3>

One interesting example of this is the task of walking, something many of us probably take for granted

When we teach humans how to walk, they usually end up walking like other humans

But what happens if we tell a robot get from point A to Point B, you'll get more reward the further you go 

Well now the robot can end up doing things that look very strange in the eyes of a human, and perhaps that's not exactly the behavior we wanted

---

<h3>Novel Strategies</h3>

But it's important not to take things too far

One thing we learned from AlphaGo and AlohaZero is that they were able to come up with completely
novel strategies

They were also able to rediscover well-known human strategies 

As we know now humans strategies were actually sub optimal, and that's how the AI was able to win

If we had rewarded agents for performing human strategies this would have been detrimental to discovering completely new and better strategies

The point of this is we have to think hard about what is it that we actually want the agent to achieve

How can we assign rewards such that the agent will achieve the goal and not do something that we don't actually wanted to do while leaving it with enough flexibility to discover approaches that are better than what currently exists

<h1>Math</h1>

We're going to start this section with a famous quote.

$$\text{All models are wrong but some are useful}$$

This is a quote by George Box and it's usually taught to students of statistics but it's particularly relevant to this section because this section is all about the Markoff property

---

<h3>The Markov Property</h3>

What we'll learn in this lecture is that the markov property seems completely unrealistic, and despite that it underpins multiple subfields of machine learning including RL 

In this way we can think of the markov property like this, it's definitely wrong but it's also definitely useful

---

<h3>Language Models</h3>

We always like to start my discussions of the markov property with Language modeling 

We can think of sentences as sequences of words 

We can think of sequences of words as simply sequences of states

So for the purpose of this example states are words 

A language model might sound complicated but in
fact it simply means to build a probability model for sequences of words

Take the sequence, "The quick brown fox jumps over the lazy dog"

Let's suppose we're given this entire sequence except for the last word

And now we want to predict the next word

Well we look at our probability distribution

What is the distribution over the word at time 9 given the words at time 1 to a are equal to "the quick brown fox jumps over the lazy"

$$\large p(w_9 \vert w_{1\ldots 8} = "the quick brown fox jumps over the lazy")$$

Obviously building such a language model would require us to consider all possible sentences in the English language

---

<h3>Why is this hard?</h3>

Of course considering all possible sentences in the English language would be a monumental task 

Even with a limited vocabulary of just 20000 words, consider that a sentence might be say anywhere from one to one hundred words long although the world record for longest English sentence is in the thousands 

While the number of possibilities is then $20,000^1 + 20,000^2 + 20,000^3 + \ldots$

In other words completely intractable

---

<h3>The Markov Property</h3>

What the markov property says is, forget about any sequence longer than two 

For any sequence of words of length two, we call that a Bigram

So for our sentence "The quick brown fox jumps over the lazy dog"

We would have to consider the probabilities p(quick | the), p(brown | quick) and so forth

---

More generally the markov property can be stated like this

It says that even though the state at time $t$ could depend on the state at time $t-1$ all the
way down to the state at time $1$, let's assume that the state is independent of any state earlier than $t-1$

$$\large p(s_t \verrt s_{t-1},s_{t-2},\cdots,s_1) = p(s_t \vert s_{t-1})$$

In other words let's assume that the state of time $t$ depends only on the state on time $t-1$

We call this the markov assumption or the markov property, more accurately this is the first order
mark of assumption

If we depend on two previous states we call that the second order mark of assumption and so forth

Typically we don't consider anything beyond the first order assumption in most machine learning models

<h3>How well does it work?</h3>

Now of course this assumption is somewhat restrictive 

Given only the word 'lazy', what might the next word be?

Sure we could say it might be 'dog', because we already mentioned the the entire sentence, but using that information is cheating

It could just as easily be 'lazy programmer' it might be 'lazy uncle' or 'lazy students'

Consider an extreme example

What if my conditioning word is 'a' or 'the' 

Now it's nearly impossible to guess the next word in a sentence with any accuracy

---

<h3>How well does it work?</h3>

In reality the Markov assumption is not as bad as it seems

Markov models are the basis of hidden Markov models which for a long time were a state of the art in speech recognition

In addition consider the fact that what we consider the state is completely up to us

An example of that comes from deep Q-learning 

In a deep Q-learning which was applied to Atari games, we said earlier that they use screenshots from the game

Of course a single static screenshot cannot tell us much about the game and it would be very difficult to decipher anything about the speed or direction of any of the objects in the game from only a single frame

Instead what the authors did was they took four consecutive screenshots and considered that to be the state 

<img src='extras/54.8.PNG' width='300'></img>

So the lesson here is that, what we consider the state is entirely up to us

The state does not necessarily have to be the observation at a single point in time

---

<h3>State Transition Matrix</h3>

One of the key ingredients of the marlov model is the state transition matrix

The reason it's a matrix is because in order to express the probability of going from one state to another we need to have two inputs, the current state and the next state, or optionally we can think of this as the previous state and the current state

Either way there are two inputs so there are two dimensions

$A_{ij}$ is the probability of going to state $j$ from state $i$ 

$$\large A_{ij} = p(s_t = j \vert s_{t-1} = i)$$

Notice that there is no time index on $A$

This is because we assume that this relationship holds for all time 

Using maximum likelihood, we know that this probability can be estimated by counting the number of times we were in state $i$ and transitioned to state $j$ and dividing that by the count we were just in state $i$ 

$$\large A_{ij} = p(s_t = j \vert s_{t-1} = i) \approx \frac{count(i \rightarrow j)}{count(i)}$$

Although in typical Markov model settings we are interested in measuring this probability, we'll see that this is not necessary for the algorithms we discuss later in the following notebooks

The reason we wanted to mention the state transition matrix in this section is because we have something similar in MDPs but with more ingredients

<h1>Math</h1>

In this section we are going to finally discuss the Markov Decision Process or MDP

---

<h3>Markov Decision Processes (MDPs)</h3>

This is mostly just a formality because at this point we've already discussed many of the components that make up an MDP and after this section we will continue to discuss MDPs in terms of the specific quantities we're interested in and how we can use them to solve a task

We're going to start from what might seem like a bit of a hairy definition but we'll see that as we break down this definition in this section it will begin to make a lot of sense

So in a single sentence we can say that an MDP is a discrete time stochastic control process

We'll discuss more about what all these words mean later in this section, but for now let's continue talking about what makes up an MDP

---

So in essence an MDP is made up of the ingredients that we've already been discussing

<img src='extras/54.9.PNG' width='600'></img>

First we have an Agent and an environment

These represent two distinct entities in an MDP and we've already seen several examples

The agent is the brain or the controller

This is the part that we as machine learning engineers are trying to build in order to accomplish something useful in the environment 

The environment is the surrounding world

This can be as simple as Gridworld or tic-tac-toe or chess but it can also be the physical world or
a simulation of the physical world

One important task for reinforcement learning is locomotion, that is learning to walk, run and generally maneuver around

So that's an example of the environment being the real physical world 

Between these two entities the agent and the environment, wWe have some signals that pass between them

The agent takes actions in the environment which of course may have an effect on the environment and that subsequently brings the agent to a new state 

For example if we're playing tic-tac-toe and the
board is empty that's one state

If our action is to put an X down in the middle that brings us to another state 

In return the environment returns some information to the agent 

In particular the environment gives the agent both the next state and the corresponding reward

Remember the reward is just a number we can't apply any connotations to this word

It's not like a reward like here's a cookie

It's just a number

It can be positive negative or zero

---

<h3>Time Indices</h3>

One important thing to get right in reinforcement learning is our time indices

Let's focus on the sequence of events that will happen in a single step of an episode

<img src='extras/54.1.PNG' width='500'></img>

Suppose that an agent reads the state at the current time $t$, we call that $S_t$

The agent uses its policy $\pi(a \vert s)$ to choose which actions to perform based on the state it's currently in, we call that $A_t$ 

Notice that they have the same time index 

The action $A_t$ gets transmitted to the environment

Which brings us to the next time step

So the increment in the time step is implicit in the environment 

Ater performing the action of $A_t$ in the environment, we arrive in the next state $S_{t+1}$ and we receive the reward $R_{t+1}$ altogether.

This forms a tuple which we might think of as a single transition or a single step of an MDP

The tuple is made up of 

$$\large \text{Transition/Step: } \{S_t,A_t,R_{t+1},S_{t+1}\}$$

---

<h3>More Examples</h3>

There are several other ways to make up this tuple including adding the next action 

So we can say 

$$\large\{S_t,A_t,R_{t+1},S_{t+1},A_{t+1}\}$$

We'll see later that this is involved in one of
the control algorithms we will study 

Yet another method of writing this down as a tuple involves dropping the time indices entirely

We simply write 

$$\large \{s,a,r,s^\prime\}$$

Note that in this notation the, prime symbol does not indicate time $t+1$ since the reward $r$ also
occurs at time $t+1$ but does not have any prime symbol

<h3>Everything is Probability</h3>

As we mentioned earlier, the language of reinforcement learning is probability

So having good intuition of probability is important especially now

Both entities in an MDP, the agent and the environment are represented by probability distributions 

<img src='extras/54.10.PNG' width='600'></img>


As we've seen before, the most general form of representing the policy of an agent is through the distribution 

$$\large \pi(a \vert s)$$

The reason this is the most general form is because it can still represent deterministic policies, we just give the chosen action probability 1 and the rest 0 

The environment is also represented by a probability distribution 

$$\large p(s^\prime , r \vert s,a)$$

This is analogous to the plain Markov model where we have the state transition probability $p(s^\prime \vert s)$ 

Of course in MDPs, we have a few additional variables namely the action and the reward

This makes complete sense if we think about it 

In plain Markov models, the next state simply happens

So the distribution of $s^\pri,e$ is completely determined by the current statu $s$

It's not controlled by an agent doing any actions 

But in an MDP, the next state $s^\prime$ depends both on the current state $s$ and the action taken $a$

---

<h3>Verbose</h3>

note that a more verbose way of writing the state transition probability is to write 

$$\large p(s^\prime,r \vert s,a) \equiv p(S_{t+1} = s^\prime, R_{t+1} = r \vert S_t =s, A_t = a)$$

As usual we use capitalised letters to represent the random variable and lowercase letters to represent a specific realization

---

<h3>State Diagram</h3>

Sometimes people find it useful to think of MDPs in terms of state diagrams

<img src='extras/54.11.PNG' width='500'></img>

This might be useful if we have a small number of states and actions and we're using a toy example
to maybe derive an equation or invent a new solution

But generally speaking we don't find this very useful for actually doing work in RL

Of course if we like pictures we are encouraged to look at such pictures as much as we like 

Gridworld could be represented by a picture like this, but we think the actual picture of grid world communicates the exact same information so it's not really useful in this instance

Actually pictures like this are limited because each node is a state and each edge is a probability

But remember that in the transition from one state to another state there are actually two probabilities

One to choose the action and one to choose the next state given the action and previous state

So diagrams like this squash the two together and we can't really tell which is which

---

<h3>State-Action Diagram</h3>

A slightly better diagram shows both the states and actions as different kinds of nodes

And now each edge can show the probability of going to some action from some state and then to some state from some action

<img src='extras/54.12.PNG' width='500'></img>

However notice that we have to repeat the same action multiple times in this diagram

So these nodes do not truly represent the action they actually represent the action coming from a particular state

This is because the next state depends on both the action done and the state we came from

Unfortunately with such a large number of elements, diagrams like these become messy very easily 

Even this complicated looking diagram is an MDP with only three states and two actions

If we were to draw the full state diagram to show even transitions with zero probability it would be even messier

---

<h3>Wht Probability</h3>

Let's now talk about why we represent the state transitions of the environment as $p(s^\prime,r \vert s,a)$

This is by the way what we can also refer to as the environment dynamics

In any case it seems like an overly generalized representation of real world environments

We might ask in what environment would the reward be stochastic

If we arrive in a state surely we will receive the same reward each time we arrive in that state 

For example, If we win a chess game or a tic-tac-toe game probably the environment is built in such a way that we won't get +10 reward in one episode and +20 reward in another episode if we win we win and we get the winning reward

In fact for some environments that may be true

In that case we can represent the environment dynamics more simply by using the distribution $p(s^\prime) \vert s,a$

$$\large \text{Reward is random : } p(s^\prime \vert s,a)$$

$$\large \text{Reward is deterministic : } p(s^\prime \vert s,a)$$

The reward $r$ does not appear in this distribution since it is deterministic

Even the next state $s^\prime$ could be deterministic

For example imagine we're playing Gridworld or Pacman, we would expect that if we press the up button that we go up and if we press the down button we go down

Generally speaking that's how video games work

However even though this is true we would still like to represent our environment dynamics using the distribution $p(s^\prime \vert s,a)$ since that's how everything else we do with MDPs will be derived

Without this distribution, we can't really do anything else

And remember it's perfectly fine to have a distribution to represent a deterministic process

We simply have events that get probability 1 and some events they get probability zero

---

We should also keep in mind another reason for representing the environment dynamics probabilistically

It's that we may not have perfect information about the environment

Remember that the state is simply a reading from our computer sensors

Think of something like a self-driving car which uses video cameras to measure its state 

Well, what if there are some objects in the surrounding area that are occluded by other objects

In that case our camera won't pick them up but they might have an effect on what happens in the environment

Another great example is online advertising

We have to remember that a lot of the time people enter fake information into their profiles so they should not necessarily be trusted

Sometimes people share accounts so the behavior for some user might be different depending on who is actually using the account 

So our state that we measure may not precisely reflect the entire environment 

Using probability helps us quantify this uncertainty, even if the true nature of the environment is deterministic

---

<h3>Discrete-Time Stochastic Control Process</h3>

So let's break down the phrase discrete time stochastic control process

As we recall stochastic is just a fancy word for random

So there's no real difference in a saying stochastic process or random process 

An even simpler way to think of us stochastic process is to think of any sequence of events that happens over time

For example Time series are often modeled as a stochastic process and a specific example of that is the stock price over time

Another example of a stochastic process is modeling the number of customers that arrive at a grocery store or a website

For that we usually use a on Poisson process 

When we say discrete time we mean that each step of the game that we play is discrete

This is obviously the case for games like chess Go and tic-tac-toe

It's not so clear that it's the case for things like driving a car or controlling the temperature of a room

But keeping in mind that computers themselves are discrete time, we can really only take sample measurements from a continuous environment at specific intervals 

So effectively all of the environments we care to solve in RL will be discrete time 

It's clear from our ingredients that make up an MDP that they are discrete time

We always go from time $t$ time $t+1$

We never have something like Time $t+0.25$ for example

---

The final key component of this is control

To understand how this comes into play it's helpful to think of stochastic processes that are not controlled

A simple example of this is the temperature

If all we have is a thermometer we're simply measuring the temperature and getting back attime series

We don't have any control over this, the temperature is what it is, it's completely determined by the environment

We might model the temperature over time as a Markov model 

On the other hand, we might be building a thermostat instead of a thermometer and a thermostat does try to control the temperature

So in this case there is an actor or an agent who influences the state 

As we've seen we might model, a system like this as a mark of decision process 

<table>
    <tr>
        <td></td>
        <td>States are fully observed</td>
        <td>States are partially unobserved</td>
    </tr>
    <tr>
        <td>System is autonomous</td>
        <td>Markov Model (ex:Thermometer)</td>
        <td>Hidden Markov Model (HMM)</td>
    </tr>
    <tr>
        <td>System is controlled</td>
        <td>Markov Decision Process (MDP) (ex:Thermostat)</td>
        <td>Partially-Observable MDP (POMDP)</td>
    </tr>
    
</table>

In the top right quadrant, perhaps this is the case that we don't have an actor but the states are hidden that is unobserved 

For this situation we use a hidden Markov model

In fact we can still apply hidden Markov models to temperature time series or stock price time series

The assumption and using such a model is that there is some hidden cause that we haven't observed that is causing the results that we do observe

So sometimes we say it's partially observed 

we think of our observations as noisy measurements of some true underlying state that we never get to see

Finally we can add control to the Hidden Markov Model and that gives us partially observable Markov Decision Processes or POMDPs

Note that in most popular reinforcement learning applications including deep reinforcement learning we do not deal with POMDPs

So if we're interested in modern applications of deep reinforcement learning then you'll be interested in regular MDPs for the most part

<h1>Math</h1>

In this section we are going to get more specific about what our agent is trying to do

---

<h3>Rewards</h3>

If we look back to our previous discussions we'll realize that we spoke very generally about our agent trying to maximize its reward

But this is not specific enough

We know that in an MDP a reward is given at every time step

We denote the reward at time $t$ by $r_t$ 

Given that there will be a reward at every time step, it actually doesn't make sense to say that we want to maximize the reward since there will be $T$ rewards for any episode of length $T$

So what are we actually trying to do?

---

<h3>Sum of Rewards</h3>

In fact our true goal is to maximize the sum of rewards $r_1 + r_2 + \ldots + r_T$

Of course the thing we want to maximize is a scalar and one way to get a scalar from a set of numbers is to simply add them together

<img src='extras/54.13.PNG' width='700'></img>

---

<h3>Sum of Future Rewards</h3>

But we can be even more specific than this

Consider that our actions can have no effect on the past but rather can only affect the future

Thus our goal isn't just to maximize the sum of all rewards but rather the sum of all future rewards

<img src='extras/54.14.PNG' width='700'></img>

Of course if we are in the initial state of the game then this is equal to the sum of total rewards for that episode

---

<h3>The Return</h3>

We call the sum of future rewards the return and we use the letter $G$ to represent it 

Technically since each $R(t)$ is a random variable then $G$ is also a random variable since the sum of random variables is also a random variable

We can say that

$$\large G(t) = R(t+1) + R(t+2) + \ldots = \sum^\infty_{\tau=0}R(t+\tau+1)$$

At any gicen moment, the goal of our agent is to act such that it can maximize the return or in other words the sum of future rewards 

It can't maximise the current reward or pass rewards because it has already received them

---

<h3>Why is RL Powerful?</h3>

Believe it or not, this is the key feature that makes reinforcement learning so powerful

It's at this point we should take a step back to appreciate what reinforcement learning actually does

Consider a task such as playing chess or Go or balancing a pole on a cart

Suppose we tried to use an approach like supervised learning 

In fact supervised learning at first seems like it should be a viable solution 

For each input state X, simply tell the agent what the best action should be and call that the target Y 

Then train an agent on a data set of Xs and Ys as usual in supervised learning 

Of course, from this perspective our agent isn't really learning at all

It's simply memorising the state action pairs that we gave it

This is contingent on our state action pairs being good in the first place

In addition if all we do is store these state action pairs in say a database table then we really haven't learned anything we're just looking at moves in a dictionary

But let's consider more practical reasons why this would not work

In actuality it's not even feasible

The number of states in chess is something like $10^{50}$ and the number of states and Go is something $10^170$, although there is lots of discussion online about how to calculate these

The point is we will never be able to create a dataset that covers even close to the entire state space

Furthermore this assumes that the human derived strategy is the best strategy which we know is obviously not true

---

<h3>Planning</h3>

What makes reinforcement learning so interesting is this 

In reinforcement learning, often the reward signal we are looking for is very far into the future something like solving a maze or winning a videogame stage

These are all tasks where we don't win until the very end and thus there is an element of planning
that has to take place

Of course we should not give our agent a plan because as mentioned previously we have to assume that our strategy is not a good strategy and that we want the agent to learn and not just copy 

The remarkable thing about reinforcement learning is that this planning is built-in 

By saying we want to maximize the sum of future rewards our agent will come up with whatever strategy is necessary to reach that goal

This strategy finding feature is automatic 

If it takes one million steps to reach the goal,
we don't have to create a dataset with one million data points to specify the correct action at each of those 1 million states which is obviously infeasible 

Often as we've seen, these strategies can look
very weird to us humans but these strategies are often objectively reasonable as measured by the reward

---

<h3>Discounting</h3>

An important addition to the return is discounting

We use a discount factor called $\gamma$ to down-weight rewards that are further into the future

$$\large G(t) = R(t+1) + \gamma R(t+2) + \gamma^2 R(t+3) + \ldots$$

$$\large G(t) = \sum^{\infty}_{\tau=0} \gamma^\tau R(t+\tau+1)$$


Why might we want to do this?

One simple intuition is to think about finance

We know from finance that we have to account for inflation 

when we're trying to estimate the cost of a project, for example we calculate that cost in terms of the net present value

We don't say if we have to buy a ten thousand dollar machine one year from now that its current value to us is ten thousand dollars

Instead we discount it usually by some inflation rate

It's the same reason we would rather receive ten thousand dollars now rather than ten thousand dollars 20 years from now

It's simply worth more now, under normal conditions

---

Another intuitive way to think about discounting is to remember that the environment dynamics are probabilistic

Whenever we are working with a stochastic process, we know that the further you look into the future the harder it is to predict

We might have a pretty good idea that it will rain tomorrow but do we know with any confidence whether it will rain 30 days from now

And so discounting allows us to say that the near future matters more because we actually know what's going to happen in the near future

We can't trust seomeone to pay us back 100 dollars next year

---

<h3>Effect of $\gamma$</h3>

Let's consider what it means to have different values of $\gamma$ 

Supposed $\gamma$ equals zero, that means our return is equal to $R(t+1)$ which means Iweonly care about the reward we get in the next step and we don't care about the future

This can be thought of as greedy or shortsighted

Now suppose that gamma equals one, that says don't discount the future at all

In fact this is closest to what we truly care about because remember we want to maximize the sum of rewards not the sum of discounted rewards 

Although our algorithms uses a discounted reward,
we still want to see our agent achieve the high score in a game and that high score is obviously not going to be discounted

So there's a mismatch between what our algorithms care about and what we really want to agent to achieve

That is to say the real purpose of discounting rewards is that it's a helpful tool to improve the training process

It doesn't necessarily mean that we want to maximize the discounted return 

If we have an environment that consists of very short episodic tasks, then it may not be worth discounting at all

The discount factor gamma is a hyper parameter that should be optimized by the user

Usually it's kept very close to one like 0.97,0.99,0.999 and so forth 

An experiment with doing would be to test an algorithm using different discoun factors and then plot the total return as a function of the discount factor

Then we found the best discount factor for that environment at least for the current settings of all the other hyper parameter

---

<h3>Contunuing vs. Episodic Tasks</h3>

We'll notice that in our definition of the return we sum up to infinity

It might seem strange that we've assumed that we're doing a continuing task, when we said earlier that we'll be working mostly with episodic tasks

The key is the same formula can be applied to both episodic and continuing tasks

The short story is that summing up to infinity makes the math more convenient

So that's why we want to do it

But how can we justify this

Well we can consider episodic tasks to simply be continuing tasks with the terminal states have self loops with probability 1 

<img src='extras/54.15.PNG' width='350'></img>

That is whenever we reach a terminal state the state transition probability is 1 for going back to itself and zero otherwise

In this way the environment is technically a continuing task but for practical purposes the episode ends when we reach a terminal state

---

<h3>Another Reason for Discounting</h3>

This brings up yet another reason to use the discount factor

If we have no discount factor and we want to maximize the sum of future rewards what happens when the duration of an episode is infinitely long

Well the sum of future awards is potentially infinite and of course it's not possible to choose a better policy if both policies give us an infinite return

<h1>Math</h1>

So in this section we are going to begin with the fact that our agent's goal is to maximize the sum of future rewards

---

<h3>Sum of Future Rewards </h3>

The question we have to ask is How does the agent know what the sum of future rewards will be in the first place

Clearly this is dependent on our policy

Let's again look at Gridworld

If our policy is to go directly to the winning state using the arrows that we've drawn here 

<img src='extras/54.7.PNG' width='250'></img>

then obviously the sum of future awards is one as long as we follow this policy

Note that this is the case no matter what state we're in

As long as we follow our policy we're guaranteed to get +1 reward in the future 

---

On the other hand, suppose that we follow this other policy which always leads us to the losing state

<img src='extras/54.16.PNG' width='250'></img>

Now if we follow this other policy, our sum of future rewards will always be -1

Note that this is still the case no matter what state we're in

---

Now let's consider a slightly different environment where we receive -1 reward in every state, and our policy is still to go directly to the winning state

<img src='extras/54.17.PNG' width='250'></img>

The difference between this example and the previous examples is that now our return is dependent on which state we're in

If we start just to the left of the winning state then my sum of future rewards is just +1, the reward at the winning state

But if we start to the left of that state then our sum of future rewards is -1 + 1 which is zero

<img src='extras/54.18.PNG' width='250'></img>

In general the return from any particular state is dependent on what state we are in

This is also true when we have discounting

---

If we have discounting it's easy to see that the discounts of reward will change depending on how far away we are from the winning state

If we're right beside the winning state or one step away then the return is 1

But if we're two steps away then the return is 0.9, assuming $\gamma = 0.9$

If we're three steps away then the return is $0.81$

<img src='extras/54.7.PNG' width='250'></img>

$$\large \text{Discount factor} \gamma = 0.9$$
$$\large \text{Return at state = (0,2) is 1}$$
$$\large \text{Return at state = (0,1) is 0.9}$$
$$\large \text{Return at state (0,0) is 0.81}$$


The lesson is that in general the return is dependent not just on what policy we are following but also what state we are in

---

<h3>The return is Random</h3>

What about a game like tic-tac-toe or chess

In the versions of Gridworld we've been looking at so far the entire game is deterministic

There are states which are the locations on the grid and our actions simply bring us to neighboring states

But in tic-tac-toe and chess we have an opponent who also can take any number of actions

We have no idea what our opponent will do and therefore the next state we arrive in when it is our turn to make the next move is not deterministic 

We know that in general the environment dynamics are represented by the probability distribution $p(s^\prime,r \vert s,a)$

And so ultimately the future is not known

Thus we should ask, does it make sense to maximize the return when we don't even know what the return will be

How can we maximize the return when the return can be different depending on the trajectory of the game?

---

<h3>Value Function</h3>

In fact the return is a random variable

What we really want to maximize is the expected return or the expected value of the return

Another way of saying that is, we want to maximize the average return 

Of course in order to maximize anything, we need to know how to calculate that thing in the first place

So how do we calculate the expected return? 

The expected return is known as the value or the value function

Unfortunately this is a terrible name since the word value is such a generic word nonetheless that's what it's called, so that's what we're going to call it

As mentioned previously the entire purpose of the following notebooks is studying algorithms for solving the value function

So it's pretty important to know what the value function is 

The value function is defined as the expected value of the return given a state $s$ under the policy distribution $\pi$

$$\large V_\pi(s) = E_\pi \left[G(t) \vert S_t = s\right]$$

This is why earlier in this section we demonstrated that the return will be dependent on what state we are in and what policy we are following

The value function is defined as the expected return of time $t$ Given that the state at time $t$ is $s$

---

<h3>Value of a Terminal State</h3>

One important fact to remember is that the value of a terminal state is always zero

Remember that the value is the expected return but the return is the sum of future rewards

If we are in a terminal state then there are no future rewards and hence the value of a terminal state is always zero

<h1>Math</h1>

One useful way to think about value functions is probability trees, although these are somewhat simplistic as we'll see very soon

---

<h3>Expected Value</h3>

As a simple example consider a biased coin with probability of heads equals 0.6

The idea behind the probability tree is the root node is going to represent where we are now all the child nodes represent the states we could possibly go next

Each branch of the tree has a weight which tells us the probability of going through that state

<img src='extras/54.19.PNG' width='300'></img>

From this graph we can see that we have a 40% chance of going to the left and getting zero reward and we have a 60% chance of going right and getting a reward of 1 

If heads equals 1 and tails equals zero, it's obvious that our expected reward will be $0.4 \times 0 + 0.6 \times 1 = 0.6$

In other words our expected reward is the weighted sum of all the possible rewards where those weights are given by the branches of the tree

---

Note that this applies to a tree that can have any number of children 

<img src='extras/54.20.PNG' width='300'></img>

The rule remains the same, the expected value is the weighted sum of each of the child node values weighted by the respective probabilities

Of course this is just the definition of the expected value for a discrete random variable

If $X$ is our random variable then the expected value of $X$ is just

$$\large E(X) = \sum_x p(x)$$

---

<h3>Game Tree</h3>

Of course we can extend this concept even further

Let's look at a game of tic-tac-toe

Previously we looked at a very simple tree where there was only a root node and then direct descendants of the root node 

With tic-tac-toe, We have to consider the fact that the game isn't over after just a single move 

In tic-tac-toe, a game is played by doing an entire sequence of moves and thus we have an entire sequence of states

Each leaf node in the tree represents a possible state that we might end up in

This tree represents all the different trajectories we might take from where we are in the tree at the current moment

<img src='extras/54.21.PNG' width='500'></img>

Luckily the same thought process still applies

What we want to do is calculate the expected sum future rewards which ends up being some kind of weighted sum of the values of the descendants

The nice thing about trees is that they should remind us of recursion

Remember that a tree is a recursive data structure

We can pick any child node in our tree and the sub-tree starting at that child node is also itself a tree

---

<h3>More Recursion</h3>

Let's look carefully at the equation for the return 

We say that it's the sum of future rewards 

$$\large G_1 = R_2 + R_3 + \ldots + R_T$$

But what if we are at the next time step and we want to know what is $G_2$

Well using the definition $G_2$ is just 

$$\large G_2 = \qquad R_3 + \ldots + R_T$$

Notice how $G_2$ is actually embedded into $G_1$

In other words we can substitute $G_2$ into the expression for $G_1$ 

$$\large G_1 = R_1 + G_2$$

$$\large \text{In General : } G(t) = R(t+1) + G(t+1)$$

---

<h3>With Discounting</h3>

We are encouraged to prove to ourselves that this applies to discounted returns as well

$$\large G(t) = R(t+1) + \gamma R(t+2) + \gamma^2 R(t+3) + \ldots$$

$$\large G(t+1) = \qquad R(t+2) + \gamma R(t+3) + \gamma^2 R(t+4) + \ldots$$

$$\large \text{In General : } G(t) = R(t+1) + \gamma G(t+1)$$

We can summarize this by saying the return is recursive

---

<h3>Value Function</h3>

But let's remember that we are not just interested in the return but rather the expected return 

Having an expected value makes this a bit more complicated

Still this shouldn't stop us from trying

We haven an expected value

$$\large V_\pi(s) = E_{\pi} \left[G(t) \vert S_t = s\right]$$

Now let's try to expand our expression for the expected value using the relevant probabilities 

To take our first step in deriving the Bellman equation, let's remember that the return is recursive

So we can replace $G(t)$ with $R(t+1) + \gamma G(t+1)$

$$\large V_\pi(s) = E_\pi \left[ R(t+1) + \gamma G(t+1) \vert S_t = s \right]$$

---

<h3>Law of Total Expectation : $E(E(X \vert Y)) = E(X)$</h3>

The next step is the most tricky step out of this entire derivation 

To understand this step, we first have to understand the law of total expectation 

In general, this says that 

$$\large E(E(X \vert Y)) = E(X)$$

That is if we take the expected value of a conditional expectation, it doesn't matter what we are conditioning on, that thing disappears 

To show why this is true, we only have to expand the expression by the definition of the expected value

We start by expanding the inner expected value which is a conditional expectation

The random variable here is $x$, so we sum over $x$

$$\large E(E(X \vert y)) = E \left[\sum_x x p(x \vert Y)\right]$$

The next step is to expand the outer expected value

Since the inner expected value has already sum over $x$, The random variable now is $y$

Thus our outer sum should be over $y$

Since this is an expectation over the random variable $y$ the appropriate probability distribution is $p(y)$

$$\large E(E(X \vert y)) = \sum_y \left[\sum_x x p(x \vert Y)\right] p(y)$$

In the next step, we remove the extraneous brackets since they don't affect the order of operations

$$ \large = \sum_y \sum_x x p(x \vert y)p(y)$$

In the next step we recognize that $p(x \vert y)p(y)$ is  just the joint probability $p(x,y)$ 

In the next step, we move the sum over $y$ inside past all the $X$s 

That's allowed since nothing to the left of $y$ depends on $y$

$$ \large = \sum_x x \sum_y p(x,y)$$

Of course this is just marginalization, the sum over $p(x,y)$ over $y$ just gives us $p(x)$

$$\large = \sum_x p(x) = E(X)$$

---

<h3>Value Function</h3>

So how can we use the law of total expectation to help us derive the rest of the Bellman equation

$$\large V_\pi(s) = E_\pi \left[G(t) \vert S_t=s\right]$$

$$\large = E_\pi \left[R(t+1) + \gamma G(t+1) \vert S_t = s\right]$$

Well it might help to first recognize that the expectation is a linear operation

So if we see a plus sign inside an expectation we can split up the expectation 

That expectation of a sum is the sum of expectations

We can also bring constants outside like gamma

$$\large = E_\pi \left[R(t+1) \vert S_t=s \right] + \gamma E_\pi \left[G(t+1) \vert S_t = s\right]$$

The important part of this is the second term, the term with the return $G(t+1)$

This is like our $X$ in the law of total expectation 

As you know we can replace $G(t+1)$ with any
conditional expectation

But obviously we want to replace it with something that makes sense, so how about the next state $s^\prime$ at time $t+1$ 


$$\large = E_\pi \left[R(t+1) \vert S_t=s \right] + \gamma E_\pi \left[ E_\pi \left\{ G(t+1) \vert S_{t+1} = s^\prime \right\} \vert S_t = s \right]$$

Of course this is nothing but the definition of the value function at the state $s^\prime$

And so we can simply replace it with a symbol $V_{s^\prime}$ and put both terms back together under the same expected value

$$\large = E_\pi \left[R(t+1) + \gamma V_\pi(s^\prime) \vert S_t = s\right]$$

---

<h3>The Bellman Equation</h3>

So this is the Bellman equation for the value function, the centrepiece of reinforcement learning 

$$\large V_\pi(s) = E_\pi \left[R(t+1) + \gamma V_\pi(s^\prime) \vert S_t = s\right]$$

Although it seems simple it has one key characteristic, this is that the computation of the value function is recursive 

Calculating the value function at the current, state $s$ depends only on the possible next states $s^\prime$

There is no need to say traverse a possibly infinite number of future trajectories to estimate the expected value

In other words we don't need to search an entire probability tree

What this equation says is that in order to calculate $V_s$, we only need to look at one step ahead, the value at $s^\prime$ 

Why is this remarkable?

It means that our agent can plan without actually having to look very far in the future

Even if our goal is 1 million steps ahead, we can plan that entire 1 million steps simply by solving this equation that only looks at the next step

As we'll see this is of particular importance in popular algorithm such as Q-learning

<h1>Math</h1>

In this section we are going to continue our discussion of the Bellman equation

---

<h3>Bellman Equation (continued)</h3>

It's important to recognize what probability distributions in the Bellman equation we are actually summing over

As we discussed earlier the value of a state is dependent on both the state but also the policy we are following

That's the probability distribution $\pi$, and we even write this explicitly as a subscript

What is not is clear is that the value also depends on the environment dynamics

Of course now that we say this it's pretty obvious

Obviously they expected some of future rewards would depend on the environment itself

Recall that in an MDP there are two entities the agent and the environment and they both have their own respective probability distributions

The agent is represented $\pi(a \vert s)$ and the environment is represented by $p(s^\prime,r \vert s,a)$

Therefore if we expand out the expression for the Bellman equation both of these probabilities should appear 

note that in this expression the random variables we need to sum over are $s^\prime$, the next state, $r$ the reward and $a$ the action

It might not be clear exactly why expanding the Bellman equation like this is useful but later we'll do some exercises to help us see how to use it

$$\large V_\pi(s) = \sum_{s^\prime} \sum_r \sum_a \pi(a \vert s) p(s^\prime,r \vert s,a) \{r + \gamma V_{\pi}(s^\prime)\}$$

note : so we replace the expected value by the definition $\sum\limits_x x p(x)$ , now $x$ is represented by $\{r+ \gamma V_\pi (s^\prime) \}$, the probability of getting this value is the probability of getting the tuple $(s^\prime,r)$, which is $\sum\limits_a\pi(a \vert s) p(s^\prime,r \vert s,a)$, that is the sum of probabilities of getting $(s^\prime,r)$ while in state $s$ for every action $a$

One way to simplify this equation a bit is to recognize that the policy distribution $\pi$ does not depend on $s^\prime$ or $r$ so it can be brought outside those sums

$$\large V_{\pi}(s) = \sum_a(a \vert s) \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \{r + \gamma V_\pi(s^\prime)\}$$

Now we only have one outermost sum over $a$ and the sums over $s^\prime$ and $r$ can be brought further inside

---

<h3>Other Notations</h3>

We want to mention that there are some alternative forms of the Bellman equation that we will see out in the wild

We recall we said earlier that sometimes the reward is not considered to be stochastic, just the next state

In that case we can simplify Bellman's equation 

The environment dynamics are now represented by $p(s^\prime \vert s,a)$

$$\large V_{\pi}(s) = \sum_a \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s,a) \{r + \gamma V_\pi(s^\prime)\}$$

note : reward is not stochastic, no sum over $r$

---

Another notation which is a little more verbose, makes the dependence of the reward more explicit

Essentially it's subscript the reward with a triple, the current state $s$ the action $a$ and the next state $s^\prime$

$$\large V_\pi(s) = \sum_{a} \pi(a \vert s) \sum_{s^\prime} p(s^\prime \vert s,a) \{r(s,a,s^\prime) + \gamma V_\pi(s^\prime)\}$$

In other words the reward is dependent on these three values

---

One alternative notation which is kind of odd in our opinion is to use subscript and superscript instead of explicitly stating what's the random variable and what's being conditioned on

$$\large V_\pi(s) = \sum_a \pi(s,a) \sum_{s^\prime} P^a_{ss^\prime} \{r^a_{ss^\prime} + \gamma V_\pi(s^\prime)\}$$

Of course this is not ideal since we can't see what the meaning of any of these distributions is

It also confuses the policy as a joint distribution rather than a conditional distribution

We will never use this but we want to know that it exists

---

<h3>Trees, or Graphs?</h3>

Finally we want to bring this back to the idea of probability trees

The thing with probability trees is that we can only go in one direction but we know that this is limited

Even in a game like chess, if the players just keep moving their pieces back and forth, although we probably wouldn't want to do so, our game will just go back and forth between states that were already visited

Another example of this is if we are modeling the weather

Suppose we have three states sunny, rainy and cloudy

Obviously we can go from any state to any other state from one day to the next including the option of staying in the same state

For example it's sunny two days in a row 

As you saw previously, calculating expected values in a tree is easy you just some over the child nodes

But what if you have a generic graph that has loops?

Now there is no notion of child because we can have two nodes that are pointing at each other

<img src='extras/54.22.PNG' width='700'></img>

---

<h3>Systems of Linear Equations</h3>

Our claim is that solving the Bellma equation becomes nothing but a system of linear equations 

To remind ourselves of what these look like, although we should already know this, consider a set of two equations 

$$ x+y = 3$$

$$ 2x-y = 1$$

This is two equations and two unknowns and so usually from these we can solve for $x$ and $y$

Well the Bellman equation is the same

---

<h3>Closer Inspection</h3>

Suppose we have two states, $s_1$ and $s_2$ 

We can go from $s_1$ to $s_2$ or vice versa and we can stay in the same state 

<img src='extras/54.23.PNG' width='250'></img>

Remember that for now we will consider the environment dynamics and policy distribution to be given

If we look carefully at our Bellman equation what do we see?

$$\large V_\pi(s) = \sum_a \pi(a \vert s) \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \{r + \gamma V_\pi(s^\prime)\}$$

First let's consider $\pi$ 

Since this is given, it's just a constant, it's just a number

Now let's consider $p(s^\prime,r \vert s,a)$

Since this is also given it is also a constant

Now let's consider $r$, well this is known since $p(s^\prime,r \vert s,a )$

We are just summing over all possible values of $r$ 

How about $V_\prime(s^\prime)$

Well this can only take on two possible values $V_\pi(s_1)$ or $V_\pi(s_2)$

Therefore it's easy to see that the Bellman equation boils down to two equations 

$$\large V(s_1) = b_1 + c_{11} V(s_1) + c_{12} V(s_2)$$

$$\large V(s_2) = b_2 + c_{21} V(s_1) + c_{22} V(s_2)$$


Here are the $b$s and $c$s are constants and thus the Bellman equation is really nothing but a set of linear equations

---

<h3>Does it work?</h3>

Of course if this solution worked then the rest of this notebook wouldn't be necessary

So what's wrong with solving the Bellman equation using a system of linear equations?

Well the problem is that the number of variables is the number of possible states

What would we do for games like chess or go which have $10^{50}$ or $10^{170}$ states

In cases like these solving systems of linear equations is not possible

Unfortunately problems like these are also the kinds of problems that we care about

We also haven't yet discussed problems where the state space is infinite

Thus the rest of these notebooks is not just how to solve the Bellman equation but rather how to solve it in a way that is also scalable

<h1>Math</h1>

In this section we are going to continue our discussion on value functions and the Bellman equation

---

<h3>Value Functions Again</h3>

So far we've discussed the value function $V(s)$ 

It says, what is the expected sum of future rewards, given that I am currently in the state $s$ following the policy $\pi$ 

But here's a question to consider

What if we want to ask another question. what would be the sum of future rewards if we took some arbitrary action $a$ and thereafter followed the policy $\pi$ 

As we know in reinforcement learning, our goal is to get our agent to learn 

To learn, it must perform new actions, actions that are different from its current policy

Thus it's not enough to only consider what is the value of the current state under the current policy

We also need to ask what happens if we try some other action that is perhaps not the action dictated by the policy 

<img src='extras/54.24.PNG' width='400'></img>

Also recognize that this is a question that must automatically be considered if we have a stochastic policy 

A stochastic policy allows us to take any action from a given state

So obviously the sum of future rewards will be different depending on which action is actually taken

---

<h3>2 Kinds of Value Functions</h3>

To address the fact that the sum of future rewards may be different depending on which action is taken, we create a new kind of value function 

In order to distinguish between these two kinds of value functions, we give them separate names 

The value function $V$ which we previously discussed is referred to as the state value

$$\large \text{State - value : } V_\pi(s) = E_\pi \left[ G(t) \vert S_t = s \right]$$

The value function $Q$ which depends on both the current state and the action taken in that state is known as the action value

The definition of the action value is analogous to the state value it's the sum of future rewards conditioned on the current state $s$ and the current action $a$

$$\large \text{Action - value : } Q_\pi(s,a) = E_\pi \left[ G(t) \vert S_t = s, A_t = a \right]$$

Obviously since we are now conditioning on two variables the $Q$ function has two arguments $s$ and $a$

Note that in many places we'll see Q referred to as a $Q$-table

This makes sense because $Q$ has two arguments the state and the action 

Tables are two dimensional, we have rows and columns

So if we line up all the states along the rows and we line up all the actions along the columns and inside the table we store the actual $Q $values for the given state and action then it's pretty obvious that this is a table of $Q$ values or simply a $Q$-table

<table>
    <tr>
        <td>$Q$</td>
        <td>$a_1$</td>
        <td>$a_2$</td>
        <td>$a_3$</td>
    </tr>
    <tr>
        <td>$s_1$</td>
        <td>$5$</td>
        <td>$10$</td>
        <td>$2$</td>
    </tr>
    <tr>
        <td>$s_2$</td>
        <td>$4$</td>
        <td>$1$</td>
        <td>$3$</td>
    </tr>
    <tr>
        <td>$s_3$</td>
        <td>$7$</td>
        <td>$6$</td>
        <td>$8$</td>
    </tr>
</table>

---

<h3>Bellman Equation for Action-Value</h3>

Just like what the state value, we can derive an expression for the Bellman equation for the action value

As usual we start from the definition 

$$\large Q_\pi(s,a) = E_\pi \left[G(t) \vert S_t = s, A_t = a\right]$$

Same as before, the next step is to replace $G(T)$ with the recursive definition 

$$\large = E_\pi \left[R(t+1) + \gamma G(t+1) \vert S_t = s,A_t = a \right]$$

The next step is to replace the return at time $t+1$ with the expectation of the return at time $t+1$ conditioned on the $s^\prime$

$$\large E_\pi \left[ R(t+1) + \gamma E_\pi \{G(t+1) \vert S_{t+1} = s^\prime \} \vert S_t =s, A_t = a\right]$$

Of course this is the state value function for $s^\prime$

Now we might wonder here why don't we condition on the next action $a^\prime$ as well?

Well remember that $a^\prime$ is not a variable being considered under the current expectation

Therefore it has to be marginalized out

So the next step is to additionally condition on the next action $a$ but to take its expectation over all possible next actions 

$$\large E_\pi \left[ R(t+1) + \gamma \sum_{a^\prime} \pi(a^\prime \vert s^\prime) E_\pi \{G(t+1) \vert S_{t+1} = s^\prime, A_{t+1} = a^\prime \} \vert S_t = s, A_t = a \right]$$

Of course now the inner term is just $Q_\pi(s^\prime,a^\prime)$, and so we end up with a recursive definition of the action value function as well

$$\large E_\pi \left[ R(t+1) + \gamma \sum_{a^\prime} \pi(a^\prime,s^\prime) Q_\pi(s^\prime,a^\prime) \vert S_t = s, A_t = a \right]$$

---

<h3>Relationships Between State-Value / Action-Value</h3>

At this point it's useful to tease out a few useful identities from our previous derivation

The critical one is how do we relate the state value to the action value?

This might be easier to see if we express both the state value and the action value in terms of probabilities

So let's start with the state value again

$$\large V_\pi(s) = E_\pi \left[G(t) \vert S_t=s\right] = \sum_a \pi(a \vert s) \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a)G(t)$$

Now let's write out the expression for the action value

$$\large Q_\pi(s,a) = E_\pi \left[ G(t) \vert S_t=s,A_t=a\right] = \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a)G(t) $$

Notice how there is no need to sum over $a$ because we are conditioning on $a$ 

Of course this makes it obvious that there is only one additional step for calculating $V$ which is to sum over the action $a$ 

$$\large V_\pi(s) = \sum_a \pi(a \vert s) Q_\pi(s,a)$$

And thus the state value is nothing but the expected value of the action value if we take the average over all possible actions we could have done weighted by the probabilities of those actions

---

<h3>What are they useful for?</h3>

At this point we might be wondering what is the action value even useful for?

Well this probably won't be super clear until we get to the next few notebooks where we start to introduce actual algorithms but the basic idea is this 

The state value is useful for evaluating a given policy

That is given a policy, tell us how good it is

Tell us what future award we can expect to achieve 

The action value is used for control

Again this is all abstract because we haven't yet seen any examples of it so we'll just have to take the instructor's word for it at this point 

But the intuition is this, the action value allows us to see what is the best action we can take in the current state

It's telling us the expected future award broken down by action 

Thus let's say we want to compare two actions $a_1$ and $a_2$ for the state we're currently in 

We want to ask what is the best action we can take right now

Well the best action is obviously the one that's going to give us the highest future award and thus we can compare $Q(s,a_1)$ and $Q(s,a_2)$, we will choose the action that gives me the highest future award

This opens us up for the next section which is on optimal policies and optimal value functions

<h1>Math</h1>

In this section we're going to look at the Bellman equation by example

---

<h3>Bellman Equation by Example</h3>

Now of course in the real world we may have thousands or maybe even millions of states but in this section we're going to look at some very simple models with just a handful of states and we'll see how we can solve the Belman equation by hand for some very simple cases

Hopefully this helps us gain some intuition about the Bellman equation and realize that it's not so scary

And also keep in mind just from a big picture perspective, this idea of solving for $V(s)$ that's essentially what these entire few following notebooks are about 

This whole thing can be generalized to this idea

We have a model called an MDP and there's this value function $V(s)$, now all we need to do is find it

It's really that simple, so once we've handled the details we can take a step back and appreciate this general framework and how
useful this abstraction is for solving a variety of problems

---

<h3>Example #1</h3>

Let's start with a very basic example

We have two states start and end 

The probability of going from start to end is 1

So every time we play this game it's completely deterministic 

Every game just consists of two steps start and end

Let's suppose the end state gives us a reward of 1 and the discount factor is $\gamma = 0.9$

<img src='extras/54.25.PNG' width='300'></img>

The question is what do we need to find?

---

Now if our answer was we need to solve for the value function that's great, we are on the right track

Since there are only two states that means we need to find $V(start)$ and $V(end)$

---

Now remember that the value is the sum of all future rewards 

Since at the end state there are no future rewards, the value of the end state, or in more general terms any terminal state is always $0$ 

Because everything else is deterministic, the V(start) is one

$$\large V(end) = 0$$
$$\large V(start) = 1$$

Why is this?

Well remember that we can think of this as a probability tree

So going from start to end is the only possibility, so the weight of that edge is one

Note that we do not multiply this reward by gamma because gamma applies only to future states

So the full equation is actually 

$$V(start) = R + \gamma V(end)$$

where the second term is just zero

---

<h3>Example #2</h3>

For our next example, we're going to add a third state

<img src='extras/54.26.PNG' width='500'></img>

Note how everything is still deterministic

Therefore every game just has three steps start, middle and end 

The reward again is one for arriving in the end state and 0 otherwise 

gamma remains as 0.9 as we already know

Our job is to find $V(start)$, $V(middle)$ and $V(end)$

---

So because $V(end)$ is a terminal state we already know its value is zero

In this example the middle takes on the role of $V(start)$ from our previous example, so $V(middle)$ is one by the logic we discussed previously

Since this is all deterministic, we can use the Bellman equation in simple form to determine the value of $Vsatrt$ which is 

$$\large V(start) = R(start,mid) + \gamma V(mid) = 0 + 0.9 \times 1 = 0.9$$

---

<h3>Example #3</h3>

In this next example, we're going to make things a little more complicated

We have the same states and everything is still deterministic but now we get a reward of $-0.1$ for going to the middle state

<img src='extras/54.27.PNG' width='500'></img>

Our job again is to find $V(start)$, $V(middle)$ and $V(end)$

---

So because the $V(end)$ comes after $V(middle)$ its value is unaffected and it remains as zero

And since the middle is only concerned with future awards its value is also an affected, so it remains at $1$ 

$V(start)$ however is affected

It's equal to $-0.1 + 0.9 \times 1 = 0.8$

So hopefully we're getting the hang of this 

---

<h3>Example  #4</h3>

In this next example, we're going to introduce some randomness

Our states are as $s_1,s_2,s_3 \text{ and } s_4$

$s_4$ is a terminal state and we always get a reward of $1$ when we go here 

$s_2$ and $s_3$ both go to $s_4$ 100% of the time, so this part is deterministic

$s_1$ goes to $s_2$ 50% of the time and to $s_3$ 50% of the time, so this part is random

The reward for arriving in $s_4$ is 1 and the reward for arriving in $s_2$ is $-0.2$ and the reward for arriving in $s_3$ is minus $-0.3$

Discount factor $\gamma$ = 0.9

<img src='extras/54.28.PNG' width='300'></img>

---

Before we move on to the solution this example introduces an interesting nuance

What does the probability we just mentioned refer to?

If we say we have a 50% chance of going to $s_2$ and a 50% chance of going to $s_3$, what does that really mean?

Because this is reinforcement learning and there are multiple components in an MDP, this is not just a simple coin flip

We don't flip a coin and go which way it tells us to go

We're an intelligent Agent and therefore I have an internal policy ($\pi(a \vert s)$) that tells us what action to do

Importantly it's not the probability of where to go it's the probability of what to do

note : $\pi(a \vert s)$ does not tell me where to go, bute what action to di

But if we recall this is not the only relevant probability

We also have $p(s^\prime,r \vert s,a)$

This just means that by doing some action $a$ in state $s$ the result can be random

note : $p(s^\prime,r \vert s,a)$ tells me where I end up

So if our action is flip a coin of course there are multiple possible results , we may get heads or we may get tails

That's one way an action might result in a probabilistic next state 

So when we say we have a 50% chance of going to $s_2$ and a 50% chance of going to $s_3$, that sounds really intuitive and simple but it oversimplifies the MDP and results in some ambiguity, although when you verbalize it it sounds simpler 

For this particular example, let's assume $p(s^\prime,r \vert s,a)$ is deterministic and the 50% probabilities we mentioned earlier referred to the policy $\pi(a \vert s)$, where the action is simply just to go to the next state

That means the agent itself decides where to go by random chance and it's not a result of the environment

---

So as usual the value of the terminal state $V(s_4)$ is zero

Everything that happens after we arrive in $s_2$ and $s_3$ is deterministic, and so it follows the same logic as the previous examples

So $V(s_2)$ is 1 and $V(s_3)$ is also 1 

$V(s+1)$ however must be calculated using expected values

So we have a 50% chance of going to $s_2$ and receiving a reward of $-0.2$

We have a 50% chance of going to $s_3$ and receiving a reward of $-0.3$

After that, the next state is always $s_4$ which always yields a reward of 1, so that part can be effectively removed from the expected value 

The value of $V(s_1)$ is then just 0.6, since its just the sum of these two branches

$$\large V(s_1) = p(s_2 \vert s_1)[R_2 + \gamma V(s_2)] + p(s_3 \vert s_1) [R_3 + \gamma V(s_3)]$$ 

$= 0.5(-0.2 + 0.9 \times 1) + 0.5 (-0.3 + 0.9 \times 1)$

$  = 0.5(-0.2-0.3) + 0.9$

$ = -0.25 + 0.9 = 0.65$

---

<h3>Example #5</h3>

In this next example we're going to extend this problem just a little bit further 

The expected value in the previous example was easy to calculate because we always ended up in the same terminal state $s_4$ 

In this example that won't be the case

Note that $s_1$ can go to either $s_2$ or $s_3$ but from $s_2$ or $s_3$ we can end up in either $s_4$ or $s_5$

<img src='extras/54.29.PNG' width='250'></img>

So now we have two terminal states and there are multiple paths to either one

The only relevant rewards for this example are a reward of $+1$ if we end up in $s_5$ and a reward of $-1$ if we end up in $s_4$ 

---

Firstly. we know that $V(s_4)$ and $V(s_5)$ are 0 since they are both terminals states

Second we can calculate $V(s_2)$ and $V(s_3)$ just from the rewards since we know the next state must be terminal

So we don't need to use the full Bellman equation

So $V(s_2) = 0.8 \times (-1) + 0.2 \times 1 = -0.6$

Its expected value is negative because it ends up in the losing state more often 

$V(s_3) = 0.9 \times 1 + 0.1 \times (-1) = 0.8$

It's expected value is positive because it ends up in the winning state more often

---

To calculate $V(s_1)$, we need to take into account both $V(s_1)$ and $V(s_3)$ since they are both possible next states

But it's still simple since the rewards at $s_2$ and $s_3$ are both zero

So it's just the weighted average of $V(s_2)$ and $V(s_3)$ which gives us 0.09 

$V(s_1) = 0.5(0 + \gamma V(s_2)) + 0.5(0 + \gamma V(s_3))$

$= 0.5 \times 0.9 \times (-0.6) + 0.5 \times 0.9 \times 0.8$

$= 0.9$

Its slightly positive because we have an equal chance of landing in $s_2$ or $s_3$ but once we are $s_3$ we have a higher chance of going to $s_5$ than we do the other path going from $s_2$ to as $s_4$

---

<h3>Example #6</h3>

In this next example, we're going to look at a simpler state diagram but with more complex probabilities 

This will be a somewhat more realistic scenario so we can actually picture what's happening

It starts state is we're standing 

Now someone throws a ball at us, so we can kind of think of this as dodgeball

We have two choices, we can either decide to jump or duck, those are our two actions

The next state is, we either get hit or we don't get hit which means we're safe

Lets say the reward for getting hit is $-1$ and the reward for not getting hit is $0$ 

Importantly though, we can see why these state diagrams are somewhat oversimplistic 

<img src='extras/54.30.PNG' width='300'></img>

In particular when we look at a tree like this, we think, as we mentioned earlier, that the action is to go to the next state, but that doesn't include all the components of an MDP

And so we can see how that doesn't work in this scenario, we can't just choose don't get hit

Similarly in life, we can't simply choose that we want to be rich

We do some actions and the next state is going to be somewhat probabilistic

We may start a company and it may fail or it may succeed but we can't decide that

All we can do is do the action

The state that we end up in is probabilistic

---

So to continue on with this example let's use these probabilities 

$$\pi(jump \ \vert \ start) = 0.5$$
$$\pi(duck \ \vert \ start) = 0.5$$


Next we have the state transition probabilities

$$p(hit, \ reward = -1 \ vert \ jump,start) = 0.8$$

$$p(hit, \ reward = 0 \ vert \ jump,start) = 0$$

$$p(safe, \ reward = -1 \ vert \ jump,start) = 0$$

$$p(safe, \ reward = 0 \ vert \ jump,start) = 0.2$$

$$p(hit, \ reward = -1 \ vert \ duck,start) = 0.4$$

$$p(hit, \ reward = 0 \ vert \ duck,start) = 0$$

$$p(safe, \ reward = -1 \ vert \ duck,start) = 0$$

$$p(safe, \ reward = 0 \ vert \ duck,start) = 0.6$$

Notice how we define the probability distribution over the reward even though we didn't have to

We just wanted to show the full picture for what the probability space looks like

Notice how we can put in any other number for the reward and the probability is 0 as per the specifications of the problem

So for example, the probability that the reward is 10 is zero because that just doesn't exist in this game

---

We can of course just marginalize these to get rid of the reward completely since it is deterministic

We've already stated that when we get hit that counts as a reward of $-1$, and if we don't get hit it's zero

$$p(hit \ \vert \ jump,start) = 0.8$$

$$p(safe \ \vert \ jump,start) = 0.2$$

$$p(hit \ \vert \ duck,start) = 0.4$$

$$p(safe \ \vert \ duck,start) = 0.6$$


---

So as usual the value for the terminal states is zero

So whether we get hit or we don't get hit the value is zero

It's calculating $V(start)$ that's more challenging for this problem

So we've started using some typesetting for this equation, since now things are starting to look a little more complicated but it's still a little simpler than the full Bohman equation because we don't need to consider the value function for the next state

$$\large V(start) = \sum_{a \in \{\text{duck,jump}\}} \pi(a \ \vert \ start) \sum_{s^\prime \in \{\text{safe,hit}\}} p(s^\prime \vert start,a) r(s,s^\prime)$$

For this type of thing it's useful to do a little sanity check

We see a summation inside another summation

In code we can think of this as a for loop inside another for loop or in other words a nested for loop

So in the outer loop we have two things to sum over and then the inner loop we have another two things to sum over

And we know that the number of times things run in total is the size of the outer loop times the size of the inner loop

So in total we should have four things to sum over

---

In the equation above we're showing the exact terms we have to sum over

To make things simple we can abbreviate things in our calculations

So we already know we're conditioning on the start state so let's just not show that 

We know that we're going to either jump or duck so we can use the letter $j$ or $d$ which doesn't interfere with any other letter

And so we see that we end up with four terms 

$$V(start) = \pi(j)p(safe \  \vert \ j)\times 0 + \pi(j)p(hit \ \vert \ j)\times(-1) + \pi(d)p(safe \ \vert \ d)\times0 + \pi(d)p(hit \ \vert \ d)\times(-1)$$


The first two are if we decide to jump we're either safe and we get zero award or we get hit and we get a reward of $-1$

The next two are if we decide to duck, we're either safe and we get a reward of zero or we get hit and we get a reward of $-1$

So plugging in the numbers we get

$0.5 \times 0.2 \times 0 + 0.5 \times 0.8 \times (-1) + 0.5 \times 0.6 \times 0 + 0.5 \times 0.4 \times (-1) = -0.6$

---

<h3>Example #7</h3>

So at this point, probably any state diagram shaped like a tree is going to be too easy for us

By now we should understand that this is recursive

Essentially we work backwards from the end, so the terminal state is 0 the state next to that is whatever reward we get for ending the game and so
on using the Bellman equation

In this next and final example we're going to do something even more complex

We're going to introduce a cycle 

When there's a cycle, it's not possible to just work backwards because there's no notion of backwards, we can just end up in the same place we started 

<img src='extras/54.31.PNG' width='400'></img>

In this example we're going to have just three states, $s_1,s_2$ and $s_3$

$s_3$ is a terminal state so once we reach it we've done the game and we get a reward of $1$ 

$s_1$ and $s_2$ are non-terminals states

They can both go to each other so there's no notion of a starting position 

Only $s_2$ can go to $s_3$, but $s_1$ can go back to itself 

Again $\gamma = 0.9$

Let's assume that the only probabilities here are the actions and that the actions are simply to go to whatever state is next

As we discussed the actions realistically don't have to be choosing which state is next, but it helps to simplify the picture in terms of what probabilities we have to consider

---

So how can we solve this problem

Well we can't use the method of just working backwards like we did previously, that's because of this cycle we have

So the only thing we can do is apply Bellman's equation in terms of the variables involved and just see what that gives us

Let's do a $V(s_1)$ first 

Using the Bellman equation and we can see that it's equal to 

$\large V(s_1) = p(s_1 \vert s_1)(R_1 + \gamma V(s_1)) + p(s_2 \vert s_1)(R_2 + \gamma V(s_2))$

$\large V(s_1) = 0.3(-0.1 + 0.9V(s_1)) + 0.7(-0.1 + 0.9V(s_2))$

note : $0.3(-0.1 + 0.9V(s_1))$ is the loop of $s_1$ back to itself

note : $0.7(-0.1 + 0.9V(s_2))$ is the transition going from $s_1$ to $s_2$

$\large V(s_1) = -0.1 + 0.27V(s_1) + 0.63V(s_2)$

Notice that $V(s_3)$ doesn't appear here because we can't get to $s_3$ from $s_1$

---

Let's do $V(s_2)$ next

$\large V(s_2) = p(s_1 \vert s_2)(R_1 + \gamma V(s+1)) + p(s_3 \vert s_2)(R_3 + \gamma V(s_3))$

$\large V(s_2) = 0.6(-0.1  + 0.9 V(s_1)) + 0.4(1 + 0.9 V(s_3))$

$\large V(s_2) = 0.34 + 0.54 V(s_1) + 0.36 V(s_3)$

Using the Bellman equation we can see that there is one term involving $p(s_1 \vert s_2)$, and there's one term involving $p(s_3 \vert s_2)$

So these correspond to going to $s_1$  from $s_2$ and going to $s_2$ from $s_2$ 

notice that $V(s_2)$ doesn't appear here because $s_2$ can't go back to itself

---

And finally for $s_3$ we know that this is just zero because it's a terminal state

$\large V(s_3) = 0$

---

Now if we look at these three equations very carefully what do we see?

$\large V(s_1) = -0.1 + 0.27V(s_1) + 0.63V(s_2)$

$\large V(s_2) = 0.34 + 0.54 V(s_1) + 0.36 V(s_3)$

$\large V(s_3) = 0$


What we can see that it's actually a system of linear equations just like we used to study in linear algebra

We have three equations in three unknowns

So either we could solve this by substitution or we could use the matrix method

---

The first thing we can do is make this more like a system of linear equations by grouping together like terms and spacing them out appropriately

$\large 0.1 \ \ \ \ \  = -0.73V(s_1) + 0.63V(s_2)$

$\large -0.34 = \ \  \ 0.54V(s_1) - \ \ \ \ \ \ \ V(s_2) + \ \ \ 0.36V(s_3)$

$\large 0 \ \ \ \  \ \ \ \ =  \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad V(s_3)$

---

So we can see that we can neatly separate this equation into three parts 

A vector containing $V(s_1),V(s_2)$ and $V(s_3)$ 

A $3 \times 3$ Matrix to multiply by 

And a length $3$ vector of numbers that just sits by itself

$$\begin{bmatrix}
-0.73 & 0.63 & 0\\
0.52 & -1 & 0.36\\
0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
V(s_1)\\
V(s_2)\\
V(s_3)
\end{bmatrix} = 
\begin{bmatrix}
0.1\\
-0.34\\
0
\end{bmatrix}
$$

This of course takes the form of $Ax=b$ and we can solve it using the numpy function ```x = np.linalg.solve(A,b)```

As we recall, we never want to solve this using
the inverse method, which would be to calculate the inverse of $A$ and multiply that by $b$, since that's not efficient in code

---

The result is

$V(s_1) = 0.293$
$V(s_2) = 0.498$
$V(s_3) = 0$

As an exercise we might also want to check if these are correct by plugging them back into the original equations

---

<h3>Summary</h3>

So hopefully this section gave us an idea of how the Bellman equation can be reasoned about and used

We've seen that for some simple cases, solving for the value function is no more than just working backwards up a tree

We've also seen for slightly more complex cases, that approach doesn't work but the Bellman equation is general enough such that it can still represent them

We just need slightly more sophisticated techniques to solve them 

And just to give ourselves a big picture view of the following notebooks, the following notebooks are completely and 100% devoted to just one thing

How to Solve the Bellman equation 

From a bird's eye view, that's what reinforcement learning is 

Just like regular supervised machine learning, we want to get away from this idea that we need to do an example on a finance set and then we need
to do an example on some health data, then we need to work with some house price data and so on

We know that all data is the same and what the algorithm sees is just a table of numbers

The real work is figuring out how the algorithm works and coding it 

So the following notebooks are devoted to just that

In this section we hope to give ourselves a sense that what we need to solve for the value function is really a better algorithm

We saw that the idea of creating a linear system of equations and then calling ```np.lingalg.solve``` was pretty general but we know that that's not a scale solution

So the few following notebooks are going to answer what are scalable solutions to this problem what assumptions do they make and what are their advantages and disadvantages

<h1>Math</h1>

In this section we are going to answer the question, How can we find the best policy and the best value function?

---

<h3>The Best Policy / The Best Value Function</h3>

Before we start we should think about why this is important

This brings us back to the big picture, in case we've been getting too far into the details of the math

Remember that our goal as the machine learning engineer is to create an intelligent agent

We consider an intelligent agent one that learns to achieve the maximum reward in a game

We've refined this definition to say that the agent should achieve the maximum expected sum of future rewards in a game

One might think of that as the best possible value function

But remember the value function is dependent on what policy we are following

Thus we would like to find some policy that leads us to achieve this best possible value function

It makes sense to call that policy the best policy 

When we achieve that, we might say we've created an intelligent agent, now that might sound abstract but it's really not

Imagine we create a computer program that plays a video game

Obviously a program that does nothing but move around randomly would not be considered very good

On the other hand a program that results in beating the video game at superhuman speed would be considered very intelligent and impressive

And so it's sort of an obvious statement, a good policy is one that displays the desired behavior

---

<h3>Best Value Function</h3>

Let's now discuss more precisely what it means for one value function to be better than another

Of course we must have some notion of better some notion of order, in order to consider one of these value functions to be the best

The problem with this is that the value function is not just a single number 

In machine learning, it's simple because our objective is usually framed in terms of a single scalar loss 

Obviously a loss of 10 is better than a loss of 20

But the value is given for each state in an environment

---

<h3>How to Compare Policies and Values</h3>

The answer to this is that, a policy can only be considered greater than or equal to another policy if its value function is greater than or equal to the other value function for every state in the state space 

$$\large \pi_1 \ge \pi_2 \text{ iff } V_{\pi_1}(s) \ge V_{\pi_2}(s) \ \ \  \forall_s \in S$$

$\large S = \text{ state space}$

That is there can be no exceptions

It must be better then or at least the same overall states for us to be able to use this greater than or equal to symbol

So this implies that, it's possible to come up with two policies such that policy one is not greater than or equal to policy two but simultaneously policy two is not greater than or equal to policy one

---

<h3>Example</h3>

Here's a simple example with only two states

Imagine that under policy one, $V(s_1) = 2$ and $V(s_2) = 1$ 

But under policy two, $V(s_1) = 1$ and $V(s2) = 2$

<img src='extras/54.32.PNG' width='400'></img>

$\require{\cancel}$

In this case $\pi_1$ is not better than $\pi_2$, $\pi_1 \cancel{\ge} \pi_2$,  but also $\pi_2$ is not better than $\pi_1$, $\pi_2 \cancel{\ge} \pi_1$

---

Here's another example, again with two states 

Under policy one, $V(s_1)=2$ and $V(s_2)=3$ 

Under policy two,  $V(s_1)=2$ and $V(s_2=1)$ 

<img src='extras/54.33.PNG' width='400'></img>

Since under policy one the value is greater than or equal to the value under policy two for all states, we can see that $\pi_1 \ge pi_2$

---

<h3>Optimal Policy and Optimal Value Function</h3>

Now that we have some way of comparing two policies, let's discuss what it means to have an optimal policy and an optimal value function

Both of these can be defined in terms of a state value, $V(s)$

In particular the optimal state value function is

$$\large V^*(s) = \max_\pi V_\pi(s) \forall_s \in S$$

As we can see, we call this particular state value function $V^*$ 

Similarly the optimal policy $\pi^*$ is the arg max of the same expression 

$$\large \pi^* = \arg \max_\pi V_\pi(s) \ \ \forall_s \in S$$

In the same way we can define the optimal action value $Q^*$ as 

$$\large Q^*(s,a) = \max_\pi Q_\pi(s,a) \ \forall_s \in S, \ \forall_a \in A$$

$\large A = \text{ action space}$

---

<h3>Naive Approach</h3>

Knowing these definitions let's consider a naive approach at finding the optimal policy in a given environment

In fact it's quite simple

All we need is a simple for loop

Let's assume we know how to evaluate a policy,  that is given a policy we find its value function

We confirmed earlier that this is nothing but solving a system of linear equations

Well then all we need to do is loop through all possible policies and for each policy evaluate it

In other words find its $V(s)$ 

If the value of the current policy is better than the current best value, replace the current best value with the new value and replace the current best policy with the new policy

By the end of this loop, we will have found the optimal policy

```python
pi_best = None
V_best[s] = -inf for all s in S
for pi in all_possible_policies:
    V[s] = evaluate(pi)
    if V[s] >= V_best[s] for all s in S:
        V_best = V
        pi_best = pi
return pi_best
```

---

<h3>Problem</h3>

what is the problem with the previous approach?

Well we have to ask how many possible policies are there?

Let's assume we have a discrete finite action space $A$, and a discrete finite state space $S$

This is like saying we have $S$ buckets and we want to know what is the number of possible ways
we can fill these buckets with any of the actions in $A$ 

That's equal to the cardinality of $A$ to the power of the cardinality of $S$

$$\large \text{Number of possible policies: }\vert A \vert^{\vert S \vert }$$

In other words this grows exponentially with the size of the state spase which makes the problem intractable

Even with our simple grid world example we have 11 possible states that the agent can occupy and 4 possible actions, $4^{11}$ is almost $4.2$ million

---

<h3>Uniqueness</h3>

One interesting rule to remember is that the optimal value function is unique but the optimal policy is not unique 

For a simple example of this, consider our GridWorld game

<img src='extras/54.34.PNG' width='350'></img>

Remember that from the starting position there are two alternate paths to the winning state each with equal length

Thus even with discounting, the value of the initial state is the same no matter which of these two paths we take

This is an example of where the optimal value is unique but the optimal policy is not

---

<h3>Bellman Optimality Equation</h3>

So how can the Bellman equation help us find a good policy

Let's consider again the standard Bellman equation

It says that the value of the state $s$ is the expected return summed over all possible values of the action $a$ 

But since we're now trying to find the best policy, it doesn't make sense to sum over all possible actions

Instead what we would like to do is pick the best action 

As we previously noted, the best action is the action that yields the best future reward 

We can write this simply by removing the sum over $\pi(a \vert s)$ and replacing it with the $\max$ over $a$ 

We call this the Bellman optimality equation

It tells us how $V^*(s)$ can be described recursively when we are following the optimal policy which is to take the action $a$ that always leads to the maximum expected future reward

---

<h3>Bellman Optimality Equation for Action-Value</h3>

We can do a similar thing for the action value

Recall that the action value can be described using the Bellman equation as follows

$$\large Q_\pi(s,a) = E_\pi \left[R(t+1) + \gamma \sum_{a^\prime} \pi(a^\prime \vert s^\prime) Q_\pi(s^\prime,a^\prime) \vert S_t = s, A_t = a\right]$$

$$\large Q_\pi(s,a) = \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma \sum_{a^\prime} \pi(a^\prime \vert s^\prime) Q_\pi(s^\prime,a^\prime)\right\}$$

$$\large \boxed{Q^*(s,a) = \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma \max_{a^\prime} Q^*(s^\prime,a^\prime)\right\}}$$

Note that there is no sum over the action $a$ on the outside and thus this is a little different from our description of the optimal state value

What we can recognize however is that there is a sum over the next action $a^\prime$ and thus the optimal action value is the one that takes the max over $a^\prime$ inside the expected value

We call this the Bellman optimally  equation for action values

As we'll see later this equation is very handy when we start talking about Q-learning

---

<h3>Optimal State-Value vs Optimal Avtion-Value</h3>

Using the definition of the optimal state value and the optimal action value, we can relate the optimal state value to the optimal action value as follows 

$$\large V^*(s) = max_a Q^*(s,a)$$

The optimal state value at state $s$ is simply the max over all possible actions $a$ of the action value at state $s$ and action $a$

If we take the Bellman optimally the equation for $Q^*$

$$\large Q^*(s,a) = \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma \max_{a^\prime} Q^*(s^\prime,a^\prime)\right\}$$

we can see that it's possible to replace the inner $Q^*$ with $V^*$ 

$$\large Q^*(s,a) = \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma V^*(s)\right\}$$

And thus this is another way to relate $Q^*$ and $V^*$ in the same equation

And of course if we take the max over the current action $a$ on both sides of the equation

$$\max_a \large Q^*(s,a) = \max_a  \sum_{s^\prime}\sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma V^*(s)\right\}$$


we just recover the Bellman optimality equation for $V^*$

$$\large V^*(s) = \max_a \sum_{s^\prime} \sum_r p(s^\prime,r \vert s,a) \left\{r + \gamma V^*(s^\prime)\right\}$$

<h1>Math</h1>

Later on in the following notebooks we're going to look at an algorithm that allows us to find the optimal value function $V^*$

For now let's assume that there is some method of finding $V^*$ and $Q^*$, we just don't know what these methods might be yet

One interesting question to ask is how do we use $V^*$ or $Q^*$ to find the optimal policy $\pi^*$

---

<h3>Finding hte Optimal Policy</h3>

Remember that this is the real goal of reinforcement learning

We don't care about $V^*$ or $Q^*$ so much, we really just want to create an intelligent agent that behaves optimally and of course that behavior is described by $\pi^*$

So finding $V^*$ and $Q^*$ is just a means to an end

---

<h3>Using the Bellman Optimality Equation</h3>

So let's say we have $V^*$

How do we find $\pi^*$?

We can again apply the bellman optimally equation 

$V^*$ is defined recursively by taking the max over all possible actions $a$

$$\large V^*(s) = \max_a \sum_{s^\prime} \sum_r p(s^\prime,r \vert s,a) \left\{ r + \gamma V^*(s^\prime)\right\}$$

Well it makes sense that whatever action $a$ leads to the max will be the best action to take when we are in state $s$

Therefore the optimal policy is to simply take this action $a$ whatever that is 

And hence to find the optimal policy, we just take the $\arg \max$ instead of the $\max$

$$\large \pi^*(s) = \arg \max_a \sum_{s^\prime} \sum_r p(s^\prime,r \vert s,a) \{r + \gamma V^*(s^\prime)\}$$

---

<h3>Using the Action-Value </h3>

Now let's consider what we will do if we have $Q^*$ 

Remember that $Q^*$ is already conditioned on the action $a$ , $a$ is an argument into the $Q$ table

In fact we can use what we already thought of in the previous subsection since the right hand side is already $Q^*(s,a)$

$$\large Q^*(s,a) = \sum_{s^\prime}\sum_{r} p(s^\prime,r \vert s,a) \{ r + \gamma V^*(s^\prime)\}$$


Thus the action we want to choose if we are in state $s$ is just the $\arg \max$ over all actions $a$ of $Q^*(s,a)$

$$\large \pi^*(s) = \arg \max_a Q^*(s,a)$$

---

<h3>Which is better?</h3>

At this point we want to answer a question which might not seem obvious to ask 

Which is better, using $V^*$ or using $Q^*$

Since we're using lots of math and lots of symbols, it's easy to forget that these correspond to real world programming problems

Let's compare how we find the optimal policy for $V^*$ compared with $Q^*$

$$\large \pi^*(s) = \arg \max_a \sum_{s^\prime} \sum_r p(s^\prime,r \vert s,a) \{r + \gamma V^*(s^\prime)\}$$

$$\large \pi^*(s) = \arg \max_a Q^*(s,a)$$

Firstly it's clear that the expression when we have $Q^*$ is much simpler

It's just the $\arg \max$ over some $Q$ table 

The expression for $V^*$ is more complicated

We have to sum over two random variables the next state $s^\prime$ and the reward $r$

This might be computationally expensive

If we imagine what we might do when we're playing an actual game in a computer, this might involve having to actually perform an action $a$ to see which next state $s^\prime$ we end up in and of course that's not always possible in a real game like chess

If we do an action and we end up in the next state $s^\prime$, it's not possible to go back in time and pretend we're in state $s$ and we never made that move

---

<h3>Pattern</h3>

The reason we want to mention this now is that this theme is going to be repeated throughout the rest of the following notebooks

The pattern that the few follwoing notebooks are going to follow is like this

For each algorithm we discuss we're going to be answering the questions

<ul>
    <li>How do we find $V$ for a given policy</li>
    <li>How do we find the best policy<li>
</ul>

As we've seen, finding the best policy is much easier when we have $Q$ compared to when we have $V$

We call the act of finding the value function for a given policy The evaluation problem or the Prediction problem 

We call the act of finding the optimal policy The control problem 

In most cases this will involve finding $Q$ 

Thus it's the case that most of the time, the evaluation problem will be associated with finding $V$ whereas the control problem will be associated with finding $Q$ or more precisely $Q^*$

<h1>Math</h1>

In this section we are going to summarise everything we learn in this notebook

---

<h3>notebook summary</h3>

This notebook was all about Mark of decision processes or MDPs

We started this notebook by discussing the Gridworld environment which we will be using throughout the following notebooks since it is an excellent tool for helping us understand reinforcement learning

Next we use this notebook to define many of the terms that we will be using throughout reinforcement learning

There were quite a few terms to learn, more so than a typical notebook on supervised or  unsupervised learning

We looked at states, actions, rewards, terminal states, episodes, state spaces, action, spaces policies, values and more

We reviewed the Markov property and extended the basic Markov model to mark our decision processes

We saw how using the framework of MDPs we could derive a few more concepts

We recognized that our goal is to maximize the sum of future rewards and that this is a random variable

We named this the return 

Since the return is random, our goal is actually to maximize the expected return which we call the value

We looked at two different kinds of values the state value and the action value

From this we were able to derive the Bellma equation which is a recursive equation describing the value function

This also led us to the notion of optimality for both values and policies

Our goal in reinforcement learning is to find an optimal policy which has an associated value which is the optimal value

---

<h3>Where is the Code?</h3>

We want to remind ourselves that the study of MDPs is by nature a theoretical

This is why there was no coding to be done in this notebook

In every og the subsequenct notebooks we will be focusing on specific algorithms that will help us find solutions to the prediction problem and the control problem