# Chapter 3: Finite Markov Decision Processes

### Exercise 3.1

Q: Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as *different* from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

A: First, let us recall the finite, discrete-time MDP framework as it is defined in the textbook.

An MDP consists of three processes that evolve in discrete time:

* A state process, $S_t \in \mathcal{S}$, describing the evolution of the state that the environment is in (as represented to the agent). The set of all possible states, $\mathcal{S}$, is called the *state space* for the MDP, and for this chapter is assumed to be finite.
* A sequence of actions $A_t$ that the agent takes at each time step $t$. At each time step $t$, $A_t \in \mathcal{A}(S_t)$, where $\mathcal{A}(s)$ is the *action space* representing all available actions when the environment is in state $s \in \mathcal{S}$. When the same set of actions are available irrespective of state, we may denote the action space by $\mathcal{A}$.
* A sequence of rewards $R_t \in \mathcal{R} \subseteq \mathbb{R}$ received by the agent. The book uses the convention that $R_t$ is the reward received at the *start* of a time step, i.e. as a consequence of the agent's *previous* action (at time step $t - 1$). However, it is also possible to define $R_t$ as the reward received at the *end* of a time step, as a consequence of the action taken at that same time step $t$.

The object that turns these three components into a Markov Decision Process is the *dynamics function* $p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to [0, 1]$ that specifies the probability distribution for $(S_t, R_t)$ conditional on the state of the environment and the action taken by the agent at the previous step, $(S_{t - 1}, A_{t - 1})$. In other words

$$p(s', r \mid s, a) \doteq \operatorname{Pr}\left[S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a\right].$$

More generally, one could extend the framework to cover (manageably) infinite state and action spaces, or problems that play out in continuous time.

Three examples of MDPs, ranging from practical to whimsical, are as follows:
1. An intelligent building climate control system. At any point, the state signal may correspond to: the building's current occupancy (and perhaps even locations of occupants); internal temperature sensor readings; external temperature, windspeed and solar illumination; perhaps even extending to the system's memory of past occupancy patterns, occupants' calendars and weather forecasts. The action space may correspond to: opening / closing valves on individual radiators; opening / closing windows; or lowering / raising blinds. The reward received by the system would be a weighted sum of the deviation between actual climate conditions and preferred climate conditions specified by occupants, a measure of the energy consumption (or perhaps CO<sub>2</sub> emissions rate), and perhaps a penalty that applies whenever an occupant overrides the system (e.g. by manually adjusting a radiator, window or blind).
2. A customer relationship management system that aims to increase the value a given customer brings to a business. The state signal may consist of transactions the customer makes with the business, website / mobile app usage data attributable to the customer, external data relating to the customer's spending / interactions with competitors, economic or market data on general consumption / demand patterns for the business's products. The action space may consist of written communications (post, email, SMS, social media, etc), perhaps even composition of the wording or decisions over the format / layout, promotions or other customisation of products offered to the customer. The reward signal at each time step could consist of the estimated change in the customer's lifetime value less the costs (e.g. marketing costs or promotional costs) incurred by the action last taken.
3. A government. The state signal would consist of all the economic, behavioural, polling and forecasting data available to decision makers. The action space consists of the range of legislative changes, fiscal and monetary policy changes, subsidies, service provision / curtailment, capital expenditure, etc decisions available to the government. The reward could be the change in (depending on one's political leaning!), say from one month to the next, some measure of the welfare of the population. Or more cynically, the reward could be zero except when there is an election (where it may be 1 if the incumbent governing party wins the election or -1 if it loses power).

### Exercise 3.2

Q: Is the MDP framework adequate to usefully represent *all* goal-directed learning tasks? Can you think of any clear exceptions?

A: The MDP framework as defined above (and in the textbook) has some restrictions that could limit its universality. Let's consider some of these restrictions and the types of problems that may therefore fall outside of this framework.

* The framework (implicitly) assumes that the agent is able to observe the state of the environment. In practice, it may only see limited information about the environment, or its observations of the environment may be corrupted by noise. This would apply, for example, to any agent that relies on (fallible) sensors to perceive the environment, or in many situations where an agent's observations of state are limited (e.g. by walls, unavailable data, or other limitations on its ability to perceive). However, it is unclear whether this is strict limitation of the MDP framework: one could simply define the state signal to be whatever information the agent *does* receive about the environment, or perhaps to be the state of the agent's beliefs about the environment. Perhaps the pertinent point is that partial observability may limit how *usefully* the MDP framework can be applied to such problems (even if some sense the MDP framework can technically be made to fit such problems).
* The framework assumes that the probability distribution for the next state $s'$ only depends on the previous state $s$ and previous action $a$. The fact that it only depends on the previous state $s$ is not too restrictive, as we could simply expand the definition of environment state to include as much of a memory of previous states as required to be able to usefully predict the future. However, the dependency on just the previous action $a$ may be more restrictive: it seems to preclude problems where actions may have delayed effects, or even adversarial problems where the environment may actively react to an agent's past behaviour when determining future state transitions. Again, technically, it is possible to expand the definition of an environment's state to include a history of the agent's previous actions and therefore incorporate such problems within the MDP framework. However, practically, it is unclear how useful this extension would be, as the dynamics function will effectively change over time (possibly adversarially): standard algorithms for solving MDPs may not be the most effective for learning a good policy under such circumstances.
* The framework assumes that the dynamics function is stationary. This is actually still quite flexible: statistically stationary systems can still evolve over time, just according to (ultimate) parameters that are constant. A bigger challenge is where the dynamics depend on an external time-varying input (which may not be known to the agent). Technically, we would incorporate time itself within the representation of state: if time-variability is predictable, this may help the agent learn how to deal with it. However, from a perspective of learnability, it may not be practical to learn a good policy for a problem where the rules governing the system vary wildly and unpredictably over time. (Compare to the problem of collaborating with a highly unpredictable or volatile colleague!)

### Exercise 3.3

Q: Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out — say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in — say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices *where* to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

A: In my view, the basis for determining the "right" level depends on where lies the line dividing what a agent can be assumed to have complete control over versus what the agent can only seek to influence. Another perspective is that the same actor may be trying to solve a hierarchy of problems (e.g. navigate across a city to the office *and* drive the car safely and comfortably in the moment *and* control one's muscles to produce desired movements in one's limbs): at each level of the hierarchy once could consider an agent specific to that hierarchy that receives instructions from a higher level and determines actions in the form of instructions to a lower level agent. (At the lowest levels, an agent may not be needed at all, as a simple mechanical device could effect the desired change.)

First, let us contrast the option of drawing the interface at the driving controls (accelarator, brake, steering wheel) versus considering actions to be tire torques. The latter approach implicitly assumes that the agent is able to specify actions as precise (changes to?) tire torques. In other words, irrespective of external conditions (e.g. the slope, weather, road surface, tire inflation and temperature, etc), it assumes that the agent knows exactly how to, say, increase the torque on the front right wheel by exactly 1 newton-meter. Practically, that seems unrealistic: as a driver I feel I can precisely control the position of the steering wheel, pressure on the brake pedal, etc; while I have some sense of how this will translate into the car's behaviour on the road, this is not exact and in fact depends on many external factors.

Next, let us contrast the option of drawing the interface at the driving controls versus as electrical signals to my muscles (or, if I were to replace myself with a humanoid robot, to the robot's actuators). If I am not sure exactly how to get my arms to turn the steering wheel at a desired rate, then this would seem reasonable. In fact, when I was first learning to drive a car (or perhaps even now, when driving an unfamiliar vehicle) this may be a sensible choice: I may not yet have a feel for how hard I need to pull the steering wheel to have the desired effect. But for most purposes, this seems to be an overcomplication of the problem: I have a pretty good sense of how to control my limbs, and it would simplify the learning challenge if I take that as a given and concentrate on driving at the more abstract level of operating the car's controls. Note however, that my authority over the car's controls is not absolute: a water bottle could roll under the brake pedal, or I could suddenly get cramp in a leg — so there may be circumstances where if I frame actions purely in terms of changes to the car's controls, I may not be able to realise those actions.

Finally, let us contrast the option of drawing the interface at the driving controls versus at the level of deciding *where* to drive (e.g. drive along the motorway until junction 8, take the first exit at the roundabout, etc). This choice really depends on the level of the problem hierarchy we are dealing with. Once we have assumed: (a) that we have a low level agent who is effectively able to use electrical signals to control the movement of my limbs; (b) a mid-level agent who is effectively able to operate the car's controls to control the car's immediate velocity, rate of acceleration, turn, etc; then the next level of the hierarchy is to use that mid-level agent to get us from A to B. For this purpose, we may consider a high level agent (i.e. a navigator, which could be myself, a passenger or even a sat-nav system) that gives control signals in the form of highly driving instructions.


### Exercise 3.4

Q: Give a table analogous to that in Example 3.3, but for $p(s', r \mid s, a)$.

A: Example 3.3 in the textbook describes an MDP describing a recycling robot that searches for cans to recycle. The question asks to calculate $p(s', r \mid s, a)$ from $p(s' \mid s, a)$ and $r(s, a, s')$ which are set out in the book.

This is not a particularly enlightening exercise, so will skip answering it explicitly here. But one interesting point to note is that, technically speaking, $p(s', r \mid s, a)$ contains more information than the other two functions, and hence can't be reconstructed without making additional assumptions. Specifically, if rewards are random (given $s$, $a$, and $s'$), then the function $r$ only gives the expected value of these rewards which, on its own, is insufficient to reconstruct the function.

On the other hand though, it is only the expected reward that is important for determining the optimal policy (at least, if the goal being optimised is to maximise the expected, discounted future reward). Hence, we may as well assume that rewards are deterministic in order to reconstruct $p(s', r \mid s, a)$ even if that is not really the case. If we make this assumption, then the following formula immediately follows:

$$p(s', r \mid s, a) = 𝟙_{r = r(s, a, s')} \, p(s' \mid s, a).$$


I imagine this is why (as the authors point out at the end of the chapter) it is more conventional to specify a MDP in terms of the state transition function $p(s' \mid s, a)$ and expected reward function $r(s, a, s')$ instead of dealing directly with $p(s', r \mid s, a)$ — since the latter contains information that is superfluous to solving the MDP.

### Exercise 3.5

Q: The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

A: The key difference with an episodic task is that there is an absorbing (terminal) state that, once reached, the task ends. Note that there is no need to have more than a single terminal state in the MDP. (Even if, conceptually, the task could end in many ways, all that matters for learning a policy is the reward collected before the task finally ends.) Therefore, for episodic tasks, we denote by $\mathcal{S}$ all states excluding the terminal state, and by $\mathcal{S}^+$ all states including the terminal state.

Hence, the dynamics function is now a mapping $p: \mathcal{S}^+ \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to [0, 1]$. (I.e. the domain of the first argument is now $\mathcal{S}^+$ instead of $\mathcal{S}$.) Nevertheless, $p$ essentially has the same meaning for both continuous and episodic tasks:

$$ p(s', r \mid s, a) \doteq \operatorname{Pr}\left[S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a \right].$$

Turning back to the question, the only difference is that $p$ now satisfies the normalisation property

$$\sum_{s' \in \mathcal{S}^+} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) = 1$$

instead of this equation's counterpart (3.3) in the textbook.

### Exercise 3.6

Q: Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

A: Recall that this task is about vertically balancing a pole hinged to a cart by applying horizontal forces to the cart. Failure occurs if the pole leans more than a certain angle or if the cart runs off the track; the pole and cart position are subsequently reset.

Treating this as a continuous task, the return at time $t$ is therefore given by

$$ G_t \doteq \sum_{i = 1}^{\infty} \gamma^{i - 1} R_{t + i} = - \sum_{e=1}^{\infty} 𝟙_{T_e \geq t} \, \gamma^{T_e - t}$$

where $T_e$ is the time step corresponding to the $e^\text{th}$ failure.

On the other hand, treating this as an episodic task, the return becomes

$$ G_t = - \gamma^{T_{e'(t)} - t}$$

where $e'(t)$ is the episode number corresponding to the next failure, i.e. satisfying the conditions $T_{e'(t)} \geq t$ and $T_{e'(t) - 1} < t$.

Simply put, under the continuous task, the reward at time step $t$ depends on the time to failure for *all* future failures, whereas, under the continuous task, the reward at time step $t$ depends only the time remaining to the very next failure. As mentioned in the book though, in both cases, the key to maximising the return is to maximise the time to (every) failure, i.e. to avoid failure for as long as possible.

### Exercise 3.7

Q: Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes — the successive runs through the maze — so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

A: The key detail in the question is that the experimenter has decided to maximise the (undiscounted) expected total reward. However, as long as the learning agent has a policy that enables it to *eventually* escape the maze (no matter how long that takes, and noting that even a randomly moving agent will ultimately escape the maze), the expected total reward will be one, *independent* of the time taken to escape. Hence, there is no further incentive for the agent to escape the maze in a timely fashion (let alone as quickly as possible).

An obvious improvement would be to change to goal to maximise the discounted return. At any time step $t$, the discounted return is $\mathbb{E}\left[\gamma^{T-t}\right]$ where $T$ is the (random) time of escape, or equivalently the episode length. By doing this, we introduce some time-sensitivity to the optimisation goal: by reducing the time to escape, the agent can increase its return. Therefore it is encouraged to learn to escape the maze efficiently.

### Exercise 3.8

Q: Suppose $\gamma = 0.5$ and, for an episode of terminating at time $T=5$, the following sequence of rewards is received: -1, 2, 6, 3, 2. What are the corresponding returns $G_0 \ldots G_5$?

A: Let's solve this using code:

In [1]:
gamma = 0.5
rewards = [-1, 2, 6, 3, 2]
G_backwards = [0]  # no further rewards once terminal step has been reached
for R in reversed(rewards):
    G_backwards.append(gamma * G_backwards[-1] + R)
G = G_backwards[::-1]
print("Returns: ", G)

Returns:  [2.0, 6.0, 8.0, 4.0, 2.0, 0]


### Exercise 3.9

Q: Suppose $\gamma = 0.9$ and the reward sequence is $R_1 = 2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?

A: Recall that $\sum_{i=0}^{\infty} \gamma^i = (1 - \gamma)^{-1}$. Hence

$$G_1 = \frac{7}{1 - 0.9} = 70$$

and

$$G_0 = 2 + 0.9 \times 70 = 65.$$

### Exercise 3.10

Q: Prove the second inequality in (3.10)

A: This appears to be a typo in the book, as (3.10) contains the follow equality:

$$\sum_{k = 0}^{\infty} \gamma^k = \frac{1}{1 - \gamma}.$$

This is the standard infinite geometric series formula which is derived as follows.

Defining $S_n = \sum_{k=0}^{n-1} \gamma^k$, it follows that
$$
S_n = 1 + \gamma S_{n-1} = 1 - \gamma^n + \gamma S_n.
$$

Therefore
$$S_n = \frac{1 - \gamma^n}{1 - \gamma}.$$

Taking the limit $n \to \infty$, which converges when $\gamma < 1$, we obtain the desired result.

### Exercise 3.11

Q: If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

A: At this point, the book formally introduces stochastic policies. A policy $\pi$ is a mapping from each state $s \in \mathcal{S}$ to a probability distribution over the corresponding action space $\mathcal{A}(s)$, the idea being that the agent's next action $A_t$ (having observed it is in state $S_t$) would be drawn from this distribution $\pi(S_t)$. Notice that, by applying a fixed policy to determine the agent's actions in this manner, we transform a MDP into a Markov chain; the state-transition probability for the Markov chain is simply given by

$$\operatorname{Pr}\left[S_{t+1} = s' | S_t = s\right] = p(s' \mid s, a) \, \pi(a \mid s)
= \sum_{r \in \mathcal{R}} p(s', r \mid s, a) \, \pi(a \mid s).$$

Going back to the question, it follows that

$$\mathbb{E}_{\pi}\left[R_{t+1}\right] = \sum_{s' \in \mathcal{S}}\sum_{a \in \mathcal{A}(s)}\sum_{r \in \mathcal{R}}
r \, p(s', r \mid s, a) \, \pi(a \mid s).$$

Note the notation implicitly introduced at this point: by $\mathbb{E}_\pi$, we (and the book) refer to taking expectations of random variables / processes under the (Markov chain) measure induced by applying policy $\pi$ to the originally specified MDP.

### Exercise 3.12

Q: Give an equation for $v_\pi$ in terms of $q_\pi$ and $\pi$.

A: Recall $v_\pi$ is the state-value function for policy $\pi$,

$$v_\pi(s) \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right],$$

and $q_\pi$ is the action-value function for policy $\pi$,

$$q_\pi(s, a) \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s, A_t = a\right].$$

By the law of total expectation it follows that

$$
\begin{align}
v_\pi(s) &= \mathbb{E}_\pi \left[\mathbb{E}_\pi\left[G_t \mid S_t, A_t\right] \mid S_t = s\right] \\
&= \mathbb{E}_\pi \left[q_\pi(S_t, A_t) \mid S_t = s \right] \\
&= \sum_{a \in \mathcal{A}(s)} q_\pi(s, a) \, \pi(a | s).
\end{align}
$$

### Exercise 3.13

Q: Give an equation for $q_\pi$ in terms of $v_\pi$ and the four-argument $p$.

A: From its definition,

$$
\begin{align}
q_\pi(s, a) &\doteq \mathbb{E}_\pi \left[G_t \mid S_t=s, A_t=a\right]\\
&= \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} \left[r + \gamma \mathbb{E}_\pi\left[G_{t+1} \mid S_{t+1}=s'\right]\right] p(s', r \mid s, a)\\
&= \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} \left[r + \gamma v_\pi(s')\right] p(s', r \mid s, a),
\end{align}
$$

which answers the question.

### Exercise 3.14

Q: (Generalising the original question) solve the Bellman equation to recover state-values for Gridworld as shown in Figure 3.2.

A: Let's begin by deriving the Bellman equation: this is an implicit equation satisfied by the state-value function $v_\pi$.). We proceed similarly to the previous exercise, as follows:
$$
\begin{align}
v_\pi(s) &\doteq \mathbb{E}_\pi \left[G_t \mid S_t = s\right] \\
 &= \mathbb{E}_\pi \left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\
 &= \mathbb{E}_\pi \left[R_{t+1} + \gamma \mathbb{E}_\pi\left[G_{t+1} \mid S_{t+1}\right] \mid S_t =s\right],
\end{align}
$$
hence obtaining the Bellman equation:
$$v_\pi(s) = \mathbb{E}_\pi \left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s\right].$$

We have left the equation in expectation form, rather than the equivalent weighted sum form given in (3.14) in the textbook.

Notice that this is a linear system of equations: we could rewrite it
$$\left(𝟙 - \gamma \hat{B}_\pi\right) v_\pi(s) = r(s),$$
where $r(s) \doteq \mathbb{E}_\pi \left[R_{t+1} \mid S_t = s\right]$ is the expected reward given we are in state $s$, and $\hat{B}_\pi$, the expected value of the subsequent state, is a linear operator on the space of state-value functions:
$$
\left(\hat{B}_\pi \, v\right)(s) \doteq \mathbb{E}_\pi \left[v_\pi(S_{t+1}) \mid S_t = s\right].
$$

This means that, at least for finite MDPs, one option for deriving the state-value function $v_\pi$ is to use a linear solver to solve this system of equations. The code below does this for the GridWorld problem defined in the textbook:

In [2]:
import pandas as pd
from rl.mdp import GridWorld
from rl.mdp.solve import exact_state_values

gridworld = GridWorld(
    size=5,
    wormholes={
        (0, 1): ((4, 1), 10),  # state "A" in the book
        (0, 3): ((2, 3), 5),  # state "B" in the book
    },
)

def pi(a, s):
    """Policy for agent that takes actions at random."""
    return 0.25

v = exact_state_values(gridworld, gamma=0.9, pi=pi)

# Display the result as a DataFrame for easy inspection
display(  
    pd.Series(v.values(), index=v.keys())
    .rename_axis(index=["r", "c"])
    .unstack()
    .round(1)
)

# Spot checks against results in textbook
assert round(v[(2, 1)], 1) == 0.7
assert round(v[(0, 3)], 1) == 5.3

c,0,1,2,3,4
r,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,3.3,8.8,4.4,5.3,1.5
1,1.5,3.0,2.3,1.9,0.5
2,0.1,0.7,0.7,0.4,-0.4
3,-1.0,-0.4,-0.4,-0.6,-1.2
4,-1.9,-1.3,-1.2,-1.4,-2.0


### Exercise 3.15

Q: In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is $v_c$ in terms of $c$ and $\gamma$?

A: Recall the definition of the state-value function:

$$v_\pi(s) \doteq \mathbb{E}_\pi \left[G_t \mid S_t = s\right].$$

If we shift all rewards up by the same constant $c$, then it follows (using the definition of $G_t$ and the formula for an infinite geometric series) that the return is also shifted by a constant:

$$G_t \to G_t + \frac{c}{1 - \gamma}.$$

Thus we find that state values are shifted up by $v_c = \tfrac{c}{1 - \gamma}$. Moreover, since $v_c$ is independent of the policy, the state-value function for all possible policies are shifted by the same amount. Hence, shifting rewards by a constant has no impact on the optimal policy for a continuous-task MDP.

### Exercise 3.16

Q: Now consider adding a constant $c$ to all rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

A: Again, adding a constant $c$ to all rewards can only increase returns and therefore increase state-values (which are, after all, conditional expected returns). However, unlike the continuous case, the increase in state-values is not uniform for an episodic task. This is because, for episodic tasks, non-zero rewards are only collected up until the terminal state is reached; beyond that, that rewards effectively remain zero irrespective of the shift.

What does this mean for the optimal policy for the task? As rewards are increased, there is more incentive (or less disincentive) to defer reaching the terminal stage. So, intuitively, we would expect the agent to take longer to reach the terminal stage (e.g. leave the maze). If the rewards are sufficiently shifted upwards, there could reach a point where the optimum policy is to avoid ever reaching the terminal stage so that the agent can collect rewards indefinitely.

How can we confirm this intuition formally? Letting $T_e$ denote the number of time steps in an episode (a random variable at $t < T_e$, it is straightforward to show using the geometric series formula that, if all rewards are increased by $c$, returns will change as follows:

$$G_t \to G_t + c \frac{1 - \gamma^{T_e}}{1 - \gamma}.$$

Hence, taking conditional expectations, we see that for a fixed policy $\pi$ the state-value function will change as follows:

$$v_\pi(s) \to v_\pi(s) + \frac{c}{1 - \gamma}\left(1 - \mathbb{E}_\pi\left[\gamma^{T_e} \mid S_t = s\right]\right).$$

Notice that, unlike for the continuous tasks, state values are now shifted upwards by an amount that depends on the episode length: for states where typical episode lengths are long, the impact of adding $c$ to rewards is greater. This fits the intuition described above. 
Notice also that the extent to which state values are now shifted will vary from policy to policy (since $v_c$ is now also a function of $\pi$). Hence, the optimal policy will also typically be different after the constant $c$ has been added.

### Exercise 3.17

Q: What is the Bellman equation for action values, that is, for $q_\pi$?

A: Starting from the definition of the action value function:

$$
\begin{align}
q_\pi(s, a) &\doteq \mathbb{E}_\pi\left[G_t \mid S_t=s, A_t=a\right] \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t=s, A_t=a\right] \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma \mathbb{E}_\pi\left[G_{t+1} \mid S_{t+1}, A_{t+1} \right] \mid S_t=s, A_t=a\right] \\
&= \mathbb{E}_\pi\left[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t=s, A_t=a\right].
\end{align}
$$

Writing this out in component form for a finite MDP:

$$
q_\pi(s, a) = \sum_{s' \in \mathcal{S}}  p(s' | s, a)
\left[r(s, a, s') + \gamma \sum_{a' \in \mathcal{A}} \pi(a' | s') \, q_\pi(s', a')\right].
$$



### Exercise 3.18

Q: The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. (...) Give the equation corresponding to this intuition, for the value $v_\pi(s)$ in terms of $q_\pi(s, a)$ given $S_t=s$. This equation should include an expectation conditioned on following the policy, $\pi$. Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

A: In expected value form:

$$ v_\pi(s) = \mathbb{E}_\pi\left[q_\pi(S_t, A_t) \mid S_t=s\right]. $$

In component form:

$$ v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \, q_\pi(s, a).$$

### Exercise 3.19

Q: The value of an action, $q_\pi(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. (...) Give the equation corresponding to this intuition, for the action value, $q_\pi(s, a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_\pi(S_{t+1})$, given that $S_t=s$ and $A_t=a$. This equation should include an expectation but *not* one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p(s', r \mid s, a)$ defined by (3.2), such that no expected value notation appears in the equation.

A: In expected value form:

$$q_\pi(s, a) = \mathbb{E}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=a\right].$$

In component form:

$$q_\pi(s, a) = \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right].$$

### Optimal policies

The Bellman equations above allow us to calculate state-value and action-value functions for a given, fixed policy $\pi$. However, we are typically interested in *finding* a good policy (or, perhaps, the *best* policy) for the agent, rather than specifying one in advance. How could we do this?

Let us suppose for a moment we already know the optimal policy, $\pi_*$, for a MDP. Then we would also know the state-value function $v_*$. (We could simply use the above Bellman equation to calculate this from $\pi_*$.) However, if we knew $v_*$, then it follows that, in any state $s$ our policy should simply be to pick any action that maximises the sum of the expected next reward and the value of the next state the agent would find itself in. This implies that $v_*$ must satisfy the following implicit equation:

$$v_*(s) = \max_{a \in \mathcal{A}} \mathbb{E}\left[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a\right].$$

Note that this expectation does not depend on the policy $\pi_*$ itself, since the two random variables in the expectand only depend on the action that would be taken immediately afterwards, $A_t$, which itself is conditioned upon.

Thus we have a (now non-linear) set of equations for $v_*$, which can be solved to obtained the optimal state-value function. This is the Bellman optimality equation for the state-value function. From this, we can infer an optimal policy: any policy that, in state $s$, assigns non-zero probabilities to actions that attain the maximum in the above equation is an optimum policy.

Similarly, we could derive a Bellman optimality equation for the action-value function $q_*$:

$$q_*(s, a) = \mathbb{E}\left[R_{t+1} + \gamma \max_{a' \in \mathcal{A}} q_*(S_{t+1}, a') \,\middle|\, S_t = s, A_t = a\right].$$

In terms of $q_*$ an optimal policy is any policy that in a state $s$ assigns non-zero probabilities only to actions $a$ that maximise $q_*(s, a)$.

### Exercise 3.20

Q: Draw or describe the optimal state value function for the golf example.

A: We will not re-describe the example here: please see the book for a full description. When the ball is anywhere on the green, it is possible to get it in the hole in one shot by putting (the best possible outcome); hence anywhere on the green, $v_* = -1$. Off the green, the optimal policy is to use the driver, and thus $v_*$ coincides with $q_*(s, \texttt{driver})$ as drawn in Figure 3.3.

### Exercise 3.21

Q: Draw or describe the contours of the optimal action-value function for putting, $q_*(s,\texttt{putter})$, for the golf example.

A: Within the green, this is -1 (since the putter can get the ball in the hole from anywhere on the green). Outside the green, the contours for $q_*(s,\texttt{putter})$ run parallel, but outside, the contours for $v_*$ as described in the previous question, the distance between the two sets of contours being the range of the putter when used on the fairway, and the difference in values of the two functions being -1. This is because we would use the putter once to take us as near the hole as possible before then switching to the optimal policy (using the driver off the green and the putter on it). In the sand traps, $q_*(s,\texttt{putter})$ is one less than the value of $v_*$: after one use of the putter (which will leave us in the sand trap) we then switch to the driver to proceed with the optimal policy.

### Exercise 3.22

Q: Consider the continuing MDP shown on to the right. (...) There are exactly two deterministic policies, $\pi_\texttt{left}$ and $\pi_\texttt{right}$. What is the optimal policy if $\gamma=0$? If $\gamma=0.9$? If $\gamma=0.5$?

A: The MDP described in the book consists of three states: A, B and C.

* From state A, taking action $\texttt{left}$ always results in collecting a reward $+1$, following which we always move to state B. From state B we always move back to state A (there is no choice of action available), collecting a reward of $0$.
* From state A, we could alternatively take action $\texttt{right}$, which always results in collecting a reward of $0$ and following which we always move to state C. From state C we always move back to state A (again there is no choice of action available), collecting a reward of $+2$.

If $\gamma=0$, then the agent only cares about the next reward, and the Bellman optimality equation simplifies. The optimal policy is therefore $\pi_\texttt{left}$, i.e. to always take the action $\texttt{left}$.

Before solving the MDP for non-zero $\gamma$, note that we can simplify the MDP by eliminating the states B and C, since there are no actions to choose at these states. Specifically, for any given $\gamma$, the above MDP is equivalent to a MDP with a single state, where both actions $\texttt{left}$ and $\texttt{right}$ result in transitioning back to the same state; by taking action $\texttt{left}$ we collect reward $1$ along the way, whereas by taking action $\texttt{right}$ we collect reward $2 \gamma$ along the way. Also note that the effective discount rate for this new MDP is $\gamma^2$.

Therefore, we see that the state-value for policy $\pi_\texttt{left}$ is $(1 - \gamma^2)^{-1}$, whereas the state-value for policy $\pi_\texttt{right}$ is $2 \gamma (1 - \gamma^2)^{-1}$. Which is the better policy? Well, for any $\gamma < 0.5$, $\pi_\texttt{left}$ is optimal, whereas for any $\gamma > 0.5$, $\pi_\texttt{right}$ is optimal. If $\gamma = 0.5$, then both policies deliver the same expected return.

### Exercise 3.23

Q: Give the Bellman equation for $q_*$ for the recycling robot.

A: See the book for a description of the recycling robot. The system of Bellman equations is a follows:

$$
\newcommand{wait}{\texttt{w}}
\newcommand{recharge}{\texttt{r}}
\newcommand{search}{\texttt{s}}
\newcommand{low}{\texttt{l}}
\newcommand{high}{\texttt{h}}
\begin{align}
q_*(\low, \wait) &= r_\wait + \gamma \max\left[q_*(\low, \wait), q_*(\low, \recharge), q_*(\low, \search)\right] \\
q_*(\low, \recharge) &= \gamma \max\left[q_*(\high, \wait), q_*(\high, \search)\right] \\
q_*(\low, \search) &= \beta\,r_\search - 3(1-\beta)
  + \gamma \, \beta \max\left[q_*(\low, \wait), q_*(\low, \recharge), q_*(\low, \search)\right]
  + \gamma (1 - \beta) \max\left[q_*(\high, \wait), q_*(\high, \search)\right]\\
q_*(\high, \wait) &= r_\wait + \gamma \max\left[q_*(\high, \wait), q_*(\high, \search)\right] \\
q_*(\high, \search) &= r_\search + \gamma \, \alpha \max\left[q_*(\high, \wait), q_*(\high, \search)\right] + \gamma \,
 (1-\alpha) \max\left[ q_*(\low, \wait), q_*(\low, \recharge), q_*(\low, \search) \right].
\end{align}
$$

Extension Q: Can you solve these equations?

A: The solution of these equations clearly depend on the problem parameters — $r_s$, $r_w$, $\alpha$, $\beta$ and $\gamma$. For example, should $r_w$ be high enough, then the robot could end up indefinitely waiting in the low, or even high, battery states; if $r_s$ is high enough and $\gamma$ is low enough, then the robot would be willing to risk running out of charge in a low battery state; etc.

So, instead of finding the general solution, let us posit an intuitive optimal policy and see what constraints must be placed on the problem parameters to make this the actually optimal policy. Specifically, let us suppose that in the low battery state it is always most preferable to recharge and in the high battery state it is always most preferable to search. Formally, this means we assume:

$$q_*(\low, \search) < q_*(\low, \recharge),\quad q_*(\low, \wait) < q_*(\low, \recharge) \quad\text{and}\quad q_*(\high, \wait) < q_*(\high, \search).$$

These assumptions allow us to simplify the Bellman equations as follows:

$$
\begin{align}
q_*(\low, \wait) &= r_\wait + \gamma q_*(\low, \recharge) \\
q_*(\low, \recharge) &= \gamma q_*(\high, \search) \\
q_*(\low, \search) &= \beta\,r_\search - 3(1-\beta)
  + \gamma \, \beta q_*(\low, \recharge) + \gamma (1 - \beta) q_*(\high, \search)\\
q_*(\high, \wait) &= r_\wait + \gamma q_*(\high, \search) \\
q_*(\high, \search) &= r_\search + \gamma \, \alpha q_*(\high, \search) + \gamma \,(1-\alpha) q_*(\low, \recharge).
\end{align}
$$

We can solve these equations straightforwardly by substitution to find

$$q_*(\high, \search) = \frac{r_\search}{1 - \gamma \left[\alpha + \gamma (1 - \alpha)\right]}$$

and the remaining action values in terms of $q_*(\high, \search)$:

$$
\begin{align}
q_*(\low, \wait) &= r_\wait + \gamma^2 q_*(\high, \search) \\
q_*(\low, \recharge) &= \gamma q_*(\high, \search) \\
q_*(\low, \search) &= \beta\,r_\search - 3(1-\beta)
  + \gamma \left[1 - \beta(1 - \gamma)\right] q_*(\high, \search)\\
q_*(\high, \wait) &= r_\wait + \gamma q_*(\high, \search).
\end{align}
$$

We now need to insert this solution into the inequalities above representing our initial assumptions, in order to see what constraints need to be placed on the MDP parameters to make this solution the action values for the optimal policy:

$$
\begin{align}
q_*(\low, \wait) &< q_*(\low, \recharge) &\implies r_\wait &< \gamma (1 - \gamma) q_*(\high, \search)\\
q_*(\high, \wait) &< q_*(\high, \search) &\implies r_\wait &< (1 - \gamma) q_*(\high, \search)\\
q_*(\low, \search) &< q_*(\low, \recharge) &\implies r_\search &< 3\frac{1-\beta}{\beta} + (1 - \gamma)q_*(\high, \search).\\
\end{align}
$$

These simplify to the following two inequalities on the MDP parameters:

$$
\frac{r_\wait}{r_\search} < \frac{\gamma (1 - \gamma)}{1 - \gamma \left[\alpha + \gamma (1 - \alpha)\right]} \quad\text{and}\quad
r_s \left[1 - \frac{1 - \gamma}{1 - \gamma \left[\alpha + \gamma (1 - \alpha)\right]} \right] < 3\frac{1 - \beta}{\beta}.
$$

Intuitively, the first inequality means that the reward for waiting, $r_\wait$, should not be so high that the robot is tempted to wait, rather than search or recharge, even under the low battery condition; the second inequality suggests that the reward for searching $r_\search$ should not be so high, nor the risk of getting stranded $1 - \beta$ so low, that the robot is tempted to search instead of recharge when under the low battery condition.

### Exercise 3.24

Q: Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.

A: The best state is grid cell $\texttt{(0, 1)}$, denoted 'A' in the figure. From this cell, all actions result in a reward of $10$ and move the agent to cell $A'$, which has value $16.0$. This means all actions are "optimal". Hence, the value of state A must be $10 + 0.9 \times 16.0 = 24.4$, which is what we find from checking Figure 3.5.

Going further, let us recompute $v_*$ for the entire grid using a numerical solver, and then give the value of the best state to three decimal places as requested:

In [3]:
import numpy as np
import scipy.optimize
import pandas as pd
from rl.mdp import GridWorld

# Setup gridworld specified in the book
gridworld = GridWorld(
    size=5,
    wormholes={
        (0, 1): ((4, 1), 10),  # state "A" in the book
        (0, 3): ((2, 3), 5),  # state "B" in the book
    },
)

# Express Bellman optimality equation as root-finding problem and solve
result = scipy.optimize.root(
    lambda v: v - gridworld.backup_optimal_values(v, gamma=0.9),
    x0=np.zeros(len(gridworld.states)),
    tol=1e-6,
)

# Check some results against values given in Figure 3.5 in the book
assert result.success
assert round(result.x[gridworld.s2i((2, 1))], 1) == 19.8
assert round(result.x[gridworld.s2i((3, 2))], 1) == 16.0
assert round(result.x[gridworld.s2i((0, 1))], 1) == 24.4  # "best" state, asked in the question

# Display the result as a DataFrame for easy inspection
print("State values under an optimal policy:")
display(  
    pd.Series(
        result.x,
        index=pd.MultiIndex.from_tuples(gridworld.states, names=["r", "c"]),
    )
    .unstack()
    .round(1)
)

# And answering the question...
print("Best state value to 3 decimal places:", round(np.max(result.x), 3))

State values under an optimal policy:


c,0,1,2,3,4
r,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,22.0,24.4,22.0,19.4,17.5
1,19.8,22.0,19.8,17.8,16.0
2,17.8,19.8,17.8,16.0,14.4
3,16.0,17.8,16.0,14.4,13.0
4,14.4,16.0,14.4,13.0,11.7


Best state value to 3 decimal places: 24.419


### Exercise 3.25

Q: Give an equation for $v_*$ in terms of $q_*$.

A: $v_*(s) = \max_{a \in \mathcal{A}} q_*(s, a)$.

### Exercise 3.26

Q: Give an equation for $q_*$ in terms of $v_*$ and the four-argument $p$.

A:

$$q_*(s, a) = \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} \left[r + v_*(s')\right] p(s', r \mid s, a).$$

### Exercise 3.27

Q: Give an equation for $\pi_*$ in terms of $q_*$.

A: $\pi_*(a | s) = \operatorname{arg\,max}_{a \in \mathcal{A}} q_*(s, a)$ (where ties are broken at random).

### Exercise 3.28

Q: Give an equation for $\pi_*$ in terms of $v_*$ and the four-argument $p$.

A: Again, where ties under the arg-max operation are broken at random:

$$
\DeclareMathOperator*{\argmax}{arg\,max}
\pi_*(a | s) = \argmax_{a \in \mathcal{A}} \sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} \left[r + v_*(s')\right] p(s', r \mid s, a).$$