# Chapter 3: Finite Markov Decision Processes

### Exercise 3.1

Q: Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as *different* from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

A: First, let us recall the finite, discrete-time MDP framework as it is defined in the textbook.

An MDP consists of three processes that evolve in discrete time:

* A state process, $S_t \in \mathcal{S}$, describing the evolution of the state that the environment is in (as represented to the agent). The set of all possible states, $\mathcal{S}$, is called the *state space* for the MDP, and for this chapter is assumed to be finite.
* A sequence of actions $A_t$ that the agent takes at each time step $t$. At each time step $t$, $A_t \in \mathcal{A}(S_t)$, where $\mathcal{A}(s)$ is the *action space* representing all available actions when the environment is in state $s \in \mathcal{S}$. When the same set of actions are available irrespective of state, we may denote the action space by $\mathcal{A}$.
* A sequence of rewards $R_t \in \mathcal{R} \subseteq \mathbb{R}$ received by the agent. The book uses the convention that $R_t$ is the reward received at the *start* of a time step, i.e. as a consequence of the agent's *previous* action (at time step $t - 1$). However, it is also possible to define $R_t$ as the reward received at the *end* of a time step, as a consequence of the action taken at that same time step $t$.

The object that turns these three components into a Markov Decision Process is the *dynamics function* $p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to [0, 1]$ that specifies the probability distribution for $(S_t, R_t)$ conditional on the state of the environment and the action taken by the agent at the previous step, $(S_{t - 1}, A_{t - 1})$. In other words

$$p(s', r \mid s, a) \doteq \operatorname{Pr}\left[S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a\right].$$

More generally, one could extend the framework to cover (manageably) infinite state and action spaces, or problems that play out in continuous time.

Three examples of MDPs, ranging from practical to whimsical, are as follows:
1. An intelligent building climate control system. At any point, the state signal may correspond to: the building's current occupancy (and perhaps even locations of occupants); internal temperature sensor readings; external temperature, windspeed and solar illumination; perhaps even extending to the system's memory of past occupancy patterns, occupants' calendars and weather forecasts. The action space may correspond to: opening / closing valves on individual radiators; opening / closing windows; or lowering / raising blinds. The reward received by the system would be a weighted sum of the deviation between actual climate conditions and preferred climate conditions specified by occupants, a measure of the energy consumption (or perhaps CO<sub>2</sub> emissions rate), and perhaps a penalty that applies whenever an occupant overrides the system (e.g. by manually adjusting a radiator, window or blind).
2. A customer relationship management system that aims to increase the value a given customer brings to a business. The state signal may consist of transactions the customer makes with the business, website / mobile app usage data attributable to the customer, external data relating to the customer's spending / interactions with competitors, economic or market data on general consumption / demand patterns for the business's products. The action space may consist of written communications (post, email, SMS, social media, etc), perhaps even composition of the wording or decisions over the format / layout, promotions or other customisation of products offered to the customer. The reward signal at each time step could consist of the estimated change in the customer's lifetime value less the costs (e.g. marketing costs or promotional costs) incurred by the action last taken.
3. A government. The state signal would consist of all the economic, behavioural, polling and forecasting data available to decision makers. The action space consists of the range of legislative changes, fiscal and monetary policy changes, subsidies, service provision / curtailment, capital expenditure, etc decisions available to the government. The reward could be the change in (depending on one's political leaning!), say from one month to the next, some measure of the welfare of the population. Or more cynically, the reward could be zero except when there is an election (where it may be 1 if the incumbent governing party wins the election or -1 if it loses power).

### Exercise 3.2

Q: Is the MDP framework adequate to usefully represent *all* goal-directed learning tasks? Can you think of any clear exceptions?

A: The MDP framework as defined above (and in the textbook) has some restrictions that could limit its universality. Let's consider some of these restrictions and the types of problems that may therefore fall outside of this framework.

* The framework (implicitly) assumes that the agent is able to observe the state of the environment. In practice, it may only see limited information about the environment, or its observations of the environment may be corrupted by noise. This would apply, for example, to any agent that relies on (fallible) sensors to perceive the environment, or in many situations where an agent's observations of state are limited (e.g. by walls, unavailable data, or other limitations on its ability to perceive). However, it is unclear whether this is strict limitation of the MDP framework: one could simply define the state signal to be whatever information the agent *does* receive about the environment, or perhaps to be the state of the agent's beliefs about the environment. Perhaps the pertinent point is that partial observability may limit how *usefully* the MDP framework can be applied to such problems (even if some sense the MDP framework can technically be made to fit such problems).
* The framework assumes that the probability distribution for the next state $s'$ only depends on the previous state $s$ and previous action $a$. The fact that it only depends on the previous state $s$ is not too restrictive, as we could simply expand the definition of environment state to include as much of a memory of previous states as required to be able to usefully predict the future. However, the dependency on just the previous action $a$ may be more restrictive: it seems to preclude problems where actions may have delayed effects, or even adversarial problems where the environment may actively react to an agent's past behaviour when determining future state transitions. Again, technically, it is possible to expand the definition of an environment's state to include a history of the agent's previous actions and therefore incorporate such problems within the MDP framework. However, practically, it is unclear how useful this extension would be, as the dynamics function will effectively change over time (possibly adversarially): standard algorithms for solving MDPs may not be the most effective for learning a good policy under such circumstances.
* The framework assumes that the dynamics function is stationary. This is actually still quite flexible: statistically stationary systems can still evolve over time, just according to (ultimate) parameters that are constant. A bigger challenge is where the dynamics depend on an external time-varying input (which may not be known to the agent). Technically, we would incorporate time itself within the representation of state: if time-variability is predictable, this may help the agent learn how to deal with it. However, from a perspective of learnability, it may not be practical to learn a good policy for a problem where the rules governing the system vary wildly and unpredictably over time. (Compare to the problem of collaborating with a highly unpredictable or volatile colleague!)

### Exercise 3.3

Q: Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out — say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in — say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices *where* to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

A: In my view, the basis for determining the "right" level depends on where lies the line dividing what a agent can be assumed to have complete control over versus what the agent can only seek to influence. Another perspective is that the same actor may be trying to solve a hierarchy of problems (e.g. navigate across a city to the office *and* drive the car safely and comfortably in the moment *and* control one's muscles to produce desired movements in one's limbs): at each level of the hierarchy once could consider an agent specific to that hierarchy that receives instructions from a higher level and determines actions in the form of instructions to a lower level agent. (At the lowest levels, an agent may not be needed at all, as a simple mechanical device could effect the desired change.)

First, let us contrast the option of drawing the interface at the driving controls (accelarator, brake, steering wheel) versus considering actions to be tire torques. The latter approach implicitly assumes that the agent is able to specify actions as precise (changes to?) tire torques. In other words, irrespective of external conditions (e.g. the slope, weather, road surface, tire inflation and temperature, etc), it assumes that the agent knows exactly how to, say, increase the torque on the front right wheel by exactly 1 newton-meter. Practically, that seems unrealistic: as a driver I feel I can precisely control the position of the steering wheel, pressure on the brake pedal, etc; while I have some sense of how this will translate into the car's behaviour on the road, this is not exact and in fact depends on many external factors.

Next, let us contrast the option of drawing the interface at the driving controls versus as electrical signals to my muscles (or, if I were to replace myself with a humanoid robot, to the robot's actuators). If I am not sure exactly how to get my arms to turn the steering wheel at a desired rate, then this would seem reasonable. In fact, when I was first learning to drive a car (or perhaps even now, when driving an unfamiliar vehicle) this may be a sensible choice: I may not yet have a feel for how hard I need to pull the steering wheel to have the desired effect. But for most purposes, this seems to be an overcomplication of the problem: I have a pretty good sense of how to control my limbs, and it would simplify the learning challenge if I take that as a given and concentrate on driving at the more abstract level of operating the car's controls. Note however, that my authority over the car's controls is not absolute: a water bottle could roll under the brake pedal, or I could suddenly get cramp in a leg — so there may be circumstances where if I frame actions purely in terms of changes to the car's controls, I may not be able to realise those actions.

Finally, let us contrast the option of drawing the interface at the driving controls versus at the level of deciding *where* to drive (e.g. drive along the motorway until junction 8, take the first exit at the roundabout, etc). This choice really depends on the level of the problem hierarchy we are dealing with. Once we have assumed: (a) that we have a low level agent who is effectively able to use electrical signals to control the movement of my limbs; (b) a mid-level agent who is effectively able to operate the car's controls to control the car's immediate velocity, rate of acceleration, turn, etc; then the next level of the hierarchy is to use that mid-level agent to get us from A to B. For this purpose, we may consider a high level agent (i.e. a navigator, which could be myself, a passenger or even a sat-nav system) that gives control signals in the form of highly driving instructions.
