# Chapter 3 - Finite MDPs

## 3.1 - The Agent-Environment Interface

<p style="font-size:25px;">
<b>Exercise 3.1</b> 
</p>

Devise three example tasks of your own that fit into the MDP framework,
identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples


<p style="font-size:22px;">
<b>Answer:</b> 
</p>


+ Example 1: MMO game. Each player is an agent. The actions might be any movement a player can make in such a game, such as moving forward, backward, attacking, opening inventory, etc. The states are everything that represents the player in its environment: the position of the agent, the enemies around, the obstacles, the HP, etc. The rewards might be given if a player gains XP or complete a quest, negative rewards when it looses HP.

+ Example 2: Government policy creation. RL here is applied to determine new policies for a government. The actions might be what to put in each of those policies. This could be laws, specific decisions that will then be applied by the society. The states represents the current set of policies applied, the happiness of the citizens, GDP, trade & diplomatic status with other countries, etc. The rewards might be moment-to-moment measures of key metrics such as employment, GDP, happiness, airquality, etc. 

+ Example 3: Surgery. RL can be applied to perform surgery with extreme levels of precisions. Depending on the level of abstraction of the agent, the actions can be which tool to pick, which operation to perform. The states are the current location of the patient, tools, the health levels & metrics of the patient. The rewards can be +1 if the operation is successfull. The agent could receive negative rewards for risky actions or anything that destabilize the patient's health. 



<p style="font-size:25px;">
<b>Exercise 3.2</b> 
</p>

Is the MDP framework adequate to usefully represent all goal-directed
learning tasks? Can you think of any clear exceptions?


<p style="font-size:22px;">
<b>Answer:</b> 
</p>

Here are some limitations: 

* Continuous state and action space problems. In this case, it's impossible for tabular methods to represent all values, and new methods need to be used
* Non-markovian environment: Environments where the next state depends on previous states sequences, such as stock trading 
* Any tasks where the goal can't be clear and measurable enough this might be an issue. Also, systems where there are contradictory goals, this requires to find some sort of equilibrium 

<p style="font-size:25px;">
<b>Exercise 3.3</b> 
</p>

Consider the problem of driving. You could define the actions in terms of
the accelerator, steering wheel, and brake, that is, where your body meets the machine.
Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive.
What is the right level, the right place to draw the line between agent and environment?
On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?


<p style="font-size:22px;">
<b>Answer:</b> 
</p>

It's important to chose a level where we can easily map actions with real life functions and where we can actually do something. Then, it's a matter of balance to find the level that enables us to maximize our rewards.

The level of which actions are defined should align with the goal for the task. 

<p style="font-size:25px;">
<b>Exercise 3.4</b> 
</p>

Give a table analogous to that in Example 3.3, but for $p(s'
, r|s, a)$. It should have columns for $s, a, s', r,$ and $p(s', r|s, a)$, and a row for every 4-tuple for which $p(s', r|s, a) > 0.$


<p style="font-size:22px;">
<b>Answer:</b> 
</p>

$$
\begin{array}{cccc|c}
\hline
\textbf{s} & \textbf{a} & \textbf{s'} & \textbf{r} & \textbf{p(s', r|s,a)}\\
\hline
high & search & high & r_{search} & \alpha \\ 
\hline
high & search & low & r_{search} & 1-\alpha \\
\hline
low & search & high & -3 & 1 - \beta \\
\hline
low & search & low & r_{search} & \beta \\
\hline
high & wait & high & r_{wait} & 1 \\
\hline
low & wait & low & r_{wait} & 1 \\
\hline
low & recharge & high & 0 & 1 \\
\end{array}
$$



## 3.2 - Goals and Rewards 
## 3.3 - Returns and Episodes

<p style="font-size:25px;">
<b>Exercise 3.5</b> 
</p>

The equations in Section 3.1 are for the continuing case and need to be
modified (very slightly) to apply to episodic tasks. Show that you know the modification needed by giving the modified version of (3.3). 


<p style="font-size:22px;">
<b>Answer:</b> 
</p>

In this modified version, in a terminal state, there is no action to be taken, no reward, and no next step. But, in the state $s$ prior to a terminal state (which is still within $S$), the next state set needs to include the terminal state, hence: 

$$
\sum_{s' \in \mathcal{S}^+} \sum_{r \in \mathcal R} p(s',r \mid s, a) = 1, \quad \text{for all } s \in \mathcal{S}, a \in \mathcal{A}(s)
$$
