<a href="https://colab.research.google.com/github/COMP90054/2024-S2-tutorials/blob/main/solution_set_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMP90054 AI Planning for Autonomy
### Solution Set 10: Policy iteration and reward shaping




### Problem 1: Policy update

Policy Iteration has two main steps, policy evaluation and policy update. In order to evaluate the given policy:

$\begin{array}{lll}
V^{\pi}(Messi) & = & Q^{\pi}(Messi, Pass)\\
               & = & P_{Pass}(Suarez \mid Messi)[r(Messi,pass,Suarez) + \gamma \cdot V^{\pi}(Suarez)]\\
               & = &\gamma \cdot V^{\pi}(Suarez) −1\\[2mm]
V^{\pi}(Suarez) & = & Q^{\pi}(Suraez, Pass)\\
               & = & P_{Pass}(Messi \mid Suarez)[r(Suarez,pass,Messi) + \gamma \cdot V^{\pi}(Messi)]\\     
               & = & \gamma \cdot V^{\pi}(Messi) −1 \\[2mm]
V^{\pi}(Scored) & = & Q^{\pi}(Scored, Return)\\
               & = & P_{Return}(Messi \mid Scored)[r(Scored,pass,Messi) + \gamma \cdot V^{\pi}(Messi)]\\
               & = & \gamma \cdot V^{\pi}(Messi) + 2
\end{array}$

Then, we solve a basic simultaneous linear equation (not part of the subject learning outcomes) about $V^{\pi}(Messi)$ and $V^{\pi}(Suarez)$:

$\begin{array}{lll}
V^{\pi}(Messi) & = & 1/(\gamma -1)\\
V^{\pi}(Suarez) & = & 1/(\gamma -1)\\
V^{\pi}(Scored) & = & 3 + 1/(\gamma -1)
\end{array}$

Then apply $\gamma = 0.8), the policy evaluation table would be:

Iteration  | Q(Messi, P) | Q(Messi, S)  | Q(Suarez, P)  | Q(Suarez, S) | Q(Scored)
-----------|---|----|----|--------------|-------
0  |0.0|0.0|0.0|0.0|0.0|
1  | -5| -5.52| -5 | -4.56 | -2|
2  |-4.194| -4.772|-4.355|-3.993|-1.355

Then we apply  two iterations of policy update based on the above table to get:

Iteration  | $\pi$(Messi) | $\pi$(Suarez)  | $\pi$(Scored)
-----------|---|----|----
0  |Pass|Pass|Return
1  |Pass | Shoot| Return
2  |Pass | Shoot |Return

### Problem 2: Potential functions

The important thing for the reward function is that you need to consider the next goal and
whether the key is held. Using normalised Manhattan distance as the estimate, we can define the
following potential function:

```
if Key == 0:
    return 1 - NormalizedManhattan(s, K)
else if Key == 1 and M == False:
    return 1 - NormalizedManhattan(s, M)
else if Key == 1 and M == True:
    return 1 - NormalizedManhattan(s, R)
else if Key == 2:
    return 1 - NormalizedManhattan(s, R)
```

Others are possible, but this one will help to guide the agent early in the exploration.

### Problem 3: Reward shaping update

Assuming that the Manhattan function is normalise, we calculate $\Phi$ for the following:

Let $s$ be the current state, $s_1$ be the state after action Up and $s_2$ be the state after action Right:

$
\begin{array}{lll}
 s & = & ((4,0), Key = 1,M = False)\\
s_1 & = & ((4,1),Key = 1,M = False)\\
s_2 & = & ((5,0),Key = 1,M = False)
\end{array}
$

Then the $\Phi$ value of each state is:
$
\begin{array}{lll}
\Phi(s) & = & 1 − \frac{9}{12} & = & \frac{3}{12}\\
\Phi(s_1) & = & 1 − \frac{8}{12} & = & \frac{4}{12}\\
\Phi(s_2) & = & 1 − \frac{10}{12} & = & \frac{2}{12}
\end{array}
$

To update the Up action:

$
\begin{array}{lll}
Q(s, Up) & \rightarrow & Q(s, Up) + \alpha[r(s,Up,s_1) + F(s,s_1) + \gamma \max Q(s_1,a') - Q(s,Up)]\\
        & \rightarrow  & 0 + 0.2 \times [0 + 0.9 \times \frac{4}{12} − \frac{3}{12} + 0.9 \times 0 - 0]\\
        & \rightarrow & 0.01
\end{array}
$

To update the Right action:

$
\begin{array}{lll}
Q(s, Right) & \rightarrow & Q(s, Right) + \alpha[r(s,Right,s_2) + F(s,s_2) + \gamma \max Q(s_2,a') - Q(s,Right)]\\
        & \rightarrow  & 0 + 0.2 \times [0 + 0.9 \times \frac{2}{12} − \frac{3}{12} + 0.9 \times 0 - 0]\\
        & \rightarrow & -0.02
\end{array}
$