<a href="https://colab.research.google.com/github/COMP90054/2025-S2-tutorials/blob/main/solution_set_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMP90054 AI Planning for Autonomy
### Problem Set 09
 - Value Function Approximation




### Key concepts:
- The role of function approximators in RL
- Gradient descent
- DQN



---


### Problem 1:


Discuss the following questions in groups:

- What is the primary benefit of function approximation in RL?
- What problem is batch RL trying to solve?


### Answer:

Function approximation is *not* primarily about reducing storage space for the Q-table. Imagine a computer with infinite space, trying to learn chess using tabular RL. Even though storage is not an issue, it will still fail to learn in any reasonable time, because an update only affects that exact state, and knowledge cannot generalise at all. The primary benefit of function approximation is that experience *generalises*, i.e. what you learn in one state will teach you something about how to act in similar but not identical states. This is fundamental to all efficient learning. 

Batch RL tries to solve the problem that incremental learning using function approximators is very inefficient. This is because each experience is only used once, then thrown away. Consider a case where early in an training, the agent finds a key. Because they have no knowledge of whether this is useful yet, no meaningful update occurs. Later in training, the agent learns that having the key is highly valuable. In an incremental setting the opportunity to learn from that initial experience has past, but in a batch mode, the agent can go back and update that the action which picked up the key is an important one. 



### Problem 2:

Consider a simple MDP consisting of a 10x10 grid, the agent can move up, down, left, or right. The agent starts in (0,0) and gets a reward of +1 when reaching (9,9) which is a terminal state. $\gamma = 0.9$, and and all actions are deterministic (there is no randomness).  

Andrew the average computer scientist has designed the following feature representation: $\hat{q}((x,y),A, \mathbf{w}) = w_0\cdot x' + w_1 \cdot y'$, where $x'$ and $y'$ are the coordinates of the agent after applying action $A$. 
He is trying to perform incremental control with action-value function approximation using TD(0), i.e. linear SARSA. Imagine the agent moves up from the cell (2,3) and then moves right. The existing weight vector is $[0.1,-0.2]$. 

1. Perform a single weight update from that transition with $\alpha = 0.2$. Recall from the lecture slides, the weight update equation is: $\Delta \mathbf{w} = \alpha ({R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w})} - \hat{q}(S_t, A_t, \mathbf{w})) \nabla_\mathbf{w} \hat{q}(S_t, A_t, \mathbf{w})$
2. What weights would result in an optimal greedy policy?
3. Consider changing the MDP so that the terminal reward state is in (5,5). What problems does this cause? How could you address this?

### Answer:
1.
Since we are using linear function approximation, $\nabla_\mathbf{w} \hat{q}(S_t, A_t, \mathbf{w}) = \mathbf{x}$. This means our update is of the form: 

$$\Delta \mathbf{w} = \alpha ({R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w})} - \hat{q}(S_t, A_t, \mathbf{w})) \cdot \mathbf{x}$$
$$  = \alpha ({R_{t+1} + \gamma (w_0\cdot x'' + w_1 \cdot y'')} - (w_0\cdot x'' + w_1 \cdot y'')) \cdot \mathbf{x}$$
$$  = 0.2 ({0 + 0.9 (0.1\cdot 3 + -0.2 \cdot 4)} - (0.1\cdot 2 + -0.2 \cdot 4)) \cdot [2,3] $$
$$  = 0.03 \cdot [2,3] $$


So $\Delta w_0 = 0.06$, $\Delta w_1 = 0.09$, and the updated weights are $[0.106,-0.191]$

1. If $w_0>0$ and $w_1>0$, the greedy policy is optimal as squares with larger x and y coordinates will be favoured, leading the agent to travel to (9,9) by any optimal path. 
2. Assuming a greedy (or near greedy policy), no combination of feature weights can possibly encode the optimal policy. If $w_0$ is positive, the agent will prefer to go right, if negative, it will prefer to go left, and if 0, it will be indifferent. Similarly for $w_1$ in the y-direction. But the optimal policy requires going left and right depending on the current state, which is not possible to represent. In order to resolve this, the features must be altered. One possibility would be to have features which represent the distance to the terminal rewarding state in the x and y directions.


---

### Problem 3:




Go to Andrej Kaparthy's [DQN demo](https://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html). Read the description of the MDP, and observe the agent learning. 

Now, consider the following questions:
 1. How does the given state representation differ from the image you are seeing. What would be the implications of changing to a pixel based representation? Compare the sizes of the state spaces under these two representations with reasonable assumptions. 
 2. Does the current state representation fully describe the entire environment? Is this an MDP?
 3. Try to verbally describe a reasonable policy for this problem (in terms of the current state representation). This will be necessarily be a bit vague.
 4. Imagine you are using linear feature approximation. What feature weights could correspond to the policy you described above? Are there issues with this? How does using a neural network address this?


### Answer:

1. The state representation is just a list of 27 numbers between 0 and 1 representing distances to 3 different objects along 9 directions. If we assume the distances around rounded to one decimal place, this gives $10^27$ states, a large, but not extreme number. Using a pixel based representation would be challenging - assume we map the world to a 100x100 grid, with each cell either containing a food, poison, a wall, or the robot. If we encode these objects using colours in an RGB fashion, with the standard 255 increments per dimension, this would be $(255^3)^10000$ states, an incomparably larger number. How could we be more efficient? 
2. No, the current state only encodes a small fraction of the environment. That means that the agent is actually learning over a POMDP. Notice though that this is not mentioned at all, and we are still using exactly the same methods that we normally use for MDPs. It turns out that with Deep-RL, we generally don't have any convergence guarantees anyway, so it is common to just pretend a POMDP is an MDP and use standard methods anyway. This works surprisingly well much of the time. 
3. One possible attempt: if the distance to the food is lower than that of the wall and poison along the center line, go forwards. If the distance to the food is lower than that of the wall and poison along a different line, turn left or right to try to center that line. If the distance to the food is not closer than walls or poison along any line, turn towards a direction where distance to walls and poison are far away, to try to get around closer obstacles.
4. If we concentrate on simply the center line, then generally the closer the food, the better, so we would have a positive weight for that feature. The opposite is true for the poison, the closer the worse, so a negative weight would be appropriate. The issue is that the value of going forwards depends quite precisely on which is closer. If the food is 0.5 away, and the poison 0.6, that implies a very different policy than if the poison is 0.5 and the food 0.6. Any linear approximator would have to assign similar values to these states though. By using a neural network, it can learn more complex features, such as distance to food - distance to poison, and introduce non-linear breakpoints to account for this sort of complexity. 