# Model Free Predicition

Notes for the following lecture:

- [Lecture Video](https://www.youtube.com/watch?v=PnHCvfgC_ZA&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ&index=4)
- [Lecture Slides](https://davidstarsilver.wordpress.com/wp-content/uploads/2025/04/lecture-4-model-free-prediction-.pdf)

In this lecture we discuss:
1. Monte-Carlo Learning
2. Temporal-Difference Learning (TD)
3. TD($\lambda$)


Previously we've discussed:
1. Planning with dynamic programming, that involved knowing the dynamics of the MDP (hence the name planning).
2. This lecture we're discussing Model-Free prediction, where we estimate the value function for an unknown MDP (the dynamics of the system are not known).
3. Next lecture we will discuss Model-Free control where


Why does dynamic programming is Model-Based (hence the name planning)?
That's because the transition probabilities $P(s'| s, a)$ and the reward model $R(s, a)$ must be known to solve the problem in the dynamic programming fashion.

## Monte-Carlo Learning

Now since we're performing model free, we're giving up on the assumption that we know $P(s'| s, a)$ and $R(s, a)$.

This lecture we'll focus on just policy evaluation.

Monte-carlo:
1. Not the most efficient, but widely used in practice.
2. Episodic, learns directly from complete episodes of experience. This a caveat cause you have to wait till the episode to terminate before we could learn.
3. Uses the simplest idea: value = mean return


The idea is to look at episodes of experience, and use that to evaluate the value function.

Given that you have a trajectory for an episode following a certain policy $\pi$:


$$
G_t = R_{t + 1} + \gamma R_{t + 2} + .... + \gamma^{T - 1}R_T
$$

The value function for a given policy $\pi$ is then:

$$
v_{\pi}(s) = E_{\pi}[G_t | S_t = s]
$$

This is basically the empirical mean of all the trajectories from this state onwards.

One might ask, doesn't the amount of samples you have to take increase exponentially with the size of the state space. The answer is two fold:
1. For a specific state, as long as you sample enough trajectories onwards from that state, the empirical average will approach the true mean and you don't need to sample anymore.
2. You can only sample portions of the state space that you care about, you don't have to explore the entire state space, hence even if the state space is super large, we can just sample the portions that we care about.

There are 2 flavours of Monte-Carlo:
1. First visit MC: only consider the return when the state is at the beginning of the trajectory
2. Every visit MC: consider all states along any sample trajectory (not just the first state) when computing the mean.

This is the update rule for each state for MC, it is basically a rolling average of the history of the returns seen at this state, where $\alpha$ is the learning factory:

$$
V(S_t) \leftarrow V(S_t) + \alpha(G_t - V(S_t))
$$

$G_t$ in the above formula is the return of an entire episode from the state $S_t$, hence you have to wait for the entire episode to finish before going back and updating all the states visited in this episode.

## Temporal Difference Learning 

Instead of waiting for an entire trajectory to complete then update the value functions along these trajectories, we instead break this assumption and learn from incomplete episodes, where we will just guestimate what is the value of the entire trajectory without actually finishing it.


So unlike MC, where we had to wait for the entire episode to terminate to get the value of $G_t$, we will now use the bellman equation to find an estimate of $G_t$ without having to terminate the episode:

$$
V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t + 1}) - V(S_t))
$$

The above update law is called TD(0).

Some terminology:
1. $R_{t+1} + \gamma V(S_{t + 1})$ is called TD target
2. $R_{t+1} + \gamma V(S_{t + 1}) - V(S_t)$ is called the TD error


TD let's you learn from expected experiences, not real experiences like MC. For example if driving a car and about to crash, TD will better reflect the situation since we will use the expectation of crashing and back that up, while MC may never crash, so we never actually make the value reflect that.

Basically:
- TD can learn online before seeing the final outcome, while MC must wait for the final outcome to take place.
- TD can learn for non-episodic non-terminating environments 

### Bias/Variance Trade-Off
- MC doesn't have any bias, but has high variance
- TD is low variance, but has a bias. (TD(0) converges to $v_{\pi}(s)$ eventully though)

### Certainty equivelance
- MC just tries to minimize the RMS error between the data seen from previous trajectories and the value function. **MC does not exploit the Markov property**, hence MC is better for non-markov environments.
- TD fits an MDP to the problem, and tries to solve for it finding the best value function that fits the MDP.
(See a good example in the lecture) - **TD exploits the markov property**, hence TD is better in markov enviroments.

## Unified View of RL

![unified_view_of_rl](../../images/unified_view_of_rl.png)

## TD($\lambda$)

This is an algorithm that makes the best of both the MC and TD world.
Instead of looking forward for an entire episode (like MC) or just one timestep (TD(0)), we will look forward for N-steps:

$$
G_t^n = R_{t+1} + \gamma R_{t+2} + ... + \gamma ^{n-1} R_{t+n} + \gamma^n V(S_{t+n})
$$

And now our update rule becomes:

$$
V(S_t) \leftarrow V(S_t) + \alpha (G_t^n - V(S_t))
$$

Now what if we actually want to consider all possible values of n. We could now take a sort of weighted average of all the possible n values. This is the TD($\lambda$) algorithm.

$$
G_t^{\lambda} = (1 - \lambda) \sum_{n + 1}^{\infty} \lambda^{n - 1} G_t^n
$$

And the update rule becomes:

$$
V(S_t) \leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t))
$$

This effectively takes the geometric series average of all possible values of n as shown in the below diagram:

![td_lambda](../../images/td_lambda.png)

There is a forward and backward view of TD($\lambda$), the backward view is more efficient to compute and more suitable for non-terminating problems.

See the slides for more details about the implementation of TD($\lambda$).