# Model Free Control

See the lecture [video](https://www.youtube.com/watch?v=0g4j2k_Ggc4&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ).
See the lecture [notes](https://davidstarsilver.wordpress.com/wp-content/uploads/2025/04/lecture-5-model-free-control-.pdf)


On-policy learning: **Learn on the job**. Learn about policy $\pi$ from experience sampled from $\pi$.

Off-policy learning: **Look over someone's shoulder**, learn about policy $\pi$  from experience sampled from $\mu$


We could use the generalized policy iteration framework with both Monte-Carlo, and TD to form a model-free method for learning the policy.

## Epsilon-Greed Exploration

As opposed to dynamic programming where we had a model of the system, here we don't have a model of the system, so we can't use the value function $V$, we have to use the action-value function $Q$ since it abstracts away the model.

The other issue is that with dynamic programming, we did full sweeps of the state space. Hence, it was ok to be fully greedy when improving the policy since you're not prone to getting stuck in local minima. However, with sampling approaches like Monte-Carlo or TD, if you take the fully greedy solution after sampling a small section of the state space, you may get stuck with a suboptimal policy since you don't know if there is a better solution out there since you didn't explore the entire state space.
The solution to that is to do $\epsilon$-greedy exploration, where epsilon amount of times we will take a random action (exploration) and (1 - epsilon) times we will take the greedy action (exploitation).


We could then use TD 


## Monte-Carlo vs. TD

TD has advantages over MC:
- Lower-variance (but high bias as opposed to MC)
- Online (don't have to wait till the end of the episode like MC).
- Incomplete sequences (the problem has to terminate).
- Fits well with problems that are Markov (MC fits will with non-markov problems.



## SARSA Algorithm (On-Policy)

The SARSA algorithm uses TD with epsilon greedy exploration to find the optimal policy using the generalized policy iteration framework.

Let's program it out:


In [302]:
import random


def greedy(action_values):
    return max(action_values, key=action_values.get)

def e_greedy(action_values, epsilon):
    if random.random() < epsilon:
        # Return a random action
        random_action = random.choice(list(action_values.keys()))
        return random_action
    else:
        # Return action with the max Q-value
        return greedy(action_values)


"""
Pefrom the SARSA algorithm.
Inputs:
- Q: ns x na q-value lookup table.
- R: ns reward lookup table
- s_init: Initial s at the beginning of an episode.
- s_terminal_set: Set containing terminal states if we ever end at we will stop.
- epsilon: The epsilon for the e-greedy policy.
- alpha: The learning rate.
- gamma: The discount factor.
- simulate: a function that simulates taking an action
- n_episodes: Number of episodes to run.
- max_iterations: the max_iterations to run.
"""
def sarsa(Q, R, s_init, s_terminal_set, epsilon, alpha, gamma, simulate, n_episodes, max_iterations=1000):
    for _ in range(n_episodes):
        s = s_init
        a = e_greedy(Q[s], epsilon)
        epsilon -= 1 / n_episodes * epsilon
        iteration = 0
        while (iteration < max_iterations):
            if s in s_terminal_set:
                Q[s][a] = R[s]
                break
            r = R[s]
            s_prime = simulate(s, a)
            a_prime = e_greedy(Q[s_prime], epsilon)    
            Q[s][a] = Q[s][a] + alpha * (r + gamma * Q[s_prime][a_prime] - Q[s][a])
            s = s_prime
            a = a_prime
            iteration += 1

In [330]:
# Perform SARSA over a grid
from collections import defaultdict
import numpy as np

def simulate(s, a):
    return (s[0] + a[0], s[1] + a[1])

N = 5 # grid size

# Initialize Q
Q = defaultdict(dict)
for x in range(N):
    for y in range(N):
        for ax, ay in [(1, 0), (-1, 0), (0, 1), (0, -1)]:
            if (x + ax >= N) or (ax < 0 and x == 0):
                continue
            if (y + ay >= N) or (ay < 0 and y == 0):
                continue
            Q[(x, y)][(ax, ay)] = 0

# Plot greedy Q
def print_greedy_Q(Q, N):
    array = np.array([[0 for i in range(N)]for i in range(N)])
    a_array = np.array([["a" for i in range(N)]for i in range(N)])
    
    for s, action_values in Q.items():
        greedy_a = greedy(action_values)
        array[s[0]][s[1]] = action_values[greedy_a]
        if greedy_a == (1, 0):
            a_array[s[0]][s[1]] = 'd'
        elif greedy_a == (-1, 0):
            a_array[s[0]][s[1]] = 'u'
        elif greedy_a == (0, 1):
            a_array[s[0]][s[1]] = 'r'
        else:
            a_array[s[0]][s[1]] = 'l'
            
        
    print(array)
    print(a_array)

print("Q:")
print_greedy_Q(Q, N)

# Initialize R
R = {}
for x in range(N):
    for y in range(N):
        R[(x, y)] = 0
R[(0,0)] = 10

# Create an obstacle for R
R[(0, 1)] = -1000
R[(1, 1)] = -1000
R[(2, 1)] = -1000
R[(3, 1)] = -1000

def print_R(R, N):
    array = np.array([[0 for i in range(N)]for i in range(N)])
    for s, r in R.items():
        array[s[0]][s[1]] = r
    print(array)

print("R:")
print_R(R, N)

Q:
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
[['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['u' 'u' 'u' 'u' 'u']]
R:
[[   10 -1000     0     0     0]
 [    0 -1000     0     0     0]
 [    0 -1000     0     0     0]
 [    0 -1000     0     0     0]
 [    0     0     0     0     0]]


In [324]:
# Initialize the rest of params
s_init = (4, 4)
s_terminal_set = set([(0, 0), (0, 1), (1, 1), (2, 1), (3, 1)])
epsilon = 0.1
alpha = 0.1
gamma = 1.0
n_episodes = 10000
max_iterations = 1e6
sarsa(Q, R, s_init, s_terminal_set, epsilon, alpha, gamma, simulate, n_episodes, max_iterations)
print_greedy_Q(Q, N)

[[   10 -1000  -311  -312  -212]
 [    9 -1000  -297  -212  -182]
 [    9 -1000  -192  -179  -159]
 [    9 -1000  -122  -149  -143]
 [  -45   -30   -15   -35   -51]]
[['d' 'd' 'r' 'r' 'd']
 ['u' 'd' 'r' 'd' 'd']
 ['u' 'd' 'r' 'r' 'd']
 ['u' 'd' 'd' 'd' 'd']
 ['u' 'l' 'l' 'l' 'l']]


### Observations about SARSA

It's great but takes a long time to converge, as it's on policy.

## Q Learning

Q Learning is an off policy learning approach. In the off policy approach, the robot follows a policy while improving another policy, in the q learning approach. The robot does the following:
1. Follows epsilon greedy policy
2. Improves the fully greedy policy

Let's go through a code example:

In [334]:
"""
Pefrom the Q-learning algorithm.
Inputs:
- Q: ns x na q-value lookup table.
- R: ns reward lookup table
- s_init: Initial s at the beginning of an episode.
- s_terminal_set: Set containing terminal states if we ever end at we will stop.
- epsilon: The epsilon for the e-greedy policy.
- alpha: The learning rate.
- gamma: The discount factor.
- simulate: a function that simulates taking an action
- n_episodes: Number of episodes to run.
- max_iterations: the max_iterations to run.
"""
def q_learning(Q, R, s_init, s_terminal_set, epsilon, alpha, gamma, simulate, n_episodes, max_iterations=1000):
    for _ in range(n_episodes):
        s = s_init
        iteration = 0
        while (iteration < max_iterations):
            a = e_greedy(Q[s], epsilon)
            r = R[s]
            if s in s_terminal_set:
                Q[s][a] = r
                break
            s_prime = simulate(s, a)
            a_max = greedy(Q[s_prime])
            Q[s][a] = Q[s][a] + alpha * (r + gamma * Q[s_prime][a_max] - Q[s][a])
            s = s_prime
            iteration += 1

In [338]:
# Initialize the rest of params
s_init = (4, 4)
s_terminal_set = set([(0, 0), (0, 1), (1, 1), (2, 1), (3, 1)])
epsilon = 0.1
alpha = 0.1
gamma = 1.0
n_episodes = 10000
max_iterations = 1e6
q_learning(Q, R, s_init, s_terminal_set, epsilon, alpha, gamma, simulate, n_episodes, max_iterations)
print_greedy_Q(Q, N)

[[   10  -564     2     0     0]
 [    9 -1000     9     1     1]
 [    9 -1000     9     9     9]
 [    9 -1000     9     9     9]
 [    9     9     9     9     9]]
[['d' 'l' 'd' 'd' 'd']
 ['u' 'd' 'd' 'd' 'd']
 ['u' 'd' 'd' 'd' 'd']
 ['u' 'd' 'd' 'd' 'd']
 ['u' 'l' 'l' 'l' 'l']]


### Observations on Q-learning

Much more efficient than SARSA, it finds the solutions in much fewer iterations. Mainly because it is exploring, but at the same time updating Q function with the most greedy option. It's a good balance between the 2 options.