In [52]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import linprog

## 7.4 Reinforcement learning

### 7.4.1 Learning in unknown MDPs

As covered in earlier notebooks (2-B) the main way for an agent to solve a known MDP is with value iteration. At each stage we calculate the $Q$ values (value of taking an action in a state), which is the immediate reward of that action plus a mixture over the value of future states, and then maximise this over the action to get $V$, the values in each state. But this requires knowing the transition probabilities and the rewards. What if we don't know either? It turns out that if you at least know the rewards at each iteration then without knowing the transition probabilities you can still converge to an accurate $Q$ (and hence $V$ too).

The basic algorithm behind Q-learning is very simple. At each step take a random action in state $s_t$ and observe the new state $s_{t+1}$. Then update the value of $s_t$ as some mixture of the old value and the (reward + discounted future value). I.e.,


$$Q_{t+1}(s_t,a_t) = (1-\alpha_t)Q_t(s_t,a_t)+\alpha_t(r(s_t,a_t)+\beta V_t(s_{t+1})$$

Note how the learning rate $\alpha_t$ is time-dependent. It needs to sum to infinity over infinity, but the square has to sum to less than that. E.g., $\frac{1}{t}$.

Then we just update $V$ in the usual way by taking the maximum of $Q$ over the action. Here is a simple example. Say you have states A, B, and C. B is crap. A and C are both good. You can aim for any state. If you are already in the state you are aiming for you have a 50% chance of getting it vs getting B. If you are in another state you have a 75% chance. This means the optimal strategy is to flip between A and B.

In [53]:
T_aim_A = np.array([
    [0.5,0.5,0], # already in A
    [0.75,0.25,0],
    [0.75,0.25,0],
])
T_aim_B = np.array([
    [0.25,0.75,0],
    [0.25,0.50,0.25], # already in B (split the other options)
    [0.0,0.75,0.25],
])
T_aim_C = np.array([
    [0.0,0.25,0.75],
    [0.0,0.25,0.75],
    [0.0,0.5,0.5], # already in C
])

reward = np.array([5,0,6]) # same rewards-per-state everywhere, C is slightly better than A

**True value iteration**

In [54]:
Q = np.zeros((3,3)) # 3 states, 3 actions
V = np.zeros(3)
gamma = 0.9

state = 0
for t in range(100):
    for state in range(3):
        T_given_state = np.concatenate([T_aim_A[[state]],T_aim_B[[state]],T_aim_C[[state]]],axis=0)
        future_values = T_given_state.dot(V)
        Q[state] = reward[state] + gamma * future_values
        V[state] = np.max(Q[state])

print("Q",Q)
print("V",V)

Q [[41.25186521 40.12686521 42.77985032]
 [37.37686527 36.3861936  37.77985032]
 [43.37686529 41.26119364 42.52052199]]
V [42.77985032 37.77985032 43.37686529]


**Q-learning**

In [55]:
Q = np.random.rand(3,3) # 3 states, 3 actions
V = np.random.rand(3)
alpha_func = lambda x: (1/(0.01*x+2)) # 0 -> 0.5, 100 -> 0.333, etc

state = 0
for t in range(10000):
    alpha = alpha_func(t)
    action = np.random.randint(3)
    new_state = np.random.choice(3,p=[T_aim_A,T_aim_B,T_aim_C][action][state])
    Q[state,action] = (1-alpha)*Q[state,action] + alpha * (reward[state] + gamma * V[new_state])
    V[state] = np.max(Q[state])#
    state = new_state

print("Q",Q)
print("V",V)

Q [[41.09618421 39.78099005 42.63111412]
 [37.32365504 36.04446835 37.44804461]
 [43.17843919 40.52914154 42.02254613]]
V [42.63111412 37.44804461 43.17843919]


Roughly the same...

$Q$-learning will eventually converge to values matching the optimal policy, however it makes no guarantee about the rate of convergence. This is a bit of a problem as some agents might take on such a big loss early on that they can't make it up knowing the optimal policy going forward.

### 7.4.2 Reinforcement learning in zero-sum
Adding other players to the game comes with a number of challenges, as we have covered before. One option is to just ignore the other and use the above approach. In that scenario the other player is just a stochastic force that affects the transition probabilities. A more intelligent agent would instead model the other player. We can do this for $Q$-learning directly by modifying the formula:

$$Q_{t+1}(s_t,a_t,o_t) = (1-\alpha_t)Q_t(s_t,a_t,o_t)+\alpha_t(r(s_t,a_t,o_t)+\beta V_t(s_{t+1})$$

Here we are just expanding $Q$ to include the other player's actions. This is all well and good, but how to get the value function? Well, in a zero-sum game the maxmin strategy for both players forms a nash equilibrium. Therefore we might say:

$$V_{t+1}(s_t) = \max_{a}\min_{o}Q_{t}(s,a,o)$$

One downside of this approach is that it means the agent may be unable to exploit when the opponent is using a sub-optimal strategy.

To give an example of the algorithm, let's say that you are trying to run away from some opponent who is trying to catch you. You live on a 4 square long grid which loops around.

In [81]:
# Define positions: 0=Left, 1=Center left, 2=Center right, 3 = Right
positions = [0, 1, 2, 3]
actions = [0, 1, 2]  # 0=Left, 1=Stay, 2=Right
num_states = len(positions) ** 2
num_actions = len(actions)

# Q-table: Q1[state=(p1,p2), a1, a2]
Q1 = np.ones((4, 4, num_actions, num_actions))

# Value estimates and stochastic policies
V1 = np.ones((4, 4))
policy1 = np.ones((4, 4, num_actions))/num_actions  # probability distribution over actions

reward1 = np.array([
    [-1,  1,  1, 1],
    [ 1, -1,  1, 1],
    [ 1,  1,  -1, 1],
    [ 1,  1,  1, -1],
])
reward2 = -reward1  # zero-sum

# Exploration
explore_prob = 0.5
visits = np.zeros((4, 4), dtype=int)

def alpha_func(n): 
    return 1.0 / (1.0 + 0.01*n)

def sample_action(prob_dist, explore_prob):
    if np.random.rand() < explore_prob:
        return np.random.choice(actions)
    return np.random.choice(actions, p=prob_dist)

def transition(pos1, pos2, a1, a2):
    new1 = (pos1 + (a1 - 1)) % 4
    new2 = (pos2 + (a2 - 1)) % 4
    return int(new1), int(new2)

def solve_minimax(Q_matrix): # linear program to solve for the minimax mixed strategy
    num_actions = Q_matrix.shape[0]
    c = np.zeros(num_actions + 1)
    c[-1] = -1
    A_eq = np.zeros((1, num_actions + 1))
    A_eq[0, :num_actions] = 1
    b_eq = np.array([1.0])
    A_ub = np.zeros((Q_matrix.shape[1], num_actions + 1))
    b_ub = np.zeros(Q_matrix.shape[1])
    for j in range(Q_matrix.shape[1]):
        A_ub[j, :num_actions] = -Q_matrix[:, j]
        A_ub[j, -1] = 1
    bounds = [(0, 1)] * num_actions + [(None, None)]
    res = linprog(c, A_ub=A_ub, b_ub=b_ub, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method="highs")
    if res.success:
        probs = res.x[:num_actions]
        v = res.x[-1]
        return probs, v
    else:
        return np.ones(num_actions) / num_actions, 0.0

gamma = 0.9
pos1, pos2 = 0, 0
for t in range(10000):
    alpha = alpha_func(visits[pos1, pos2])
    visits[pos1, pos2] += 1
    a1 = sample_action(policy1[pos1, pos2], explore_prob)
    a2 = np.random.choice(actions) # player 2 can be random, doesn't matter!
    new1, new2 = transition(pos1, pos2, a1, a2)
    r1 = reward1[pos1, pos2]
    future = r1 + gamma * V1[new1, new2]
    Q1[pos1, pos2, a1, a2] = (1 - alpha) * Q1[pos1, pos2, a1, a2] + alpha * future
    policy, v = solve_minimax(Q1[pos1, pos2])
    policy1[pos1, pos2] = policy
    V1[pos1, pos2] = v
    pos1, pos2 = new1, new2


def encode_state(pos1, pos2, size=4):
    """Encode a state as a string like A__B"""
    symbols = ["_"] * size
    symbols[pos1] = "A"
    symbols[pos2] = "B" if not pos1==pos2 else "X"
    return "__".join(symbols)
    
def format_policy(p):
    # show each probability with 3 decimals
    return "[" + " ".join(f"{x:5.3f}" for x in p) + "]"

for pos1 in positions:
    for pos2 in positions:
        state_code = encode_state(pos1, pos2)
        policy_str = format_policy(policy1[pos1, pos2])
        value_str = f"{V1[pos1, pos2]:.3f}"
        # Print with fixed widths
        print(f"{state_code:10} {policy_str:20} value {value_str:6}")

X_________ [0.344 0.317 0.339]  value 6.397 
A__B______ [1.000 -0.000 0.000] value 9.356 
A_____B___ [-0.000 1.000 0.000] value 9.370 
A________B [0.000 -0.000 1.000] value 9.372 
B__A______ [-0.000 0.000 1.000] value 9.373 
___X______ [0.322 0.348 0.330]  value 6.445 
___A__B___ [1.000 0.000 -0.000] value 9.378 
___A_____B [0.000 1.000 -0.000] value 9.358 
B_____A___ [0.000 1.000 -0.000] value 9.387 
___B__A___ [-0.000 0.000 1.000] value 9.366 
______X___ [0.333 0.331 0.337]  value 6.459 
______A__B [1.000 -0.000 0.000] value 9.360 
B________A [1.000 0.000 -0.000] value 9.371 
___B_____A [0.000 1.000 -0.000] value 9.370 
______B__A [-0.000 0.000 1.000] value 9.362 
_________X [0.331 0.331 0.339]  value 6.435 


In the above you are A and the other is B, with a clash marked by X. On the clash spots the strategy is to randomly choose whether to go left or right or stay put. Otherwise the strategy is to run away in the obvious direction.

This algorithm will return the correct values in a zero-sum game eventually. There are algorithms which guarantee a weaker reward of maxmin - $\epsilon$ for some $\epsilon$, e.g., the R-max algorithm.

### 7.4.3 Beyond zero-sum stochastic games

$Q$-learning does not generalize to general-sum games. Althought well-defined the above algorithm isn't guaranteed to get to the maxmin strategy equilibrium, nor to the Nash equilibrium. There are algorithms for 'pure-coordination' games.

### 7.4.4 Belief-based reinforcement learning

It is possible to have models of the opponent's behaviour, as in Fictitious play. You need to add a belief function over the actions that the other player will take, and update it as you learn. There is some indication that these can converge to equilibrium in self-play, but nothing theoretical.