In [1]:
import numpy as np
import matplotlib.pyplot as plt

## 7.5 No-regret learning and universal consistency

As discussed earlier (see 7-A) a learning rule is consistent if yields at least what you could get by having adopted a single pure strategy. We define a regret measure as the difference between the average reward you got and the average reward you would have got if you had played a pure strategy $s$ instead. A no-regret learning rule is one which guarantees no positive regret for any pure strategy $s$. We look at the pure strategies in the stage-game, but it doesn't matter if we looked at mixed strategies instead, as mixed strategies just use indifference between pure strategies anyway.

An example of a no-regret strategy is the defect option in the Prisoner's Dilemma. Tit-for-tat is not no-regret, because if you play a defecting opponent you will have regret for having cooperated in the first round.

A variety of no-regret learning algorithms exist, in particular these 2:

**Regret matching:**

At each timestep an action is chosen with probability equal to it's regret (i.e., how much better it would have been to play that action).

Say we have a staghunt:

$$
\begin{array}{c|cc}
\text{} & \text{C} & \text{D} \\
\hline
\text{C} & 4,4 & 0,1 \\
\text{D} & 1,0 & 1,1 \\
\end{array}
$$

In [54]:
# Payoff matrix for (Player1, Player2) given (action1, action2)
# Actions: "C" = 0, "D" = 1
payoff_matrix = {
    (0, 0): (3, 3),  # C, C
    (0, 1): (0, 1),  # C, D
    (1, 0): (1, 0),  # D, C
    (1, 1): (1, 1)   # D, D
}

# Track returns
C1_returns, D1_returns, true1_returns = [], [], []
C2_returns, D2_returns, true2_returns = [], [], []

actions = ["C", "D"]

for iteration in range(20):
    if iteration == 0:
        a1, a2 = 1, 1  # Start with (D, D)
    else:
        # Compute regrets for Player 1
        C1_regret = np.mean(C1_returns) - np.mean(true1_returns)
        D1_regret = np.mean(D1_returns) - np.mean(true1_returns)
        probs1 = np.clip([C1_regret, D1_regret], 0, np.inf)+0.001
        probs1 = probs1 / np.sum(probs1)

        # Compute regrets for Player 2
        C2_regret = np.mean(C2_returns) - np.mean(true2_returns)
        D2_regret = np.mean(D2_returns) - np.mean(true2_returns)
        probs2 = np.clip([C2_regret, D2_regret], 0, np.inf)+0.001
        probs2 = probs2 / np.sum(probs2)

        a1 = np.random.choice([0, 1], p=probs1)
        a2 = np.random.choice([0, 1], p=probs2)

    p1, p2 = payoff_matrix[(a1, a2)]
    true1_returns.append(p1)
    true2_returns.append(p2)
    C1_returns.append(payoff_matrix[(0, a2)][0])
    D1_returns.append(payoff_matrix[(1, a2)][0])
    C2_returns.append(payoff_matrix[(a1, 0)][1])
    D2_returns.append(payoff_matrix[(a1, 1)][1])

print("true1_returns",true1_returns)
print("true2_returns",true1_returns)

true1_returns [1, 0, 1, 1, 0, 0, 3, 1, 0, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
true2_returns [1, 0, 1, 1, 0, 0, 3, 1, 0, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


There are a couple of times where D is randomly chosen, and a few times where it hurts to be CD, but it pretty quickly ends us with CC the whole time.

**Smooth fictitious play:**

With smooth fictitious play the agents stochastically respond rather than choosing the best action. For example, they might use a softmax over the actions.

## 7.6 Targeted learning

Another approach recognises that doing well against all possible opponents (no-regret) is pretty strict. In some cases we may may care about performance against a targeted group vs the wider one differently. Targeted learning aims for some properties:
1. Targeted optimality - the learning algorithm gives the best response to the target class
2. Satefy - the learning algorithm at least gets the maxmin value against the wider group
3. Autocompatibility - self-play is strictly Pareto efficient (best outcome possible)
4. Efficiency - you shouldn't have to get to infinity to learn. There is some metric for this.

The general algorithm for targeted learning looks like:
1. Start by assuming the opponent is in the target set and play a best response. If it becomes clear this isn't the case, move on.
2. Signal to the opponent whether they have the same learning strategy. If so, coordinate on Pareto outcome.
3. Play to the security level

This gets more complicated the more agents you add, unsuprisingly. 