In [1]:
import numpy as np
import matplotlib.pyplot as plt

## 7.3 Rational learning

This is basically the same as fictitious play, except we allow agents to have a model over *all* strategies their opponent might employ, not just the ones for a single game. For example, tit-for-tat. Player update their beliefs about the probability of their opponent's strategy given the history of games. The book gives an example of the repeated Prisoner's Dilemma:

$$
\begin{array}{c|cc}
\text{} & \text{C} & \text{D} \\
\hline
\text{C} & 3,3 & 0,4 \\
\text{D} & 4,0 & 1,1 \\
\end{array}
$$

Let's say that players employ the trigger strategy up to a time of their choosing, $t$, after which they defect. They don't do mixtures but will adjust their best-response strategy. The strategy is denoted $g_t$. After $j$ steps of cooperation the likelihood of the other player having strategy $g_t$ is 0 if $t<j$ otherwise uniform. The update rule is therefore pretty simple, we just re-normalize the prior over the remaining options. Say I start having the following distribution:

$$
P(t) =
\begin{cases}
\frac{1}{10} & t=0 \\
\frac{2}{10} & t=1 \\
\frac{4}{10} & t=2 \\
\frac{3}{10} & t=3
\end{cases}
$$

After 1 iteration if my opponent cooperated the new distribution is just

$$
P(t) =
\begin{cases}
0 & t=0 \\
\frac{2}{9} & t=1 \\
\frac{4}{9} & t=2 \\
\frac{3}{9} & t=3
\end{cases}
$$

And after 2...

$$
P(t) =
\begin{cases}
0 & t=0 \\
0 & t=1 \\
\frac{4}{7} & t=2 \\
\frac{3}{7} & t=3
\end{cases}
$$

etc etc.

As another example let's imagine that in the above case players play either tit-for-tat, always cooperate, always defect, or do the trigger strategy.

In [2]:
# Strategies: 0 = Tit-for-Tat, 1 = Always Cooperate, 2 = Always Defect, 3 = Trigger
def move(strategy, last_self, last_other, step, triggered):
    if strategy == 0:
        return "C" if last_other is None else last_other
    elif strategy == 1:
        return "C"
    elif strategy == 2:
        return "D"
    elif strategy == 3:
        return "D" if triggered else "C"
    else:
        raise ValueError("Unknown strategy")

def play_round(strategy1, strategy2, history):
    last1, last2 = (None, None) if not history else history[-1]
    trig1 = any(m2 == "D" for (_, m2) in history) if history else False
    trig2 = any(m1 == "D" for (m1, _) in history) if history else False
    m1 = move(strategy1, last1, last2, len(history), trig1)
    m2 = move(strategy2, last2, last1, len(history), trig2)
    return m1, m2

def generate_history(strategy1, strategy2, steps):
    history = []
    for _ in range(steps):
        history.append(play_round(strategy1, strategy2, history))
    return history

hist = generate_history(0, 2, 5) # Tit-for-Tat vs Always Defect
print(hist)

[('C', 'D'), ('D', 'D'), ('D', 'D'), ('D', 'D'), ('D', 'D')]


In [4]:
def update_beliefs(my_strategy, beliefs, history, player):
    new_beliefs = beliefs.copy()
    for other_strategy in range(len(beliefs)):
        if player == 1:
            possible_history = generate_history(my_strategy, other_strategy, len(history))
        else:
            possible_history = generate_history(other_strategy, my_strategy, len(history))
        if possible_history != history:
            new_beliefs[other_strategy] = 0
    if np.sum(new_beliefs) > 0:
        new_beliefs = new_beliefs / np.sum(new_beliefs)
    return new_beliefs

possible_beliefs = np.array([0.25,0.25,0.25,0.25])
print(update_beliefs(0,possible_beliefs,[('C','C')],1)) # rule out defect if they cooperate

[0.33333333 0.33333333 0.         0.33333333]


In [19]:
beliefs1_about2 = np.array([0.25, 0.25, 0.25, 0.25])
beliefs2_about1 = np.array([0.25, 0.25, 0.25, 0.25])

strategy1 = 2  # Always Defect
strategy2 = 3  # Trigger
history = []

for i in range(5):
    history.append(play_round(strategy1, strategy2, history))
    beliefs1_about2 = update_beliefs(strategy1, beliefs1_about2, history, 1)
    beliefs2_about1 = update_beliefs(strategy2, beliefs2_about1, history, 2)
    print("Iteration", i)
    print("History:", history)
    print("Beliefs 1 about 2:", beliefs1_about2)
    print("Beliefs 2 about 1:", beliefs2_about1)

Iteration 0
History: [('D', 'C')]
Beliefs 1 about 2: [0.33333333 0.33333333 0.         0.33333333]
Beliefs 2 about 1: [0. 0. 1. 0.]
Iteration 1
History: [('D', 'C'), ('D', 'D')]
Beliefs 1 about 2: [0.5 0.  0.  0.5]
Beliefs 2 about 1: [0. 0. 1. 0.]
Iteration 2
History: [('D', 'C'), ('D', 'D'), ('D', 'D')]
Beliefs 1 about 2: [0.5 0.  0.  0.5]
Beliefs 2 about 1: [0. 0. 1. 0.]
Iteration 3
History: [('D', 'C'), ('D', 'D'), ('D', 'D'), ('D', 'D')]
Beliefs 1 about 2: [0.5 0.  0.  0.5]
Beliefs 2 about 1: [0. 0. 1. 0.]
Iteration 4
History: [('D', 'C'), ('D', 'D'), ('D', 'D'), ('D', 'D'), ('D', 'D')]
Beliefs 1 about 2: [0.5 0.  0.  0.5]
Beliefs 2 about 1: [0. 0. 1. 0.]


In this case here player 2 first learns that player 1 is a defect player. Player 2 learns that player 1 is not. After a second round player 1 also doesn't believe player 2 is an always cooperate player. From then on it is always defecting, with player 1's belief never settle on whether player 2 is tit-for-tat or playing the trigger strategy. 

However in this example we aren't letting the players change strategies! If we allow the players to change we also need to tweak the beliefs update to not eliminate options entirely (as they might come back) and add a 'best-response' function. 

In [None]:
mistake_prob = 0.05 

def update_beliefs_2(my_strategy, beliefs, history, player):
    new_beliefs = beliefs.copy()
    for other_strategy in range(len(beliefs)):
        if player == 1:
            possible_history = generate_history(my_strategy, other_strategy, len(history))
        else:
            possible_history = generate_history(other_strategy, my_strategy, len(history))

        if possible_history != history:
            new_beliefs[other_strategy] *= mistake_prob  # small probability for a “mistake”

    # normalize
    new_beliefs /= np.sum(new_beliefs)
    return new_beliefs

def best_response(beliefs):
    payoff_table = np.array([
        [3,3,1,3],  # TfT
        [3,3,0,3],  # Always C
        [1,4,1,1],  # Always D
        [3,3,1,3]   # Trigger
    ])
    expected_payoffs = payoff_table.dot(beliefs)
    return np.argmax(expected_payoffs)

In [25]:
# --- Simulation ---
beliefs1 = np.array([0.25,0.25,0.25,0.25])
beliefs2 = np.array([0.7,0.1,0.1,0.1])

strategy1 = 0
strategy2 = 2
history = []

for t in range(5):
    # Play round
    m1, m2 = play_round(strategy1, strategy2, history)
    history.append((m1, m2))

    # Update beliefs
    beliefs1 = update_beliefs_2(strategy1, beliefs1, history, player=1)
    beliefs2 = update_beliefs_2(strategy2, beliefs2, history, player=2)

    # Choose best-response strategy
    strategy1 = best_response(beliefs1)
    strategy2 = best_response(beliefs2)

    print(f"Round {t+1}")
    print("Moves:", m1, m2)
    print("Beliefs 1:", beliefs1)
    print("Beliefs 2:", beliefs2)
    print("Strategies:", strategy1, strategy2)
    print("---")

Round 1
Moves: C D
Beliefs 1: [0.04347826 0.04347826 0.86956522 0.04347826]
Beliefs 2: [0.77348066 0.11049724 0.00552486 0.11049724]
Strategies: 0 0
---
Round 2
Moves: D C
Beliefs 1: [0.04347826 0.04347826 0.86956522 0.04347826]
Beliefs 2: [0.77348066 0.11049724 0.00552486 0.11049724]
Strategies: 0 0
---
Round 3
Moves: C D
Beliefs 1: [0.04347826 0.04347826 0.86956522 0.04347826]
Beliefs 2: [0.77348066 0.11049724 0.00552486 0.11049724]
Strategies: 0 0
---
Round 4
Moves: D C
Beliefs 1: [0.04347826 0.04347826 0.86956522 0.04347826]
Beliefs 2: [0.77348066 0.11049724 0.00552486 0.11049724]
Strategies: 0 0
---
Round 5
Moves: C D
Beliefs 1: [0.04347826 0.04347826 0.86956522 0.04347826]
Beliefs 2: [0.77348066 0.11049724 0.00552486 0.11049724]
Strategies: 0 0
---


In this example player 1 starts off not knowing anything, but player 2 starts off thinking player 1 is likely TfT. Then player 2 chooses TfT next round, but player 1 has switched. This goes back and forth. I think the method of choosing the best response needs to be more precise than the one I have above! Anyway, this has some nice properties, converges to an $\epsilon$-equilibrium, stuff like that! But I'm not sure I really *get* it. Player's just believe their opponent has a fixed strategy that never changes, same as fictitious play. Except they are changing it...