In [5]:
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
import sys
import os
sys.path.append(os.path.abspath('..'))
import utils

## 6.2 Stochastic games

Stochastic games are those in which players will repeatedly play a number of games in sequence.

### 6.2.1 Definition

We have a finite set of games, a finite set of players, a finite set of actions available to each player, transition probabilities, and a payoff function for each game. To get the overall payoff you can use average or discounted reward.

### 6.2.2 Strategies and equilibria

A player needs to decide what to do given the history of all previous games. We can restrict this in several ways:

**Behavioural strategies**

We can limit the players to choosing an action for each game independently. I.e., the action I choose for the nth game is based on the history of all games up to n. I don't decide at the start what actions to play at the start. The behavioural strategy is then just a probability of taking each possible action for the current game, based on the history so far.

**Markov strategies**

This can be further restricted by making each action only dependent on the game that was just played. In this way - Markov.

**Stationary strategies**

Finally, we can remove the time component by saying that given the last game that was played each agents next strategy is the same. In the pure Markov case you are allowed to have a different strategy for the same game at time $a$ and $b$ even if the previous game history is the same.

There are a few theorems about the equilibria in a stochastic game:

**Theorem:** Every discounted-reward stochastic game has a Markov perfect equilibrium.

A Markov perfect equilibrium means that every player has a Markov strategy, and it is a nash equilibrium regardless of the starting game.

There is not a similar theory for average-reward games, as the average reward might not converge (e.g., imagine a cycle). However, if we have a stochastic game such that every sub-game has a non-zero probability of being reached, regardless of the strategies, then the average-reward game does at least have a nash equilibrium. We call these irreducible stochastic games. This leads to this theorem:

**Theorem:** For every two-player, general-sum irreducible stochastic game, and every feasible outcome with a payoff vector $r$ that provides to each player at least their minmax value, then $r$ is the payoff vector of a nash equilibrium. This is true for average-reward games, and games with a large enough discount factor.

### 6.2.3 Computing equilibria

In a couple of cases computing the equilibria to a stochastic game is simple - either if there is only 1 player affecting the transition probabilities (this is just a MDP), or something called 'seperable reward state independent transition'. In general though you need to solve a nonlinear problem. 

Often a variant of value iteration is used, but this can get stuck in local optima. In order to do it we take all the combinations of history and all the combinations of actions.

The standard Bellman formulation states:

$$Q^\pi(s,a) = r(s,a) + \beta \sum_\hat{s} P(\hat{s} | s, a) V^\pi(\hat{s}) $$

$$V^\pi(s) = Q^\pi(s,\pi(s)) $$

We then take the maximum over $\pi$ for the second equation and plug it into the first to get:

$$V(s) = \max_a r(s,a) + \beta \sum_\hat{s} P(\hat{s} | s, a) V(\hat{s}) $$

This function is true at the optimum, and is a contraction mapping so we can find that point by iterating in the value iteration manner.

With a two player game we instead have two different $Q$, $r$, and $V$ functions, as well as different (possibly mixed) policies $\pi_1$,$\pi_2$ and different actions:

$$Q_1^\pi(s,a_1,a_2) = r_1(s,a_1,a_2) + \beta \sum_\hat{s} P(\hat{s} | s, a_1, a_2) V_1^{\pi_1,\pi_2}(\hat{s}) $$

$$Q_2^\pi(s,a_1,a_2) = r_2(s,a_1,a_2) + \beta \sum_\hat{s} P(\hat{s} | s, a_1, a_2) V_2^{\pi_1,\pi_2}(\hat{s}) $$

If we know the values for $Q_1$ and $Q_2$ we can work out the nash equilibria policies $\pi_1$ and $\pi_2$. In the pure case this means a single action per state, but they could be mixed. The value function is then taken as the expected value given the current state and both policies. 

This produces quite an easy computation. We just update the Q values for a given state, compute the nash equilibria, then use that to update the value of that state.

The problem (other than a lack of convergence) is that we are forcing our actors to make the best decision in *each* game rather than over all the games. So any nash equilibria over the whole space are possibly ignored. 

**Example**

Lets say we are both farming some land together. We can both either farm responsibly (C) or farm to maximise immediate revenue (D). The reward we get from farming depends on the state of the land, which in turn depends on what we do. The land can be healthy (H) or eroded (E). In both states the game is a prisoner's dilemma, but in the latter the payoffs are worst:

*Healthy*

$
\begin{array}{c|ccc}
\text{} & C & D \\
\hline
\text{C} & 4,4 & 1,5 \\
\text{D} & 5,1 & 2,2 \\
\end{array}
$

*Eroded*

$
\begin{array}{c|ccc}
\text{} & C & D \\
\hline
\text{C} & 2,2 & 0,3 \\
\text{D} & 3,0 & 1,1 \\
\end{array}
$

If the land is healthy and both players cooperate it will stay healthy with 90% probability, otherwise eroding with 50% probability. 
If the land is eroded and both players cooperate it will get better with 30% probability, otherwise remaining eroded.

The state is just the last game that was played and what the players did, followed by the current game in play. E.g., HCCH means the land was healthy, both players cooperated, and now the land is healthy for this game.

In [112]:
states = ["".join(s) for s in list(product(["H","E"],["C","D"],["C","D"],["H","E"]))]
r1 = []
r2 = []
for state in states:
    if state[-1]=="H":
        r1.append([[4,1],[5,2]]) # CC, CD, DC, DD
        r2.append([[4,5],[1,2]])
    else:
        r1.append([[2,0],[3,1]])
        r2.append([[2,3],[0,1]])
r1 = np.array(r1)
r2 = np.array(r2)
transitionMatrix = np.zeros((len(states),2,2,len(states))) # current state, action 1, action 2, new state
for s in range(len(states)):
    for a1 in range(2):
        for a2 in range(2):
            healthyOutcome = states[s][-1] + ("C" if a1==0 else "D") + ("C" if a2==0 else "D") + "H"
            erodedOutcome = states[s][-1] + ("C" if a1==0 else "D") + ("C" if a2==0 else "D") + "E"
            if a1==0 and a2==0: # CC
                if states[s][-1]=="H":
                    transitionMatrix[s,a1,a2,states.index(healthyOutcome)]=0.9
                    transitionMatrix[s,a1,a2,states.index(erodedOutcome)]=0.1
                else:
                    transitionMatrix[s,a1,a2,states.index(healthyOutcome)]=0.3
                    transitionMatrix[s,a1,a2,states.index(erodedOutcome)]=0.7
            else:
                if states[s][-1]=="H":
                    transitionMatrix[s,a1,a2,states.index(healthyOutcome)]=0.5
                    transitionMatrix[s,a1,a2,states.index(erodedOutcome)]=0.5
                else:
                    transitionMatrix[s,a1,a2,states.index(healthyOutcome)]=0.0
                    transitionMatrix[s,a1,a2,states.index(erodedOutcome)]=1.0

print("states",states)

states ['HCCH', 'HCCE', 'HCDH', 'HCDE', 'HDCH', 'HDCE', 'HDDH', 'HDDE', 'ECCH', 'ECCE', 'ECDH', 'ECDE', 'EDCH', 'EDCE', 'EDDH', 'EDDE']


Now calculating Q, getting nash strategies, updating v...

In [113]:
# Warning! Might take a while...
discount = 0.99
v1 = np.random.rand(len(states))
v2 = np.random.rand(len(states))
for iteration in range(10000):
    old_v1 = v1.copy()
    old_v2 = v2.copy()
    for s in range(len(states)):
        q1 = r1[s] + discount*np.sum(transitionMatrix[s]*v1.reshape(1,1,-1),axis=2)
        q2 = r2[s] + discount*np.sum(transitionMatrix[s]*v2.reshape(1,1,-1),axis=2)
        nash_strategy1, nash_strategy2 = utils.lemke_howson(q1,q2)
        pCC = nash_strategy1[0]*nash_strategy2[0]
        pCD = nash_strategy1[0]*nash_strategy2[1]
        pDC = nash_strategy1[1]*nash_strategy2[0]
        pDD = nash_strategy1[1]*nash_strategy2[1]
        v1[s]=pCC*q1[0,0]+pCD*q1[0,1]+pDC*q1[1,0]+pDD*q1[1,1]
        v2[s]=pCC*q2[0,0]+pCD*q2[0,1]+pDC*q2[1,0]+pDD*q2[1,1]
    if np.max(np.abs(v1 - old_v1)) < 1e-4 and np.max(np.abs(v2 - old_v2)) < 1e-4:
        print("Stopping early!")
        break

Stopping early!


We can see what the strategies are in each state:

In [114]:
for s in range(len(states)):
    q1 = r1[s] + discount*np.sum(transitionMatrix[s]*v1.reshape(1,1,-1),axis=2)
    q2 = r2[s] + discount*np.sum(transitionMatrix[s]*v2.reshape(1,1,-1),axis=2)
    nash_strategy1, nash_strategy2 = utils.lemke_howson(q1,q2)
    print(states[s],nash_strategy1,nash_strategy2)

HCCH [0. 1.] [0. 1.]
HCCE [0. 1.] [0. 1.]
HCDH [0. 1.] [0. 1.]
HCDE [0. 1.] [0. 1.]
HDCH [0. 1.] [0. 1.]
HDCE [0. 1.] [0. 1.]
HDDH [0. 1.] [0. 1.]
HDDE [0. 1.] [0. 1.]
ECCH [0. 1.] [0. 1.]
ECCE [0. 1.] [0. 1.]
ECDH [0. 1.] [0. 1.]
ECDE [0. 1.] [0. 1.]
EDCH [0. 1.] [0. 1.]
EDCE [0. 1.] [0. 1.]
EDDH [0. 1.] [0. 1.]
EDDE [0. 1.] [0. 1.]


The strategy is to always defect, because this is always the best option in any subgame. We don't allow someone to use a strategy in one game as a threat in another and so on. 