Consider an infinitely repeated duopoly market with stochastic demand. Specifically, firms 1 and 2 interact over time $t \in\{1, \ldots, \infty\}$ with the demand in each time period being either high (H) or low (L). If the demand is $\mathrm{H}$, then the inverse demand function is given by $p_{H}=120-q_{1}-q_{2} ;$ if the demand is $L,$ then the inverse demand function is given by $p_{L}=60-q_{1}-q_{2}$ (where $q_{1}$ is quantity chosen by firm $1,$ and $q_{2}$ is the quantity chosen by firm 2). Furthermore, suppose that firms discount the future according to the discount rate of $\beta=.99$ and the demand evolves according to the transition probability matrix $M=\left[\begin{array}{rr}.8 & .2 \\ .3 & .7\end{array}\right] .$ For example, the probability that the demand in period $t+1$ is $\mathrm{H}$ given that the demand in period $t$ is $\mathrm{H}$ is . $8 .$ For simplicity, suppose that each firm produces at zero marginal cost.

a) Suppose that firm 1 's policy is to produce 30 in every period. Determine the value function and the optimal policy functions for firm $2 .$ Hints: there are two possible states to consider: $(\mathrm{H})$ and $(\mathrm{L}) .$ Your goal is to find the value of being in each state and the optimal decision by firm 2 .

b) Suppose that firm 1 's policy is to produce 25 in every round as long neither firm exceeded 35 in any round in the past. If either of the firms exceeded $35,$ firm 1 will produce $40 .$ Determine the value and the policy function for firm $2 .$ Hint: there are four possible states to consider: (H, neither firm exceeded 35), (H, one of the firms exceeded
35), (L, neither firm exceeded exceeded 35), (L, one of the firms exceeded 35)

In [1]:
import numpy as np

M = np.array([[0.8, 0.2], [0.3, 0.7]])

p_h = lambda q1, q2: 120 - q1 - q2
p_l = lambda q1, q2: 60 - q1 - q2

beta = 0.99

H = 0
L = 1

## a

For firm 1, the policy is 
$
\pi_1(a|s) = 30
$.
Given any state $s$, the output action $a$ of $\pi_1$ is 30. 


For firm 2, the state space is $S = \{H, L\}$  
The action space is $A = \{a | a \in \mathbb{R} \land 0 \leq a < 90\}$  
We want to get the policy of firm 2, $\pi_2$, with value iteration.  

Value function for $\pi_2$ is 

$$
\begin{align}
V(s) &= \underset{a}{max} \sum_{s' \in S} \mathcal{T} (s, a, s') (R(s, a, s') + \beta V(s'))\\
     % &= \underset{a}{max} \sum_{s' \in S} M (R(s, a, s') + \beta V(s'))
\end{align}
$$

Given current state $s$ and the action going to take $a$, $\mathcal{T}$ returns the probability of which the ongoing state is $s'$. $\mathcal{T}$ can be gotten by querying $M$.   
Similarly, $R$ returns the reward when taking action $a$ in state $s$ and arriving state $s'$.   

Once we get the converged (optimal) value function $V^*$, we have $$\pi_2(a|s) = \underset{a}{argmax} \sum_{s' \in S} \mathcal{T} (s, a, s') (R(s, a, s') + \beta V^*(s'))$$ 

In this problem, the action will not affect the state transform, so we can simplify the $\mathcal{T}(s, a, s')$ to $\mathcal{T}(s, s')$

In [2]:
def trans(s0, s1):
    return M[s0][s1]

The reward function does not care about current state, so we can simplify the $R(s, a, s')$ to $R(a, s')$

In [3]:
# only consider revenue
def rew(a, s1):
    assert s1 in [0, 1]

    if s1 == 0:  # next state is H
        return p_h(30, a) * a
    else:  # next state is L
        return p_l(30, a) * a

Value function and $\pi_2$

In [4]:
v = np.zeros(2)  # [0]: H, [1]: L

def bf_argmax_a(fn, l, h):
    # include upper boundary h
    ind = np.arange(l, h + 1)
    vs = fn(ind)
    return l + np.argmax(vs), np.max(vs)

def pi_2(s):
    # TODO: v_a can be simplified as matrix product
    v_a = lambda a: trans(s, 0) * (rew(a, 0) + beta * v[0]) + trans(s, 1) * (rew(a, 1) + beta * v[1])
    return bf_argmax_a(v_a, 1, 90)

In [5]:
def iteration(max_iter, delta):
    for _ in range(max_iter):
        new_v_0 = pi_2(0)[1]
        new_v_1 = pi_2(1)[1]
        new_v = np.array([new_v_0, new_v_1])
        
        global v
        diff = np.max(new_v - v)
        if diff <= delta: 
            break
        
        v = new_v

In [6]:
iteration(10000, 0.01)

In [7]:
print(f"If current state is H, firm 2 should produce: {pi_2(0)[0]}")
print(f"If current state is L, firm 2 should produce: {pi_2(1)[0]}")

If current state is H, firm 2 should produce: 39
If current state is L, firm 2 should produce: 24


# b

$$
\begin{align}
V(s) &= \underset{a}{max} \sum_{s' \in S} \mathcal{T} (s, a, s') (R(s, a, s') + \beta V(s'))\\
     % &= \underset{a}{max} \sum_{s' \in S} M (R(s, a, s') + \beta V(s'))
\end{align}
$$                   

This question seems to be problematic. If we follow the hint, there are 4 states needed to be considered. However, the $\mathcal{T}$ (a 4 x 4 matrix) for the 4 states is unknown. We do know the $M$, but, for example, we still cannot get the probability from $pr((H, q>35) \rightarrow (H, q<35) | H)$. The reward for different current state $s$ is different. Thus we are not able to solve this problem without the full knowledges of $\mathcal{T}$.