# Lab 11 Average Markov Decision Processes

Made & Presented by Bo Tang

In this lab, we will explore the concept of Markov Decision Processes (MDPs) with average reward in the infinite-horizon setting. Unlike the discounted case where future rewards are geometrically devalued, the average reward framework focuses on optimizing the long-run average cost per period. We will formulate the problem as a Linear Program (LP) and implement it using Gurobi.

In [1]:
# install gurobipy first
! pip install gurobipy

Collecting gurobipy
  Downloading gurobipy-12.0.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (16 kB)
Downloading gurobipy-12.0.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (14.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.4/14.4 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gurobipy
Successfully installed gurobipy-12.0.1


In [2]:
import gurobipy as gp
import numpy as np
from gurobipy import GRB

### **1 Markov Decision Processes Review**

We begin by reviewing the standard framework of Markov Decision Processes (MDPs), which provide a mathematical model for sequential decision-making under uncertainty.

A typical MDP consists of:
- A finite set of states $S$
- For each state $s \in S$, a finite set of actions $A(s)$ available in that state
- A reward function $r(s, a)$ that gives the immediate reward for taking action $a$ in state $s$
- A transition probability function $P(t \mid s, a)$, representing the probability of moving to state $t$ after taking action $a$ in state $s$
- A decision policy $\pi_s$, which specifies the action to take in each state

The goal is to find a policy $\pi$ that maximizes some measure of long-term reward — typically, either the expected total discounted reward or the average reward per time step.

#### **2 Markov Chain and Steady-State Distribution**

Once an MDP is governed by a fixed (stationary) policy, the evolution of the system becomes a **Markov Chain**. In such a chain:

- The system can be in one of $n$ possible states.
- The **transition probability** from state $i$ to $j$ is denoted $p_{ij}$.
- The **steady-state distribution** $\pi = (\pi_1, \ldots, \pi_n)$ describes the long-run probability of being in each state.

The defining condition for a steady-state distribution is:

$$
\pi_j = \sum_{i=1}^{n} \pi_i \cdot p_{ij}, \quad \text{for all } j
$$

This ensures that the distribution remains **unchanged over time**, i.e., $\pi = \pi P$.

## **3 Average Reward Formulation**

We now shift our focus from discounted total reward to the long-run **average reward per time step**, which leads to a different modeling framework based on **steady-state analysis**.

Instead of computing the present value of future rewards, we focus on the **long-run frequency** with which the system visits each state-action pair.

In the MDP setting, we extend the idea from Markov Chains:  
Each action $a$ taken in state $s$ leads to a different transition matrix $P_a$. We define $\pi_{sa}$ as the **steady-state joint probability** that the system is in state $s$ and takes action $a$.

Assuming the Markov Decision Process (MDP) is recurrent and a stationary policy induces a steady-state distribution, the **average reward** can be expressed as:

$$
\bar{r} = \sum_{s \in S} \sum_{a \in A(s)} r(s, a) \cdot \pi_{sa}
$$

In other words, instead of computing a value function for each state, we now optimize over the **frequencies** of visiting each state-action pair. The total average reward becomes a **weighted sum of immediate rewards**, where the weights reflect how often each $(s, a)$ is used under the system's long-run behavior.

##### **3.1 Objective and Decision Variables**

Given this setup, we formulate a linear program where the decision variable is the joint steady-state distribution $\pi_{sa}$. The goal is to maximize the long-run average reward:

$$
\max_{\pi} \sum_{s \in S} \sum_{a \in A(s)} r(s, a) \pi_{sa}
$$

However, in order for this optimization problem to be meaningful and well-defined, we must ensure that the variables $\pi_{sa}$ represent a valid steady-state distribution. This requires several key constraints:

##### **3.2 Flow Balance Constraints**

The flow balance constraints ensure that, in steady-state, the total probability "flowing into" each state equals the total probability "flowing out". This reflects a stable long-run system where no probability mass accumulates or vanishes in any state.

For every state $j \in S$, the balance condition is:

$$
\sum_{a \in A(t)} \pi_{ta} = \sum_{s \in S} \sum_{a \in A(s)} P(t \mid s, a) \cdot \pi_{sa}
$$

- Left-hand side: The total probability of being in state $t$ and taking some action $a$.
- Right-hand side: The total probability of transitioning into state $t$ from any state-action pair $(s, a)$.

This constraint guarantees the internal consistency of the joint distribution $\pi_{sa}$ under the system dynamics.

##### **3.3 Probability Normalization Constraint**

Since $\pi_{sa}$ represents a joint probability distribution over state-action pairs, its total sum must equal 1:

$$
\sum_{s \in S} \sum_{a \in A(s)} \pi_{sa} = 1
$$

This ensures that $\pi_{sa}$ corresponds to a proper probability distribution.

##### **3.4 Non-Negativity Constraints**

All probabilities must be non-negative:

$$
\pi_{sa} \geq 0, \quad \forall s \in S,\ a \in A(s)
$$


##### **3.5 Linear Programming Formulation**

Together, the objective and these constraints form a complete linear program for solving average reward MDPs:

$$
\max_{\pi} \sum_{s \in S} \sum_{a \in A(s)} r(s, a) \cdot \pi_{sa}
$$

Subject to:

$$
\sum_{a \in A(t)} \pi_{ta} = \sum_{s \in S} \sum_{a \in A(s)} P(t \mid s, a) \cdot \pi_{sa}
$$

$$
\sum_{s \in S} \sum_{a \in A(s)} \pi_{sa} = 1
$$

$$
\pi_{sa} \geq 0, \quad \forall s \in S,\ a \in A(s)
$$

### **4 Problem Statement**

A warehouse has an end-of-period capacity of 3 units. During a period in which production takes place, a setup cost of \$4 is incurred. A \$1 holding cost is assessed against each unit of a period’s ending inventory. Also, a variable production cost of $ 1 per unit is incurred. During each period demand is equally likely to be 1 or 2 units. All demand must be met on time. Minimize expected discounted costs over an infinite horizon.

Try implementing this LP model:

In [None]:
# define parameters
states = [0, 1, 2, 3] # states

# define allowed actions per state
actions = {
    0: [2, 3, 4],
    1: [1, 2, 3],
    2: [0, 1, 2],
    3: [0, 1]
}

# TODO: define transition probabilities P(t | s, a)
P = {
    # s = 0

    # s = 1

    # s = 2

    # s = 3

}

# TODO: define the cost with s-a pairs
cost = {
    # s = 0

    # s = 1

    # s = 2

    # s = 3

}

In [None]:
# ceate a model
m = gp.Model("MDP")
# varibles
pi = m.addVars(cost, lb=0, name="steady-state distribution")
# TODO: obj func: minimize total expected cost
m.setObjective
# TODO: constraint
m.addConstr
# solve
m.optimize()
# Output results
print("\nOptimal steady-state policy (pi_sa > 0):")
for (s, a) in cost:
    if pi[s, a].X > 1e-6:
        print(f"State {s}, Action {a} -> π = {pi[s, a].X:.4f}")