# Lab 10 Discounted Markov Decision Processes

Made & Presented by Bo Tang

In this lab, we will explore the concept of Markov Decision Processes in the infinite horizon discounted case, understand how to formulate them, and implement solution methods including Linear Programming, Policy Iteration, and Value Iteration.

In [None]:
# install gurobipy first
! pip install gurobipy

In [None]:
import gurobipy as gp
import numpy as np
from gurobipy import GRB

### **1 Dynamic Programming Review**

In Dynamic Programming (DP), the problem is divided into several stages, each with:
- **Stages** $n = 1, 2, \cdots, N$: A sequence of decision points.
- **State Variables** $s_n$: Describe the current state, containing all necessary information to make the next decision.
- **Decision Variables** $x_n$: The choices available at each stage, which affect the next state.
- **Transition Relations** $s_{n+1} = g_n(s_n, x_n)$: Define how the system moves from one state to another based on the decision made.
- **Immediate Return** $h_n(s_n, x_n)$: The reward or cost associated with each decision at each stage.

The solution to a dynamic programming problem typically involves **backward recursion**, where we start from the last stage and move backward to calculate the optimal solution. With these basic concepts, we can express the optimal return from stage $n$ to the end of the horizon as:
$$
f_n(s_n) = \max_{x_n} \big[ h_n(s_n, x_n) + f_{n+1}\big(g_n(s_n, x_n)\big) \big]
$$

### **2 From DP to MDP**

#### **2.1 Discount with Infinite Horizon**

In previous DP problems, we worked with multiple stages, but those stages $s$ were not always interpreted as time periods. In the context of MDPs, we now make a key assumption: each stage represents a **time period** $t$, and here the process continues **indefinitely**.

For instance, consider a company that expects to operate for the foreseeable future without a predefined endpoint. In such settings, we model the system over an infinite horizon and incorporate a **discount factor** $\beta \in (0, 1)$ to prioritize immediate costs or rewards over distant ones.

Under these assumptions, the optimal value function $f_t(s_t)$ for each state $s_t$ satisfies the Bellman equation:

$$
f_t(s_t) = \max_{x_t} \big[ h_t(s_t, x_t) + \beta \cdot f_{t+1}\big(g_t(s_t, x_t)\big) \big]
$$

This recursive equation expresses the optimal total return starting from state $s_t$, as the best immediate reward plus the discounted future value.

#### **2.2 Stochastic Transitions**

So far, in both classical Dynamic Programming and our previous formulations, we assumed deterministic transitions—that is, given the current state $s_t$ and action $x_t$, the next state $s_{t+1}$ is fully determined by a function $g_t(s_t, x_t)$.

However, in many real-world settings, transitions are inherently stochastic. That is, after taking action $x_t$ in state $s_t$, the next state $s_{t+1}$ is not deterministic, but follows a probability distribution:

$$
\text{Pr}(s_{t+1} \mid s_t, x_t)
$$

To incorporate this uncertainty, we modify the Bellman equation to take the expected value over all possible next states. The value function now becomes:

$$
f_t(s_t) = \max_{x_t} \left[ h_t(s_t, x_t) + \beta \cdot \sum_{s_{t+1}} \Pr(s_{t+1} \mid s_t, x_t) \cdot f_{t+1}(s_{t+1}) \right]
$$

This expression reflects the fact that, rather than knowing exactly what happens next, we optimize based on the expected return across all possible outcomes weighted by their transition probabilities.

#### **2.3 Notation**

To simplify notation in the infinite-horizon case, we drop the explicit time index $t$, and assume the system is stationary—that is, transition probabilities and rewards do not change over time.

- Let $s$ denote the current state
- Let $a$ denote an action taken in state $s$, where $a \in A(s)$ and $A(s)$ is the set of actions feasible from state $s$
- Let $r(s, a)$ be the immediate cost (or negative reward)
- Let $P(s' \mid s, a)$ be the probability of transitioning to state $s'$ after taking action $a$ in state $s$
- Let $\beta \in (0, 1)$ be the discount factor
- Let $V(s)$ be the value function, representing the expected discounted cost starting from state $s$

With this notation, the Bellman equation becomes:

$$
V(s) = \min_{a \in A(s)} \left[ r(s, a) + \beta \cdot \sum_{s'} P(s' \mid s, a) \cdot V(s') \right]
$$

### **3 Problem Statement**

A warehouse has an end-of-period capacity of 3 units. During a period in which production takes place, a setup cost of \$4 is incurred. A \$1 holding cost is assessed against each unit of a period’s ending inventory. Also, a variable production cost of $ 1 per unit is incurred. During each period demand is equally likely to be 1 or 2 units. All demand must be met on time, and β = 0.8. Minimize expected discounted costs over an infinite horizon.

##### Question:
To formulate this problem as a dynamic program or MDP, identify the following elements:
- Stages:
- States:
- Decisions:

In [None]:
# states
states = [0, 1, 2, 3]
# discounted factor
beta = 0.8

#### **3.1 Linear Programming**

One way to solve an infinite-horizon discounted MDP is to formulate it as a linear program (LP). This approach is based on the Bellman equations.

For each state $s$, we introduce a variable $V(s)$ representing the expected discounted cost starting from state $s$ for all possible action $a$.
$$
V(s) = \min_{a \in A(s)} \left[ r(s, a) + \beta \cdot \sum_{s'} P(s' \mid s, a) \cdot V(s') \right]
$$

Then, we transform the equation for each possible action $a$ in each state $s$, ensuring that the Bellman inequality holds:

$$
V(s) \leq r(s, a) + \beta \cdot \sum_{s'} P(s' \mid s, a) \cdot V(s') \quad \forall a \in A(s)
$$

To complete the linear program, we define the **objective function** as:

$$
\max \sum_s V(s)
$$

This objective ensures that the inequalities are **tight** for at least one action in each state—that is, the Bellman inequality becomes an equality for the optimal action. We can recover the optimal policy once we solve the LP and obtain the value function $V(s)$.


Try implementing this LP model:

In [None]:
# ceate a model
m = gp.Model("MDP")
# varibles
v = m.addVars(states, lb=-GRB.INFINITY, name="value") # value
# TODO: obj func
m.setObjective
# TODO: constr
# v0 = min x in {2,3,4}
m.addConstr
# v1 = min x in {1,2,3}
m.addConstr
# v2 = min x in {0,1,2}
m.addConstr
# v3 = min x in {0,1}
m.addConstr
# solves
m.optimize()
# value
print("Model Solution:")
for s in states:
    print("v_{} = {:.2f}".format(s, v[s].x), end=" ")

##### Question:
What is the actions for the best policy?

$x^*(0)=4, x^*(1)=3, x^*(2)=0, x^*(3)=0$

#### **3.2 Policy Iteration**

This method consists of two main steps, which are repeated until convergence:

1. **Policy Evaluation**:  
   Given a fixed policy $\pi$, compute the value function $V^\pi(s)$ for all states $s$.  
   This involves solving a system of linear equations:

   $$
   V^\pi(s) = r(s, \pi(s)) + \beta \cdot \sum_{s'} P(s' \mid s, \pi(s)) \cdot V^\pi(s') \quad \forall s
   $$

2. **Policy Improvement**:  
   Given the current value function $V^\pi$, update the policy by choosing the best action:

   $$
   \pi_{\text{new}}(s) = \arg\min_{a \in A(s)} \left[ r(s, a) + \beta \cdot \sum_{s'} P(s' \mid s, a) \cdot V^\pi(s') \right]
   $$

Start with an initial policy $\pi$ and repeat these two steps until the policy stops changing. At that point, we have found the **optimal policy**.

We start from this initial policy:

In [None]:
init_policy = {
    0: 2,   # min([2, 3, 4])
    1: 1,   # min([1, 2, 3])
    2: 0,   # min([0, 1, 2])
    3: 0    # min([0, 1])
}

##### Question:
Why is it reasonable to initialize our policy in this way?

Because it choose the smallest action in each state.

Try implementing two iteration of this algorithm by writing:
- Evaluate a given policy (solve a linear system)
- Improve the policy based on the current value function

In [None]:
# first interation
print("Iteration 1:")

# evaluation
m = gp.Model("Eval")
# turn off output log
m.setParam("OutputFlag", 0)
# varibles
v = m.addVars(states, lb=-GRB.INFINITY, name="value")
# linear equations
m.addConstr(v[0] == 6 + beta * (0.5 * v[0] + 0.5 * v[1]))
m.addConstr(v[1] == 6 + beta * (0.5 * v[0] + 0.5 * v[1]))
m.addConstr(v[2] == 2 + beta * (0.5 * v[0] + 0.5 * v[1]))
m.addConstr(v[3] == 3 + beta * (0.5 * v[1] + 0.5 * v[2]))
# dummy objective (we just want feasibility)
m.setObjective(0, GRB.MINIMIZE)
# solve
m.optimize()
V = {s: v[s].X for s in states}
print("\nValue after first evaluation:", {k: round(v, 2) for k, v in V.items()})

# improvement
policy = {}
# state = 0 and a in (2, 3, 4)
pi0 = {2: 6 + beta * (0.5 * V[0] + 0.5 * V[1]),
       3: 7 + beta * (0.5 * V[1] + 0.5 * V[2]),
       4: 8 + beta * (0.5 * V[2] + 0.5 * V[3])}
policy[0] = min(pi0, key=pi0.get)
# state = 1 and a in (1, 2, 3)
pi1 = {1: 6 + beta * (0.5 * V[0] + 0.5 * V[1]),
       2: 7 + beta * (0.5 * V[1] + 0.5 * V[2]),
       3: 8 + beta * (0.5 * V[2] + 0.5 * V[3])}
policy[1] = min(pi1, key=pi1.get)
# state = 2 and a in (0, 1, 2)
pi2 = {0: 2 + beta * (0.5 * V[0] + 0.5 * V[1]),
       1: 7 + beta * (0.5 * V[1] + 0.5 * V[2]),
       2: 8 + beta * (0.5 * V[2] + 0.5 * V[3])}
policy[2] = min(pi2, key=pi2.get)
# state = 3 and a in (0, 1)
pi3 = {0: 3 + beta * (0.5 * V[1] + 0.5 * V[2]),
       1: 8 + beta * (0.5 * V[2] + 0.5 * V[3])}
policy[3] = min(pi3, key=pi3.get)
print("\nPolicy after first improvement:", {k: round(v, 2) for k, v in policy.items()})

Iteration 1:

Value after first evaluation: {0: 30.0, 1: 30.0, 2: 26.0, 3: 25.4}

Policy after first improvement: {0: 4, 1: 3, 2: 0, 3: 0}


In [None]:
# second iteration
print("Iteration 2:")

# TODO: evaluation

# TODO: improvement

##### Question:
Has policy converged?

#### **3.3 Value Iteration**

Value iteration directly updates the value function iteratively based on the Bellman equations.

We start from an initial guess for the value function—**often all zeros**—and repeatedly update each state's value using:

$$
V_{k+1}(s) = \min_{a \in A(s)} \left[ r(s, a) + \beta \sum_{s'} P(s' \mid s, a) \cdot V_k(s') \right]
$$

This recursive process continues until the value function converges, that is, until the maximum change in value across all states is below a predefined small value $\epsilon$.

nce the value function has converged, the optimal policy can be recovered by selecting, for each state, the action that minimizes the right-hand side of the Bellman equation.

Try implementing value iteration starting from $V(s) = 0$ with 100 iterations:

In [None]:
# init
V = {0:0, 1:0, 2:0, 3:0}

In [None]:
def valueIterationUpdate(V, beta):
    V_new = {}
    # TODO: Update values
    # state = 0 and a in (2, 3, 4)

    # state = 1 and a in (1, 2, 3)

    # state = 2 and a in (0, 1, 2)

    # state = 3 and a in (0, 1)
    return V_new

In [None]:
# iterations
for it in range(100):
    V = valueIterationUpdate(V, beta)
    print(f"Value after iteration {it+1}:", {k: round(v, 2) for k, v in V.items()})