# Introduction to Markov Decision Process - MDP


## Scenario Overview & Model

**States:**  
$S$: Set of states, indexed by $s$.

**Actions:**  
$A$: Set of possible actions (prices) that can be applied, indexed by $a$.

**Transition Probabilities:**  
$P_{s,s'}^a$: Probability of transitioning from state $s$ to state $s'$ under action $a$.

**Rewards:**  
$R_{s,a}$: Reward received when transitioning from state $s$ to any state under action $a$.

**Discount Factor:**  
$\gamma$: The discount factor, where $0 \leq \gamma < 1$.

**Decision Variables:**  
$\pi(s)$: Policy function, indicating the action to be taken in state $s$.

**Objective:**  
Maximize the expected total discounted reward over the time horizon $N$:  
$\max_\pi \sum_{t=1}^N \gamma^{t-1} \sum_{s \in S} \sum_{s' \in S} R_{s,\pi(s)} P_{s,s'}^{\pi(s)} V_t(s')$

**Constraints:**  
1. **Transition Probability Constraints:** For each $s$, $s'$ in $S$ and $a$ in $A$:  
$\sum_{s' \in S} P_{s,s'}^a = 1$

2. **Policy Constraints:** Policy $\pi(s)$ should be a valid action for each state $s$, typically defined within the action space $A$.


In [1]:
import numpy as np
import scipy.optimize as optimize

# Parameters
N = 12  # Time periods
S = 5   # Number of states (demand levels)
A = 3   # Number of actions (price levels)
gamma = 1  # Discount factor

# Transition probabilities: P[s, a, s'] = Probability of transition from s to s' given action a
P = np.random.rand(S, A, S)  # Example probabilities, should be based on data

# Reward function: R[s, a] = Expected revenue for action a in state s
R = np.random.rand(S, A)  # Example rewards, should be based on data

# Value function and policy initialization
V = np.zeros((N+1, S))
policy = np.zeros((N, S), dtype=int)

# Dynamic programming to solve the MDP
for t in range(N-1, -1, -1):
    for s in range(S):
        Q = np.zeros(A)
        for a in range(A):
            Q[a] = R[s, a] + gamma * np.sum(P[s, a, :] * V[t+1, :])
        V[t, s] = np.max(Q)
        policy[t, s] = np.argmax(Q)

print("Optimal Policy (Price Levels per State and Time):")
print(policy)


Optimal Policy (Price Levels per State and Time):
[[1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [1 0 1 0 0]
 [0 0 1 0 2]]
