# 求每个状态的state value（用迭代法求解贝尔曼方程，得到state value，数值解）

## 问题描述
如图对应《强化学习的数学原理》第二章2.5示例Page22,求出某个策略下每个状态s的state value

<img src="./picture1_3.png" alt="插入图片哈哈" width="40%">

## 根据上图创建环境

In [2]:
import numpy as np
S = ["s1", "s2", "s3", "s4"]  # 状态集合
A = ["a1", "a2", "a3", "a4", "a5"]  # 动作集合,上，下，左，右，原地
# 状态转移函数
P = {   "s1-a1-s1": 1.0,
        "s1-a2-s2": 1.0,
        "s1-a3-s3": 1.0,
        "s1-a4-s1": 1.0,
        "s1-a5-s1": 1.0,

        "s2-a1-s2": 1.0,
        "s2-a2-s2": 1.0,
        "s2-a3-s4": 1.0,
        "s2-a4-s1": 1.0,
        "s2-a5-s2": 1.0,

        "s3-a1-s1": 1.0,
        "s3-a2-s4": 1.0,
        "s3-a3-s3": 1.0,
        "s3-a4-s3": 1.0,
        "s3-a5-s3": 1.0,

        "s4-a1-s2": 1.0,
        "s4-a2-s4": 1.0,
        "s4-a3-s4": 1.0,
        "s4-a4-s3": 1.0,
        "s4-a5-s4": 1.0}
# 奖励函数
R = {"s1-a1": 0,
    "s1-a2": -1,
    "s1-a3": 0,
    "s1-a4": 0,
    "s1-a5": 0,
    "s2-a1": -1,
    "s2-a2": -1,
    "s2-a3": 1,
    "s2-a4": 0,
    "s2-a5": -1,
    "s3-a1": 0,
    "s3-a2": 1,
    "s3-a3": 0,
    "s3-a4": 0,
    "s3-a5": 0,
    "s4-a1": -1,
    "s4-a2": 0,
    "s4-a3": 0,
    "s4-a4": 0,
    "s4-a5": 1}
gamma = 0.5  # 折扣因子
MDP = (S, A, P, R, gamma)

## Policy1(对应P22例子1)

In [3]:
Pi_1 = {"s1-a3": 1,
        "s2-a3": 1,
        "s3-a2": 1,
        "s4-a5": 1,}   #这是一个确定性策略 deterministic policy

### 迭代法求解Policy1情况下的state value

In [4]:
P_from_mdp_to_mrp=[
    [0,0,1,0],
    [0,0,0,1],
    [0,0,0,1],
    [0,0,0,1]
]
#转成二维矩阵
P1 = np.array(P_from_mdp_to_mrp)

In [5]:
#根据策略1，写出奖励
#s1的reward，0
#s2的reward，1
#s3的reward,1
#s4的reward,1
R1 = [0,1,1,1]

In [6]:
#对应《强化学习的数学原理》Page27，数值解
def policy_evaluation(P, R, gamma, theta):
    """
    数值迭代求解策略下状态价值函数 V(s)
    参数:
    P : ndarray, shape (n_states, n_states)
        策略下的状态转移矩阵
    R : ndarray, shape (n_states,)
        策略下的状态奖励向量
    gamma : float
        折扣因子
    theta : float
        收敛阈值
    返回:
    V : ndarray, shape (n_states,)
        各状态的状态价值
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)
    delta = float('inf')

    while delta > theta:
        delta = 0
        V_new = np.copy(V)
        for s in range(n_states):
            V_new[s] = R[s] + gamma * np.sum(P[s] * V)
            delta = max(delta, abs(V_new[s] - V[s]))
        V = V_new

    return V

In [7]:
V_pi1 = policy_evaluation(P1, R1, 0.9, theta=1e-7)
print(V_pi1)

[8.9999991 9.9999991 9.9999991 9.9999991]


## Policy2(对应P24例子2)

In [8]:
Pi_2 = {"s1-a3": 0.5,
        "s1-a2": 0.5,
        "s2-a3": 1,
        "s3-a2": 1,
        "s4-a5": 1}   #这是一个随机选策略 stochastic policy

### 迭代法求解Policy2情况下的state value

In [9]:
P2_from_mdp_to_mrp=[
    [0,0.5,0.5,0],
    [0,0,0,1],
    [0,0,0,1],
    [0,0,0,1]
]
#转成二维矩阵
P2 = np.array(P2_from_mdp_to_mrp)

In [10]:
#根据策略2，写出奖励
#s1的reward，0.5*(-1)+0.5*0=-0.5
#s2的reward，1
#s3的reward,1
#s4的reward,1
R2 = [-0.5,1,1,1]

In [11]:
V_pi2 = policy_evaluation(P2, R2, 0.9, theta=1e-10)
print(V_pi2)

[ 8.5 10.  10.  10. ]
