# 求每个状态的state value(用矩阵运算求解贝尔曼方程，得到state value，解析解)

## 问题描述
如图对应《强化学习的数学原理》第二章2.5示例Page22,求出某个策略下每个状态s的state value

<img src="./picture1_3.png" alt="插入图片哈哈" width="40%">

## 根据上图创建环境

In [12]:
import numpy as np
S = ["s1", "s2", "s3", "s4"]  # 状态集合
A = ["a1", "a2", "a3", "a4", "a5"]  # 动作集合,上，下，左，右，原地
# 状态转移函数
P = {   "s1-a1-s1": 1.0,
        "s1-a2-s2": 1.0,
        "s1-a3-s3": 1.0,
        "s1-a4-s1": 1.0,
        "s1-a5-s1": 1.0,

        "s2-a1-s2": 1.0,
        "s2-a2-s2": 1.0,
        "s2-a3-s4": 1.0,
        "s2-a4-s1": 1.0,
        "s2-a5-s2": 1.0,

        "s3-a1-s1": 1.0,
        "s3-a2-s4": 1.0,
        "s3-a3-s3": 1.0,
        "s3-a4-s3": 1.0,
        "s3-a5-s3": 1.0,

        "s4-a1-s2": 1.0,
        "s4-a2-s4": 1.0,
        "s4-a3-s4": 1.0,
        "s4-a4-s3": 1.0,
        "s4-a5-s4": 1.0}
# 奖励函数
R = {"s1-a1": 0,
    "s1-a2": -1,
    "s1-a3": 0,
    "s1-a4": 0,
    "s1-a5": 0,
    "s2-a1": -1,
    "s2-a2": -1,
    "s2-a3": 1,
    "s2-a4": 0,
    "s2-a5": -1,
    "s3-a1": 0,
    "s3-a2": 1,
    "s3-a3": 0,
    "s3-a4": 0,
    "s3-a5": 0,
    "s4-a1": -1,
    "s4-a2": 0,
    "s4-a3": 0,
    "s4-a4": 0,
    "s4-a5": 1}
gamma = 0.5  # 折扣因子
MDP = (S, A, P, R, gamma)

## Policy1(对应P22例子1)

In [13]:
Pi_1 = {"s1-a3": 1,
        "s2-a3": 1,
        "s3-a2": 1,
        "s4-a5": 1,}   #这是一个确定性策略 deterministic policy

### 解析解求解Policy1情况下的state value

In [14]:
P_from_mdp_to_mrp=[
    [0,0,1,0],
    [0,0,0,1],
    [0,0,0,1],
    [0,0,0,1]
]
#转成二维矩阵
P_from_mdp_to_mrp = np.array(P_from_mdp_to_mrp)

In [15]:
#根据策略1，写出奖励
#s1的reward，0
#s2的reward，1
#s3的reward,1
#s4的reward,1
R_from_mdp_to_mrp = [0,1,1,1]

In [16]:
def compute(P, rewards, gamma, states_num):
    ''' 利用贝尔曼方程的矩阵形式计算解析解,states_num是MRP的状态数 '''
    rewards = np.array(rewards).reshape((-1, 1))  #将rewards写成列向量形式
    value = np.dot(np.linalg.inv(np.eye(states_num, states_num) - gamma * P),
                   rewards)
    return value

In [17]:
V_pi1 = compute(P=P_from_mdp_to_mrp, rewards=R_from_mdp_to_mrp, gamma=0.9, states_num=4)
print(V_pi1)  #s1-s5每个状态的state value

[[ 9.]
 [10.]
 [10.]
 [10.]]


## Policy2(对应P24例子2)

In [18]:
Pi_2 = {"s1-a3": 0.5,
        "s1-a2": 0.5,
        "s2-a3": 1,
        "s3-a2": 1,
        "s4-a5": 1}   #这是一个随机选策略 stochastic policy

### 解析解求解Policy2情况下的state value

In [19]:
P2_from_mdp_to_mrp=[
    [0,0.5,0.5,0],
    [0,0,0,1],
    [0,0,0,1],
    [0,0,0,1]
]
#转成二维矩阵
P2_from_mdp_to_mrp = np.array(P2_from_mdp_to_mrp)

In [20]:
#根据策略2，写出奖励
#s1的reward，0.5*(-1)+0.5*0=-0.5
#s2的reward，1
#s3的reward,1
#s4的reward,1
R2_from_mdp_to_mrp = [-0.5,1,1,1]

In [21]:
V_pi2 = compute(P=P2_from_mdp_to_mrp, rewards=R2_from_mdp_to_mrp, gamma=0.9, states_num=4)
print(V_pi2)  #s1-s5每个状态的state value

[[ 8.5]
 [10. ]
 [10. ]
 [10. ]]
