# Markov
* 实现下图的环境，需要实现环境中的动态转移函数。
* 实现一个 agent, 策略是随机的，通过仿真的方式，用回报值的经验平均去估计每个状态的值函数。验证仿真的结果和课件中计算的结果。(分别仿真  γ  = 0.5, 1)
* 强化学习中寻找最优策略的方法有很多种，其中全局遍历是朴素的解法，由于对于MDPs总存在最优的确定性策略，通过全局遍历所有确定性策略，并比较策略即可求出最优策略。在该MDPs中只有四个状态可以决策，每个状态只有两个可行的动作，所以总共有16个确定性策略，使用算法遍历所有策略，并输出最优策略。(分别考虑  γ  = 0.5, 1)

![](http://ww2.sinaimg.cn/large/006tNc79ly1g4xyd7ugfxj30ey0e6755.jpg)

In [1]:
import numpy as np
from code2_markov import Env, Agent

env = Env()
env.step('s4', 'review')

('s5', 10, True)

In [2]:
agent = Agent()
agent.random_policy('s1')

'phone'

In [3]:
def simulate_eval(s, gamma, max_step=100, N=10000):
    gs = []
    for _ in range(N):
        g = 0
        curr_s = s
        curr_gamma = 1
        for step in range(max_step):
            a = agent.policy(curr_s)
            curr_s, r, term = env.step(curr_s, a)
            g += curr_gamma * r
            curr_gamma *= gamma
            if term:
                break
        gs.append(g)
    return np.average(gs)

In [5]:
for s in ['s1', 's2', 's3', 's4']:
    print('V({}): {}'.format(s, simulate_eval(s, 1)))

V(s1): -4.6086


V(s2): -3.5991


V(s3): 0.4792


V(s4): 2.8575


可以和课件中的结果对比，注意到增加max_step和N的值是可以减小误差的。
> 课件中的结果是通过在给定策略的情况下，将MDP转化成MRP，然后用矩阵求解的
![](http://ww3.sinaimg.cn/large/006tNc79ly1g4xz0d5cnij30f80e6t9k.jpg)

In [6]:
for s in ['s1', 's2', 's3', 's4']:
    print('V({}): {}'.format(s, simulate_eval(s, 0.5)))

V(s1): -1.3025859339088348


V(s2): -1.9081474046711446


V(s3): -0.3386011561062558


V(s4): 2.6385232126889866


接下来，我们通过遍历所有的确定性策略，来寻找最优策略

- 首先需要确定有哪些确定性策略
- 然后选择哪个确定性策略最好

In [7]:
deterministic_policies = [
    dict(s1=a1, s2=a2, s3=a3, s4=a4)
    for a1 in agent.available_actions['s1']
    for a2 in agent.available_actions['s2']
    for a3 in agent.available_actions['s3']
    for a4 in agent.available_actions['s4']
]

for policy in deterministic_policies:
    print(policy)

{'s1': 'phone', 's2': 'phone', 's3': 'study', 's4': 'review'}
{'s1': 'phone', 's2': 'phone', 's3': 'study', 's4': 'noreview'}
{'s1': 'phone', 's2': 'phone', 's3': 'sleep', 's4': 'review'}
{'s1': 'phone', 's2': 'phone', 's3': 'sleep', 's4': 'noreview'}
{'s1': 'phone', 's2': 'study', 's3': 'study', 's4': 'review'}
{'s1': 'phone', 's2': 'study', 's3': 'study', 's4': 'noreview'}
{'s1': 'phone', 's2': 'study', 's3': 'sleep', 's4': 'review'}
{'s1': 'phone', 's2': 'study', 's3': 'sleep', 's4': 'noreview'}
{'s1': 'quit', 's2': 'phone', 's3': 'study', 's4': 'review'}
{'s1': 'quit', 's2': 'phone', 's3': 'study', 's4': 'noreview'}
{'s1': 'quit', 's2': 'phone', 's3': 'sleep', 's4': 'review'}
{'s1': 'quit', 's2': 'phone', 's3': 'sleep', 's4': 'noreview'}
{'s1': 'quit', 's2': 'study', 's3': 'study', 's4': 'review'}
{'s1': 'quit', 's2': 'study', 's3': 'study', 's4': 'noreview'}
{'s1': 'quit', 's2': 'study', 's3': 'sleep', 's4': 'review'}
{'s1': 'quit', 's2': 'study', 's3': 'sleep', 's4': 'noreview'}


In [8]:
def wrap_policy(deterministic_policies, index):
    policy_dict = deterministic_policies[index]
    def policy(s):
        return policy_dict[s]
    return policy

agent.policy = wrap_policy(deterministic_policies, 0)
for s in ['s1', 's2', 's3', 's4']:
    print('{}: {}'.format(s, agent.policy(s)))

s1: phone
s2: phone
s3: study
s4: review


我们接下来统计这些策略所能实现的值函数

- 这里只计算了 γ =0.5的结果， γ =1同理
- 为了降低计算量，max_step和N选择较小的值

In [9]:
gamma = 0.5
values = []
for index in range(len(deterministic_policies)):
    print('Simulate {}th deterministic policy'.format(index))
    value_for_policy = []
    agent.policy = wrap_policy(deterministic_policies, index)
    for s in ['s1', 's2', 's3', 's4']:
        value_for_policy.append(simulate_eval(s, gamma, max_step=100, N=1000))
    values.append(value_for_policy)

Simulate 0th deterministic policy


Simulate 1th deterministic policy


Simulate 2th deterministic policy


Simulate 3th deterministic policy


Simulate 4th deterministic policy


Simulate 5th deterministic policy


Simulate 6th deterministic policy


Simulate 7th deterministic policy


Simulate 8th deterministic policy


Simulate 9th deterministic policy


Simulate 10th deterministic policy


Simulate 11th deterministic policy


Simulate 12th deterministic policy


Simulate 13th deterministic policy


Simulate 14th deterministic policy


Simulate 15th deterministic policy


简单地，我们比较哪个策略所能实现的值函数最大，即能得到最优策略

In [10]:
max_sum_value = -1000000
max_index = -1
for i, value in enumerate(values):
    sum_value = sum(value)
    if sum_value > max_sum_value:
        max_sum_value = sum_value
        max_index = i
print('Optimal policy')
print(deterministic_policies[max_index])

Optimal policy
{'s1': 'quit', 's2': 'study', 's3': 'study', 's4': 'review'}
