# FrozenLake-v0
- The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

- Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

```
SFFF       (S: starting point, safe)
FHFH       (F: frozen surface, safe)
FFFH       (H: hole, fall to your doom)
HFFG       (G: goal, where the frisbee is located)
```
```
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
```

In [1]:
import gym
import numpy as np

In [2]:
def get_environment():
    env = gym.make('FrozenLake-v0')
    state_space_n = env.observation_space.n
    action_space_n = env.action_space.n
    print(state_space_n)
    print(action_space_n)
    return env, state_space_n, action_space_n

# 策略迭代[基于贪婪策略]
个体在处于任一状态时,将比较所有可能后续状态的价值，从中选择最大价值的状态，再选择能到该状态的行为。

In [3]:
env, state_space_n, action_space_n = get_environment()

16
4


In [4]:
value_table = np.zeros(state_space_n)
value_table

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [5]:
env.P[0][0]

[(0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 0, 0.0, False),
 (0.3333333333333333, 4, 0.0, False)]

In [6]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


In [7]:
def convergence_flag(value_table1, value_table2, threshold = 1e-2):
    if np.sum(np.fabs(value_table1-value_table2)) < threshold:
        return True
    else:
        return False

In [8]:
iteration_num = 10000000
gamma = 1.0

## 价值迭代函数：
$$
V_{k+1}(s) = \sum_{a \in A}\pi(a|s)(R_{s}^{a} + \gamma \sum_{s^{'}\in S}P^{a}_{ss^{'}}V_{k}(s^{'}))
$$

In [9]:
for i in range(iteration_num):
    value_table_1 = np.copy(value_table)#记录上一次迭代的value_table
    for s in range(state_space_n):#更新所有状态的v(s)
        q_value = []#记录在状态s执行s的所有可能动作后,所有可能到达状态的q动作价值函数
        for a in range(action_space_n):#遍历状态s下所有可能的动作
            next_states_rewards = []#记录在状态s执行动作a所能达到的所有状态的期望价值
            for next_s in env.P[s][a]:#遍历在s执行a后所有可能到达状态,并代入价值迭代函数进行价值更新
                pi, next_state, reward, done = next_s
                #由于选用贪婪策略,在策略pi下执行最大Q-value动作并转化为该动作状态的概率为100%
                next_states_rewards.append((pi*(reward+gamma*value_table_1[next_state]*1.0)))
                q_value.append(np.sum(next_states_rewards))
                value_table[s] = max(q_value)
    if convergence_flag(value_table_1, value_table):
        break

In [10]:
value_table.reshape([4,4])

array([[0.78269257, 0.76903016, 0.7593322 , 0.75430111],
       [0.78566665, 0.        , 0.50023373, 0.        ],
       [0.79139792, 0.79946853, 0.74349026, 0.        ],
       [0.        , 0.86526845, 0.93231167, 0.        ]])