## MountainCar-v0 Question: 

### Observation
观测值，即智能体的状态，包括位置和速度，具体取值范围如下：  
位置：-1.2 —— 0.6  
速度：-0.07 —— 0.07

### Actions

智能体的行为，一种只有三种：  
0（向左）  
1（不动）  
2（向右）

### Reward

每一步的回报为-1，到达目标山峰的回报为0.5。
到达左边山顶没有定义，相当于触碰了墙壁。

### Starting State

位置为(-0.6，-0.4)之间的随机位置，速度为0。

In [1]:
import gym
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model

In [2]:
#导入模型，MountainCar问题
env = gym.make("MountainCar-v0")
env.reset()

array([-0.56164963,  0.        ])

# Parameter

In [3]:
#基本参数
alpha = 0.5 #学习率
gama = 0.95 #折现因子
episodes_num = 5000 #训练轮数
display_num = 200 #训练时每隔多少轮显示进度
Qtable_size = 20 #Q表的长度

#离散化参数
discrete_size = [Qtable_size] * len(env.observation_space.high) #20*20
discrete_step = (env.observation_space.high - env.observation_space.low) / discrete_size #单位步长
q_table = np.zeros(discrete_size + [env.action_space.n]) #Q表初始化，大小：20*20*3

#ε参数
#初始为１，后续随训练轮数逐渐减小
epsilon = 1
start_episode = 1
end_episode = episodes_num//2
#每步递减值
de_step = epsilon/(end_episode - start_episode)

# Auxiliary Function

In [4]:
#辅助函数
#离散化状态——将连续的状态变位离散的状态
def get_discrete_state (state):
    discrete_state = (state - env.observation_space.low) // discrete_step
    return tuple(discrete_state.astype(int))

#ε—贪心策略执行行动
def take_epilon_greedy_action(state, epsilon):
    discrete_state = get_discrete_state(state)
    if np.random.random() < epsilon:
        action = np.random.randint(0,env.action_space.n)
    else:
        action = np.argmax(q_table[discrete_state])
    return action

# Train

In [5]:
#训练
for episode in range(episodes_num):
    # initiate reward every episode
    if episode % display_num == 0:
        print("episode: {}".format(episode))
        render = True
    else:
        render = False

    state = env.reset()
    done = False
    A_list = [] #行为存储
    S_list = [] #状态存储
    R_list = [] #回报存储
    Done_list = [] #完成情况存储
    num_T = 0 #序列长度
    #生成序列
    while not done:
        S_list.append(state)
        #行为策略：St——At
        action = take_epilon_greedy_action(state, epsilon)
        A_list.append(action)
        #St，At——Rt+1，St+1
        next_state, reward, done, _ = env.step(action)
        R_list.append(reward)
        Done_list.append(done)
        state = next_state
        num_T = num_T + 1
    Gt = 0
    #根据序列更新Q表
    for i in range(num_T):
        temp_t = num_T - i - 1
        if not Done_list[temp_t]:
            Gt = gama * Gt + R_list[temp_t] #Gt迭代计算
            td_target = Gt
            q_table[get_discrete_state(S_list[temp_t])][A_list[temp_t]] += alpha * (td_target - q_table[get_discrete_state(S_list[temp_t])][A_list[temp_t]])
        else:
            q_table[get_discrete_state(S_list[temp_t])][A_list[temp_t]] = 0

    # ε值的递减
    if end_episode >= episode >= start_episode:
        epsilon -= de_step

episode: 0
episode: 200
episode: 400
episode: 600
episode: 800
episode: 1000
episode: 1200
episode: 1400
episode: 1600
episode: 1800
episode: 2000
episode: 2200
episode: 2400
episode: 2600
episode: 2800
episode: 3000
episode: 3200
episode: 3400
episode: 3600
episode: 3800
episode: 4000
episode: 4200
episode: 4400
episode: 4600
episode: 4800


# Test

In [6]:
#训练结果测试
done = False
state = env.reset()
while not done:
    action = np.argmax(q_table[get_discrete_state(state)])
    next_state, _, done, _ = env.step(action)
    state = next_state
    env.render()

env.close()