# 什么是强化学习

<img src="files/figures/what_is_reinforcement_learning.svg" style="width: 300px;" />

# 安装和配置OpenAI Gym

* 在浏览器地址栏中输入：https://github.com/openai/gym 这就是OpenAI gym的GitHub库。
* 打开命令提示符（Command Prompt），选择合适的文件夹，输入`git clone https://github.com/openai/gym`
* 再输入`cd gym`进入gym的文件夹
* 输入`pip install -e .`进行安装
* 最后我们验证一下gym是否安装成功，在命令提示符处输入`python`
* 跳转成功后输入`import gym`，如果没有报错的话gym就算是安装成功了（再输入`exit()`或按CTRL+D退出）

# 导入OpenAI Gym

In [7]:
# Import OpenAI gym library
import gym

# Import environments from OpenAI gym
from gym import envs

In [9]:
# See the number of environments in OpenAI gym
len(envs.registry.all())

797

# MountainCar-v0环境的可视化

In [None]:
# Mountain car environment
import gym
env = gym.make('MountainCar-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()

# CartPole-v0环境的可视化

In [None]:
# Cart pole environment
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()

# import gym
# import time 
# env = gym.make('CartPole-v0')
# env.reset()
# for step in range(1000):
#     env.render()
#     env.step(env.action_space.sample())
#     time.sleep(0.1)
# env.close()

# 观测值 (Observations)

如果我们希望比在每步随机选取动作要做得更好，那么我们就需要知道动作对环境有什么作用。

环境的`step`函数返回的就是我们需要的信息。实际上`step`会返还4个值：

* `observation`（对象, object）：这是表示我们观测值的一个环境特定的对象。比如相机的像素数据、机器人的关节角度和角速度或者是棋类游戏的棋盘布局。
* `reward`（浮点数, float）：这是之前动作得到的奖励。奖励数值的大小和尺度根据环境不同会有所变化，无论如何目标都是要最大化总奖励。
* `done`（布尔值, boolean）：这代表了是否需要再次`reset`当前环境。大多数任务（但并不是全部）都被分成了明确的回合，`done`是`True`时意味着回合结束了。（比如杆子倾斜幅度太大，或者是游戏里的最后一条命没了。）
* `info`（字典, dict）：对调试有帮助的诊断信息。这些信息有时候对学习会有帮助（比如它也许包含了环境最后一个状态改变的原始概率信息）。然而，对于智能体的官方评估是不允许用这些信息来进行学习的。

# 探索CartPole-v0的动作

In [2]:
import gym
env = gym.make('CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


重置环境，返回值为cart-pole的观测值$(x, \dot{x}, \theta, \dot{\theta})$

* $x$：cart的位置，也就是横坐标
* $\dot{x}$：cart的速度
* $\theta$：pole的角度
* $\dot{\theta}$：pole的角速度

In [5]:
env.reset()

array([ 0.00206123, -0.03788145, -0.00525245, -0.04826139])

查看可以选择的动作，`Discrete(2)`意味着我们有两个离散的动作可以选择，第1个动作的序号为`0`也就是向左推cart，第2个动作的序号为`1`也就是向右推cart

In [5]:
env.action_space

Discrete(2)

向左推cart，返回值为$\left( (x, \dot{x}, \theta, \dot{\theta}), \mbox{reward}, \mbox{done}, \mbox{info} \right)$

In [6]:
env.step(0)

(array([ 0.0013036 , -0.23292769, -0.00621768,  0.24275972]), 1.0, False, {})

# 随机搜索 (Random Search)

In [None]:
import tensorflow as tf
import numpy as np
import gym
import logging

logger = logging.getLogger('rl')
logger.setLevel(logging.DEBUG)

class Harness:

    def run_episode(self, env, agent):
        observation = env.reset()
        total_reward = 0
        for _ in range(1000):
            action = agent.next_action(observation)
            observation, reward, done, info = env.step(action)
            total_reward += reward
            if done:
                break
        return total_reward


class LinearAgent:

    def __init__(self):
        self.parameters = np.random.rand(4) * 2 - 1

    def next_action(self, observation):
        return 0 if np.matmul(self.parameters, observation) < 0 else 1


def random_search():
    env = gym.make('CartPole-v0')
    best_params = None
    best_reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        agent.parameters = np.random.rand(4) * 2 - 1
        reward = harness.run_episode(env, agent)
        if reward > best_reward:
            best_reward = reward
            best_params = agent.parameters
            if reward == 200:
                print('200 achieved on step {}'.format(step))

    print(best_params)
    env.close()

def hill_climbing():
    env = gym.make('CartPole-v0')
    noise_scaling = 0.1
    best_params = None
    best_reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        new_params = agent.parameters + (np.random.rand(4) * 2 - 1) * noise_scaling
        reward = harness.run_episode(env, agent)
        if reward > best_reward:
            agent.parameters = new_params
            best_reward = reward
            best_params = agent.parameters
            if reward == 200:
                print('200 achieved on step {}'.format(step))
                break
    print('best reward: {}'.format(best_reward))
    print('best parameters: {}'.format(best_params))
    env.close()

random_search()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
200 achieved on step 9


# 爬坡法 (Hill Climbing)

In [32]:
import tensorflow as tf
import numpy as np
import gym
import logging

logger = logging.getLogger('rl')
logger.setLevel(logging.DEBUG)

class Harness:

    def run_episode(self, env, agent):
        observation = env.reset()
        total_reward = 0
        for _ in range(1000):
            action = agent.next_action(observation)
            observation, reward, done, info = env.step(action)
            total_reward += reward
            if done:
                break
        return total_reward


class LinearAgent:

    def __init__(self):
        self.parameters = np.random.rand(4) * 2 - 1

    def next_action(self, observation):
        return 0 if np.matmul(self.parameters, observation) < 0 else 1


def random_search():
    env = gym.make('CartPole-v0')
    best_params = None
    best_reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        agent.parameters = np.random.rand(4) * 2 - 1
        reward = harness.run_episode(env, agent)
        if reward > best_reward:
            best_reward = reward
            best_params = agent.parameters
            if reward == 200:
                print('200 achieved on step {}'.format(step))

    print(best_params)
    env.close()

def hill_climbing():
    env = gym.make('CartPole-v0')
    noise_scaling = 0.1
    best_params = None
    best_reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        new_params = agent.parameters + (np.random.rand(4) * 2 - 1) * noise_scaling
        reward = harness.run_episode(env, agent)
        if reward > best_reward:
            agent.parameters = new_params
            best_reward = reward
            best_params = agent.parameters
            if reward == 200:
                print('200 achieved on step {}'.format(step))
                break
    print('best reward: {}'.format(best_reward))
    print('best parameters: {}'.format(best_params))
    env.close()

hill_climbing()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
200 achieved on step 60
best reward: 200.0
best parameters: [-0.23058888  0.27271016 -0.50275668  0.34309585]


# 初次编辑日期

2018年5月13日

# 参考文献

[1] https://www.udemy.com/hands-on-reinforcement-learning-with-python

[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W., 2016. Openai gym. arXiv preprint arXiv:1606.01540.