# Getting Started with Baselines – DQN on Cart-Pole

At the end of the notebook we will obtain a RL agent able to obtain the following results on cartpole:

In [19]:
# Render the episodes
import io
import base64
from IPython.display import HTML, display

def render(episodes_to_watch=1):
    for episode in range(episodes_to_watch):
        video = io.open(
            f"./gym-results/openaigym.video.{env.file_infix}.video{episode:06d}.mp4", "r+b"
        ).read()
        encoded = base64.b64encode(video)
        display(
            HTML(
                data="""
            <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>""".format(
                    encoded.decode("ascii")
                )
            )
        )

In [20]:
render()

## From the command line

Train DQN on CartPole with a simple line:

<div class="alert alert-warning">

**Note:** The following cell will take some time.

</div>

In [1]:
! python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5

Logging to /var/folders/yq/3ns6lnvj3670mmd5kk_dlljm0000gn/T/openai-2020-05-01-15-36-18-955327
env_type: classic_control
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.




2020-05-01 15:36:20.615649: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-01 15:36:20.637248: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fd3ec8ddbf0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-01 15:36:20.637271: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

Training deepq o

Just ignore the warning raised by the previous commands.

Run the model to see the results:

<div class="alert alert-warning">

**Note:** The following command opens another window in which you can see the cartpole agent.

</div>

<div class="alert alert-warning">

**Note:** Stop the following cell, otherwise it will run forever.

</div>

In [None]:
# Load the model saved in cartpole_model.pkl and visualize the learned policy
! python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play

## Using Python

1. Imports:

In [6]:
import gym

# Import the desired algorithm from baselines
from baselines import deepq

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



2. Define a callback for informing baselines when to stop training. The callback should return True if the reward is satisfying:

In [8]:
def callback(locals, globals):
    """
    function called at every step with state of the algorithm.
    If callback returns true training stops.
    """
    # stop training if average reward exceeds 199
    # time should be greater than 100 and the average of last 100 returns should be >= 199
    is_solved = (
        locals["t"] > 100 and sum(locals["episode_rewards"][-101:-1]) / 100 >= 199
    )
    return is_solved

3. Now let’s create the environment and prepare the algorithm parameters: 

In [9]:
# create the environment
env = gym.make("CartPole-v0")

# Prepare learning parameters: network and learning rate
# the policy is a multi-layer perceptron
network = "mlp"
# set learning rate of the algorithm
learning_rate = 1e-3

4. We can use the method deep.learn() to start the training and solve the task:

<div class="alert alert-warning">

**Note:** The following cell will take some time.

</div>

In [10]:
# launch learning on this environment using DQN
# ignore the exploration parameter for now
actor = deepq.learn(
    env,
    network=network,
    lr=learning_rate,
    total_timesteps=1e5,
    buffer_size=5e4,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,
    print_freq=10,
    callback=callback,
)








Instructions for updating:
Use `tf.cast` instead.

Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where





Logging to /var/folders/yq/3ns6lnvj3670mmd5kk_dlljm0000gn/T/openai-2020-05-01-15-53-15-333494
--------------------------------------
| % time spent exploring  | 97       |
| episodes                | 10       |
| mean 100 episode reward | 24.4     |
| steps                   | 219      |
--------------------------------------


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


--------------------------------------
| % time spent exploring  | 96       |
| episodes                | 20       |
| mean 100 episode reward | 21.3     |
| steps                   | 403      |
--------------------------------------
--------------------------------------
| % time spent exploring  | 94       |
| episodes                | 30       |
| mean 100 episode reward | 20.4     |
| steps                   | 590      |
--------------------------------------
--------------------------------------
| % time spent exploring  | 92       |
| episodes                | 40       |
| mean 100 episode reward | 20.7     |
| steps                   | 805      |
--------------------------------------
--------------------------------------
| % time spent exploring  | 90       |
| episodes                | 50       |
| mean 100 episode reward | 20       |
| steps                   | 980      |
--------------------------------------
--------------------------------------
| % time spent exploring 

--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 370      |
| mean 100 episode reward | 123      |
| steps                   | 2.84e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 380      |
| mean 100 episode reward | 115      |
| steps                   | 2.88e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 390      |
| mean 100 episode reward | 108      |
| steps                   | 2.9e+04  |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 400      |
| mean 100 episode reward | 95.6     |
| steps                   | 2.92e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring 

--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 720      |
| mean 100 episode reward | 81.3     |
| steps                   | 5.98e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 730      |
| mean 100 episode reward | 83.3     |
| steps                   | 6.01e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 740      |
| mean 100 episode reward | 83.4     |
| steps                   | 6.03e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring  | 2        |
| episodes                | 750      |
| mean 100 episode reward | 76.1     |
| steps                   | 6.05e+04 |
--------------------------------------
--------------------------------------
| % time spent exploring 

5. Now we can save our actor so that we can reuse it without re-training:

In [11]:
print("Saving model to cartpole_model.pkl")
actor.save("cartpole_model.pkl")

Saving model to cartpole_model.pkl


6. Now it is possible to use the model and visualize the agent’s behaviour.
The actor returned by deepq.learn is actually a callable that returns the action given the current observation, it is the agent policy. We can use it passing the current observation and it returns the selected action.

In [15]:
# Needed to show the environment in a notebook
from gym import wrappers

In [16]:
env = wrappers.Monitor(
    env, "./gym-results", force=True, video_callable=lambda episode_id: True
)

#visualize the policy
n_episodes = 5
n_timesteps = 1000
for episode in range(n_episodes):
    observation = env.reset()
    episode_return = 0
    for timestep in range(n_timesteps):
        # render the environment
        env.render()

        # select the action according to the actor
        action = actor(observation[None])[0]

        # call env.step function
        observation, reward, done, _ = env.step(action)

        # since the reward is undiscounted we can simply add the reward to the cumulated return
        episode_return += reward

        if done:
            break
    
    # here an episode is terminated, print the return
    print("Episode return", episode_return) 
       # here an episode is terminated, print the return and the number of steps
    print(f"Episode return {episode_return}, Number of steps: {timestep}")
env.close()

Episode return 200.0
Episode return 200.0, Number of steps: 199
Episode return 200.0
Episode return 200.0, Number of steps: 199
Episode return 200.0
Episode return 200.0, Number of steps: 199
Episode return 200.0
Episode return 200.0, Number of steps: 199
Episode return 200.0
Episode return 200.0, Number of steps: 199


Let's render the episodes in the notebook

In [21]:
render(5)