# RLlib Sample Application: MountainCar-v0

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to trains a policy with the `MountainCar-v0` environment:

 - <https://gym.openai.com/envs/MountainCar-v0/>

For more background about this problem, see:

  - ["Efficient memory-based learning for robot control"](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf)  
[Andrew William Moore](https://www.cl.cam.ac.uk/~awm22/)  
University of Cambridge (1990)
  - ["Solving Mountain Car with Q-Learning"](https://medium.com/@ts1829/solving-mountain-car-with-q-learning-b77bf71b1de2)  
[Tim Sullivan](https://twitter.com/ts_1829)
  
---

First, let's make sure that Ray and RLlib are installed…

In [None]:
!pip install ray[rllib]
!pip install gym

Then start Ray…

In [1]:
import ray
import ray.rllib.agents.ppo as ppo

ray.shutdown()
ray.init(ignore_reinit_error=True)

2020-07-06 15:26:21,380	INFO resource_spec.py:212 -- Starting Ray with 3.52 GiB memory available for workers and up to 1.78 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 15:26:21,898	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.65',
 'raylet_ip_address': '192.168.1.65',
 'redis_address': '192.168.1.65:6379',
 'object_store_address': '/tmp/ray/session_2020-07-06_15-26-21_364551_84337/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-07-06_15-26-21_364551_84337/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-07-06_15-26-21_364551_84337'}

After a successful launch, the Ray dashboard will be running on a local port:

In [2]:
print("Dashboard URL: http://{}".format(ray.get_webui_url()))

Dashboard URL: http://localhost:8265


Open that URL in another tab to view the Ray dashboard as the example runs. We'll also set up a checkpoint location to store the trained policy:

In [9]:
import os
import shutil

CHECKPOINT_ROOT = "tmp/ppo/moun"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Next we'll configure to train an RLlib policy with the `MountainCar-v0` environment <https://gym.openai.com/envs/MountainCar-v0/>

In [10]:
SELECT_ENV = "MountainCar-v0"

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

config["num_workers"] = 4               # default = 2
config["train_batch_size"] = 10000      # default = 4000
config["sgd_minibatch_size"] = 256      # default = 128
config["evaluation_num_episodes"] = 50  # default = 10

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-07-06 15:32:34,053	INFO (unknown file):0 -- gc.collect() freed 923 refs in 0.29063460799989116 seconds


By default, training runs for `40` iterations. Increase the `N_ITER` setting if you want to see the resulting rewards improve.

In [11]:
N_ITER = 40
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"

for n in range(N_ITER):
    result = agent.train()
    file_name = agent.save(CHECKPOINT_ROOT)

    print(s.format(
        n + 1,
        result["episode_reward_min"],
        result["episode_reward_mean"],
        result["episode_reward_max"],
        result["episode_len_mean"],
        file_name
        ))

  1 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_1/checkpoint-1
  2 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_2/checkpoint-2
  3 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_3/checkpoint-3
  4 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_4/checkpoint-4
  5 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_5/checkpoint-5
  6 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_6/checkpoint-6
  7 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_7/checkpoint-7
  8 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_8/checkpoint-8
  9 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_9/checkpoint-9
 10 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_10/checkpoint-10
 11 reward -200.00/-200.00/-200.00 len 200.00 saved tmp/ppo/moun/checkpoint_11/checkpoin

Do the episode rewards increase after multiple iterations?
That shows whether the policy is improving.

Also, print out the policy and model to see the results of training in detail…

In [12]:
policy = agent.get_policy()
model = policy.model
print(model.base_model.summary())

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
observations (InputLayer)       [(None, 2)]          0                                            
__________________________________________________________________________________________________
fc_1 (Dense)                    (None, 256)          768         observations[0][0]               
__________________________________________________________________________________________________
fc_value_1 (Dense)              (None, 256)          768         observations[0][0]               
__________________________________________________________________________________________________
fc_2 (Dense)                    (None, 256)          65792       fc_1[0][0]                       
______________________________________________________________________________________________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "car" agent operating within the simulation: rocking back and forth to gain momentum to overcome the mountain.

In [13]:
! rllib rollout \
    tmp/ppo/moun/checkpoint_20/checkpoint-20 \
    --config "{\"env\": \"MountainCar-v0\"}" \
    --run PPO \
    --steps 2000

2020-07-06 15:41:38,869	INFO resource_spec.py:212 -- Starting Ray with 3.47 GiB memory available for workers and up to 1.74 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 15:41:39,496	INFO services.py:1165 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-07-06 15:41:40,642	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2020-07-06 15:41:40,642	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-07-06 15:41:45,007	INFO trainable.py:423 -- Restored on 192.168.1.65 from checkpoint: tmp/ppo/moun/checkpoint_20/checkpoint-20
2020-07-06 15:41:45,007	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 201.64923691749573, '_episodes_total': 1040}
Episode #0: reward: -200.0
Episode #1: reward: -200.0
Epi

2020-07-06 16:19:38,435	ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-07-06 16:19:38,429	ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-07-06 16:19:38,427	ERROR worker.py:949 -- print_logs: Connection closed by server.


The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

---

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (copy/paste the URL it generates) to visualize key metrics from training with RLlib…

In [None]:
!pip install tensorflow
!tensorboard --logdir=$HOME/ray_results