# Tutorial ACS_UPB_LAB1: Running Sumo Simulations

__Credits: most of the credits for this ipynb goes to https://github.com/flow-project/flow/tree/master/tutorials__

This tutorial walks through the process of running non-RL traffic simulations in Flow. Simulations of this form act as non-autonomous baselines and depict the behavior of human dynamics on a network. Similar simulations may also be used to evaluate the performance of hand-designed controllers on a network. This tutorial focuses primarily on the former use case, while an example of the latter may be found in `exercise07_controllers.ipynb`.

In this exercise, we simulate a initially perturbed single lane ring road. We witness in simulation that as time advances the initially perturbations do not dissipate, but instead propagates and expands until vehicles are forced to periodically stop and accelerate. For more information on this behavior, we refer the reader to the following article [1].

## 1.1 Components of a Simulation
All simulations, both in the presence and absence of RL, require two components: a *network*, and an *environment*. Networks describe the features of the transportation network used in simulation. This includes the positions and properties of nodes and edges constituting the lanes and junctions, as well as properties of the vehicles, traffic lights, inflows, etc. in the network. Environments, on the other hand, initialize, reset, and advance simulations, and act the primary interface between the reinforcement learning algorithm and the network. Moreover, custom environments may be used to modify the dynamical features of an network.

## 1.2 Setting up the environment of current lab (ENV1)
Load configurations for lab 1.

## 2. Setting up a Network
Flow contains a plethora of pre-designed networks used to replicate highways, intersections, and merges in both closed and open settings. All these networks are located in flow/networks. In order to recreate a ring road network, we begin by importing the network `RingNetwork`.

In [1]:
from flow.envs.nemodrive_lab import ENV2 as ENV

# from flow.networks.figure_eight import FigureEightNetwork
network_name = ENV["NETWORK"]
print(network_name.__name__)

FigureEightNetwork


This network, as well as all other networks in Flow, is parametrized by the following arguments: 
* name
* vehicles
* net_params
* initial_config
* traffic_lights

These parameters allow a single network to be recycled for a multitude of different network settings. For example, `RingNetwork` may be used to create ring roads of variable length with a variable number of lanes and vehicles.

### 2.1 Name
The `name` argument is a string variable depicting the name of the network. This has no effect on the type of network created.

In [2]:
name = network_name.__name__

### 2.2 VehicleParams
The `VehicleParams` class stores state information on all vehicles in the network. This class is used to identify the dynamical behavior of a vehicle and whether it is controlled by a reinforcement learning agent. Morover, information pertaining to the observations and reward function can be collected from various get methods within this class.

The initial configuration of this class describes the number of vehicles in the network at the start of every simulation, as well as the properties of these vehicles. We begin by creating an empty `VehicleParams` object.

In [3]:
vehicles = ENV["VEHICLES"]()

# code in get_vehicles 
# from flow.core.params import VehicleParams

# vehicles = VehicleParams()

Once this object is created, vehicles may be introduced using the `add` method. This method specifies the types and quantities of vehicles at the start of a simulation rollout. For a description of the various arguements associated with the `add` method, we refer the reader to the following documentation ([VehicleParams.add](https://flow.readthedocs.io/en/latest/flow.core.html?highlight=vehicleparam#flow.core.params.VehicleParams)).

When adding vehicles, their dynamical behaviors may be specified either by the simulator (default), or by user-generated models. For longitudinal (acceleration) dynamics, several prominent car-following models are implemented in Flow. For this example, the acceleration behavior of all vehicles will be defined by the Intelligent Driver Model (IDM) [2].

In [4]:
# code in get_vehicles 
# from flow.controllers.car_following_models import IDMController

Another controller we define is for the vehicle's routing behavior. For closed network where the route for any vehicle is repeated, the `ContinuousRouter` controller is used to perpetually reroute all vehicles to the initial set route.

In [5]:
# code in get_vehicles 
# from flow.controllers.routing_controllers import ContinuousRouter

Finally, we add 22 vehicles of type "human" with the above acceleration and routing behavior into the `Vehicles` class.

In [6]:
# (E.g. code in get_vehicles)
# vehicles.add("human",
#              acceleration_controller=(IDMController, {}),
#              routing_controller=(ContinuousRouter, {}),
#              num_vehicles=22)

### 2.3 NetParams

`NetParams` are network-specific parameters used to define the shape and properties of a network. Unlike most other parameters, `NetParams` may vary drastically depending on the specific network configuration, and accordingly most of its parameters are stored in `additional_params`. In order to determine which `additional_params` variables may be needed for a specific network, we refer to the `ADDITIONAL_NET_PARAMS` variable located in the network file.

In [7]:
# from flow.networks.ring import ADDITIONAL_NET_PARAMS

ADDITIONAL_NET_PARAMS = ENV["ADDITIONAL_NET_PARAMS"]

print(ADDITIONAL_NET_PARAMS)

{'radius_ring': 60, 'lanes': 2, 'speed_limit': 30, 'resolution': 40}


Importing the `ADDITIONAL_NET_PARAMS` dict from the ring road network, we see that the required parameters are:

* **length**: length of the ring road
* **lanes**: number of lanes
* **speed**: speed limit for all edges
* **resolution**: resolution of the curves on the ring. Setting this value to 1 converts the ring to a diamond.


At times, other inputs may be needed from `NetParams` to recreate proper network features/behavior. These requirements can be founded in the network's documentation. For the ring road, no attributes are needed aside from the `additional_params` terms. Furthermore, for this exercise, we use the network's default parameters when creating the `NetParams` object.

In [8]:
from flow.core.params import NetParams

net_params = NetParams(additional_params=ADDITIONAL_NET_PARAMS)

### 2.4 InitialConfig

`InitialConfig` specifies parameters that affect the positioning of vehicle in the network at the start of a simulation. These parameters can be used to limit the edges and number of lanes vehicles originally occupy, and provide a means of adding randomness to the starting positions of vehicles. In order to introduce a small initial disturbance to the system of vehicles in the network, we set the `perturbation` term in `InitialConfig` to 1m.

In [9]:
from flow.core.params import InitialConfig
initial_config_param = ENV["INITIAL_CONFIG_PARAMS"]
print(initial_config_param)

initial_config = InitialConfig(**initial_config_param)

{'spacing': 'random', 'perturbation': 50}


### 2.5 TrafficLightParams

`TrafficLightParams` are used to describe the positions and types of traffic lights in the network. These inputs are outside the scope of this tutorial, and instead are covered in `exercise06_traffic_lights.ipynb`. For our example, we create an empty `TrafficLightParams` object, thereby ensuring that none are placed on any nodes.

In [10]:
from flow.core.params import TrafficLightParams

traffic_lights = TrafficLightParams()

## 3. Setting up an Environment

Several envionrments in Flow exist to train autonomous agents of different forms (e.g. autonomous vehicles, traffic lights) to perform a variety of different tasks. These environments are often network or task specific; however, some can be deployed on an ambiguous set of networks as well. One such environment, `AccelEnv`, may be used to train a variable number of vehicles in a fully observable network with a *static* number of vehicles.

In [11]:
# from flow.envs.nemodrive_lab.env1_lab import LaneChangeAccelEnv1
env_name = ENV["ENVIRONMENT"]
print(env_name)

<class 'flow.envs.nemodrive_lab.env2_lab.LaneChangeAccelEnv2'>


Although we will not be training any autonomous agents in this exercise, the use of an environment allows us to view the cumulative reward simulation rollouts receive in the absence of autonomy.

Envrionments in Flow are parametrized by three components:
* `EnvParams`
* `SumoParams`
* `Network`

### 3.1 SumoParams
`SumoParams` specifies simulation-specific variables. These variables include the length a simulation step (in seconds) and whether to render the GUI when running the experiment. For this example, we consider a simulation step length of 0.1s and activate the GUI.

Another useful parameter is `emission_path`, which is used to specify the path where the emissions output will be generated. They contain a lot of information about the simulation, for instance the position and speed of each car at each time step. If you do not specify any emission path, the emission file will not be generated. More on this in Section 5.

In [12]:
from flow.core.params import SumoParams

sumo_params = SumoParams(sim_step=0.1, render=True, emission_path='data', restart_instance=True)

### 3.2 EnvParams

`EnvParams` specify environment and experiment-specific parameters that either affect the training process or the dynamics of various components within the network. Much like `NetParams`, the attributes associated with this parameter are mostly environment specific, and can be found in the environment's `ADDITIONAL_ENV_PARAMS` dictionary.

In [13]:
# from flow.envs.nemodrive_lab.env1_lab import ADDITIONAL_ENV1_PARAMS
ADDITIONAL_ENV_PARAMS = ENV["ADDITIONAL_ENV_PARAMS"]

print(ADDITIONAL_ENV_PARAMS)

{'max_accel': 3, 'max_decel': 3, 'lane_change_duration': 0, 'target_velocity': 10, 'sort_vehicles': False, 'forward_progress_gain': 0.1, 'collision_reward': -1, 'lane_change_reward': -0.1, 'frontal_collision_distance': 2.0, 'lateral_collision_distance': 3.0, 'action_space_box': False, 'pos_noise_std': [0.5, 2], 'pos_noise_steps_reset': 100, 'speed_noise_std': [0.2, 0.8], 'acc_noise_std': [0.2, 0.4]}


Importing the `ADDITIONAL_ENV_PARAMS` variable, we see that it consists of only one entry, "target_velocity", which is used when computing the reward function associated with the environment. We use this default value when generating the `EnvParams` object.

In [14]:
from flow.core.params import EnvParams

env_params = EnvParams(additional_params=ADDITIONAL_ENV_PARAMS, horizon=ENV["HORIZON"])

## 4. Setting up and Running the Experiment
Once the inputs to the network and environment classes are ready, we are ready to set up a `Experiment` object.

In [15]:
from flow.core.experiment import Experiment

These objects may be used to simulate rollouts in the absence of reinforcement learning agents, as well as acquire behaviors and rewards that may be used as a baseline with which to compare the performance of the learning agent. In this case, we choose to run our experiment for one rollout consisting of 3000 steps (300 s).

**Note**: When executing the below code, remeber to click on the    <img style="display:inline;" src="img/play_button.png"> Play button after the GUI is rendered.

In [16]:
# create the network object
network = network_name(name="ring_example",
                       vehicles=vehicles,
                       net_params=net_params,
                       initial_config=initial_config,
                       traffic_lights=traffic_lights)



In [34]:
# create the environment object
sumo_params.render = True
env = env_name(env_params, sumo_params, network)

# create the experiment object
exp = Experiment(env)
_ = exp.run(1, 3000, convert_to_csv=True)


FatalTraCIError: connection closed by SUMO

Run still agent.

In [20]:
sumo_params.render = False
env = env_name(env_params, sumo_params, network)

# create the experiment object
exp = Experiment(env)

rl_actions = lambda state: [0, 0]

_ = exp.run(1, 3000, convert_to_csv=True, rl_actions=rl_actions)

Round 0, return: -3000.0
Average, std return: -3000.0, 0.0
Average, std speed: 4.8421383321199025, 0.0


Run random agent.

Use __FullExperiment__ to test agent that expects _state, reward, done, info_.

In [25]:
from flow.core.experiment_with_reward import FullExperiment
import numpy as np

class RandomAgent():
    def __init__(self, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
        self.change_lane_step_freq = 1
        self.num_steps = 0
        
    def act(self, state, reward, done, info):
        self.num_steps += 1
        d = 0
        if self.num_steps % self.change_lane_step_freq == 0:
            d = np.random.randint(3)

        acc = np.random.uniform(-self.max_decel, self.max_accel)
        action =  np.array([acc, d])
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = RandomAgent(env)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: -213.70367132378917
Round 1, return: -1128.7147838285787
Round 2, return: -627.554034545858
Round 3, return: -681.9083110341325
Round 4, return: -119.14582699761802
Round 5, return: -446.7533413121279
Round 6, return: -425.7369412641068
Round 7, return: -475.6252823138319
Round 8, return: -716.9215183259731
Round 9, return: -841.8928955496672
Average, std return: -567.7956606495684, 282.592630864858
Average, std speed: 5.680333411899924, 0.7670237487883679


### Results RandomRun:

Round 0, return: -213.70367132378917

Round 1, return: -1128.7147838285787

Round 2, return: -627.554034545858

Round 3, return: -681.9083110341325

Round 4, return: -119.14582699761802

Round 5, return: -446.7533413121279

Round 6, return: -425.7369412641068

Round 7, return: -475.6252823138319

Round 8, return: -716.9215183259731

Round 9, return: -841.8928955496672

Average, std return: -567.7956606495684, 282.592630864858

Average, std speed: 5.680333411899924, 0.7670237487883679


In [58]:
from flow.core.experiment_with_reward import FullExperiment
import numpy as np

class PIDAgent():
    def __init__(self, Kp, Ki, Kd, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
        self.change_lane_step_freq = 2
        self.num_steps = 0
        self.Ki = Ki
        self.Kp = Kp
        self.Kd = Kd
        self.vd = env.env_params.additional_params["target_velocity"]
        self.env = env
        
        self.distance_check = 6
        self.prev_speed = 0
        self.sum_err = 0
    @property   
    def vehicle_id(self):
        return "rl_0"
    @property
    def lane(self):
        return self.env.k.vehicle.get_lane(self.vehicle_id)

    def lane_change_check(self):
        
        closest_dist = self.env.k.vehicle.get_headway(self.vehicle_id)
        safe_distance = self.env.env_params.additional_params["frontal_collision_distance"] * self.distance_check
        
        
        myedge = self.env.k.vehicle.get_edge(self.vehicle_id)        
        nr_lanes = self.env.k.network.num_lanes(myedge)
        
        if (nr_lanes > 1) and (closest_dist > 0) and (closest_dist < safe_distance):
            #check other lanes 
            #import pdb; pdb.set_trace()
            otherlanes = self.env.k.vehicle.get_lane_headways(self.vehicle_id)
            otherlane = self.lane+1 if self.lane<nr_lanes-1 else self.lane-1
            dist = otherlanes[otherlane]
            
            if closest_dist < dist and dist>safe_distance:
                return 1 if self.lane == nr_lanes-1 else 2
       
        return 0 # no lane change
        
    def act(self, state, reward, done, info):
        self.num_steps += 1

        v = self.env.k.vehicle.get_speed(self.vehicle_id)
        
        dv = self.prev_speed - v
        
        self.sum_err += self.vd - v
        
        acc = self.Kp*(self.vd - v) + self.Kd*dv + self.Ki*self.sum_err
        
        acc = np.clip(acc, -self.max_decel, self.max_accel)
        
        self.prev_speed = v
        d = 0
        #import pdb; pdb.set_trace()
        if self.num_steps % self.change_lane_step_freq == 0:
            d = self.lane_change_check()

        action =  np.array([acc, d])
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = PIDAgent(env=env,
                Ki=0.004,
                Kp=0.2,
                Kd=0.3)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: 368.22066219162355
Round 1, return: 169.9672823130634
Round 2, return: 393.0464766466746
Round 3, return: 483.6021698201408
Round 4, return: 686.6617658491106
Round 5, return: 169.01411824244326
Round 6, return: 175.21081685515858
Round 7, return: 207.65770702442296
Round 8, return: 243.2925879221222
Round 9, return: 452.8956719968303
Average, std return: 334.95692588615896, 164.03261325573877
Average, std speed: 6.327870921677684, 0.22107667209877344


### Results PID Run:

Round 0, return: 368.22066219162355

Round 1, return: 169.9672823130634

Round 2, return: 393.0464766466746

Round 3, return: 483.6021698201408

Round 4, return: 686.6617658491106

Round 5, return: 169.01411824244326

Round 6, return: 175.21081685515858

Round 7, return: 207.65770702442296

Round 8, return: 243.2925879221222

Round 9, return: 452.8956719968303

Average, std return: 334.95692588615896, 164.03261325573877

Average, std speed: 6.327870921677684, 0.22107667209877344

In [17]:
env_params

<flow.core.params.EnvParams at 0x7f9b01f7c5c0>

In [17]:
from flow.core.experiment_with_reward import FullExperiment
import numpy as np

class PIDAgent():
    def __init__(self, Kp, Ki, Kd, env):
        self.action_space = env.action_space
        self.max_decel = env.env_params.additional_params["max_decel"]
        self.max_accel = env.env_params.additional_params["max_accel"]
        self.change_lane_step_freq = 2
        self.num_steps = 0
        self.Ki = Ki
        self.Kp = Kp
        self.Kd = Kd
        self.vd = env.env_params.additional_params["target_velocity"]
        self.env = env
        
        self.distance_check = 6
        self.prev_speed = 0
        self.sum_err = 0
    @property   
    def vehicle_id(self):
        return "rl_0"
    @property
    def lane(self):
        return self.env.k.vehicle.get_lane(self.vehicle_id)

    def lane_change_check(self):
        
        closest_dist = self.env.k.vehicle.get_headway(self.vehicle_id)
        safe_distance = self.env.env_params.additional_params["frontal_collision_distance"] * self.distance_check
        
        
        myedge = self.env.k.vehicle.get_edge(self.vehicle_id)        
        nr_lanes = self.env.k.network.num_lanes(myedge)
        
        if (nr_lanes > 1) and (closest_dist > 0) and (closest_dist < safe_distance):
            #check other lanes 
            #import pdb; pdb.set_trace()
            otherlanes = self.env.k.vehicle.get_lane_headways(self.vehicle_id)
            otherlane = self.lane+1 if self.lane<nr_lanes-1 else self.lane-1
            dist = otherlanes[otherlane]
            
            if closest_dist < dist and dist>safe_distance:
                return 1 if self.lane == nr_lanes-1 else 2
       
        return 0 # no lane change
        
    def act(self, state, reward, done, info):
        self.num_steps += 1

        v = self.env.k.vehicle.get_speed(self.vehicle_id)
        
        dv = self.prev_speed - v
        
        self.sum_err += self.vd - v
        
        acc = self.Kp*(self.vd - v) + self.Kd*dv + self.Ki*self.sum_err
        
        acc = np.clip(acc, -self.max_decel, self.max_accel)
        
        self.prev_speed = v
        d = 0
        #import pdb; pdb.set_trace()
        if self.num_steps % self.change_lane_step_freq == 0:
            d = self.lane_change_check()

        action =  np.array([acc, d])
        yield action

sumo_params.render = False
env = env_name(env_params, sumo_params, network)

exp = FullExperiment(env)

agent = PIDAgent(env=env,
                Ki=0.004,
                Kp=0.2,
                Kd=0.3)

_ = exp.run(10, 3000, convert_to_csv=True, rl_actions=agent.act)

Round 0, return: -194.25226637140568
Round 1, return: 645.1726348031408
Round 2, return: -8.057047273020647
Round 3, return: 127.11625496919676
Round 4, return: 131.45050575089874
Round 5, return: 75.54286546257498
Round 6, return: 675.3547626437601
Round 7, return: -341.11844498315025
Round 8, return: -774.20189448934
Round 9, return: 308.22323390325136
Average, std return: 64.52306044159063, 414.537672709776
Average, std speed: 7.426823532001501, 0.2581054114423834


## Env2

Round 0, return: -194.25226637140568

Round 1, return: 645.1726348031408

Round 2, return: -8.057047273020647

Round 3, return: 127.11625496919676

Round 4, return: 131.45050575089874

Round 5, return: 75.54286546257498

Round 6, return: 675.3547626437601

Round 7, return: -341.11844498315025

Round 8, return: -774.20189448934

Round 9, return: 308.22323390325136

Average, std return: 64.52306044159063, 414.537672709776

Average, std speed: 7.426823532001501, 0.2581054114423834



Feel free to experiment with all these problems and more!

## Bibliography
[1] Sugiyama, Yuki, et al. "Traffic jams without bottlenecks—experimental evidence for the physical mechanism of the formation of a jam." New journal of physics 10.3 (2008): 033001.

[2] Treiber, Martin, Ansgar Hennecke, and Dirk Helbing. "Congested traffic states in empirical observations and microscopic simulations." Physical review E 62.2 (2000): 1805.

## 5 Setting up Flow Parameters

RLlib experiments both generate a `params.json` file for each experiment run. For RLlib experiments, the parameters defining the Flow network and environment must be stored as well. As such, in this section we define the dictionary `flow_params`, which contains the variables required by the utility function `make_create_env`. `make_create_env` is a higher-order function which returns a function `create_env` that initializes a Gym environment corresponding to the Flow network specified.

In [59]:
# Creating flow_params. Make sure the dictionary keys are as specified. 
sumo_params.render = False
sumo_params.print_warnings=False
flow_params = dict(
    # name of the experiment
    exp_tag=name,
    # name of the flow environment the experiment is running on
    env_name=env_name,
    # name of the network class the experiment uses
    network=network_name,
    # simulator that is used by the experiment
    simulator='traci',
    # sumo-related parameters (see flow.core.params.SumoParams)
    sim=sumo_params,
    # environment related parameters (see flow.core.params.EnvParams)
    env=env_params,
    # network-related parameters (see flow.core.params.NetParams and
    # the network's documentation or ADDITIONAL_NET_PARAMS component)
    net=net_params,
    # vehicles to be placed in the network at the start of a rollout 
    # (see flow.core.vehicles.Vehicles)
    veh=vehicles,
    # (optional) parameters affecting the positioning of vehicles upon 
    # initialization/reset (see flow.core.params.InitialConfig)
    initial=initial_config
)

In [60]:
flow_params


{'exp_tag': 'FigureEightNetwork',
 'env_name': flow.envs.nemodrive_lab.env1_lab.LaneChangeAccelEnv1,
 'network': flow.networks.figure_eight.FigureEightNetwork,
 'simulator': 'traci',
 'sim': <flow.core.params.SumoParams at 0x7f4eb0406f28>,
 'env': <flow.core.params.EnvParams at 0x7f4eb0412128>,
 'net': <flow.core.params.NetParams at 0x7f4eb0406780>,
 'veh': <flow.core.params.VehicleParams at 0x7f4ee4654390>,
 'initial': <flow.core.params.InitialConfig at 0x7f4eb0406cc0>}

## 4 Running RL experiments in Ray

### 4.1 Import 

First, we must import modules required to run experiments in Ray. The `json` package is required to store the Flow experiment parameters in the `params.json` file, as is `FlowParamsEncoder`. Ray-related imports are required: the PPO algorithm agent, `ray.tune`'s experiment runner, and environment helper methods `register_env` and `make_create_env`.

In [24]:
import json

import ray
try:
    from ray.rllib.agents.agent import get_agent_class
except ImportError:
    from ray.rllib.agents.registry import get_agent_class
from ray.tune import run_experiments
from ray.tune.registry import register_env

from flow.utils.registry import make_create_env
from flow.utils.rllib import FlowParamsEncoder

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### 4.2 Initializing Ray
Here, we initialize Ray and experiment-based constant variables specifying parallelism in the experiment as well as experiment batch size in terms of number of rollouts.

In [25]:
# number of parallel workers
N_CPUS = 8
# number of rollouts per training iteration
N_ROLLOUTS = 20

ray.init(num_cpus=N_CPUS)

2019-12-07 19:32:16,913	INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-07_19-32-16_912746_21740/logs.
2019-12-07 19:32:17,078	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:45340 to respond...
2019-12-07 19:32:17,231	INFO services.py:409 -- Waiting for redis server at 127.0.0.1:14915 to respond...
2019-12-07 19:32:17,235	INFO services.py:809 -- Starting Redis shard with 1.67 GB max memory.
2019-12-07 19:32:17,352	INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-07_19-32-16_912746_21740/logs.
2019-12-07 19:32:17,369	INFO services.py:1475 -- Starting the Plasma object store with 2.5 GB memory using /dev/shm.


{'node_ip_address': '192.168.1.188',
 'redis_address': '192.168.1.188:45340',
 'object_store_address': '/tmp/ray/session_2019-12-07_19-32-16_912746_21740/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-12-07_19-32-16_912746_21740/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2019-12-07_19-32-16_912746_21740'}

### 4.3 Configuration and Setup
Here, we copy and modify the default configuration for the [PPO algorithm](https://arxiv.org/abs/1707.06347). The agent has the number of parallel workers specified, a batch size corresponding to `N_ROLLOUTS` rollouts (each of which has length `HORIZON` steps), a discount rate $\gamma$ of 0.999, two hidden layers of size 16, uses Generalized Advantage Estimation, $\lambda$ of 0.97, and other parameters as set below.

Once `config` contains the desired parameters, a JSON string corresponding to the `flow_params` specified in section 3 is generated. The `FlowParamsEncoder` maps objects to string representations so that the experiment can be reproduced later. That string representation is stored within the `env_config` section of the `config` dictionary. Later, `config` is written out to the file `params.json`. 

Next, we call `make_create_env` and pass in the `flow_params` to return a function we can use to register our Flow environment with Gym. 

In [26]:
# The algorithm or model to train. This may refer to "
#      "the name of a built-on algorithm (e.g. RLLib's DQN "
#      "or PPO), or a user-defined trainable function or "
#      "class registered in the tune registry.")
alg_run = "PPO"
HORIZON = 100

agent_cls = get_agent_class(alg_run)
config = agent_cls._default_config.copy()
config["num_workers"] = N_CPUS - 1  # number of parallel workers
config["train_batch_size"] = HORIZON * N_ROLLOUTS  # batch size
config["gamma"] = 0.999  # discount rate
config["model"].update({"fcnet_hiddens": [16, 16]})  # size of hidden layers in network
config["use_gae"] = True  # using generalized advantage estimation
config["lambda"] = 0.97  
config["sgd_minibatch_size"] = min(16 * 1024, config["train_batch_size"])  # stochastic gradient descent
config["kl_target"] = 0.02  # target KL divergence
config["num_sgd_iter"] = 500  # number of SGD iterations
config["horizon"] = HORIZON  # rollout horizon

# save the flow params for replay
flow_json = json.dumps(flow_params, cls=FlowParamsEncoder, sort_keys=True,
                       indent=4)  # generating a string version of flow_params
config['env_config']['flow_params'] = flow_json  # adding the flow_params to config dict
config['env_config']['run'] = alg_run

# Call the utility function make_create_env to be able to 
# register the Flow env for this experiment
create_env, gym_name = make_create_env(params=flow_params, version=0)

# Register as rllib env with Gym
register_env(gym_name, create_env)

### 4.4 Running Experiments

Here, we use the `run_experiments` function from `ray.tune`. The function takes a dictionary with one key, a name corresponding to the experiment, and one value, itself a dictionary containing parameters for training.

In [27]:
trials = run_experiments({
    flow_params["exp_tag"]: {
        "run": alg_run,
        "env": gym_name,
        "config": {
            **config
        },
        "checkpoint_freq": 1,  # number of iterations between checkpoints
        "checkpoint_at_end": True,  # generate a checkpoint at the end
        "max_failures": 999,
        "stop": {  # stopping conditions
            "training_iteration": 500,  # number of iterations to stop after
        },
    },
})

2019-12-07 19:32:48,152	INFO trial_runner.py:176 -- Starting a new experiment.


== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs
Memory usage on this node: 3.2/8.3 GB



2019-12-07 19:32:48,833	ERROR log_sync.py:34 -- Log sync requires cluster to be setup with `ray up`.


== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 3.2/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING

[2m[36m(pid=23396)[0m   _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[2m[36m(pid=23396)[0m   _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[2m[36m(pid=23396)[0m   _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[2m[36m(pid=23396)[0m   _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[2m[36m(pid=23396)[0m   _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[2m[36m(pid=23396)[0m   np_resource = np.dtype([("resource", np.ubyte, 1)])
[2m[36m(pid=23396)[0m 2019-12-07 19:33:00,577	INFO rollout_worker.py:319 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
[2m[36m(pid=23396)[0m 2019-12-07 19:33:00.581416: I tensorflow/core/platform/cpu_feature_guard.cc:14




Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_19-58-20
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 69.50867156616505
  episode_reward_mean: 35.634418287290224
  episode_reward_min: -3.355474409195453
  episodes_this_iter: 20
  episodes_total: 300
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 9760.744
    learner:
      default_policy:
        cur_kl_coeff: 0.10000000149011612
        cur_lr: 4.999999873689376e-05
        entropy: 1.687409520149231
        entropy_coeff: 0.0
        kl: 0.01927327737212181
        policy_loss: -0.023318080231547356
        total_loss: 111.45574951171875
        vf_explained_var: 0.00014859437942504883
        vf_loss: 111.47715759277344
    load_time_ms: 5.43
    num_steps_sampled: 30000
    num_steps_trained: 30000
    sample_time_ms: 74735.513
    update_time_ms: 62.578
  iterations_since_restore: 15
  node_ip: 192.168.1.188
  num_healthy_workers: 7



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-04-24
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 73.98961190383628
  episode_reward_mean: 41.657861975101724
  episode_reward_min: -10.32828043686163
  episodes_this_iter: 20
  episodes_total: 360
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 13436.795
    learner:
      default_policy:
        cur_kl_coeff: 0.10000000149011612
        cur_lr: 4.999999873689376e-05
        entropy: 1.6526142358779907
        entropy_coeff: 0.0
        kl: 0.02459453046321869
        policy_loss: -0.031037544831633568
        total_loss: 140.8939208984375
        vf_explained_var: 4.035234451293945e-05
        vf_loss: 140.92250061035156
    load_time_ms: 6.003
    num_steps_sampled: 36000
    num_steps_trained: 36000
    sample_time_ms: 79629.71
    update_time_ms: 57.017
  iterations_since_restore: 18
  node_ip: 192.168.1.188
  num_healthy_workers: 7




== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 5.8/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 1828 s, 18 iter, 36000 ts, 41.7 rew





Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-07-26
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 75.3036598274489
  episode_reward_mean: 43.180380028661894
  episode_reward_min: -13.505895989818127
  episodes_this_iter: 20
  episodes_total: 380
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 14852.505
    learner:
      default_policy:
        cur_kl_coeff: 0.10000000149011612
        cur_lr: 4.999999873689376e-05
        entropy: 1.78208589553833
        entropy_coeff: 0.0
        kl: 0.06268581002950668
        policy_loss: -0.04571098834276199
        total_loss: 152.60423278808594
        vf_explained_var: 7.861852645874023e-05
        vf_loss: 152.64369201660156
    load_time_ms: 6.824
    num_steps_sampled: 38000
    num_steps_trained: 38000
    sample_time_ms: 89550.569
    update_time_ms: 94.39
  iterations_since_restore: 19
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  



== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 5.5/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 2009 s, 19 iter, 38000 ts, 43.2 rew

Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-10-23
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 79.4252730148665
  episode_reward_mean: 46.03513157153842
  episode_reward_min: -13.505895989818127
  episodes_this_iter: 20
  episodes_total: 400
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 16019.727
    learner:
      default_policy:
        cur_kl_coeff: 0.15000000596046448
        cur_lr: 4.999999873689376e-05
        entropy: 1.5401828289031982
        entropy_coeff: 0.0
        kl: 0.022949233651161194
        policy_loss: -0.0280502736568



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-15-44
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 79.4252730148665
  episode_reward_mean: 51.38459438736306
  episode_reward_min: -13.505895989818127
  episodes_this_iter: 20
  episodes_total: 440
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 19236.471
    learner:
      default_policy:
        cur_kl_coeff: 0.15000000596046448
        cur_lr: 4.999999873689376e-05
        entropy: 1.5613056421279907
        entropy_coeff: 0.0
        kl: 0.018155183643102646
        policy_loss: -0.021425001323223114
        total_loss: 210.87547302246094
        vf_explained_var: 1.7881393432617188e-07
        vf_loss: 210.89414978027344
    load_time_ms: 8.177
    num_steps_sampled: 44000
    num_steps_trained: 44000
    sample_time_ms: 114083.767
    update_time_ms: 128.513
  iterations_since_restore: 22
  node_ip: 192.168.1.188
  num_healthy_worker



== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 5.2/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 2507 s, 22 iter, 44000 ts, 51.4 rew

Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-18-34
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 79.4252730148665
  episode_reward_mean: 54.07624869744975
  episode_reward_min: -13.505895989818127
  episodes_this_iter: 20
  episodes_total: 460
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 21285.376
    learner:
      default_policy:
        cur_kl_coeff: 0.15000000596046448
        cur_lr: 4.999999873689376e-05
        entropy: 1.5739854574203491
        entropy_coeff: 0.0
        kl: 0.02502565085887909
        policy_loss: -0.03397645801305



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-22-14
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 79.4252730148665
  episode_reward_mean: 53.5694804593051
  episode_reward_min: -6.308523354474603
  episodes_this_iter: 20
  episodes_total: 480
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 21724.106
    learner:
      default_policy:
        cur_kl_coeff: 0.15000000596046448
        cur_lr: 4.999999873689376e-05
        entropy: 1.6964528560638428
        entropy_coeff: 0.0
        kl: 0.026512622833251953
        policy_loss: -0.030397556722164154
        total_loss: 135.02236938476562
        vf_explained_var: 2.390146255493164e-05
        vf_loss: 135.04879760742188
    load_time_ms: 9.375
    num_steps_sampled: 48000
    num_steps_trained: 48000
    sample_time_ms: 130378.14
    update_time_ms: 165.962
  iterations_since_restore: 24
  node_ip: 192.168.1.188
  num_healthy_workers: 7



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_20-25-05
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 77.3488925976265
  episode_reward_mean: 54.31941474793829
  episode_reward_min: -6.308523354474603
  episodes_this_iter: 20
  episodes_total: 500
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 23088.26
    learner:
      default_policy:
        cur_kl_coeff: 0.15000000596046448
        cur_lr: 4.999999873689376e-05
        entropy: 1.489917516708374
        entropy_coeff: 0.0
        kl: 0.025259025394916534
        policy_loss: -0.03300931677222252
        total_loss: 203.27394104003906
        vf_explained_var: -4.5299530029296875e-06
        vf_loss: 203.30316162109375
    load_time_ms: 12.446
    num_steps_sampled: 50000
    num_steps_trained: 50000
    sample_time_ms: 136968.725
    update_time_ms: 168.816
  iterations_since_restore: 25
  node_ip: 192.168.1.188
  num_healthy_workers:



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-07_21-10-05
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 84.24218977878039
  episode_reward_mean: 47.08847134731819
  episode_reward_min: -18.980767290147917
  episodes_this_iter: 20
  episodes_total: 1320
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7631.684
    learner:
      default_policy:
        cur_kl_coeff: 0.5062500238418579
        cur_lr: 4.999999873689376e-05
        entropy: 2.459232807159424
        entropy_coeff: 0.0
        kl: 0.014644804410636425
        policy_loss: -0.025912748649716377
        total_loss: 183.94227600097656
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 183.96072387695312
    load_time_ms: 2.911
    num_steps_sampled: 132000
    num_steps_trained: 132000
    sample_time_ms: 56363.599
    update_time_ms: 50.965
  iterations_since_restore: 66
  node_ip: 192.168.1.188
  num_healthy_workers:



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_00-05-34
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 88.8500167605849
  episode_reward_mean: 45.916642866181164
  episode_reward_min: -20.426864345984914
  episodes_this_iter: 20
  episodes_total: 4580
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7449.75
    learner:
      default_policy:
        cur_kl_coeff: 0.0721626877784729
        cur_lr: 4.999999873689376e-05
        entropy: 2.1943624019622803
        entropy_coeff: 0.0
        kl: 0.01371838804334402
        policy_loss: -0.0180059801787138
        total_loss: 218.78656005859375
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 218.8035888671875
    load_time_ms: 3.252
    num_steps_sampled: 458000
    num_steps_trained: 458000
    sample_time_ms: 57769.27
    update_time_ms: 55.378
  iterations_since_restore: 229
  node_ip: 192.168.1.188
  num_healthy_workers: 7
 



== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 5.7/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 16287 s, 229 iter, 458000 ts, 45.9 rew

Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_00-06-40
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 88.8500167605849
  episode_reward_mean: 49.151862089010095
  episode_reward_min: -16.64669317227277
  episodes_this_iter: 20
  episodes_total: 4600
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7641.178
    learner:
      default_policy:
        cur_kl_coeff: 0.0721626877784729
        cur_lr: 4.999999873689376e-05
        entropy: 2.1763038635253906
        entropy_coeff: 0.0
        kl: 0.01974409632384777
        policy_loss: -0.015180132351



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_00-29-34
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 84.06426507077151
  episode_reward_mean: 44.51797312463662
  episode_reward_min: -26.634411635765478
  episodes_this_iter: 20
  episodes_total: 5020
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7549.051
    learner:
      default_policy:
        cur_kl_coeff: 0.09133090078830719
        cur_lr: 4.999999873689376e-05
        entropy: 2.62837815284729
        entropy_coeff: 0.0
        kl: 1.1308518648147583
        policy_loss: 0.15088513493537903
        total_loss: 194.2867431640625
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 194.03253173828125
    load_time_ms: 3.719
    num_steps_sampled: 502000
    num_steps_trained: 502000
    sample_time_ms: 57809.635
    update_time_ms: 57.224
  iterations_since_restore: 251
  node_ip: 192.168.1.188
  num_healthy_workers: 7
 



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_01-25-38
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 89.9668034167035
  episode_reward_mean: 46.83074935128216
  episode_reward_min: -20.624290843245756
  episodes_this_iter: 20
  episodes_total: 6040
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8339.282
    learner:
      default_policy:
        cur_kl_coeff: 0.08787578344345093
        cur_lr: 4.999999873689376e-05
        entropy: 2.4306716918945312
        entropy_coeff: 0.0
        kl: 0.010581533424556255
        policy_loss: -0.012990077957510948
        total_loss: 232.55853271484375
        vf_explained_var: -1.1920928955078125e-07
        vf_loss: 232.570556640625
    load_time_ms: 3.553
    num_steps_sampled: 604000
    num_steps_trained: 604000
    sample_time_ms: 58209.7
    update_time_ms: 52.904
  iterations_since_restore: 302
  node_ip: 192.168.1.188
  num_healthy_workers:



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_01-52-03
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 87.67255446521061
  episode_reward_mean: 48.298916359649176
  episode_reward_min: 3.4091422524439325
  episodes_this_iter: 20
  episodes_total: 6520
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7652.92
    learner:
      default_policy:
        cur_kl_coeff: 0.8445600867271423
        cur_lr: 4.999999873689376e-05
        entropy: 2.1132476329803467
        entropy_coeff: 0.0
        kl: 0.006247576791793108
        policy_loss: -0.001950915320776403
        total_loss: 214.23251342773438
        vf_explained_var: 0.0
        vf_loss: 214.2292022705078
    load_time_ms: 3.154
    num_steps_sampled: 652000
    num_steps_trained: 652000
    sample_time_ms: 58225.473
    update_time_ms: 57.687
  iterations_since_restore: 326
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_es



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_01-53-11
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 87.67255446521061
  episode_reward_mean: 49.63425567681815
  episode_reward_min: 4.756393778155246
  episodes_this_iter: 20
  episodes_total: 6540
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7657.375
    learner:
      default_policy:
        cur_kl_coeff: 0.42228004336357117
        cur_lr: 4.999999873689376e-05
        entropy: 2.7014074325561523
        entropy_coeff: 0.0
        kl: 0.0029405783861875534
        policy_loss: -0.008713132701814175
        total_loss: 221.19195556640625
        vf_explained_var: 0.0
        vf_loss: 221.1994171142578
    load_time_ms: 3.161
    num_steps_sampled: 654000
    num_steps_trained: 654000
    sample_time_ms: 58187.935
    update_time_ms: 74.229
  iterations_since_restore: 327
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_e



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_02-13-05
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 89.45290085672285
  episode_reward_mean: 42.55227232485717
  episode_reward_min: -34.69760614389792
  episodes_this_iter: 20
  episodes_total: 6900
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7937.466
    learner:
      default_policy:
        cur_kl_coeff: 0.08907469362020493
        cur_lr: 4.999999873689376e-05
        entropy: 2.637213706970215
        entropy_coeff: 0.0
        kl: 0.12325534224510193
        policy_loss: 0.03857117518782616
        total_loss: 219.1698760986328
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 219.1202850341797
    load_time_ms: 3.758
    num_steps_sampled: 690000
    num_steps_trained: 690000
    sample_time_ms: 58359.3
    update_time_ms: 50.834
  iterations_since_restore: 345
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  o



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-02-06
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 86.81801748202903
  episode_reward_mean: 43.86011950191112
  episode_reward_min: -24.043855342208857
  episodes_this_iter: 20
  episodes_total: 7780
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8197.25
    learner:
      default_policy:
        cur_kl_coeff: 0.09028996527194977
        cur_lr: 4.999999873689376e-05
        entropy: 2.012434720993042
        entropy_coeff: 0.0
        kl: 0.03239030763506889
        policy_loss: -0.012761090882122517
        total_loss: 229.48171997070312
        vf_explained_var: 0.0
        vf_loss: 229.4915313720703
    load_time_ms: 4.933
    num_steps_sampled: 778000
    num_steps_trained: 778000
    sample_time_ms: 58878.309
    update_time_ms: 50.504
  iterations_since_restore: 389
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_est



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-05-27
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 86.82223199414038
  episode_reward_mean: 43.3073571078071
  episode_reward_min: -24.043855342208857
  episodes_this_iter: 20
  episodes_total: 7840
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8147.262
    learner:
      default_policy:
        cur_kl_coeff: 0.13543494045734406
        cur_lr: 4.999999873689376e-05
        entropy: 2.9406211376190186
        entropy_coeff: 0.0
        kl: 0.02556709013879299
        policy_loss: -0.0016321887960657477
        total_loss: 223.71156311035156
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 223.70974731445312
    load_time_ms: 4.335
    num_steps_sampled: 784000
    num_steps_trained: 784000
    sample_time_ms: 58907.897
    update_time_ms: 57.901
  iterations_since_restore: 392
  node_ip: 192.168.1.188
  num_healthy_worker



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-16-34
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 82.70701611425812
  episode_reward_mean: 43.61100778330859
  episode_reward_min: -12.591537936194118
  episodes_this_iter: 20
  episodes_total: 8040
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7948.289
    learner:
      default_policy:
        cur_kl_coeff: 0.5142295360565186
        cur_lr: 4.999999873689376e-05
        entropy: 2.9686391353607178
        entropy_coeff: 0.0
        kl: 0.019441736862063408
        policy_loss: 0.0016572466120123863
        total_loss: 214.40493774414062
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 214.393310546875
    load_time_ms: 4.616
    num_steps_sampled: 804000
    num_steps_trained: 804000
    sample_time_ms: 58596.695
    update_time_ms: 57.632
  iterations_since_restore: 402
  node_ip: 192.168.1.188
  num_healthy_workers:



== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 6.3/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 27740 s, 402 iter, 804000 ts, 43.6 rew

Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-17-42
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 82.70701611425812
  episode_reward_mean: 41.26483607830621
  episode_reward_min: -12.591537936194118
  episodes_this_iter: 20
  episodes_total: 8060
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8217.666
    learner:
      default_policy:
        cur_kl_coeff: 0.5142295360565186
        cur_lr: 4.999999873689376e-05
        entropy: 3.2025773525238037
        entropy_coeff: 0.0
        kl: 0.07864521443843842
        policy_loss: 0.038644757121



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-25-34
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 82.84175214227051
  episode_reward_mean: 49.949162050564546
  episode_reward_min: -3.9680306138729247
  episodes_this_iter: 20
  episodes_total: 8200
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8059.126
    learner:
      default_policy:
        cur_kl_coeff: 0.5785082578659058
        cur_lr: 4.999999873689376e-05
        entropy: 2.4293594360351562
        entropy_coeff: 0.0
        kl: 0.0415521003305912
        policy_loss: 0.019723720848560333
        total_loss: 209.8416748046875
        vf_explained_var: 5.960464477539063e-08
        vf_loss: 209.79788208007812
    load_time_ms: 3.615
    num_steps_sampled: 820000
    num_steps_trained: 820000
    sample_time_ms: 59161.398
    update_time_ms: 75.559
  iterations_since_restore: 410
  node_ip: 192.168.1.188
  num_healthy_workers: 



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-39-04
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 91.44294979947047
  episode_reward_mean: 41.431497092588614
  episode_reward_min: -7.56517764513083
  episodes_this_iter: 20
  episodes_total: 8440
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7862.989
    learner:
      default_policy:
        cur_kl_coeff: 0.08135272562503815
        cur_lr: 4.999999873689376e-05
        entropy: 3.389782667160034
        entropy_coeff: 0.0
        kl: 0.011886080726981163
        policy_loss: -0.012140102684497833
        total_loss: 211.68600463867188
        vf_explained_var: -1.1920928955078125e-07
        vf_loss: 211.69725036621094
    load_time_ms: 3.403
    num_steps_sampled: 844000
    num_steps_trained: 844000
    sample_time_ms: 59381.285
    update_time_ms: 54.639
  iterations_since_restore: 422
  node_ip: 192.168.1.188
  num_healthy_worke



== Status ==
Using FIFO scheduling algorithm.
Resources requested: 8/8 CPUs, 0/0 GPUs
Memory usage on this node: 6.3/8.3 GB
Result logdir: /home/osboxes/ray_results/FigureEightNetwork
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - PPO_LaneChangeAccelEnv1-v0_0:	RUNNING, [8 CPUs, 0 GPUs], [pid=23396], 29088 s, 422 iter, 844000 ts, 41.4 rew

Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-40-14
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 91.44294979947047
  episode_reward_mean: 42.24801551435731
  episode_reward_min: -7.56517764513083
  episodes_this_iter: 20
  episodes_total: 8460
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8032.954
    learner:
      default_policy:
        cur_kl_coeff: 0.08135272562503815
        cur_lr: 4.999999873689376e-05
        entropy: 2.404134511947632
        entropy_coeff: 0.0
        kl: 0.01642674021422863
        policy_loss: -0.0196765623986



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-43-36
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 91.44294979947047
  episode_reward_mean: 46.89840341092116
  episode_reward_min: -4.646977633687869
  episodes_this_iter: 20
  episodes_total: 8520
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 7862.225
    learner:
      default_policy:
        cur_kl_coeff: 0.08135272562503815
        cur_lr: 4.999999873689376e-05
        entropy: 2.472339153289795
        entropy_coeff: 0.0
        kl: 0.015566040761768818
        policy_loss: -0.020016226917505264
        total_loss: 231.56468200683594
        vf_explained_var: 0.0
        vf_loss: 231.58343505859375
    load_time_ms: 4.405
    num_steps_sampled: 852000
    num_steps_trained: 852000
    sample_time_ms: 59683.634
    update_time_ms: 71.068
  iterations_since_restore: 426
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_e



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_03-44-43
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 84.41544600537311
  episode_reward_mean: 46.88422813881291
  episode_reward_min: -4.646977633687869
  episodes_this_iter: 20
  episodes_total: 8540
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8123.529
    learner:
      default_policy:
        cur_kl_coeff: 0.08135272562503815
        cur_lr: 4.999999873689376e-05
        entropy: 2.3620824813842773
        entropy_coeff: 0.0
        kl: 0.01888294331729412
        policy_loss: -0.015348576009273529
        total_loss: 230.37127685546875
        vf_explained_var: 0.0
        vf_loss: 230.38507080078125
    load_time_ms: 4.361
    num_steps_sampled: 854000
    num_steps_trained: 854000
    sample_time_ms: 59537.888
    update_time_ms: 89.835
  iterations_since_restore: 427
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_e



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_04-15-07
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 84.34350969188348
  episode_reward_mean: 44.975279493327754
  episode_reward_min: -16.65530870833946
  episodes_this_iter: 20
  episodes_total: 9080
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8274.497
    learner:
      default_policy:
        cur_kl_coeff: 0.347496896982193
        cur_lr: 4.999999873689376e-05
        entropy: 2.8323802947998047
        entropy_coeff: 0.0
        kl: 0.06804783642292023
        policy_loss: 0.024645056575536728
        total_loss: 206.89007568359375
        vf_explained_var: -1.1920928955078125e-07
        vf_loss: 206.8417510986328
    load_time_ms: 5.072
    num_steps_sampled: 908000
    num_steps_trained: 908000
    sample_time_ms: 59120.108
    update_time_ms: 46.043
  iterations_since_restore: 454
  node_ip: 192.168.1.188
  num_healthy_workers:



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_04-16-14
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 84.34350969188348
  episode_reward_mean: 44.16206091064825
  episode_reward_min: -16.65530870833946
  episodes_this_iter: 20
  episodes_total: 9100
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8624.712
    learner:
      default_policy:
        cur_kl_coeff: 0.5212453603744507
        cur_lr: 4.999999873689376e-05
        entropy: 3.325767993927002
        entropy_coeff: 0.0
        kl: 0.0429525226354599
        policy_loss: 0.008992427960038185
        total_loss: 209.10305786132812
        vf_explained_var: 0.0
        vf_loss: 209.07167053222656
    load_time_ms: 5.197
    num_steps_sampled: 910000
    num_steps_trained: 910000
    sample_time_ms: 58681.618
    update_time_ms: 54.661
  iterations_since_restore: 455
  node_ip: 192.168.1.188
  num_healthy_workers: 7
  off_policy_estim



Result for PPO_LaneChangeAccelEnv1-v0_0:
  custom_metrics: {}
  date: 2019-12-08_04-32-06
  done: false
  episode_len_mean: 100.0
  episode_reward_max: 81.27364688211736
  episode_reward_mean: 38.30264872393976
  episode_reward_min: -18.21821840609351
  episodes_this_iter: 20
  episodes_total: 9380
  experiment_id: 1b1c2dbd7f554809834cc3d9e101cde6
  hostname: osboxes
  info:
    grad_time_ms: 8163.833
    learner:
      default_policy:
        cur_kl_coeff: 0.21990036964416504
        cur_lr: 4.999999873689376e-05
        entropy: 2.8802742958068848
        entropy_coeff: 0.0
        kl: 0.033574171364307404
        policy_loss: 0.005935338791459799
        total_loss: 213.27194213867188
        vf_explained_var: -1.1920928955078125e-07
        vf_loss: 213.25856018066406
    load_time_ms: 3.142
    num_steps_sampled: 938000
    num_steps_trained: 938000
    sample_time_ms: 59831.423
    update_time_ms: 51.582
  iterations_since_restore: 469
  node_ip: 192.168.1.188
  num_healthy_worke