# Imitation Learning over Heterogeneous Agents with Restraining Bolts
## Project development

Source: (De Giacomo, 2019)[http://www.diag.uniroma1.it/degiacom/papers/2020/icaps2020dfip.pdf]   

This notebook intends to run one the examples from the example provided code of the implementation of the research work paper.


# Breakout

The reference of the following code is the file: `breakout/__main__.py`, with slight modifications to be executed in this notebook. The goal is to show the flow and values of variables across the execution.

In [17]:
"""This is the main entry-point for the experiments with the Breakout environment."""
import logging
import os
from argparse import ArgumentParser

import yaml

from breakout.learner import run_learner
from breakout.expert import run_expert
from rl_algorithm.utils import Map, learn_dfa



In [2]:
## Show video method

from IPython import display as ipythondisplay
from IPython.display import HTML
import io
import base64
import glob

def show_video(list_paths, take_first=True):
    '''
    This method allows to show a list of videos next to each other, for comparison purposes.
    
    
    '''
    
    encoded_videos = []
    for i, video_path in enumerate(list_paths):
                
        mp4list = glob.glob(video_path + 'videos/*.mp4')
        ind = 0
        if len(mp4list) > 0:
            if not take_first:
                 ind = -1

            mp4 = mp4list[ind]
            print(i+1,': ', mp4)
            video = io.open(mp4, 'r+b').read()
            encoded = base64.b64encode(video)
            encoded_videos.append('''<video alt="test" autoplay 
                        loop controls style="height: 400px;">
                        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                     </video> '''.format(encoded.decode('ascii')))
        else:
            print("Could not find video of path: ", video_path)
    
    if len(encoded_videos) > 0:
        ipythondisplay.display(HTML(data='''<div>''' \
                                        + ''.join(encoded_videos) \
                                        + '''</div>'''))

In [3]:
# Preview. This wouldn't work if it's the first time the trainings are going to be executed, since the files wouln'dt exist
show_video(['experiments/breakout-output/expert/',
            'experiments/breakout-output/learner/'], False)

1 :  experiments/breakout-output/expert/videos\openaigym.video.1.4670.video000001.mp4
2 :  experiments/breakout-output/learner/videos\openaigym.video.2.4670.video000001.mp4


In [3]:
logging.getLogger("temprl").setLevel(level=logging.DEBUG)
logging.getLogger("matplotlib").setLevel(level=logging.INFO)
logging.getLogger("rl_algorithm").setLevel(level=logging.INFO)

## Argument parsing
Manually add some default values to run the example.
* The suggested line in CLI is:
`python3 -m breakout --rows 3 --cols 3 --output-dir experiments/breakout-output --overwrite`

* Change `Overwrite` to `True` in arguments, as the experiments does not exist yet. This option **deletes any training** previously performed.


* The specific arguments of the expert and the learner are in the `.yaml` files

In [27]:
parser = ArgumentParser()
parser.add_argument("--cols", type=int, default=3, help="Number of columns.")
parser.add_argument("--rows", type=int, default=3, help="Number of rows.")
parser.add_argument("--brick-reward", type=int, default=5, help="The reward for breaking a brick.")
parser.add_argument("--step-reward", type=float, default=-0.01, help="The reward for breaking a brick.")
parser.add_argument("--goal-reward", type=int, default=1000, help="The reward for satisfying the temporal goal.")
parser.add_argument("--overwrite", action="store_true", default=True, help="Overwrite the content of the output directory.")
parser.add_argument("--seed", type=int, default=42, help="Random seed.")
parser.add_argument("--output-dir", type=str, default="experiments/breakout-output", help="Output directory for the experiment results.")
parser.add_argument("--expert-config", type=str, default="breakout/expert_config.yaml", help="RL configuration for the expert.")
parser.add_argument("--learner-config", type=str, default="breakout/learner_config.yaml", help="RL configuration for the learner.")
arguments = parser.parse_args(args=[])
arguments

Namespace(brick_reward=5, cols=3, expert_config='breakout/expert_config.yaml', goal_reward=1000, learner_config='breakout/learner_config.yaml', output_dir='experiments/breakout-output', overwrite=True, rows=3, seed=42, step_reward=-0.01)

# Main

In [28]:
expert_config = Map(yaml.safe_load(open(arguments.expert_config)))
learner_config = Map(yaml.safe_load(open(arguments.learner_config)))

### Run the expert

Some facts about this training:
* It implements either SARSA or Q-Learning as training algorithm.
* The specific arguments of the expert and the learner are in the .yaml files.
* The actual algorithms, SARSA and Q-Learning, are referred to in the code as 'brain', and are developed in the `brains.py` file. And this are used by the `Agent`.
* The experts uses the `true_automaton` (diagram shown below, in Automaton) to learn the goal.

`expert.py (just this method):`

``` python 

def run_expert(arguments, configuration):
    agent_dir = Path(arguments.output_dir) / "expert"
    if arguments.overwrite:
        shutil.rmtree(arguments.output_dir, ignore_errors=True)
    agent_dir.mkdir(parents=True, exist_ok=False)

    config = BreakoutConfiguration(brick_rows=arguments.rows, brick_cols=arguments.cols,
                                   brick_reward=arguments.brick_reward, step_reward=arguments.step_reward,
                                   ball_enabled=False, fire_enabled=True)
    env = make_env(config, arguments.output_dir, arguments.goal_reward)

    np.random.seed(arguments.seed)
    env.seed(arguments.seed)

    policy = AutomataPolicy((-2, ), nb_steps=configuration.nb_exploration_steps, value_max=1.0, value_min=configuration.min_eps)

    algorithm = Sarsa if configuration.algorithm == "sarsa" else QLearning
    agent = Agent(algorithm(None,
                        env.action_space,
                        gamma=configuration.gamma,
                        alpha=configuration.alpha,
                        lambda_=configuration.lambda_),
                  policy=policy,
                  test_policy=EpsGreedyQPolicy(eps=0.01))

    history = agent.fit(
        env,
        nb_steps=configuration.nb_steps,
        visualize=configuration.visualize_training,
        callbacks=[
            ModelCheckpoint(str(agent_dir / "checkpoints" / "agent-{}.pkl")),
            TrainEpisodeLogger()
        ]
    )
    history.save(agent_dir / "history.json")
    agent.save(Path(agent_dir, "checkpoints", "agent.pkl"))
    plot_history(history, agent_dir)

    agent = Agent.load(agent_dir / "checkpoints" / "agent.pkl")
    agent.test(Monitor(env, agent_dir / "videos"), nb_episodes=5, visualize=True)

    env.close()
```

In [29]:
expert_config

{'nb_steps': 75000,
 'nb_exploration_steps': 10000,
 'min_eps': 0.01,
 'visualize_training': False,
 'reward_shaping': True,
 'gamma': 0.99,
 'alpha': 0.1,
 'lambda_': 0.0,
 'algorithm': 'sarsa'}

In [30]:
print("Run the expert.")
run_expert(arguments, expert_config)

Run the expert.


[2020-05-11 08:44:52,416][graphviz.files][save][DEBUG]: write 1117 bytes to 'experiments/breakout-output/true_automaton'
[2020-05-11 08:44:52,427][graphviz.backend][run][DEBUG]: run ['dot', '-Tsvg', '-O', 'true_automaton']
[2020-05-11 08:44:52,593][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:52,602][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


Formula: <(!c0 & !c1 & !c2)*;c0;(!c0 & !c1 & !c2)*;c1;(!c0 & !c1 & !c2)*;c2>tt
Original automaton at experiments/breakout-output/true_automaton.svg
Training for 75000 steps ...


[2020-05-11 08:44:54,172][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:54,182][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


  2549/75000: episode: 1, duration:  1.483s, episode steps:  2549, steps per second:  1718, episode reward:   -313.823, mean reward:  -0.123 [  -328.343,      4.990], mean action:  1.475 [ 0.000,  3.000], mean observation:  1.612 [ 0.000,  9.000], q_value:      0.017, mean-eps:      0.937


[2020-05-11 08:44:55,415][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:55,426][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


  5250/75000: episode: 2, duration:  1.306s, episode steps:  2701, steps per second:  2068, episode reward:   -330.343, mean reward:  -0.122 [  -328.343,      4.990], mean action:  1.525 [ 0.000,  3.000], mean observation:  3.085 [ 0.000,  9.000], q_value:     -0.107, mean-eps:      0.807


[2020-05-11 08:44:56,846][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:44:56,854][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


  7951/75000: episode: 3, duration:  1.411s, episode steps:  2701, steps per second:  1914, episode reward:   -325.343, mean reward:  -0.120 [  -328.343,      4.990], mean action:  1.531 [ 0.000,  3.000], mean observation:  2.580 [ 0.000,  9.000], q_value:     -0.176, mean-eps:      0.673


[2020-05-11 08:44:57,071][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 08:44:57,082][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-666.6666666666666, 0.0]
[2020-05-11 08:44:57,331][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:57,337][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


  8606/75000: episode: 4, duration:  0.478s, episode steps:   655, steps per second:  1370, episode reward:   -294.883, mean reward:  -0.450 [  -661.677,    338.323], mean action:  1.524 [ 0.000,  3.000], mean observation:  1.632 [ 0.000,  9.000], q_value:     -0.359, mean-eps:      0.689


[2020-05-11 08:44:58,661][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:58,669][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 11307/75000: episode: 5, duration:  1.306s, episode steps:  2701, steps per second:  2068, episode reward:   -325.343, mean reward:  -0.120 [  -328.343,      4.990], mean action:  1.550 [ 0.000,  3.000], mean observation:  2.025 [ 0.000,  9.000], q_value:     -0.057, mean-eps:      0.671


[2020-05-11 08:44:59,721][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:44:59,730][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 13557/75000: episode: 6, duration:  1.060s, episode steps:  2250, steps per second:  2122, episode reward:   -310.833, mean reward:  -0.138 [  -328.343,      4.990], mean action:  1.265 [ 0.000,  3.000], mean observation:  2.632 [ 0.000,  9.000], q_value:      0.047, mean-eps:      0.622


[2020-05-11 08:45:01,021][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:01,030][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 16258/75000: episode: 7, duration:  1.253s, episode steps:  2701, steps per second:  2156, episode reward:   -325.343, mean reward:  -0.120 [  -328.343,      4.990], mean action:  1.541 [ 0.000,  3.000], mean observation:  2.098 [ 0.000,  9.000], q_value:     -0.140, mean-eps:      0.616


[2020-05-11 08:45:01,732][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:01,742][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 17705/75000: episode: 8, duration:  0.754s, episode steps:  1447, steps per second:  1918, episode reward:   -302.803, mean reward:  -0.209 [  -328.343,      4.990], mean action:  2.366 [ 0.000,  3.000], mean observation:  2.038 [ 0.000,  9.000], q_value:     -0.252, mean-eps:      0.609


[2020-05-11 08:45:02,982][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:02,990][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 19967/75000: episode: 9, duration:  1.142s, episode steps:  2262, steps per second:  1981, episode reward:   -310.953, mean reward:  -0.137 [  -328.343,      4.990], mean action:  1.232 [ 0.000,  3.000], mean observation:  1.389 [ 0.000,  9.000], q_value:     -0.270, mean-eps:      0.605


[2020-05-11 08:45:03,220][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:03,230][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:03,874][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:03,882][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:04,000][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:04,008][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 21609/75000: episode: 10, duration:  1.047s, episode steps:  1642, steps per second:  1568, episode reward:   1028.580, mean reward:   0.626 [    -0.010,    338.323], mean action:  1.498 [ 0.000,  3.000], mean observation:  1.595 [ 0.000,  9.000], q_value:     -0.382, mean-eps:      0.648


[2020-05-11 08:45:05,432][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:05,442][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 24310/75000: episode: 11, duration:  1.394s, episode steps:  2701, steps per second:  1938, episode reward:   -320.343, mean reward:  -0.119 [  -328.343,      4.990], mean action:  1.232 [ 0.000,  3.000], mean observation:  2.027 [ 0.000,  9.000], q_value:     -0.204, mean-eps:      0.728


[2020-05-11 08:45:05,631][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:05,638][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:06,011][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:06,017][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:06,164][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:06,174][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 25350/75000: episode: 12, duration:  0.690s, episode steps:  1040, steps per second:  1507, episode reward:   1034.600, mean reward:   0.995 [    -0.010,    338.323], mean action:  1.514 [ 0.000,  3.000], mean observation:  1.950 [ 0.000,  9.000], q_value:      2.113, mean-eps:      0.718
 28051/75000: episode: 13, duration:  1.463s, episode steps:  2701, steps per second:  1846, episode reward:   -325.343, mean reward:  -0.120 [  -328.343,      4.990], mean action:  1.178 [ 0.000,  3.000], mean observation:  0.851 [ 0.000,  9.000], q_value:     -0.066, mean-eps:      0.704


[2020-05-11 08:45:07,794][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:07,803][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:07,988][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:07,995][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:08,447][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:08,454][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:08,649][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:08,657][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 28925/75000: episode: 14, duration:  0.918s, episode steps:   874, steps per second:   952, episode reward:   1036.260, mean reward:   1.186 [    -0.010,    338.323], mean action:  1.474 [ 0.000,  3.000], mean observation:  1.505 [ 0.000,  8.000], q_value:      1.971, mean-eps:      0.695


[2020-05-11 08:45:08,778][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:08,782][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:08,896][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:08,901][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


 29489/75000: episode: 15, duration:  0.430s, episode steps:   564, steps per second:  1312, episode reward:   1039.360, mean reward:   1.843 [    -0.010,    338.323], mean action:  1.610 [ 0.000,  3.000], mean observation:  1.484 [ 0.000,  8.000], q_value:      7.748, mean-eps:      0.681


[2020-05-11 08:45:09,185][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:09,190][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:09,255][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:09,261][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:09,579][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:09,590][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:09,751][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:09,757][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 30513/75000: episode: 16, duration:  0.682s, episode steps:  1024, steps per second:  1501, episode reward:   1034.760, mean reward:   1.011 [    -0.010,    338.323], mean action:  1.539 [ 0.000,  3.000], mean observation:  1.435 [ 0.000,  8.000], q_value:      4.393, mean-eps:      0.665


[2020-05-11 08:45:11,413][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:11,422][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 33214/75000: episode: 17, duration:  1.618s, episode steps:  2701, steps per second:  1669, episode reward:   -320.343, mean reward:  -0.119 [  -328.343,      4.990], mean action:  1.510 [ 0.000,  3.000], mean observation:  1.088 [ 0.000,  9.000], q_value:      0.509, mean-eps:      0.651


[2020-05-11 08:45:11,623][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:11,629][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:11,890][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:11,898][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:12,079][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:12,084][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 33962/75000: episode: 18, duration:  0.654s, episode steps:   748, steps per second:  1143, episode reward:   1037.520, mean reward:   1.387 [    -0.010,    338.323], mean action:  1.540 [ 0.000,  3.000], mean observation:  1.865 [ 0.000,  9.000], q_value:      9.632, mean-eps:      0.644


[2020-05-11 08:45:12,191][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:12,197][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:12,308][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:12,313][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:12,444][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:12,449][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 34467/75000: episode: 19, duration:  0.397s, episode steps:   505, steps per second:  1271, episode reward:   1039.950, mean reward:   2.059 [    -0.010,    338.323], mean action:  1.487 [ 0.000,  3.000], mean observation:  1.714 [ 0.000,  9.000], q_value:      9.101, mean-eps:      0.631


[2020-05-11 08:45:12,577][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:12,582][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:13,029][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:13,038][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:13,176][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:13,181][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 35673/75000: episode: 20, duration:  0.720s, episode steps:  1206, steps per second:  1675, episode reward:   1032.940, mean reward:   0.857 [    -0.010,    338.323], mean action:  1.512 [ 0.000,  3.000], mean observation:  2.294 [ 0.000,  9.000], q_value:     15.196, mean-eps:      0.615


[2020-05-11 08:45:13,312][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:13,322][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:13,444][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:13,448][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:13,565][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:13,570][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 36198/75000: episode: 21, duration:  0.392s, episode steps:   525, steps per second:  1338, episode reward:   1039.750, mean reward:   1.980 [    -0.010,    338.323], mean action:  1.554 [ 0.000,  3.000], mean observation:  1.658 [ 0.000,  9.000], q_value:     15.443, mean-eps:      0.597


[2020-05-11 08:45:13,780][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 08:45:13,787][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-666.6666666666666, 0.0]
[2020-05-11 08:45:14,302][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:14,310][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 37383/75000: episode: 22, duration:  0.745s, episode steps:  1185, steps per second:  1590, episode reward:   -300.183, mean reward:  -0.253 [  -661.677,    338.323], mean action:  1.550 [ 0.000,  3.000], mean observation:  2.261 [ 0.000,  9.000], q_value:      3.804, mean-eps:      0.585


[2020-05-11 08:45:14,457][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:14,461][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:14,546][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:14,551][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:14,679][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:14,687][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 37875/75000: episode: 23, duration:  0.343s, episode steps:   492, steps per second:  1436, episode reward:   1040.080, mean reward:   2.114 [    -0.010,    338.323], mean action:  1.283 [ 0.000,  3.000], mean observation:  1.710 [ 0.000,  8.000], q_value:     19.027, mean-eps:      0.579


[2020-05-11 08:45:14,847][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:14,854][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:14,976][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:14,984][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:15,106][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:15,110][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 38376/75000: episode: 24, duration:  0.423s, episode steps:   501, steps per second:  1183, episode reward:   1039.990, mean reward:   2.076 [    -0.010,    338.323], mean action:  1.311 [ 0.000,  3.000], mean observation:  1.753 [ 0.000,  9.000], q_value:     22.730, mean-eps:      0.569


[2020-05-11 08:45:15,264][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:15,273][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:15,395][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:15,400][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:15,543][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:15,548][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 38941/75000: episode: 25, duration:  0.402s, episode steps:   565, steps per second:  1406, episode reward:   1039.350, mean reward:   1.840 [    -0.010,    338.323], mean action:  1.531 [ 0.000,  3.000], mean observation:  1.681 [ 0.000,  9.000], q_value:     21.972, mean-eps:      0.558


[2020-05-11 08:45:15,756][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:15,761][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:15,882][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:15,888][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:15,995][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:16,000][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:16,094][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:16,099][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 39629/75000: episode: 26, duration:  0.482s, episode steps:   688, steps per second:  1426, episode reward:   1038.120, mean reward:   1.509 [    -0.010,    338.323], mean action:  1.705 [ 0.000,  3.000], mean observation:  1.335 [ 0.000,  9.000], q_value:     20.640, mean-eps:      0.546


[2020-05-11 08:45:16,179][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:16,184][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:16,320][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:16,325][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 40047/75000: episode: 27, duration:  0.290s, episode steps:   418, steps per second:  1442, episode reward:   1040.820, mean reward:   2.490 [    -0.010,    338.323], mean action:  1.871 [ 0.000,  3.000], mean observation:  1.551 [ 0.000,  9.000], q_value:     28.260, mean-eps:      0.535


[2020-05-11 08:45:16,452][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:16,461][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:16,588][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:16,594][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:16,699][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:16,705][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 40576/75000: episode: 28, duration:  0.403s, episode steps:   529, steps per second:  1311, episode reward:   1039.710, mean reward:   1.965 [    -0.010,    338.323], mean action:  1.539 [ 0.000,  3.000], mean observation:  1.725 [ 0.000,  9.000], q_value:     31.004, mean-eps:      0.526


[2020-05-11 08:45:17,660][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:17,669][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 42447/75000: episode: 29, duration:  0.983s, episode steps:  1871, steps per second:  1903, episode reward:   -307.043, mean reward:  -0.164 [  -328.343,      4.990], mean action:  1.537 [ 0.000,  3.000], mean observation:  1.545 [ 0.000,  9.000], q_value:     -0.130, mean-eps:      0.518


[2020-05-11 08:45:17,861][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:17,866][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:17,971][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:17,977][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:18,116][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:18,121][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 42996/75000: episode: 30, duration:  0.386s, episode steps:   549, steps per second:  1423, episode reward:   1039.510, mean reward:   1.893 [    -0.010,    338.323], mean action:  1.590 [ 0.000,  3.000], mean observation:  1.604 [ 0.000,  9.000], q_value:     31.222, mean-eps:      0.512


[2020-05-11 08:45:18,206][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:18,210][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:18,320][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:18,326][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:18,452][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:18,459][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 43443/75000: episode: 31, duration:  0.334s, episode steps:   447, steps per second:  1338, episode reward:   1040.530, mean reward:   2.328 [    -0.010,    338.323], mean action:  1.515 [ 0.000,  3.000], mean observation:  1.889 [ 0.000,  9.000], q_value:     37.614, mean-eps:      0.502


[2020-05-11 08:45:18,571][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:18,574][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:18,703][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:18,718][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:18,796][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:18,800][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 43920/75000: episode: 32, duration:  0.380s, episode steps:   477, steps per second:  1257, episode reward:   1040.230, mean reward:   2.181 [    -0.010,    338.323], mean action:  1.294 [ 0.000,  3.000], mean observation:  1.846 [ 0.000,  9.000], q_value:     37.559, mean-eps:      0.493


[2020-05-11 08:45:20,518][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:20,527][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 46616/75000: episode: 33, duration:  1.713s, episode steps:  2696, steps per second:  1574, episode reward:   -315.293, mean reward:  -0.117 [  -328.343,      4.990], mean action:  1.149 [ 0.000,  3.000], mean observation:  1.642 [ 0.000,  9.000], q_value:     -0.321, mean-eps:      0.486


[2020-05-11 08:45:20,677][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:20,683][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:20,787][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:20,791][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:20,878][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:20,883][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:20,993][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:20,998][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 47131/75000: episode: 34, duration:  0.344s, episode steps:   515, steps per second:  1495, episode reward:   1039.850, mean reward:   2.019 [    -0.010,    338.323], mean action:  1.656 [ 0.000,  3.000], mean observation:  1.752 [ 0.000,  9.000], q_value:     40.056, mean-eps:      0.481


[2020-05-11 08:45:21,132][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:21,137][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:21,278][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:21,282][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 47579/75000: episode: 35, duration:  0.344s, episode steps:   448, steps per second:  1303, episode reward:   1040.520, mean reward:   2.323 [    -0.010,    338.323], mean action:  1.250 [ 0.000,  3.000], mean observation:  1.756 [ 0.000,  9.000], q_value:     43.870, mean-eps:      0.472


[2020-05-11 08:45:21,407][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:21,415][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:21,616][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:21,627][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:21,731][temprl.automata][step][DEBUG]: transition idxs: 0, 2
[2020-05-11 08:45:21,737][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [-333.3333333333333, 0.0]


 48107/75000: episode: 36, duration:  0.479s, episode steps:   528, steps per second:  1103, episode reward:   1039.720, mean reward:   1.969 [    -0.010,    338.323], mean action:  1.373 [ 0.000,  3.000], mean observation:  1.849 [ 0.000,  9.000], q_value:     46.471, mean-eps:      0.462


[2020-05-11 08:45:23,143][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:23,150][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:23,250][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:23,254][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 50808/75000: episode: 37, duration:  1.399s, episode steps:  2701, steps per second:  1930, episode reward:   -325.343, mean reward:  -0.120 [  -328.343,      4.990], mean action:  1.133 [ 0.000,  3.000], mean observation:  0.814 [ 0.000,  9.000], q_value:     -0.191, mean-eps:      0.454


[2020-05-11 08:45:23,389][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:23,393][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:23,578][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:23,586][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 51287/75000: episode: 38, duration:  0.338s, episode steps:   479, steps per second:  1416, episode reward:   1040.210, mean reward:   2.172 [    -0.010,    338.323], mean action:  1.687 [ 0.000,  3.000], mean observation:  1.791 [ 0.000,  9.000], q_value:     47.291, mean-eps:      0.450


[2020-05-11 08:45:23,801][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:23,808][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:23,916][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:23,922][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


 51841/75000: episode: 39, duration:  0.516s, episode steps:   554, steps per second:  1073, episode reward:   1039.460, mean reward:   1.876 [    -0.010,    338.323], mean action:  1.352 [ 0.000,  3.000], mean observation:  1.380 [ 0.000,  9.000], q_value:     44.020, mean-eps:      0.439


[2020-05-11 08:45:24,332][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:24,337][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:24,449][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:24,453][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:24,559][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:24,567][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:24,724][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:24,730][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 52421/75000: episode: 40, duration:  0.638s, episode steps:   580, steps per second:   909, episode reward:   1039.200, mean reward:   1.792 [    -0.010,    338.323], mean action:  1.440 [ 0.000,  3.000], mean observation:  1.356 [ 0.000,  9.000], q_value:     48.908, mean-eps:      0.428


[2020-05-11 08:45:24,856][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:24,860][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:24,993][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:25,004][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:25,123][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:25,133][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 52877/75000: episode: 41, duration:  0.422s, episode steps:   456, steps per second:  1079, episode reward:   1040.440, mean reward:   2.282 [    -0.010,    338.323], mean action:  1.375 [ 0.000,  3.000], mean observation:  1.675 [ 0.000,  8.000], q_value:     50.349, mean-eps:      0.418


[2020-05-11 08:45:25,280][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:25,285][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:25,371][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:25,377][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:25,452][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:25,457][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:25,550][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:25,554][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 53316/75000: episode: 42, duration:  0.364s, episode steps:   439, steps per second:  1207, episode reward:   1040.610, mean reward:   2.370 [    -0.010,    338.323], mean action:  1.745 [ 0.000,  3.000], mean observation:  1.774 [ 0.000,  8.000], q_value:     52.987, mean-eps:      0.409


[2020-05-11 08:45:25,680][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:25,684][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:25,768][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:25,772][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:25,842][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:25,850][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 53779/75000: episode: 43, duration:  0.299s, episode steps:   463, steps per second:  1551, episode reward:   1040.370, mean reward:   2.247 [    -0.010,    338.323], mean action:  1.851 [ 0.000,  3.000], mean observation:  1.886 [ 0.000,  9.000], q_value:     56.905, mean-eps:      0.400


[2020-05-11 08:45:25,997][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:26,002][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:26,084][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:26,089][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:26,179][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:26,188][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 54238/75000: episode: 44, duration:  0.313s, episode steps:   459, steps per second:  1466, episode reward:   1040.410, mean reward:   2.267 [    -0.010,    338.323], mean action:  1.645 [ 0.000,  3.000], mean observation:  2.174 [ 0.000,  9.000], q_value:     61.255, mean-eps:      0.391


[2020-05-11 08:45:26,412][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:26,423][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:26,600][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:26,605][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 54628/75000: episode: 45, duration:  0.420s, episode steps:   390, steps per second:   928, episode reward:   1041.100, mean reward:   2.669 [    -0.010,    338.323], mean action:  2.036 [ 0.000,  3.000], mean observation:  1.779 [ 0.000,  9.000], q_value:     63.066, mean-eps:      0.383


[2020-05-11 08:45:26,693][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:26,702][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:26,956][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:26,970][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:27,165][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:27,169][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 55069/75000: episode: 46, duration:  0.526s, episode steps:   441, steps per second:   839, episode reward:   1040.590, mean reward:   2.360 [    -0.010,    338.323], mean action:  2.252 [ 0.000,  3.000], mean observation:  1.772 [ 0.000,  9.000], q_value:     67.574, mean-eps:      0.374


[2020-05-11 08:45:27,336][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:27,341][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:27,444][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:27,448][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:27,609][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:27,617][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 55638/75000: episode: 47, duration:  0.461s, episode steps:   569, steps per second:  1234, episode reward:   1039.310, mean reward:   1.827 [    -0.010,    338.323], mean action:  1.401 [ 0.000,  3.000], mean observation:  1.721 [ 0.000,  9.000], q_value:     66.967, mean-eps:      0.364


[2020-05-11 08:45:27,756][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:27,761][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:27,882][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:27,887][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:27,981][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:27,985][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:28,045][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:28,050][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 56278/75000: episode: 48, duration:  0.435s, episode steps:   640, steps per second:  1471, episode reward:   1038.600, mean reward:   1.623 [    -0.010,    338.323], mean action:  1.081 [ 0.000,  3.000], mean observation:  1.751 [ 0.000,  9.000], q_value:     69.228, mean-eps:      0.352


[2020-05-11 08:45:28,150][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:28,155][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:28,240][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:28,245][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:28,341][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:28,346][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 56672/75000: episode: 49, duration:  0.256s, episode steps:   394, steps per second:  1541, episode reward:   1041.060, mean reward:   2.642 [    -0.010,    338.323], mean action:  1.944 [ 0.000,  3.000], mean observation:  1.876 [ 0.000,  9.000], q_value:     67.994, mean-eps:      0.342


[2020-05-11 08:45:28,475][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:28,485][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:28,569][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:28,573][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:28,667][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:28,671][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 57134/75000: episode: 50, duration:  0.326s, episode steps:   462, steps per second:  1418, episode reward:   1040.380, mean reward:   2.252 [    -0.010,    338.323], mean action:  1.781 [ 0.000,  3.000], mean observation:  2.131 [ 0.000,  9.000], q_value:     69.203, mean-eps:      0.334


[2020-05-11 08:45:28,781][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:28,786][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:28,864][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:28,868][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:28,974][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:28,979][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 57580/75000: episode: 51, duration:  0.296s, episode steps:   446, steps per second:  1506, episode reward:   1040.540, mean reward:   2.333 [    -0.010,    338.323], mean action:  1.868 [ 0.000,  3.000], mean observation:  1.922 [ 0.000,  9.000], q_value:     70.133, mean-eps:      0.325


[2020-05-11 08:45:29,106][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:29,111][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:29,200][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:29,204][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:29,266][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:29,271][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 58078/75000: episode: 52, duration:  0.322s, episode steps:   498, steps per second:  1547, episode reward:   1040.020, mean reward:   2.088 [    -0.010,    338.323], mean action:  1.480 [ 0.000,  3.000], mean observation:  2.013 [ 0.000,  9.000], q_value:     74.740, mean-eps:      0.315


[2020-05-11 08:45:29,385][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:29,389][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


 58502/75000: episode: 53, duration:  0.269s, episode steps:   424, steps per second:  1577, episode reward:   1040.760, mean reward:   2.455 [    -0.010,    338.323], mean action:  2.031 [ 0.000,  3.000], mean observation:  2.025 [ 0.000,  9.000], q_value:     74.392, mean-eps:      0.306


[2020-05-11 08:45:29,762][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:29,767][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:29,872][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:29,876][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:30,017][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:30,021][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:30,103][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:30,107][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:30,224][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:30,229][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 59655/75000: episode: 54, duration:  0.624s, episode steps:  1153, steps per second:  1848, episode reward:   1033.470, mean reward:   0.896 [    -0.010,    338.323], mean action:  1.154 [ 0.000,  3.000], mean observation:  1.353 [ 0.000,  9.000], q_value:     83.352, mean-eps:      0.297


[2020-05-11 08:45:30,418][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:30,423][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:30,507][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:30,512][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 60190/75000: episode: 55, duration:  0.388s, episode steps:   535, steps per second:  1380, episode reward:   1039.650, mean reward:   1.943 [    -0.010,    338.323], mean action:  1.652 [ 0.000,  3.000], mean observation:  1.983 [ 0.000,  9.000], q_value:     75.210, mean-eps:      0.287


[2020-05-11 08:45:30,683][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:30,690][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:30,824][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:30,829][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:30,914][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:30,919][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:31,006][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:31,027][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 60718/75000: episode: 56, duration:  0.394s, episode steps:   528, steps per second:  1339, episode reward:   1039.720, mean reward:   1.969 [    -0.010,    338.323], mean action:  1.820 [ 0.000,  3.000], mean observation:  2.033 [ 0.000,  9.000], q_value:     79.192, mean-eps:      0.278


[2020-05-11 08:45:31,300][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:31,305][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:31,394][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:31,398][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:31,479][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:31,484][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 61311/75000: episode: 57, duration:  0.464s, episode steps:   593, steps per second:  1277, episode reward:   1039.070, mean reward:   1.752 [    -0.010,    338.323], mean action:  1.722 [ 0.000,  3.000], mean observation:  2.463 [ 0.000,  9.000], q_value:     81.577, mean-eps:      0.270


[2020-05-11 08:45:31,575][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:31,584][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:31,711][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:31,716][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 61697/75000: episode: 58, duration:  0.264s, episode steps:   386, steps per second:  1461, episode reward:   1041.140, mean reward:   2.697 [    -0.010,    338.323], mean action:  2.394 [ 0.000,  3.000], mean observation:  1.813 [ 0.000,  8.000], q_value:     93.752, mean-eps:      0.262


[2020-05-11 08:45:31,822][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:31,829][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:31,934][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:31,938][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:32,013][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:32,020][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:32,134][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:32,138][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 62092/75000: episode: 59, duration:  0.343s, episode steps:   395, steps per second:  1152, episode reward:   1041.050, mean reward:   2.636 [    -0.010,    338.323], mean action:  2.603 [ 0.000,  3.000], mean observation:  1.948 [ 0.000,  9.000], q_value:    103.648, mean-eps:      0.257


[2020-05-11 08:45:32,251][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:32,259][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:32,390][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:32,395][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 62487/75000: episode: 60, duration:  0.319s, episode steps:   395, steps per second:  1237, episode reward:   1041.050, mean reward:   2.636 [    -0.010,    338.323], mean action:  2.392 [ 0.000,  3.000], mean observation:  1.781 [ 0.000,  8.000], q_value:    102.696, mean-eps:      0.253


[2020-05-11 08:45:32,515][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:32,519][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:32,596][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:32,601][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:32,679][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:32,683][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 62896/75000: episode: 61, duration:  0.327s, episode steps:   409, steps per second:  1250, episode reward:   1040.910, mean reward:   2.545 [    -0.010,    338.323], mean action:  2.301 [ 0.000,  3.000], mean observation:  1.770 [ 0.000,  8.000], q_value:    105.247, mean-eps:      0.250


[2020-05-11 08:45:32,828][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:32,838][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:32,920][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:32,925][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:33,014][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:33,019][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:33,106][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:33,111][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 63357/75000: episode: 62, duration:  0.326s, episode steps:   461, steps per second:  1415, episode reward:   1040.390, mean reward:   2.257 [    -0.010,    338.323], mean action:  2.115 [ 0.000,  3.000], mean observation:  1.715 [ 0.000,  8.000], q_value:    105.308, mean-eps:      0.246


[2020-05-11 08:45:33,200][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:33,204][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:33,286][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:33,290][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 63779/75000: episode: 63, duration:  0.271s, episode steps:   422, steps per second:  1560, episode reward:   1040.780, mean reward:   2.466 [    -0.010,    338.323], mean action:  2.332 [ 0.000,  3.000], mean observation:  1.667 [ 0.000,  8.000], q_value:    116.195, mean-eps:      0.243


[2020-05-11 08:45:33,422][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:33,427][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:33,512][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:33,516][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:33,595][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:33,600][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 64233/75000: episode: 64, duration:  0.308s, episode steps:   454, steps per second:  1475, episode reward:   1040.460, mean reward:   2.292 [    -0.010,    338.323], mean action:  2.200 [ 0.000,  3.000], mean observation:  1.753 [ 0.000,  8.000], q_value:    111.491, mean-eps:      0.239


[2020-05-11 08:45:33,790][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:33,794][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:33,875][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:33,886][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:33,965][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:33,969][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:34,080][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:34,085][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 64774/75000: episode: 65, duration:  0.365s, episode steps:   541, steps per second:  1484, episode reward:   1039.590, mean reward:   1.922 [    -0.010,    338.323], mean action:  2.002 [ 0.000,  3.000], mean observation:  1.693 [ 0.000,  8.000], q_value:    108.241, mean-eps:      0.234


[2020-05-11 08:45:34,170][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:34,175][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:34,264][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:34,269][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:34,377][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:34,381][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 65193/75000: episode: 66, duration:  0.284s, episode steps:   419, steps per second:  1478, episode reward:   1040.810, mean reward:   2.484 [    -0.010,    338.323], mean action:  2.298 [ 0.000,  3.000], mean observation:  1.776 [ 0.000,  8.000], q_value:    112.796, mean-eps:      0.229


[2020-05-11 08:45:34,464][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:34,468][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:34,548][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:34,552][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:34,659][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:34,667][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 65628/75000: episode: 67, duration:  0.284s, episode steps:   435, steps per second:  1529, episode reward:   1040.650, mean reward:   2.392 [    -0.010,    338.323], mean action:  1.828 [ 0.000,  3.000], mean observation:  1.835 [ 0.000,  8.000], q_value:    117.054, mean-eps:      0.225


[2020-05-11 08:45:34,771][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:34,776][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:34,850][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:34,855][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:34,929][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:34,933][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 66058/75000: episode: 68, duration:  0.308s, episode steps:   430, steps per second:  1394, episode reward:   1040.700, mean reward:   2.420 [    -0.010,    338.323], mean action:  2.240 [ 0.000,  3.000], mean observation:  1.774 [ 0.000,  8.000], q_value:    116.942, mean-eps:      0.222


[2020-05-11 08:45:35,019][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:35,026][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:35,108][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:35,113][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:35,192][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:35,201][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 66415/75000: episode: 69, duration:  0.245s, episode steps:   357, steps per second:  1460, episode reward:   1041.430, mean reward:   2.917 [    -0.010,    338.323], mean action:  2.857 [ 0.000,  3.000], mean observation:  1.879 [ 0.000,  8.000], q_value:    128.464, mean-eps:      0.219


[2020-05-11 08:45:35,295][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:35,302][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:35,380][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:35,385][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:35,453][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:35,458][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 66773/75000: episode: 70, duration:  0.263s, episode steps:   358, steps per second:  1360, episode reward:   1041.420, mean reward:   2.909 [    -0.010,    338.323], mean action:  2.874 [ 0.000,  3.000], mean observation:  1.885 [ 0.000,  8.000], q_value:    127.745, mean-eps:      0.216


[2020-05-11 08:45:35,550][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:35,554][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:35,632][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:35,640][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:35,712][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:35,717][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 67132/75000: episode: 71, duration:  0.249s, episode steps:   359, steps per second:  1439, episode reward:   1041.410, mean reward:   2.901 [    -0.010,    338.323], mean action:  2.883 [ 0.000,  3.000], mean observation:  1.873 [ 0.000,  8.000], q_value:    134.202, mean-eps:      0.214


[2020-05-11 08:45:35,810][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:35,818][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:35,910][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:35,918][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 67488/75000: episode: 72, duration:  0.263s, episode steps:   356, steps per second:  1353, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.879 [ 0.000,  3.000], mean observation:  1.882 [ 0.000,  8.000], q_value:    135.178, mean-eps:      0.212


[2020-05-11 08:45:36,076][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:36,086][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:36,164][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:36,168][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:36,256][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:36,261][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:36,342][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:36,350][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 67844/75000: episode: 73, duration:  0.344s, episode steps:   356, steps per second:  1036, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.865 [ 0.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    138.243, mean-eps:      0.209


[2020-05-11 08:45:36,447][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:36,452][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:36,529][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:36,535][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:36,604][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:36,610][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 68202/75000: episode: 74, duration:  0.280s, episode steps:   358, steps per second:  1276, episode reward:   1041.420, mean reward:   2.909 [    -0.010,    338.323], mean action:  2.874 [ 0.000,  3.000], mean observation:  1.876 [ 0.000,  8.000], q_value:    138.856, mean-eps:      0.207


[2020-05-11 08:45:36,694][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:36,702][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:36,800][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:36,805][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:36,879][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:36,884][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 68558/75000: episode: 75, duration:  0.249s, episode steps:   356, steps per second:  1432, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.871 [ 0.000,  3.000], mean observation:  1.885 [ 0.000,  9.000], q_value:    142.137, mean-eps:      0.207


[2020-05-11 08:45:36,970][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:36,975][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:37,051][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:37,056][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:37,131][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:37,136][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 68914/75000: episode: 76, duration:  0.267s, episode steps:   356, steps per second:  1332, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.890 [ 1.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    144.391, mean-eps:      0.207


[2020-05-11 08:45:37,223][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:37,228][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:37,323][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:37,328][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:37,405][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:37,409][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 69272/75000: episode: 77, duration:  0.248s, episode steps:   358, steps per second:  1442, episode reward:   1041.420, mean reward:   2.909 [    -0.010,    338.323], mean action:  2.874 [ 0.000,  3.000], mean observation:  1.888 [ 0.000,  8.000], q_value:    146.244, mean-eps:      0.207


[2020-05-11 08:45:37,507][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:37,512][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:37,602][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:37,607][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:37,681][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:37,685][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 69628/75000: episode: 78, duration:  0.276s, episode steps:   356, steps per second:  1289, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.899 [ 0.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    150.981, mean-eps:      0.207


[2020-05-11 08:45:37,804][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:37,809][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:37,903][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:37,908][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:38,000][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:38,004][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 69983/75000: episode: 79, duration:  0.291s, episode steps:   355, steps per second:  1219, episode reward:   1041.450, mean reward:   2.934 [    -0.010,    338.323], mean action:  2.899 [ 0.000,  3.000], mean observation:  1.882 [ 0.000,  8.000], q_value:    156.103, mean-eps:      0.207


[2020-05-11 08:45:38,151][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:38,155][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:38,259][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:38,270][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 70379/75000: episode: 80, duration:  0.335s, episode steps:   396, steps per second:  1182, episode reward:   1041.040, mean reward:   2.629 [    -0.010,    338.323], mean action:  2.899 [ 0.000,  3.000], mean observation:  2.045 [ 0.000,  8.000], q_value:    155.973, mean-eps:      0.207


[2020-05-11 08:45:38,377][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:38,386][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:38,506][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:38,514][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:38,608][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:38,613][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:38,700][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:38,704][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 70736/75000: episode: 81, duration:  0.347s, episode steps:   357, steps per second:  1028, episode reward:   1041.430, mean reward:   2.917 [    -0.010,    338.323], mean action:  2.899 [ 1.000,  3.000], mean observation:  1.880 [ 0.000,  8.000], q_value:    157.869, mean-eps:      0.207


[2020-05-11 08:45:38,816][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:38,820][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:38,930][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:38,935][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:39,030][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:39,036][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 71092/75000: episode: 82, duration:  0.310s, episode steps:   356, steps per second:  1149, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.907 [ 1.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    163.108, mean-eps:      0.207


[2020-05-11 08:45:39,153][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:39,159][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:39,235][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:39,240][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:39,345][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:39,353][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 71448/75000: episode: 83, duration:  0.322s, episode steps:   356, steps per second:  1105, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.882 [ 0.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    165.046, mean-eps:      0.207


[2020-05-11 08:45:39,468][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:39,474][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:39,564][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:39,568][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:39,636][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:39,644][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 71804/75000: episode: 84, duration:  0.305s, episode steps:   356, steps per second:  1168, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.896 [ 0.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    166.147, mean-eps:      0.207


[2020-05-11 08:45:39,754][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:39,760][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:39,843][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:39,848][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:39,935][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:39,939][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 72160/75000: episode: 85, duration:  0.283s, episode steps:   356, steps per second:  1258, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.890 [ 1.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    168.089, mean-eps:      0.207


[2020-05-11 08:45:40,034][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:40,039][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:40,148][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:40,153][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 72518/75000: episode: 86, duration:  0.269s, episode steps:   358, steps per second:  1332, episode reward:   1041.420, mean reward:   2.909 [    -0.010,    338.323], mean action:  2.899 [ 1.000,  3.000], mean observation:  1.889 [ 0.000,  8.000], q_value:    169.437, mean-eps:      0.207


[2020-05-11 08:45:40,276][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:40,283][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:40,362][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:40,366][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:40,440][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:40,444][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:40,512][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:40,517][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 72874/75000: episode: 87, duration:  0.324s, episode steps:   356, steps per second:  1099, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.888 [ 0.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    171.387, mean-eps:      0.207


[2020-05-11 08:45:40,601][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:40,607][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:40,688][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:40,692][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:40,770][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:40,775][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 73230/75000: episode: 88, duration:  0.238s, episode steps:   356, steps per second:  1496, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.874 [ 0.000,  3.000], mean observation:  1.888 [ 0.000,  8.000], q_value:    167.512, mean-eps:      0.207


[2020-05-11 08:45:40,869][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:40,874][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:40,953][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:40,958][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:41,024][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:41,030][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 73586/75000: episode: 89, duration:  0.259s, episode steps:   356, steps per second:  1372, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.879 [ 0.000,  3.000], mean observation:  1.877 [ 0.000,  8.000], q_value:    167.110, mean-eps:      0.207


[2020-05-11 08:45:41,127][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:41,138][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:41,219][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:41,224][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:41,296][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:41,302][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 73941/75000: episode: 90, duration:  0.253s, episode steps:   355, steps per second:  1406, episode reward:   1041.450, mean reward:   2.934 [    -0.010,    338.323], mean action:  2.899 [ 1.000,  3.000], mean observation:  1.883 [ 0.000,  8.000], q_value:    167.518, mean-eps:      0.207


[2020-05-11 08:45:41,393][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:41,400][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:41,485][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:41,489][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:41,560][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:41,565][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 74298/75000: episode: 91, duration:  0.259s, episode steps:   357, steps per second:  1378, episode reward:   1041.430, mean reward:   2.917 [    -0.010,    338.323], mean action:  2.857 [ 0.000,  3.000], mean observation:  1.882 [ 0.000,  8.000], q_value:    164.402, mean-eps:      0.207


[2020-05-11 08:45:41,657][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:41,663][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]
[2020-05-11 08:45:41,762][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:41,767][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:41,845][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:41,851][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]


 74654/75000: episode: 92, duration:  0.265s, episode steps:   356, steps per second:  1345, episode reward:   1041.440, mean reward:   2.925 [    -0.010,    338.323], mean action:  2.890 [ 1.000,  3.000], mean observation:  1.882 [ 0.000,  8.000], q_value:    166.804, mean-eps:      0.207


[2020-05-11 08:45:41,945][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:41,950][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


 75011/75000: episode: 93, duration:  0.276s, episode steps:   357, steps per second:  1291, episode reward:   1041.430, mean reward:   2.917 [    -0.010,    338.323], mean action:  2.896 [ 1.000,  3.000], mean observation:  1.880 [ 0.000,  8.000], q_value:    170.422, mean-eps:      0.206
done, took 49.449 seconds




Testing for 5 episodes ...


[2020-05-11 08:45:47,674][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:47,682][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:50,917][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:45:50,922][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:45:54,365][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:45:54,372][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


Episode 1: reward: 1041.440, steps: 356


[2020-05-11 08:45:58,111][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:45:58,118][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:01,618][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:46:01,623][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:04,722][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:46:04,729][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


Episode 2: reward: 1041.450, steps: 355


[2020-05-11 08:46:05,174][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:46:05,178][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:05,463][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:46:05,466][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:05,770][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:46:05,775][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


Episode 3: reward: 1041.450, steps: 355


[2020-05-11 08:46:06,110][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:46:06,115][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:06,403][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:46:06,408][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:06,696][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:46:06,700][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


Episode 4: reward: 1041.440, steps: 356


[2020-05-11 08:46:07,027][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 08:46:07,031][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:07,326][temprl.automata][step][DEBUG]: transition idxs: 1, 3
[2020-05-11 08:46:07,332][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333, 0.0]
[2020-05-11 08:46:07,626][temprl.automata][step][DEBUG]: transition idxs: 3, 4
[2020-05-11 08:46:07,630][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337, 0.0]


Episode 5: reward: 1041.430, steps: 357


<Figure size 432x288 with 0 Axes>

### Automaton

* DFA = Deterministic Finite-state Automaton.
* The expert learns form a true automaton, shown below.

**True Automaton**

![](experiments/breakout-output/true_automaton.svg)

**Learned Automaton**

<img src="experiments/breakout-output/learned_automaton.svg" alt="drawing" width="300"/>

In [31]:
print("Learn the automaton from traces.")
dfa = learn_dfa(arguments)

[2020-05-11 08:47:34,957][inferrer][__init__][INFO]: Created Active Learner [LSTAR] instance with Passive oracle
[2020-05-11 08:47:34,959][inferrer][learn][INFO]: Start learning.
[2020-05-11 08:47:34,959][inferrer][_initialise][INFO]: Initialising the table.
[2020-05-11 08:47:34,960][inferrer][_build_automaton][INFO]: Building DFA from the table.
[2020-05-11 08:47:34,961][inferrer][learn][INFO]: Submitting equivalence query.
[2020-05-11 08:47:34,962][inferrer][learn][INFO]: Oracle return 012 as counterexample.
[2020-05-11 08:47:34,966][inferrer][_useq][INFO]: Updating table by adding the following prefixes: , 01, 0, 012
[2020-05-11 08:47:34,967][inferrer][_consistent][INFO]: Making the table consistent by adding a column.
[2020-05-11 08:47:34,970][inferrer][_find_inconsistent][INFO]: Trying to find two inconsistent rows in the table.
[2020-05-11 08:47:34,971][inferrer][_find_inconsistent][INFO]: Found two inconsistent rows  and 01
[2020-05-11 08:47:34,971][inferrer][_consistent][INFO]:

Learn the automaton from traces.


In [41]:
dfa_dot_file = os.path.join(arguments.output_dir, "learned_automaton")
dfa.to_dot(dfa_dot_file)
print("Check the file {}.svg".format(dfa_dot_file))

[2020-05-11 09:01:50,090][graphviz.files][save][DEBUG]: write 464 bytes to 'experiments/breakout-output/learned_automaton'
[2020-05-11 09:01:50,097][graphviz.backend][run][DEBUG]: run ['dot', '-Tsvg', '-O', 'learned_automaton']


Check the file experiments/breakout-output/learned_automaton.svg


### Learner

Method from `learner.py` :
``` python

def run_learner(arguments, configuration, dfa: pythomata.dfa.DFA):
    agent_dir = Path(arguments.output_dir) / "learner"
    shutil.rmtree(agent_dir, ignore_errors=True)
    agent_dir.mkdir(parents=True, exist_ok=False)

    config = BreakoutConfiguration(brick_rows=arguments.rows, brick_cols=arguments.cols,
                                   brick_reward=arguments.brick_reward, step_reward=arguments.step_reward,
                                   fire_enabled=False, ball_enabled=True)
    env = make_env_from_dfa(config, dfa)

    np.random.seed(arguments.seed)
    env.seed(arguments.seed)

    policy = AutomataPolicy((-1, ), nb_steps=configuration.nb_exploration_steps, value_max=0.8, value_min=configuration.min_eps)

    algorithm = Sarsa if configuration.algorithm == "sarsa" else QLearning
    agent = Agent(algorithm(None,
                            env.action_space,
                            gamma=configuration.gamma,
                            alpha=configuration.alpha,
                            lambda_=configuration.lambda_),
                  policy=policy,
                  test_policy=EpsGreedyQPolicy(eps=0.001))

    history = agent.fit(
        env,
        nb_steps=configuration.nb_steps,
        visualize=configuration.visualize_training,
        callbacks=[
            ModelCheckpoint(str(agent_dir / "checkpoints" / "agent-{}.pkl")),
            TrainEpisodeLogger()
        ]
    )

    history.save(agent_dir / "history.json")
    agent.save(agent_dir / "checkpoints" / "agent.pkl")
    plot_history(history, agent_dir)

    agent = Agent.load(agent_dir / "checkpoints" / "agent.pkl")
    agent.test(Monitor(env, agent_dir / "videos"), nb_episodes=5, visualize=True)

    env.close()
```

In [97]:
learner_config

{'nb_steps': 75000,
 'nb_exploration_steps': 10000,
 'min_eps': 0.01,
 'visualize_training': False,
 'reward_shaping': True,
 'gamma': 0.99,
 'alpha': 0.1,
 'lambda_': 0.99,
 'algorithm': 'sarsa'}

In [42]:
print("Running the learner.")
run_learner(arguments, learner_config, dfa)

Running the learner.
Training for 75000 steps ...
   151/75000: episode: 1, duration:  0.245s, episode steps:   151, steps per second:   616, episode reward:      3.490, mean reward:   0.023 [    -0.010,      4.990], mean action:  0.848 [ 0.000,  2.000], mean observation:  7.614 [ 0.000,  47.000], q_value:     -0.000, mean-eps:      0.794
   440/75000: episode: 2, duration:  0.513s, episode steps:   289, steps per second:   564, episode reward:      7.110, mean reward:   0.025 [    -0.010,      4.990], mean action:  0.682 [ 0.000,  2.000], mean observation:  7.262 [ 0.000,  47.000], q_value:      0.001, mean-eps:      0.777
   591/75000: episode: 3, duration:  0.359s, episode steps:   151, steps per second:   421, episode reward:      3.490, mean reward:   0.023 [    -0.010,      4.990], mean action:  0.742 [ 0.000,  2.000], mean observation:  7.909 [ 0.000,  47.000], q_value:      0.003, mean-eps:      0.759
   742/75000: episode: 4, duration:  0.160s, episode steps:   151, steps per 

  8999/75000: episode: 29, duration:  0.380s, episode steps:   281, steps per second:   739, episode reward:      7.190, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.658 [ 0.000,  2.000], mean observation:  7.587 [ 0.000,  47.000], q_value:      0.312, mean-eps:      0.098
  9280/75000: episode: 30, duration:  0.350s, episode steps:   281, steps per second:   803, episode reward:      7.190, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.509 [ 0.000,  2.000], mean observation:  7.242 [ 0.000,  47.000], q_value:      0.287, mean-eps:      0.076
  9431/75000: episode: 31, duration:  0.124s, episode steps:   151, steps per second:  1214, episode reward:      3.490, mean reward:   0.023 [    -0.010,      4.990], mean action:  0.291 [ 0.000,  2.000], mean observation:  7.496 [ 0.000,  47.000], q_value:      0.126, mean-eps:      0.059
  9850/75000: episode: 32, duration:  0.679s, episode steps:   419, steps per second:   617, episode reward:     10.810, mea

[2020-05-11 09:02:42,771][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:42,781][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 10696/75000: episode: 34, duration:  1.144s, episode steps:   565, steps per second:   494, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.411 [ 0.000,  2.000], mean observation:  7.112 [ 0.000,  47.000], q_value:      0.283, mean-eps:      0.064
 11255/75000: episode: 35, duration:  1.220s, episode steps:   559, steps per second:   458, episode reward:     14.410, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.386 [ 0.000,  2.000], mean observation:  7.293 [ 0.000,  47.000], q_value:      1.023, mean-eps:      0.402


[2020-05-11 09:02:45,225][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:45,233][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 11820/75000: episode: 36, duration:  1.214s, episode steps:   565, steps per second:   465, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.412 [ 0.000,  2.000], mean observation:  7.071 [ 0.000,  47.000], q_value:      2.703, mean-eps:      0.402


[2020-05-11 09:02:46,405][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:46,415][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 12385/75000: episode: 37, duration:  1.182s, episode steps:   565, steps per second:   478, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.435 [ 0.000,  2.000], mean observation:  7.153 [ 0.000,  47.000], q_value:      5.509, mean-eps:      0.399
 12800/75000: episode: 38, duration:  0.686s, episode steps:   415, steps per second:   605, episode reward:     10.850, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.439 [ 0.000,  2.000], mean observation:  7.467 [ 0.000,  47.000], q_value:      0.812, mean-eps:      0.396


[2020-05-11 09:02:48,348][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:48,358][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:02:49,103][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:02:49,114][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:02:49,789][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:02:49,797][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 13673/75000: episode: 39, duration:  2.424s, episode steps:   873, steps per second:   360, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.565 [ 0.000,  2.000], mean observation:  6.574 [ 0.000,  45.000], q_value:      5.411, mean-eps:      0.421
 13954/75000: episode: 40, duration:  0.401s, episode steps:   281, steps per second:   700, episode reward:      7.190, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.715 [ 0.000,  2.000], mean observation:  7.493 [ 0.000,  47.000], q_value:      0.457, mean-eps:      0.590


[2020-05-11 09:02:51,238][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:51,250][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 14519/75000: episode: 41, duration:  1.320s, episode steps:   565, steps per second:   428, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.437 [ 0.000,  2.000], mean observation:  7.087 [ 0.000,  47.000], q_value:      6.584, mean-eps:      0.590


[2020-05-11 09:02:52,609][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:52,618][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 15214/75000: episode: 42, duration:  1.785s, episode steps:   695, steps per second:   389, episode reward:    351.383, mean reward:   0.506 [    -0.010,    338.323], mean action:  0.506 [ 0.000,  2.000], mean observation:  7.251 [ 0.000,  47.000], q_value:     10.300, mean-eps:      0.588


[2020-05-11 09:02:54,292][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:54,301][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 15779/75000: episode: 43, duration:  1.213s, episode steps:   565, steps per second:   466, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.437 [ 0.000,  2.000], mean observation:  7.136 [ 0.000,  47.000], q_value:     15.090, mean-eps:      0.585


[2020-05-11 09:02:55,508][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:55,518][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:02:56,231][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:02:56,240][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 16587/75000: episode: 44, duration:  2.080s, episode steps:   808, steps per second:   388, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.563 [ 0.000,  2.000], mean observation:  6.588 [ 0.000,  47.000], q_value:     11.365, mean-eps:      0.582


[2020-05-11 09:02:57,592][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:02:57,602][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:02:58,292][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:02:58,301][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:02:59,052][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:02:59,062][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 17498/75000: episode: 45, duration:  2.408s, episode steps:   911, steps per second:   378, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.560 [ 0.000,  2.000], mean observation:  6.419 [ 0.000,  45.000], q_value:     12.640, mean-eps:      0.575


[2020-05-11 09:03:00,023][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:00,029][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 18063/75000: episode: 46, duration:  1.222s, episode steps:   565, steps per second:   462, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.421 [ 0.000,  2.000], mean observation:  7.127 [ 0.000,  47.000], q_value:     20.153, mean-eps:      0.568


[2020-05-11 09:03:01,258][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:01,266][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 18628/75000: episode: 47, duration:  1.215s, episode steps:   565, steps per second:   465, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.450 [ 0.000,  2.000], mean observation:  7.141 [ 0.000,  47.000], q_value:     19.855, mean-eps:      0.567


[2020-05-11 09:03:02,464][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:02,471][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:03,154][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:03,162][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:03,910][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:03,917][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 19539/75000: episode: 48, duration:  2.386s, episode steps:   911, steps per second:   382, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.563 [ 0.000,  2.000], mean observation:  6.374 [ 0.000,  45.000], q_value:     15.993, mean-eps:      0.563


[2020-05-11 09:03:04,856][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:04,866][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 20104/75000: episode: 49, duration:  1.180s, episode steps:   565, steps per second:   479, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.427 [ 0.000,  2.000], mean observation:  7.075 [ 0.000,  47.000], q_value:     26.743, mean-eps:      0.557


[2020-05-11 09:03:06,140][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:06,150][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 20669/75000: episode: 50, duration:  1.269s, episode steps:   565, steps per second:   445, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.457 [ 0.000,  2.000], mean observation:  7.141 [ 0.000,  47.000], q_value:     28.543, mean-eps:      0.555
 21084/75000: episode: 51, duration:  0.699s, episode steps:   415, steps per second:   594, episode reward:     10.850, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.465 [ 0.000,  2.000], mean observation:  7.467 [ 0.000,  47.000], q_value:      2.226, mean-eps:      0.554


[2020-05-11 09:03:08,046][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:08,054][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 21779/75000: episode: 52, duration:  1.674s, episode steps:   695, steps per second:   415, episode reward:    351.383, mean reward:   0.506 [    -0.010,    338.323], mean action:  0.540 [ 0.000,  2.000], mean observation:  7.287 [ 0.000,  47.000], q_value:     24.043, mean-eps:      0.553


[2020-05-11 09:03:09,755][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:09,766][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 22344/75000: episode: 53, duration:  1.231s, episode steps:   565, steps per second:   459, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.442 [ 0.000,  2.000], mean observation:  7.177 [ 0.000,  47.000], q_value:     31.693, mean-eps:      0.550
 22625/75000: episode: 54, duration:  0.389s, episode steps:   281, steps per second:   721, episode reward:      7.190, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.751 [ 0.000,  2.000], mean observation:  7.263 [ 0.000,  47.000], q_value:      1.523, mean-eps:      0.548
 22906/75000: episode: 55, duration:  0.390s, episode steps:   281, steps per second:   720, episode reward:      7.190, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.466 [ 0.000,  2.000], mean observation:  7.062 [ 0.000,  47.000], q_value:      0.866, mean-eps:      0.548
 23187/75000: episode: 56, duration:  0.377s, episode steps:   281, steps per second:   745, episode reward:      7.190, mea

[2020-05-11 09:03:12,123][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:12,135][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 23752/75000: episode: 57, duration:  1.181s, episode steps:   565, steps per second:   478, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.455 [ 0.000,  2.000], mean observation:  7.175 [ 0.000,  47.000], q_value:     28.644, mean-eps:      0.548


[2020-05-11 09:03:13,348][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:13,358][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 24317/75000: episode: 58, duration:  1.216s, episode steps:   565, steps per second:   465, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.450 [ 0.000,  2.000], mean observation:  7.089 [ 0.000,  47.000], q_value:     29.766, mean-eps:      0.547


[2020-05-11 09:03:14,526][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:14,538][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 24882/75000: episode: 59, duration:  1.188s, episode steps:   565, steps per second:   476, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.460 [ 0.000,  2.000], mean observation:  7.117 [ 0.000,  47.000], q_value:     34.622, mean-eps:      0.545
 25297/75000: episode: 60, duration:  0.715s, episode steps:   415, steps per second:   580, episode reward:     10.850, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.482 [ 0.000,  2.000], mean observation:  7.460 [ 0.000,  47.000], q_value:      2.654, mean-eps:      0.544


[2020-05-11 09:03:16,451][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:16,466][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:17,122][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:17,129][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:17,746][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:17,754][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 26170/75000: episode: 61, duration:  2.226s, episode steps:   873, steps per second:   392, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.565 [ 0.000,  2.000], mean observation:  6.599 [ 0.000,  45.000], q_value:     23.250, mean-eps:      0.542


[2020-05-11 09:03:18,696][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:18,706][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 26735/75000: episode: 62, duration:  1.182s, episode steps:   565, steps per second:   478, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.469 [ 0.000,  2.000], mean observation:  7.171 [ 0.000,  47.000], q_value:     36.950, mean-eps:      0.536


[2020-05-11 09:03:20,406][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:20,418][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:21,090][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:21,098][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 27543/75000: episode: 63, duration:  2.544s, episode steps:   808, steps per second:   318, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.575 [ 0.000,  2.000], mean observation:  6.567 [ 0.000,  47.000], q_value:     27.923, mean-eps:      0.533
 28096/75000: episode: 64, duration:  1.154s, episode steps:   553, steps per second:   479, episode reward:     14.470, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.380 [ 0.000,  2.000], mean observation:  7.236 [ 0.000,  47.000], q_value:      2.310, mean-eps:      0.528


[2020-05-11 09:03:23,601][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:23,610][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 28661/75000: episode: 65, duration:  1.188s, episode steps:   565, steps per second:   476, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.439 [ 0.000,  2.000], mean observation:  7.152 [ 0.000,  47.000], q_value:     31.106, mean-eps:      0.528


[2020-05-11 09:03:24,781][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:24,790][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:25,493][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:25,502][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 29469/75000: episode: 66, duration:  2.022s, episode steps:   808, steps per second:   400, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.531 [ 0.000,  2.000], mean observation:  6.698 [ 0.000,  47.000], q_value:     26.261, mean-eps:      0.525


[2020-05-11 09:03:26,831][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:26,840][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:27,496][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:27,504][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 30277/75000: episode: 67, duration:  1.975s, episode steps:   808, steps per second:   409, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.551 [ 0.000,  2.000], mean observation:  6.761 [ 0.000,  47.000], q_value:     30.115, mean-eps:      0.519
 30830/75000: episode: 68, duration:  1.096s, episode steps:   553, steps per second:   505, episode reward:     14.470, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.443 [ 0.000,  2.000], mean observation:  7.019 [ 0.000,  47.000], q_value:      1.506, mean-eps:      0.514
 31533/75000: episode: 69, duration:  1.586s, episode steps:   703, steps per second:   443, episode reward:     17.970, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.320 [ 0.000,  2.000], mean observation:  7.095 [ 0.000,  47.000], q_value:      8.915, mean-eps:      0.514


[2020-05-11 09:03:31,503][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:31,514][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 32098/75000: episode: 70, duration:  1.178s, episode steps:   565, steps per second:   480, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.467 [ 0.000,  2.000], mean observation:  7.155 [ 0.000,  47.000], q_value:     41.533, mean-eps:      0.514


[2020-05-11 09:03:32,680][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:32,690][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 32793/75000: episode: 71, duration:  1.588s, episode steps:   695, steps per second:   438, episode reward:    351.383, mean reward:   0.506 [    -0.010,    338.323], mean action:  0.469 [ 0.000,  2.000], mean observation:  7.210 [ 0.000,  47.000], q_value:     34.591, mean-eps:      0.512


[2020-05-11 09:03:34,295][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:34,306][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 33358/75000: episode: 72, duration:  1.194s, episode steps:   565, steps per second:   473, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.485 [ 0.000,  2.000], mean observation:  7.117 [ 0.000,  47.000], q_value:     43.406, mean-eps:      0.508


[2020-05-11 09:03:35,493][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:35,502][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 33923/75000: episode: 73, duration:  1.195s, episode steps:   565, steps per second:   473, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.474 [ 0.000,  2.000], mean observation:  7.015 [ 0.000,  47.000], q_value:     40.511, mean-eps:      0.507


[2020-05-11 09:03:36,706][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:36,714][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 34488/75000: episode: 74, duration:  1.197s, episode steps:   565, steps per second:   472, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.448 [ 0.000,  2.000], mean observation:  7.075 [ 0.000,  47.000], q_value:     42.676, mean-eps:      0.505


[2020-05-11 09:03:37,913][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:37,922][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:38,619][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:38,627][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 35296/75000: episode: 75, duration:  2.042s, episode steps:   808, steps per second:   396, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.575 [ 0.000,  2.000], mean observation:  6.707 [ 0.000,  47.000], q_value:     33.539, mean-eps:      0.502


[2020-05-11 09:03:39,924][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:39,934][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:40,604][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:40,612][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 36106/75000: episode: 76, duration:  1.968s, episode steps:   810, steps per second:   412, episode reward:    698.567, mean reward:   0.862 [    -0.010,    338.323], mean action:  0.538 [ 0.000,  2.000], mean observation:  6.786 [ 0.000,  47.000], q_value:     33.748, mean-eps:      0.496


[2020-05-11 09:03:41,922][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:41,933][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 36939/75000: episode: 77, duration:  2.015s, episode steps:   833, steps per second:   413, episode reward:    355.003, mean reward:   0.426 [    -0.010,    338.323], mean action:  0.468 [ 0.000,  2.000], mean observation:  7.125 [ 0.000,  47.000], q_value:     32.508, mean-eps:      0.489


[2020-05-11 09:03:43,953][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:43,962][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:44,618][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:44,624][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:45,384][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:45,390][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 37850/75000: episode: 78, duration:  2.361s, episode steps:   911, steps per second:   386, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.581 [ 0.000,  2.000], mean observation:  6.456 [ 0.000,  45.000], q_value:     29.527, mean-eps:      0.482


[2020-05-11 09:03:46,347][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:46,357][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:47,010][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:47,018][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 38658/75000: episode: 79, duration:  2.010s, episode steps:   808, steps per second:   402, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.540 [ 0.000,  2.000], mean observation:  6.772 [ 0.000,  47.000], q_value:     34.508, mean-eps:      0.474


[2020-05-11 09:03:48,345][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:48,354][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 39223/75000: episode: 80, duration:  1.183s, episode steps:   565, steps per second:   478, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.465 [ 0.000,  2.000], mean observation:  7.107 [ 0.000,  47.000], q_value:     48.249, mean-eps:      0.469


[2020-05-11 09:03:49,526][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:49,538][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:50,206][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:50,213][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:50,834][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:50,844][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 40096/75000: episode: 81, duration:  2.229s, episode steps:   873, steps per second:   392, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.550 [ 0.000,  2.000], mean observation:  6.544 [ 0.000,  45.000], q_value:     35.194, mean-eps:      0.466
 40649/75000: episode: 82, duration:  1.094s, episode steps:   553, steps per second:   506, episode reward:     14.470, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.356 [ 0.000,  2.000], mean observation:  7.166 [ 0.000,  47.000], q_value:      3.916, mean-eps:      0.460


[2020-05-11 09:03:52,862][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:52,874][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:53,548][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:53,556][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:54,234][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:54,245][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 41522/75000: episode: 83, duration:  2.281s, episode steps:   873, steps per second:   383, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.552 [ 0.000,  2.000], mean observation:  6.625 [ 0.000,  45.000], q_value:     36.218, mean-eps:      0.458


[2020-05-11 09:03:55,296][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:55,301][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 42087/75000: episode: 84, duration:  1.295s, episode steps:   565, steps per second:   436, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.460 [ 0.000,  2.000], mean observation:  7.184 [ 0.000,  47.000], q_value:     51.323, mean-eps:      0.452


[2020-05-11 09:03:56,514][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:56,526][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:57,213][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:57,221][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 42897/75000: episode: 85, duration:  2.053s, episode steps:   810, steps per second:   395, episode reward:    698.567, mean reward:   0.862 [    -0.010,    338.323], mean action:  0.526 [ 0.000,  2.000], mean observation:  6.711 [ 0.000,  47.000], q_value:     28.466, mean-eps:      0.450


[2020-05-11 09:03:58,600][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:03:58,609][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:59,306][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:03:59,314][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:03:59,940][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:03:59,949][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 43772/75000: episode: 86, duration:  2.323s, episode steps:   875, steps per second:   377, episode reward:   1036.250, mean reward:   1.184 [    -0.010,    338.323], mean action:  0.569 [ 0.000,  2.000], mean observation:  6.571 [ 0.000,  46.000], q_value:     26.431, mean-eps:      0.443


[2020-05-11 09:04:00,909][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:00,918][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:01,606][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:01,614][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:02,349][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:02,358][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 44683/75000: episode: 87, duration:  2.393s, episode steps:   911, steps per second:   381, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.544 [ 0.000,  2.000], mean observation:  6.308 [ 0.000,  45.000], q_value:     27.634, mean-eps:      0.435
 45094/75000: episode: 88, duration:  0.717s, episode steps:   411, steps per second:   573, episode reward:     10.890, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.640 [ 0.000,  2.000], mean observation:  7.129 [ 0.000,  47.000], q_value:      0.377, mean-eps:      0.429


[2020-05-11 09:04:04,019][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:04,030][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:04,681][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:04,688][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 45902/75000: episode: 89, duration:  1.980s, episode steps:   808, steps per second:   408, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.552 [ 0.000,  2.000], mean observation:  6.681 [ 0.000,  47.000], q_value:     34.897, mean-eps:      0.427


[2020-05-11 09:04:05,996][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:06,006][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 46467/75000: episode: 90, duration:  1.188s, episode steps:   565, steps per second:   476, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.432 [ 0.000,  2.000], mean observation:  7.158 [ 0.000,  47.000], q_value:     37.808, mean-eps:      0.422


[2020-05-11 09:04:07,208][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:07,218][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:07,887][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:07,894][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:08,708][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:08,716][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 47378/75000: episode: 91, duration:  2.433s, episode steps:   911, steps per second:   374, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.596 [ 0.000,  2.000], mean observation:  6.420 [ 0.000,  45.000], q_value:     35.720, mean-eps:      0.419
 47931/75000: episode: 92, duration:  1.937s, episode steps:   553, steps per second:   286, episode reward:     14.470, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.472 [ 0.000,  2.000], mean observation:  7.071 [ 0.000,  47.000], q_value:      4.777, mean-eps:      0.412
 48082/75000: episode: 93, duration:  0.293s, episode steps:   151, steps per second:   516, episode reward:      3.490, mean reward:   0.023 [    -0.010,      4.990], mean action:  0.623 [ 0.000,  2.000], mean observation:  8.041 [ 0.000,  47.000], q_value:      3.483, mean-eps:      0.412


[2020-05-11 09:04:12,850][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:12,858][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:13,559][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:13,570][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:14,180][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:14,189][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 48955/75000: episode: 94, duration:  3.202s, episode steps:   873, steps per second:   273, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.593 [ 0.000,  2.000], mean observation:  6.540 [ 0.000,  45.000], q_value:     41.672, mean-eps:      0.411


[2020-05-11 09:04:15,134][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:15,141][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:15,790][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:15,797][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:16,545][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:16,554][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 49866/75000: episode: 95, duration:  2.338s, episode steps:   911, steps per second:   390, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.536 [ 0.000,  2.000], mean observation:  6.334 [ 0.000,  45.000], q_value:     40.107, mean-eps:      0.403


[2020-05-11 09:04:17,482][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:17,489][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 50431/75000: episode: 96, duration:  1.165s, episode steps:   565, steps per second:   485, episode reward:    347.683, mean reward:   0.615 [    -0.010,    338.323], mean action:  0.448 [ 0.000,  2.000], mean observation:  7.168 [ 0.000,  47.000], q_value:     53.551, mean-eps:      0.396


[2020-05-11 09:04:18,705][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:18,714][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:19,440][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:19,448][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:20,054][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:20,066][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 51304/75000: episode: 97, duration:  2.317s, episode steps:   873, steps per second:   377, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.553 [ 0.000,  2.000], mean observation:  6.561 [ 0.000,  45.000], q_value:     41.799, mean-eps:      0.393


[2020-05-11 09:04:20,998][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:21,010][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:21,686][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:21,690][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:22,452][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:22,462][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 52215/75000: episode: 98, duration:  2.373s, episode steps:   911, steps per second:   384, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.562 [ 0.000,  2.000], mean observation:  6.446 [ 0.000,  46.000], q_value:     41.574, mean-eps:      0.385


[2020-05-11 09:04:23,427][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:23,438][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:24,100][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:24,108][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 53023/75000: episode: 99, duration:  2.003s, episode steps:   808, steps per second:   403, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.587 [ 0.000,  2.000], mean observation:  6.551 [ 0.000,  47.000], q_value:     39.511, mean-eps:      0.377


[2020-05-11 09:04:25,705][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:25,719][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:26,514][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:26,520][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:27,603][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:27,612][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 53896/75000: episode: 100, duration:  3.118s, episode steps:   873, steps per second:   280, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.568 [ 0.000,  2.000], mean observation:  6.609 [ 0.000,  45.000], q_value:     46.982, mean-eps:      0.371


[2020-05-11 09:04:29,010][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:29,022][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:29,676][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:29,681][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:30,452][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:30,458][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 54807/75000: episode: 101, duration:  2.906s, episode steps:   911, steps per second:   313, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.566 [ 0.000,  2.000], mean observation:  6.494 [ 0.000,  45.000], q_value:     43.865, mean-eps:      0.363


[2020-05-11 09:04:31,473][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:31,482][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:32,097][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:32,106][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:32,818][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:32,826][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 55718/75000: episode: 102, duration:  2.261s, episode steps:   911, steps per second:   403, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.564 [ 0.000,  2.000], mean observation:  6.506 [ 0.000,  45.000], q_value:     44.946, mean-eps:      0.354


[2020-05-11 09:04:33,784][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:33,794][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 56413/75000: episode: 103, duration:  1.608s, episode steps:   695, steps per second:   432, episode reward:    351.383, mean reward:   0.506 [    -0.010,    338.323], mean action:  0.462 [ 0.000,  2.000], mean observation:  7.168 [ 0.000,  47.000], q_value:     35.433, mean-eps:      0.347


[2020-05-11 09:04:35,386][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:35,394][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:36,057][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:36,065][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:36,651][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:36,656][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 57286/75000: episode: 104, duration:  2.196s, episode steps:   873, steps per second:   398, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.597 [ 0.000,  2.000], mean observation:  6.599 [ 0.000,  45.000], q_value:     47.090, mean-eps:      0.342


[2020-05-11 09:04:37,578][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:37,589][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:38,228][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:38,238][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:38,923][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:38,932][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 58197/75000: episode: 105, duration:  2.256s, episode steps:   911, steps per second:   404, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.561 [ 0.000,  2.000], mean observation:  6.518 [ 0.000,  45.000], q_value:     47.771, mean-eps:      0.334


[2020-05-11 09:04:39,860][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:39,870][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:40,497][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:40,506][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:41,078][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:41,084][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 59070/75000: episode: 106, duration:  2.143s, episode steps:   873, steps per second:   407, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.569 [ 0.000,  2.000], mean observation:  6.744 [ 0.000,  45.000], q_value:     50.375, mean-eps:      0.326


[2020-05-11 09:04:42,267][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:42,278][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:43,019][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:43,027][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:44,142][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:44,150][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 59981/75000: episode: 107, duration:  3.028s, episode steps:   911, steps per second:   301, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.575 [ 0.000,  2.000], mean observation:  6.352 [ 0.000,  45.000], q_value:     53.632, mean-eps:      0.318
 60392/75000: episode: 108, duration:  1.070s, episode steps:   411, steps per second:   384, episode reward:     10.890, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.664 [ 0.000,  2.000], mean observation:  7.106 [ 0.000,  47.000], q_value:      0.540, mean-eps:      0.312


[2020-05-11 09:04:46,311][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:46,329][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:46,948][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:46,957][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:47,542][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:47,549][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 61265/75000: episode: 109, duration:  2.311s, episode steps:   873, steps per second:   378, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.580 [ 0.000,  2.000], mean observation:  6.551 [ 0.000,  45.000], q_value:     52.739, mean-eps:      0.310


[2020-05-11 09:04:48,977][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:48,986][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:49,690][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:49,698][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 62399/75000: episode: 110, duration:  3.028s, episode steps:  1134, steps per second:   374, episode reward:    695.327, mean reward:   0.613 [    -0.010,    338.323], mean action:  0.392 [ 0.000,  2.000], mean observation:  6.343 [ 0.000,  47.000], q_value:      2.365, mean-eps:      0.302


[2020-05-11 09:04:51,504][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:51,511][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:52,122][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:52,130][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:52,709][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:52,717][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 63272/75000: episode: 111, duration:  2.113s, episode steps:   873, steps per second:   413, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.581 [ 0.000,  2.000], mean observation:  6.563 [ 0.000,  45.000], q_value:     56.418, mean-eps:      0.293


[2020-05-11 09:04:53,668][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:53,678][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:54,313][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:54,322][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:54,896][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:54,902][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 64145/75000: episode: 112, duration:  2.159s, episode steps:   873, steps per second:   404, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.566 [ 0.000,  2.000], mean observation:  6.586 [ 0.000,  45.000], q_value:     59.862, mean-eps:      0.287
 64698/75000: episode: 113, duration:  1.088s, episode steps:   553, steps per second:   508, episode reward:     14.470, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.416 [ 0.000,  2.000], mean observation:  7.142 [ 0.000,  47.000], q_value:      5.792, mean-eps:      0.284
 64849/75000: episode: 114, duration:  0.144s, episode steps:   151, steps per second:  1050, episode reward:      3.490, mean reward:   0.023 [    -0.010,      4.990], mean action:  0.272 [ 0.000,  2.000], mean observation:  7.419 [ 0.000,  47.000], q_value:      0.550, mean-eps:      0.284


[2020-05-11 09:04:57,521][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:04:57,531][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:58,326][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:04:58,336][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:04:59,234][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:04:59,242][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 65760/75000: episode: 115, duration:  3.075s, episode steps:   911, steps per second:   296, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.574 [ 0.000,  2.000], mean observation:  6.311 [ 0.000,  45.000], q_value:     42.224, mean-eps:      0.283
 66319/75000: episode: 116, duration:  1.337s, episode steps:   559, steps per second:   418, episode reward:     14.410, mean reward:   0.026 [    -0.010,      4.990], mean action:  0.385 [ 0.000,  2.000], mean observation:  7.196 [ 0.000,  47.000], q_value:     25.803, mean-eps:      0.279


[2020-05-11 09:05:01,591][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:01,602][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:02,229][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:02,238][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:02,815][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:02,824][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 67192/75000: episode: 117, duration:  2.205s, episode steps:   873, steps per second:   396, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.564 [ 0.000,  2.000], mean observation:  6.554 [ 0.000,  45.000], q_value:     60.703, mean-eps:      0.279


[2020-05-11 09:05:04,222][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:04,233][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:04,914][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:04,922][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 68328/75000: episode: 118, duration:  2.944s, episode steps:  1136, steps per second:   386, episode reward:    695.307, mean reward:   0.612 [    -0.010,    338.323], mean action:  0.424 [ 0.000,  2.000], mean observation:  6.330 [ 0.000,  47.000], q_value:      5.818, mean-eps:      0.275


[2020-05-11 09:05:06,662][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:06,667][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:07,301][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:07,310][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:07,979][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:07,986][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 69239/75000: episode: 119, duration:  2.190s, episode steps:   911, steps per second:   416, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.568 [ 0.000,  2.000], mean observation:  6.355 [ 0.000,  45.000], q_value:     64.394, mean-eps:      0.270


[2020-05-11 09:05:08,933][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:08,942][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:09,602][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:09,610][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]


 70047/75000: episode: 120, duration:  1.956s, episode steps:   808, steps per second:   413, episode reward:    698.587, mean reward:   0.865 [    -0.010,    338.323], mean action:  0.569 [ 0.000,  2.000], mean observation:  6.677 [ 0.000,  47.000], q_value:     56.600, mean-eps:      0.266


[2020-05-11 09:05:10,860][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:10,870][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:11,510][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:11,518][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:12,094][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:12,102][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 70920/75000: episode: 121, duration:  2.132s, episode steps:   873, steps per second:   409, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.581 [ 0.000,  2.000], mean observation:  6.558 [ 0.000,  45.000], q_value:     66.025, mean-eps:      0.263


[2020-05-11 09:05:12,997][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:13,005][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:13,639][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:13,648][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:14,210][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:14,218][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 71793/75000: episode: 122, duration:  2.099s, episode steps:   873, steps per second:   416, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.559 [ 0.000,  2.000], mean observation:  6.549 [ 0.000,  45.000], q_value:     66.352, mean-eps:      0.260


[2020-05-11 09:05:15,134][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:15,146][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:15,754][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:15,763][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:16,308][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:16,317][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 72666/75000: episode: 123, duration:  2.082s, episode steps:   873, steps per second:   419, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.572 [ 0.000,  2.000], mean observation:  6.670 [ 0.000,  45.000], q_value:     62.372, mean-eps:      0.256


[2020-05-11 09:05:17,211][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:17,219][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:17,851][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:17,857][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:18,559][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:18,567][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 73577/75000: episode: 124, duration:  2.230s, episode steps:   911, steps per second:   409, episode reward:   1035.890, mean reward:   1.137 [    -0.010,    338.323], mean action:  0.597 [ 0.000,  2.000], mean observation:  6.354 [ 0.000,  45.000], q_value:     63.326, mean-eps:      0.253


[2020-05-11 09:05:19,729][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:19,732][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:20,666][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:20,674][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:21,302][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:21,310][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 74450/75000: episode: 125, duration:  2.725s, episode steps:   873, steps per second:   320, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.574 [ 0.000,  2.000], mean observation:  6.591 [ 0.000,  45.000], q_value:     71.571, mean-eps:      0.248


[2020-05-11 09:05:22,356][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:22,361][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:22,984][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:22,992][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:23,613][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:23,621][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


 75323/75000: episode: 126, duration:  2.282s, episode steps:   873, steps per second:   383, episode reward:   1036.270, mean reward:   1.187 [    -0.010,    338.323], mean action:  0.569 [ 0.000,  2.000], mean observation:  6.632 [ 0.000,  45.000], q_value:     59.632, mean-eps:      0.245
done, took 179.600 seconds




Testing for 5 episodes ...


[2020-05-11 09:05:39,566][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:05:39,572][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:44,845][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:05:44,850][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:05:50,391][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:05:50,400][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


Episode 1: reward: 1035.890, steps: 911


[2020-05-11 09:06:03,580][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:06:03,589][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:09,281][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:06:09,286][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:14,924][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:06:14,933][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


Episode 2: reward: 1035.890, steps: 911


[2020-05-11 09:06:16,304][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:06:16,306][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:16,852][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:06:16,857][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:17,403][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:06:17,407][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


Episode 3: reward: 1035.890, steps: 911


[2020-05-11 09:06:18,634][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:06:18,638][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:19,139][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:06:19,146][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:19,699][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:06:19,701][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


Episode 4: reward: 1035.890, steps: 911


[2020-05-11 09:06:20,916][temprl.automata][step][DEBUG]: transition idxs: 0, 1
[2020-05-11 09:06:20,919][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:21,468][temprl.automata][step][DEBUG]: transition idxs: 1, 2
[2020-05-11 09:06:21,472][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.3333333333333]
[2020-05-11 09:06:21,994][temprl.automata][step][DEBUG]: transition idxs: 2, 3
[2020-05-11 09:06:21,999][temprl.wrapper][step][DEBUG]: Non-zero goal rewards: [333.33333333333337]


Episode 5: reward: 1035.890, steps: 911


<Figure size 432x288 with 0 Axes>

In [2]:
show_video(['experiments/breakout-output/expert/',
            'experiments/breakout-output/learner/'], False)

1 :  experiments/breakout-output/expert/videos\openaigym.video.1.4670.video000001.mp4
2 :  experiments/breakout-output/learner/videos\openaigym.video.2.4670.video000001.mp4


## Conclusions and facts

Some highlighted facts that can be drawn from these trainings are: 

* The expert and the learner, in this case, run the same algorithm (SARSA) and have the exact same parameters for the agent training.
* The environments change: The expert is trained with 'fire' mode, but the learner with 'ball' mode. This can be clearly seen in the videos.

![](experiments/breakout-output/Expert_Learner.png)