# Voice leading reinforcement learning agents.

## Introduction.
[...]

$$
\mathbb{E}[\text{return}|\alpha, s]
\ =\ 
\text{reward}+\underset{\ \beta\ \in\ \mathcal{A}_{\alpha(s)}\!\!}{\text{max}}\mathbb{E}\left[\text{return}|\beta, \alpha(s)\right]
$$

$$
v(\alpha, s)
\ =\ 
R(\alpha)+\underset{\ \beta\ \in\ \mathcal{A}_{\alpha(s)}\!\!}{\text{max}}v\big(\beta, \alpha(s)\big)
$$

In [73]:
import copy

import random
import math
import numpy as np
from inspect import isfunction

import torch
import torch.nn as nn

from tqdm import tqdm

___

## Classes for various aspects of music theory.
The various Python classes we define in this section collect important aspects of music theory relevant to problem of voice leading. Using MIDI standard encoding for instance, every note in the scale can be assigned an integer value between $0$ and $127$. In this way, a solution to any voice leading problem can be encoded completely numerically. However, the reward functions for the sequence of step-by-step actions that constitute a proposed solution to a voice leading problem depend on musical theoretical considerations. We will use the classes we define in the present section in order to evaluation remards for our agent's actions.

### Classes related to harmony and melody.

#### Class: `Notes`
Parent(s): *none*

Constructor arguments: *none*

In [74]:
class Notes():
    def __init__(self):
        
        valmod12_to_class = {0: ('C', 'C'),
            1: ('C♯', 'D♭'),
            2: ('D', 'D'),
            3: ('D♯', 'E♭'),
            4: ('E', 'E'),
            5: ('F', 'F'),
            6: ('F♯', 'G♭'),
            7: ('G', 'G'),
            8: ('G♯', 'A♭'),
            9: ('A', 'A'),
            10: ('A♯', 'B♭'),
            11: ('B', 'B')}
        self.valmod12_to_class = valmod12_to_class
        
        all_note_class_names = []
        for key in self.valmod12_to_class:
            class_pair = self.valmod12_to_class[key]
            all_note_class_names.append(class_pair[0])
            all_note_class_names.append(class_pair[1])
        self.all_note_class_names = all_note_class_names
        
        class_to_valmod12 = {}
        for key in self.valmod12_to_class:
            class_pair = self.valmod12_to_class[key]
            for entry in class_pair:
                class_to_valmod12.update({entry: key})
        self.class_to_valmod12 = class_to_valmod12
        
        value_to_class = {}
        for value in range(128):
            valmod12 = value%12
            class_pair = self.valmod12_to_class[valmod12]
            value_to_class.update({value: class_pair})
        self.value_to_class = value_to_class
        
        note_to_value = {}
        for value in self.value_to_class:
            class_pair = self.value_to_class[value]
            sharp_class = class_pair[0]
            flat_class = class_pair[1]
            valmod12 = value%12
            octave = -1 + int((value - valmod12)/12)
            note_to_value.update({sharp_class+'{}'.format(octave): value})
            note_to_value.update({flat_class+'{}'.format(octave): value})
        self.note_to_value = note_to_value

Testing:

In [75]:
notes = Notes()
print(notes.valmod12_to_class[8][0] == 'G♯')
print(notes.class_to_valmod12['E♭'] == 3)
print(notes.value_to_class[54] == ('F♯', 'G♭'))
print(notes.note_to_value['E♭2'] == 39)

True
True
True
True


#### Class: `Scales`
Parent(s):

Constructor arguments: *none*

In [76]:
class Scales():
    def __init__(self):
        
        # Construct modern mode degrees, ascending and descending, as attributes:
        self.long_step_sequence = [2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2]
        
        self.mode_start = {'Ionian': 0,
            'Dorian': 1,
            'Phrygian': 2,
            'Lydian': 3,
            'Mixolydian': 4,
            'Aeolian': 5,
            'Locrian': 6}
        
        modern_mode_steps = {}
        for key, value in self.mode_start.items():
            mode = key
            start_position = value
            current_mode_steps = [self.long_step_sequence[i] for i in range(start_position, start_position+7)]
            modern_mode_steps.update({mode: current_mode_steps})
        self.modern_mode_steps = modern_mode_steps
        
        updown_mode_degrees = {}
        for key, value in self.modern_mode_steps.items():
            mode = key
            step_sequence = value
            degree_sequence = [0]
            for i, step in enumerate(step_sequence):
                scale_degree = degree_sequence[i]
                new_scale_degree = (scale_degree + step)%12
                degree_sequence.append(new_scale_degree)
                rev_degree_sequence = degree_sequence[::-1]
            updown_mode_degrees.update({mode: {'up': copy.deepcopy(degree_sequence),
                                     'down': copy.deepcopy(rev_degree_sequence)}})
        
        # Construct Major mode degrees, ascending and descending, as attributes:
        major_updown = updown_mode_degrees['Ionian']
        updown_mode_degrees.update({'Major': copy.deepcopy(major_updown)})

        # Construct Natural minor mode degrees, ascending and descending, as attributes:
        natural_minor_updown = updown_mode_degrees['Aeolian']
        updown_mode_degrees.update({'Natural_minor': copy.deepcopy(natural_minor_updown)})

        # Construct Harmonic minor mode degrees, ascending and descending, as attributes:
        harmonic_minor_steps = [2, 1, 2, 2, 1, 3, 1]
        harmonic_minor_degree_sequence = [0]
        for i, step in enumerate(harmonic_minor_steps):
            scale_degree = harmonic_minor_degree_sequence[i]
            new_scale_degree = (scale_degree + step)%12
            harmonic_minor_degree_sequence.append(new_scale_degree)
            rev_harmonic_minor_degree_sequence = harmonic_minor_degree_sequence[::-1]
        updown_mode_degrees.update({'Harmonic_minor': {'up': copy.deepcopy(harmonic_minor_degree_sequence),
                                                     'down': copy.deepcopy(rev_harmonic_minor_degree_sequence)}})

        # Construct Melodic minor mode degrees, ascending and descending, as attributes:
        melodic_minor_steps_up = [2, 2, 1, 2, 2, 2, 1]
        melodic_minor_degrees_up = [0]
        for i in range(7):
            current_degree = melodic_minor_degrees_up[i]
            next_degree = (current_degree + melodic_minor_steps_up[i])%12
            melodic_minor_degrees_up.append(next_degree)
        melodic_minor_steps_down = [2, 2, 1, 2, 1, 2, 2]
        melodic_minor_degrees_down = [0]
        for i in range(7):
            current_degree = melodic_minor_degrees_down[i]
            next_degree = (current_degree - melodic_minor_steps_down[i])%12
            melodic_minor_degrees_down.append(next_degree)
        updown_mode_degrees.update({'Melodic_minor': {'up': copy.deepcopy(melodic_minor_degrees_up),
                                                    'down': copy.deepcopy(melodic_minor_degrees_down)}})

        # Combine all ascending and descending mode degrees into attribute dictionary:
        self.updown_mode_degrees = updown_mode_degrees

        # Collect all modes constructed as list attribute:
        mode_list = [key for key in self.updown_mode_degrees]
        self.mode_list = mode_list
 
    # Method for querying the ascending/descending mode degree dictionary attribute:
    def updown_degrees(self, mode):
        assert mode in self.mode_list
        output = self.updown_mode_degrees[mode]
        return output
        

Testing:

In [77]:
scales = Scales()

#### Class: `Key`
Parent(s):

Constructor arguments:
* *root* = `'C'`, 
* *mode* = `'Major'`

**Important.** The constructor for the `Key` class constructs an instance each of the `Notes` and `Scales` classes as attributes of `Key`.

In [78]:
class Key():
    def __init__(self,
                 root = 'C',
                 mode = 'Major'):
        
        self.notes = Notes()
        self.scales = Scales()
        
        assert root in self.notes.all_note_class_names
        assert mode in self.scales.mode_list
        
        self.root_class = root
        self.root_valmod12 = self.notes.class_to_valmod12[self.root_class]
        
        self.mode = mode
        
        self.scale_degrees = self.scales.updown_degrees(mode = self.mode)
        self.up_degrees = self.scale_degrees['up']
        self.triad_degrees = [self.up_degrees[i] for i in [0,2,4]]
        self.triad_valsmod12 = [(self.root_valmod12 + degree)%12 for degree in self.triad_degrees]

Testing:

In [79]:
key = Key(root = 'E', mode = 'Melodic_minor')
print(key.triad_degrees == [0, 4, 7])
print(key.triad_valsmod12 == [4, 8, 11])

True
True


## Classes for rewards.

### Classes for progress-to-final-interval rewards.

#### Class: `SmallProgtoFinScheme`
Parent(s): *none*

Constructor arguments:

In [80]:
class SmallProgtoFinSchema():
    def __init__(self):
        pass
        
    def reward(self,
        chord_0 = np.array([73, 76]),
        chord_1 = np.array([72, 74]),
        final_chord = np.array([35, 45])):
        
        assert len(chord_0) == len(chord_1) == len(final_chord)
        
        n = len(final_chord)
        
        centroid_0 = np.sum(chord_0)/n
        centroid_1 = np.sum(chord_1)/n
        final_centroid = np.sum(final_chord)/n
        
        needed_change = final_centroid - centroid_0
        #needed_direction = np.sign(needed_change)
        
        actual_change = centroid_1 - centroid_0
        #actual_direction = np.sign(actual_change)
        
        if actual_change == 0.:
            change_ratio = -100.
        else:
            change_ratio = needed_change/actual_change
        
        return change_ratio

Testing:

In [81]:
small_prog_to_fin_scheme = SmallProgtoFinSchema()

print(small_prog_to_fin_scheme.reward(chord_0 = np.array([73, 76]),
                                      chord_1 = np.array([73, 75]),
                                      final_chord = np.array([35, 45])))

print(small_prog_to_fin_scheme.reward(chord_0 = np.array([73, 76]),
                                      chord_1 = np.array([73, 75]),
                                      final_chord = np.array([73, 77])))

print(small_prog_to_fin_scheme.reward(chord_0 = np.array([73, 76]),
                                      chord_1 = np.array([73, 76]),
                                      final_chord = np.array([74, 78])))

69.0
-1.0
-100.0


___

## Agent classes.

#### Function: `randinterval`
Arguments: *none*

**Remark.** It occurs to me that I was introducing inductive biases into the agent with the way I had this `randchord` function written. I was randomly selecting an `np.array` $[i_0, i_1, \dots, i_{n-1}]$ by selecting $i_0$, then selecting $i_1>i_0$, and so on, within a given range. With a bit of thought though, it becomes clear that this method is not *independent and identically distributed* (iid). For instance, if we draw a pair or integers $i_0 < i_1$ from the set $[0,1,2]$ using this method, then the probability density ends up being
$$\rho([0,1])=\tfrac{1}{4},\ \ \ \ \ \ \rho([0,2])=\tfrac{1}{4},\ \ \ \ \ \ \text{and}\ \ \ \ \ \ \rho([1,2])=\tfrac{1}{2}.$$

In [82]:
class RandomChord():
    def __init__(self, chord_size = 3, lower_limit = 0, upper_limit = 127):
        assert isinstance(chord_size, int)
        assert chord_size > 0
        assert isinstance(lower_limit, int)
        assert isinstance(upper_limit, int)
        assert lower_limit <= upper_limit
        assert chord_size <= upper_limit - lower_limit + 1
        
        self.chord_size = chord_size
        self.lower_limit = lower_limit
        self.upper_limit = upper_limit
        
        admissible_chords = [[k] for k in range(self.lower_limit, self.upper_limit + 2 - self.chord_size)]
        for i in range(1, self.chord_size):
            new_admissible_chords = []
            for running_chord in admissible_chords:
                new_lower_limit = running_chord[-1] + 1
                for k in range(new_lower_limit, self.upper_limit + 2 - self.chord_size + i):
                    new_chord = running_chord + [k] 
                    new_admissible_chords.append(new_chord)
            admissible_chords = new_admissible_chords
        self.admissible_chords = admissible_chords
        
    
    def sample(self):
        chord = random.choice(self.admissible_chords)
        return chord

Testing:

In [83]:
random_chord = RandomChord(chord_size = 3, lower_limit = 0, upper_limit = 5)

for i in range(4):
    output = random_chord.sample()
    print(output[0]<output[1]<output[2])

True
True
True
True


#### Class: `ActionValue_Spec1`
Parent(s): `torch.nn.Module`

Constructor arguments:
* *layer_count* = `6`
* *layer_features* = `1000`

**Remark: `ReLU` versus `Softmax`.** Because we're implicitly using the *greedy policy*, which, at each state $s$, always selects the action $\alpha$ that maximizes the action-value $v_{\text{greed}}(s,\alpha)$, it might appear that the neural network that approximates $v_{\text{greed}}(s,\alpha)$ should use *softmax* activation at its final layer. However, the specific value of $v_{\text{greed}}(s,\alpha)$ is also important. This activation function $v_{\text{greed}}(s,\alpha)$ is supposed to output the *excpected return* $\mathbb{E}_{\text{greed}}[G|\alpha,\pi]$, which is a (potentially weighted) sum of all future rewards that the agent will obtain under the greedy policy. Because we've already specified our rewards implicitly in the various reward functions we defined above, we will run into trouble if we use softmax. Indeed, $0\le \text{softmax}(x)\le 1$, whereas our reward functions can tske all sorts of integer values, sometimes negative. Thus is makes more sense to use `ReLU` or `LeakyReLU` for activation in our neural network.

In [84]:
def onehot_tensor(index, length):
    assert isinstance(index, int)
    assert isinstance(length, int)
    assert 0 <= index <= length
    
    onehot = torch.Tensor([float(i == index) for i in range(length)])
    
    return onehot

In [72]:
class Policy_01(nn.Module):
    def __init__(self,
                 chord_size = 3,
                 lower_limit = 0,
                 upper_limit = 5,
                 layer_count = 8,
                 layer_features = 1000):
        super().__init__()
        
        self.chord_size = chord_size
        self.lower_limit = lower_limit
        self.upper_limit = upper_limit
        
        self.random_chord = RandomChord(chord_size = self.chord_size,
                                        lower_limit = self.lower_limit,
                                        upper_limit = self.upper_limit)
        
        self.admissible_chord_count = len(self.random_chord.admissible_chords)
        
        self.index_to_chord = {i: chord for i, chord in enumerate(self.random_chord.admissible_chords)}
        self.chord_to_index = {tuple(chord): i for i, chord in enumerate(self.random_chord.admissible_chords)}
        self.chord_to_tensor = {tuple(chord): onehot_tensor(index, self.admissible_chord_count) \
                                for index, chord in enumerate(self.random_chord.admissible_chords)}
        
        
        assert isinstance(layer_count, int)
        assert layer_count > 0
        assert isinstance(layer_features, int)
        assert layer_features > 0
        
        self.layer_count = layer_count
        self.layer_features = layer_features
        
        self.layers = nn.ModuleList()
        # Critical here: the `3` in our `in_features = 3 * self.admissible_chord_count` comes from:
        # 0: current state
        # 1: next state
        # 2: final (goal) state
        # The action variable here is implict in the assignment `step current state ← next state`
        self.layers.append(nn.Linear(in_features = 2 * self.admissible_chord_count, out_features = self.layer_features))
        for k in range(self.layer_count-2):
            self.layers.append(nn.Linear(in_features = self.layer_features, out_features = self.layer_features))
            nn.init.kaiming_normal_(self.layers[-1].weight, nonlinearity='leaky_relu', a=0.01)
            nn.init.constant_(self.layers[-1].bias, 0)
        self.layers.append(nn.Linear(in_features = self.layer_features, out_features = self.admissible_chord_count))
        nn.init.normal_(self.layers[-1].weight, mean=0, std=1/float(self.layer_features))
        nn.init.constant_(self.layers[-1].bias, 0)
        
        self.intermediate_activation = nn.LeakyReLU(negative_slope=0.01, inplace=False)
        #nn.Softmax(dim=-1)
        self.last_activation = nn.Softmax(dim=-1)
        
        # Define proportion or neurons to dropout
        self.dropout = nn.Dropout(0.1)

        
    def forward(self, x):
        activated_features = x
        for i, layer in enumerate(self.layers):
            if 1 <= i < self.layer_count-1:
                #print(i)
                features = layer(activated_features) + activated_features
                activated_features = self.intermediate_activation(features)
                activated_features = self.dropout(features)
            else:
                features = layer(activated_features)
                activated_features = self.intermediate_activation(features)
                activated_features = self.dropout(features)
        activated_output = self.last_activation(features)
        
        return activated_output
    
    
    def action(self, chord_0, final_chord):
        assert isinstance(chord_0, list)
        assert isinstance(final_chord, list)
        assert len(chord_0) == len(final_chord) == self.random_chord.chord_size
        for i in range(self.random_chord.chord_size):
            assert isinstance(chord_0[i], int)
            assert self.random_chord.lower_limit <= chord_0[i] <= self.random_chord.upper_limit
            assert isinstance(final_chord[i], int)
            assert self.random_chord.lower_limit <= final_chord[i] <= self.random_chord.upper_limit
            
        onehot_0 = self.chord_to_tensor[tuple(chord_0)]
        final_onehot = self.chord_to_tensor[tuple(final_chord)]

        policy_input = torch.cat((onehot_0, final_onehot))
    
        policy_output = self.forward(policy_input)
            
        max_index = torch.argmax(policy_output).item()
        
        chord_that_maximizes = self.index_to_chord[max_index]
                
        return chord_that_maximizes

Testing:

In [65]:
policy = Policy_01(chord_size = 3,
                              lower_limit = 36,
                              upper_limit = 60,
                              layer_count = 5,
                              layer_features = 500)

x = torch.rand(2 * policy.admissible_chord_count)
action = policy.action([36, 38, 40], [37, 39, 44])
print(action)

[39, 57, 58]


### "First species" voice leading reinforcement learning agent.

#### Class: `Agent_01`

Constructor arguments:
* *policy* = `Policy_01`, 
* *start_chord* = `[0, 1, 2]`, 
* *end_chord* = `[3, 4, 5]`, 

**What it does.** 

**Remark: Tips for future `Agent_##`s.** I

In [66]:
class Agent_01():
    def __init__(self,
                 policy = Policy_01(),
                 start_chord = [0, 1, 2],
                 final_chord = [3, 4, 5]):
        
        assert isinstance(policy, Policy_01)
        
        self.policy = policy
        
        self.chord_size = self.policy.random_chord.chord_size
        
        self.lower_limit = self.policy.random_chord.lower_limit
        self.upper_limit = self.policy.random_chord.upper_limit
        
        self.chord_to_tensor = self.policy.chord_to_tensor
        
        assert isinstance(start_chord, list)
        assert isinstance(final_chord, list)
        assert len(start_chord) == len(final_chord) == self.chord_size
        for i in range(self.chord_size):
            assert isinstance(start_chord[i], int)
            assert isinstance(final_chord[i], int)
            assert self.lower_limit <= start_chord[i] <= self.upper_limit
            assert self.lower_limit <= final_chord[i] <= self.upper_limit
        
        self.start_chord = start_chord
        self.start_tensor = self.chord_to_tensor[tuple(self.start_chord)]
        
        self.final_chord = final_chord
        self.final_tensor = self.chord_to_tensor[tuple(self.final_chord)]
        
        self.chord_episode = [self.start_chord]
        self.tensor_episode = [self.start_tensor]
        
        self.small_prog_to_fin_scheme = SmallProgtoFinSchema()
        
        
    def next_interval(self):
        chord_0 = self.chord_episode[-1]
        
        next_chord = policy.action(chord_0, self.final_chord)
        next_tensor = self.chord_to_tensor[tuple(next_chord)]
        
        self.chord_episode.append(next_chord)
        self.tensor_episode.append(next_tensor)
    

    def last_reward(self):
        assert len(self.chord_episode) > 1
        
        last_chord = self.chord_episode[-2]
        last_action = self.chord_episode[-1]
        
        reward = self.small_prog_to_fin_scheme.reward(chord_0 = np.array(last_chord),
                                                          chord_1 = np.array(last_action),
                                                          final_chord = np.array(self.final_chord))
        
        if last_action == self.final_chord:
            reward += 6.0
            
        return reward

Testing:

In [67]:
policy = Policy_01(chord_size = 3,
                              lower_limit = 36,
                              upper_limit = 60,
                              layer_count = 4,
                              layer_features = 500)

agent = Agent_01(policy = policy,
                 start_chord = [36, 38, 40],
                 final_chord = [37, 39, 44])

print(agent.tensor_episode, '\n')

agent.next_interval()
print(agent.tensor_episode)
print(agent.last_reward(), '\n')

agent.next_interval()
print(agent.tensor_episode)
print(agent.last_reward(), '\n')

agent.next_interval()
print(agent.tensor_episode)
print(agent.last_reward(), '\n')

agent.next_interval()
print(agent.tensor_episode)
print(agent.last_reward(), '\n')

[tensor([0., 0., 0.,  ..., 0., 0., 0.])] 

[tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.])]
1.4999999999999973 

[tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.])]
0.5 

[tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.])]
-0.24999999999999867 

[tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([0., 0., 0.,  ..., 0., 0., 0.])]
-0.625000000000001 



___

## Training loop(s).

### Attempt 1.
#### Verdict(s): 

In [70]:
policy = Policy_01(chord_size = 1,
                              lower_limit = 48,
                              upper_limit = 60,
                              layer_count = 5,
                              layer_features = 1500)

print(policy.random_chord.admissible_chords)
print(len(policy.random_chord.admissible_chords))

agent = Agent_01(policy = policy,
                 start_chord = [55],
                 final_chord = [55])

[[48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60]]
13


In [71]:
max_sequence_length = 12

episode_count = 20000
present_bias = 1.0
learning_rate = 1e-7

# Optimizers specified in the torch.optim package
optimizer = torch.optim.SGD(policy.parameters(), lr = learning_rate)

for episode_number in tqdm(range(episode_count)):
    
    # print('\n________________________________________________________________________\n')
    
    print('Epsiode count:', episode_number)
    
    random_starting_chord = policy.random_chord.sample()
    starting_tensor = policy.chord_to_tensor[tuple(random_starting_chord)]
    
    agent.chord_episode = [random_starting_chord]
    agent.tensor_episode = [starting_tensor]
    
    agent.final_chord = policy.random_chord.sample()
    agent.final_tensor = agent.chord_to_tensor[tuple(agent.final_chord)]
    
    print('Epsiode goal:', agent.final_chord)
    
    reward_list = []
    simple_reward_list = []
    
    while (agent.chord_episode[-1] != agent.final_chord) and (len(agent.chord_episode) <= max_sequence_length):
        
        agent.next_interval()
        
        reward = torch.Tensor([agent.last_reward()])
        reward_list.append(reward)
        simple_reward_list.append(reward.item())
        
        
    return_list = []
    simple_return_list = []
    
    for index in range(len(reward_list)):
        term_list = reward_list[index:]
        present_return = sum(term_list)
        return_list.append(present_return)
        simple_return_list.append(present_return.item())
        
    print('Epsiode state sequence length = {}.\n'.format(len(agent.chord_episode)), agent.chord_episode)
    print('Reward list (length = {}):\n'.format(len(simple_reward_list)), simple_reward_list, '\n')
#    print('Return list (length = {}):\n'.format(len(return_list)), return_list, '\n')
        
    for index in range(len(agent.chord_episode)-1):
        
        optimizer.zero_grad()
        
        in_tensor = agent.tensor_episode[index]
        
        out_chord = agent.chord_episode[index+1]
        out_index = agent.policy.chord_to_index[tuple(out_chord)]
        
        policy_value = policy(torch.cat((in_tensor, agent.final_tensor)))[out_index].unsqueeze(-1)
        
        present_return = reward_list[index]
        
        loss = -present_return * torch.log(policy_value)
        
        loss.backward()

        # Adjust learning weights
        optimizer.step()

  0%|                                         | 2/20000 [00:00<25:58, 12.83it/s]

Epsiode count: 0
Epsiode goal: [58]
Epsiode state sequence length = 3.
 [[50], [48], [58]]
Reward list (length = 2):
 [-4.0, 7.0] 

Epsiode count: 1
Epsiode goal: [51]
Epsiode state sequence length = 9.
 [[58], [59], [59], [59], [54], [56], [53], [58], [51]]
Reward list (length = 8):
 [-7.0, -100.0, -100.0, 1.600000023841858, -1.5, 1.6666666269302368, -0.4000000059604645, 7.0] 

Epsiode count: 2
Epsiode goal: [58]
Epsiode state sequence length = 11.
 [[54], [56], [54], [53], [54], [48], [48], [54], [56], [48], [58]]
Reward list (length = 10):
 [2.0, -1.0, -4.0, 5.0, -0.6666666865348816, -100.0, 1.6666666269302368, 2.0, -0.25, 7.0] 



  0%|                                         | 4/20000 [00:00<27:41, 12.03it/s]

Epsiode count: 3
Epsiode goal: [58]
Epsiode state sequence length = 3.
 [[59], [54], [58]]
Reward list (length = 2):
 [0.20000000298023224, 7.0] 

Epsiode count: 4
Epsiode goal: [57]
Epsiode state sequence length = 13.
 [[59], [53], [53], [60], [53], [53], [52], [53], [53], [52], [53], [51], [53]]
Reward list (length = 12):
 [0.3333333432674408, -100.0, 0.5714285969734192, 0.4285714328289032, -100.0, -4.0, 5.0, -100.0, -4.0, 5.0, -2.0, 3.0] 



  0%|                                         | 6/20000 [00:00<41:08,  8.10it/s]

Epsiode count: 5
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[56], [55], [58], [60], [58], [56], [58], [56], [58], [60], [58], [56], [58]]
Reward list (length = 12):
 [4.0, -1.0, -3.0, 4.0, 3.0, -2.0, 3.0, -2.0, -3.0, 4.0, 3.0, -2.0] 

Epsiode count: 6
Epsiode goal: [51]
Epsiode state sequence length = 1.
 [[51]]
Reward list (length = 0):
 [] 

Epsiode count: 7
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[48], [48], [60], [60], [51], [53], [51], [51], [56], [54], [51], [51], [58]]
Reward list (length = 12):
 [-100.0, 0.1666666716337204, -100.0, 1.1111111640930176, -0.5, 1.5, -100.0, -0.20000000298023224, 3.0, 1.3333333730697632, -100.0, -0.1428571492433548] 



  0%|                                         | 8/20000 [00:00<36:25,  9.15it/s]

Epsiode count: 8
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[58], [58], [49], [51], [58], [57], [55], [48], [58], [56], [58], [57], [58]]
Reward list (length = 12):
 [-100.0, 0.6666666865348816, 1.5, 0.1428571492433548, 6.0, 2.5, 0.4285714328289032, 0.4000000059604645, 3.0, -2.0, 6.0, -5.0] 

Epsiode count: 9
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[49], [51], [58], [56], [58], [56], [56], [58], [56], [60], [53], [52], [52]]
Reward list (length = 12):
 [2.5, 0.4285714328289032, 2.0, -1.0, 2.0, -100.0, -1.0, 2.0, -0.5, 0.8571428656578064, -1.0, -100.0] 



  0%|                                        | 13/20000 [00:01<29:27, 11.31it/s]

Epsiode count: 10
Epsiode goal: [56]
Epsiode state sequence length = 2.
 [[52], [56]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 11
Epsiode goal: [50]
Epsiode state sequence length = 5.
 [[53], [48], [58], [48], [50]]
Reward list (length = 4):
 [0.6000000238418579, 0.20000000298023224, 0.800000011920929, 7.0] 

Epsiode count: 12
Epsiode goal: [52]
Epsiode state sequence length = 3.
 [[60], [53], [52]]
Reward list (length = 2):
 [1.1428571939468384, 7.0] 

Epsiode count: 13
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[56], [58], [60], [58], [57], [56], [55], [58], [57], [56], [57], [58], [58]]
Reward list (length = 12):
 [-2.0, -3.0, 4.0, 6.0, 5.0, 4.0, -1.0, 6.0, 5.0, -4.0, -5.0, -100.0] 



  0%|                                        | 15/20000 [00:01<37:19,  8.92it/s]

Epsiode count: 14
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[51], [57], [57], [58], [57], [56], [58], [56], [57], [59], [58], [57], [57]]
Reward list (length = 12):
 [0.1666666716337204, -100.0, -5.0, 6.0, 5.0, -2.0, 3.0, -4.0, -2.5, 7.0, 6.0, -100.0] 

Epsiode count: 15
Epsiode goal: [56]
Epsiode state sequence length = 13.
 [[52], [60], [48], [48], [58], [48], [57], [57], [58], [60], [53], [60], [57]]
Reward list (length = 12):
 [0.5, 0.3333333432674408, -100.0, 0.800000011920929, 0.20000000298023224, 0.8888888955116272, -100.0, -1.0, -1.0, 0.5714285969734192, 0.4285714328289032, 1.3333333730697632] 



  0%|                                        | 17/20000 [00:01<37:58,  8.77it/s]

Epsiode count: 16
Epsiode goal: [55]
Epsiode state sequence length = 5.
 [[56], [56], [59], [59], [55]]
Reward list (length = 4):
 [-100.0, -0.3333333432674408, -100.0, 7.0] 

Epsiode count: 17
Epsiode goal: [53]
Epsiode state sequence length = 3.
 [[48], [58], [53]]
Reward list (length = 2):
 [0.5, 7.0] 

Epsiode count: 18
Epsiode goal: [58]
Epsiode state sequence length = 1.
 [[58]]
Reward list (length = 0):
 [] 

Epsiode count: 19
Epsiode goal: [51]
Epsiode state sequence length = 4.
 [[60], [53], [52], [51]]
Reward list (length = 3):
 [1.2857142686843872, 2.0, 7.0] 

Epsiode count: 20
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[54], [56], [52], [55], [55], [58], [60], [60], [52], [48], [48], [48], [50]]
Reward list (length = 12):
 [-2.0, 1.5, -0.6666666865348816, -100.0, -1.6666666269302368, -4.0, -100.0, 1.25, 0.5, -100.0, -100.0, 7.0] 



  0%|                                        | 21/20000 [00:02<30:07, 11.05it/s]

Epsiode count: 21
Epsiode goal: [57]
Epsiode state sequence length = 5.
 [[52], [56], [55], [58], [57]]
Reward list (length = 4):
 [1.25, -1.0, 0.6666666865348816, 7.0] 

Epsiode count: 22
Epsiode goal: [60]
Epsiode state sequence length = 11.
 [[57], [58], [57], [52], [52], [58], [48], [58], [56], [48], [60]]
Reward list (length = 10):
 [3.0, -2.0, -0.6000000238418579, -100.0, 1.3333333730697632, -0.20000000298023224, 1.2000000476837158, -1.0, -0.5, 7.0] 



  0%|                                        | 23/20000 [00:02<31:15, 10.65it/s]

Epsiode count: 23
Epsiode goal: [55]
Epsiode state sequence length = 9.
 [[56], [50], [48], [56], [57], [56], [58], [57], [55]]
Reward list (length = 8):
 [0.1666666716337204, -2.5, 0.875, -1.0, 2.0, -0.5, 3.0, 7.0] 

Epsiode count: 24
Epsiode goal: [53]
Epsiode state sequence length = 8.
 [[52], [51], [56], [49], [48], [48], [55], [53]]
Reward list (length = 7):
 [-1.0, 0.4000000059604645, 0.4285714328289032, -4.0, -100.0, 0.7142857313156128, 7.0] 



  0%|                                        | 25/20000 [00:02<32:21, 10.29it/s]

Epsiode count: 25
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[50], [56], [58], [48], [57], [58], [56], [58], [56], [58], [56], [59], [49]]
Reward list (length = 12):
 [1.6666666269302368, 2.0, -0.20000000298023224, 1.3333333730697632, 3.0, -1.0, 2.0, -1.0, 2.0, -1.0, 1.3333333730697632, -0.10000000149011612] 

Epsiode count: 26
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[51], [56], [55], [58], [49], [54], [56], [60], [53], [52], [53], [54], [56]]
Reward list (length = 12):
 [-0.20000000298023224, 6.0, -1.6666666269302368, 0.8888888955116272, 0.20000000298023224, -2.0, -1.5, 1.4285714626312256, 3.0, -2.0, -3.0, -2.0] 



  0%|                                        | 28/20000 [00:03<41:59,  7.93it/s]

Epsiode count: 27
Epsiode goal: [57]
Epsiode state sequence length = 13.
 [[53], [53], [58], [56], [58], [52], [51], [55], [53], [52], [59], [55], [60]]
Reward list (length = 12):
 [-100.0, 0.800000011920929, 0.5, 0.5, 0.1666666716337204, -5.0, 1.5, -1.0, -4.0, 0.7142857313156128, 0.5, 0.4000000059604645] 

Epsiode count: 28
Epsiode goal: [49]


  0%|                                        | 29/20000 [00:03<45:30,  7.31it/s]

Epsiode state sequence length = 13.
 [[50], [60], [55], [56], [48], [58], [56], [58], [58], [56], [53], [51], [58]]
Reward list (length = 12):
 [-0.10000000149011612, 2.200000047683716, -6.0, 0.875, 0.10000000149011612, 4.5, -3.5, -100.0, 4.5, 2.3333332538604736, 2.0, -0.2857142984867096] 

Epsiode count: 29
Epsiode goal: [53]
Epsiode state sequence length = 13.
 [[51], [56], [55], [56], [48], [56], [48], [56], [60], [56], [55], [56], [58]]
Reward list (length = 12):
 [0.4000000059604645, 3.0, -2.0, 0.375, 0.625, 0.375, 0.625, -0.75, 1.75, 3.0, -2.0, -1.5] 



  0%|                                        | 32/20000 [00:03<41:59,  7.92it/s]

Epsiode count: 30
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[54], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 31
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[58], [56], [55], [56], [55], [57], [51], [56], [57], [56], [55], [53], [52]]
Reward list (length = 12):
 [5.0, 8.0, -7.0, 8.0, -3.5, 1.5, -0.6000000238418579, -8.0, 9.0, 8.0, 3.5, 5.0] 

Epsiode count: 32
Epsiode goal: [60]


  0%|                                        | 34/20000 [00:03<34:04,  9.77it/s]

Epsiode state sequence length = 6.
 [[56], [56], [58], [57], [56], [60]]
Reward list (length = 5):
 [-100.0, 2.0, -2.0, -3.0, 7.0] 

Epsiode count: 33
Epsiode goal: [51]
Epsiode state sequence length = 4.
 [[56], [60], [49], [51]]
Reward list (length = 3):
 [-1.25, 0.8181818127632141, 7.0] 

Epsiode count: 34
Epsiode goal: [56]
Epsiode state sequence length = 6.
 [[49], [51], [58], [57], [58], [56]]
Reward list (length = 5):
 [3.5, 0.7142857313156128, 2.0, -1.0, 7.0] 

Epsiode count: 35
Epsiode goal: [59]
Epsiode state sequence length = 1.
 [[59]]
Reward list (length = 0):
 [] 

Epsiode count: 36
Epsiode goal: [56]
Epsiode state sequence length = 1.
 [[56]]
Reward list (length = 0):
 [] 

Epsiode count: 37
Epsiode goal: [57]
Epsiode state sequence length = 7.
 [[56], [55], [53], [52], [53], [60], [57]]
Reward list (length = 6):
 [-1.0, -1.0, -4.0, 5.0, 0.5714285969734192, 7.0] 



  0%|                                        | 40/20000 [00:04<23:53, 13.93it/s]

Epsiode count: 38
Epsiode goal: [50]
Epsiode state sequence length = 2.
 [[52], [50]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 39
Epsiode goal: [58]
Epsiode state sequence length = 10.
 [[51], [54], [60], [60], [55], [51], [52], [51], [51], [58]]
Reward list (length = 9):
 [2.3333332538604736, 0.6666666865348816, -100.0, 0.4000000059604645, -0.75, 7.0, -6.0, -100.0, 7.0] 

Epsiode count: 40
Epsiode goal: [57]
Epsiode state sequence length = 8.
 [[58], [53], [51], [56], [58], [60], [53], [57]]
Reward list (length = 7):
 [0.20000000298023224, -2.0, 1.2000000476837158, 0.5, -0.5, 0.4285714328289032, 7.0] 



  0%|                                        | 42/20000 [00:04<23:35, 14.10it/s]

Epsiode count: 41
Epsiode goal: [53]
Epsiode state sequence length = 3.
 [[49], [60], [53]]
Reward list (length = 2):
 [0.3636363744735718, 7.0] 

Epsiode count: 42
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[52], [52], [55], [48], [58], [48], [54], [58], [59], [52], [56], [50], [49]]
Reward list (length = 12):
 [-100.0, 2.6666667461395264, -0.7142857313156128, 1.2000000476837158, -0.20000000298023224, 2.0, 1.5, 2.0, -0.1428571492433548, 2.0, -0.6666666865348816, -10.0] 



  0%|                                        | 47/20000 [00:04<22:05, 15.06it/s]

Epsiode count: 43
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[48], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 44
Epsiode goal: [53]
Epsiode state sequence length = 1.
 [[53]]
Reward list (length = 0):
 [] 

Epsiode count: 45
Epsiode goal: [56]
Epsiode state sequence length = 2.
 [[51], [56]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 46
Epsiode goal: [48]
Epsiode state sequence length = 10.
 [[60], [53], [51], [56], [55], [52], [56], [57], [56], [48]]
Reward list (length = 9):
 [1.7142857313156128, 2.5, -0.6000000238418579, 8.0, 2.3333332538604736, -1.0, -8.0, 9.0, 7.0] 

Epsiode count: 47
Epsiode goal: [56]
Epsiode state sequence length = 2.
 [[57], [56]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 48
Epsiode goal: [48]


  0%|                                        | 49/20000 [00:04<24:46, 13.42it/s]

Epsiode state sequence length = 13.
 [[53], [52], [58], [57], [51], [56], [55], [53], [52], [56], [60], [57], [55]]
Reward list (length = 12):
 [5.0, -0.6666666865348816, 10.0, 1.5, -0.6000000238418579, 8.0, 3.5, 5.0, -1.0, -2.0, 4.0, 4.5] 

Epsiode count: 49
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[53], [51], [55], [56], [55], [52], [48], [58], [56], [51], [58], [51], [52]]
Reward list (length = 12):
 [-0.5, 0.75, -1.0, 2.0, 0.3333333432674408, -0.5, 0.6000000238418579, 2.0, 0.4000000059604645, 0.4285714328289032, 0.5714285969734192, 3.0] 



  0%|                                        | 51/20000 [00:05<33:49,  9.83it/s]

Epsiode count: 50
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[58], [56], [55], [56], [58], [48], [49], [48], [51], [56], [56], [49], [51]]
Reward list (length = 12):
 [2.0, 2.0, -1.0, -1.0, 0.4000000059604645, 6.0, -5.0, 2.0, 0.6000000238418579, -100.0, 0.2857142984867096, 2.5] 

Epsiode count: 51
Epsiode goal: [57]
Epsiode state sequence length = 13.
 [[53], [51], [55], [56], [55], [53], [60], [53], [51], [56], [58], [60], [53]]
Reward list (length = 12):
 [-2.0, 1.5, 2.0, -1.0, -1.0, 0.5714285969734192, 0.4285714328289032, -2.0, 1.2000000476837158, 0.5, -0.5, 0.4285714328289032] 



  0%|                                        | 53/20000 [00:05<40:20,  8.24it/s]

Epsiode count: 52
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[49], [60], [54], [51], [51], [56], [54], [54], [54], [51], [54], [53], [60]]
Reward list (length = 12):
 [0.8181818127632141, 0.3333333432674408, -1.3333333730697632, -100.0, 1.399999976158142, -1.0, -100.0, -100.0, -1.3333333730697632, 2.3333332538604736, -4.0, 0.7142857313156128] 

Epsiode count: 53
Epsiode goal: [60]
Epsiode state sequence length = 7.
 [[55], [48], [58], [59], [59], [59], [60]]
Reward list (length = 6):
 [-0.7142857313156128, 1.2000000476837158, 2.0, -100.0, -100.0, 7.0] 



  0%|                                        | 55/20000 [00:05<40:49,  8.14it/s]

Epsiode count: 54
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[54], [53], [54], [51], [55], [54], [59], [54], [55], [54], [51], [51], [60]]
Reward list (length = 12):
 [-4.0, 5.0, -1.3333333730697632, 1.75, -3.0, 0.800000011920929, 0.20000000298023224, 4.0, -3.0, -1.3333333730697632, -100.0, 0.7777777910232544] 

Epsiode count: 55
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[49], [54], [55], [51], [56], [55], [50], [56], [54], [56], [54], [56], [55]]
Reward list (length = 12):
 [1.7999999523162842, 4.0, -0.75, 1.399999976158142, -2.0, -0.6000000238418579, 1.3333333730697632, -1.0, 2.0, -1.0, 2.0, -2.0] 



  0%|                                        | 57/20000 [00:05<46:04,  7.22it/s]

Epsiode count: 56
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[49], [52], [58], [55], [55], [58], [53], [56], [49], [58], [48], [58], [56]]
Reward list (length = 12):
 [3.6666667461395264, 1.3333333730697632, -0.6666666865348816, -100.0, 1.6666666269302368, -0.4000000059604645, 2.3333332538604736, -0.5714285969734192, 1.2222222089767456, -0.20000000298023224, 1.2000000476837158, -1.0] 

Epsiode count: 57
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[51], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 58
Epsiode goal: [50]


  0%|                                        | 59/20000 [00:06<40:28,  8.21it/s]

Epsiode state sequence length = 13.
 [[48], [55], [48], [53], [52], [53], [54], [56], [58], [52], [56], [55], [56]]
Reward list (length = 12):
 [0.2857142984867096, 0.7142857313156128, 0.4000000059604645, 3.0, -2.0, -3.0, -2.0, -3.0, 1.3333333730697632, -0.5, 6.0, -5.0] 

Epsiode count: 59
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[57], [53], [52], [58], [56], [55], [58], [54], [58], [54], [51], [58], [58]]
Reward list (length = 12):
 [-0.75, -7.0, 1.3333333730697632, -1.0, -4.0, 1.6666666269302368, -0.5, 1.5, -0.5, -2.0, 1.2857142686843872, -100.0] 



  0%|                                        | 60/20000 [00:06<43:47,  7.59it/s]

Epsiode count: 60
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[54], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 61
Epsiode goal: [55]
Epsiode state sequence length = 4.
 [[49], [51], [56], [55]]
Reward list (length = 3):
 [3.0, 0.800000011920929, 7.0] 

Epsiode count: 62
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[59], [55], [55], [52], [56], [60], [58], [53], [52], [51], [55], [56], [55]]
Reward list (length = 12):
 [2.75, -100.0, 2.3333332538604736, -1.0, -2.0, 6.0, 2.0, 5.0, 4.0, -0.75, -7.0, 8.0] 



  0%|▏                                       | 64/20000 [00:06<38:46,  8.57it/s]

Epsiode count: 63
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[59], [52], [54], [56], [60], [53], [52], [56], [58], [56], [48], [49], [52]]
Reward list (length = 12):
 [1.2857142686843872, -1.0, -2.0, -1.5, 1.4285714626312256, 3.0, -0.5, -3.0, 4.0, 0.75, 2.0, 0.3333333432674408] 

Epsiode count: 64
Epsiode goal: [55]
Epsiode state sequence length = 5.
 [[56], [56], [51], [56], [55]]
Reward list (length = 4):
 [-100.0, 0.20000000298023224, 0.800000011920929, 7.0] 



  0%|▏                                       | 66/20000 [00:06<37:58,  8.75it/s]

Epsiode count: 65
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[52], [58], [56], [58], [56], [58], [56], [49], [52], [58], [56], [58], [56]]
Reward list (length = 12):
 [0.3333333432674408, 2.0, -1.0, 2.0, -1.0, 2.0, 0.2857142984867096, 1.6666666269302368, 0.3333333432674408, 2.0, -1.0, 2.0] 

Epsiode count: 66
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[49], [54], [55]]
Reward list (length = 2):
 [1.2000000476837158, 7.0] 

Epsiode count: 67
Epsiode goal: [53]
Epsiode state sequence length = 4.
 [[49], [50], [50], [53]]
Reward list (length = 3):
 [4.0, -100.0, 7.0] 

Epsiode count: 68
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[55], [53], [52], [52], [52], [60], [53], [52], [51], [56], [48], [58], [53]]
Reward list (length = 12):
 [-2.0, -6.0, -100.0, -100.0, 0.875, 0.1428571492433548, -6.0, -7.0, 1.600000023841858, -0.375, 1.100000023841858, -0.20000000298023224] 



  0%|▏                                       | 70/20000 [00:07<36:50,  9.02it/s]

Epsiode count: 69
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[50], [56], [60], [55], [56], [53], [57], [58], [60], [58], [56], [58], [57]]
Reward list (length = 12):
 [0.3333333432674408, -1.0, 1.600000023841858, -3.0, 1.3333333730697632, -0.25, -5.0, -3.0, 4.0, 3.0, -2.0, 6.0] 

Epsiode count: 70
Epsiode goal: [58]
Epsiode state sequence length = 7.
 [[59], [54], [56], [54], [56], [54], [58]]
Reward list (length = 6):
 [0.20000000298023224, 2.0, -1.0, 2.0, -1.0, 7.0] 



  0%|▏                                       | 72/20000 [00:07<38:23,  8.65it/s]

Epsiode count: 71
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[56], [59], [52], [50], [48], [58], [51], [55], [48], [58], [59], [54], [58]]
Reward list (length = 12):
 [1.3333333730697632, -0.1428571492433548, -4.0, -5.0, 1.2000000476837158, -0.2857142984867096, 2.25, -0.7142857313156128, 1.2000000476837158, 2.0, -0.20000000298023224, 1.5] 

Epsiode count: 72
Epsiode goal: [60]
Epsiode state sequence length = 12.
 [[50], [51], [55], [49], [51], [56], [52], [55], [53], [51], [52], [60]]
Reward list (length = 11):
 [10.0, 2.25, -0.8333333134651184, 5.5, 1.7999999523162842, -1.0, 2.6666667461395264, -2.5, -3.5, 9.0, 7.0] 



  0%|▏                                       | 74/20000 [00:07<44:09,  7.52it/s]

Epsiode count: 73
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[56], [49], [51], [57], [49], [55], [58], [60], [52], [60], [49], [52], [55]]
Reward list (length = 12):
 [-0.4285714328289032, 5.0, 1.3333333730697632, -0.25, 1.6666666269302368, 1.3333333730697632, 0.5, 0.125, 0.875, 0.09090909361839294, 3.3333332538604736, 2.3333332538604736] 

Epsiode count: 74
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[60], [58], [56], [58], [58], [51], [51], [58], [48], [53], [48], [58], [56]]
Reward list (length = 12):
 [3.0, 2.0, -1.0, -100.0, 0.5714285969734192, -100.0, 0.4285714328289032, 0.4000000059604645, 1.2000000476837158, -0.20000000298023224, 0.6000000238418579, 2.0] 



  0%|▏                                       | 76/20000 [00:08<49:27,  6.71it/s]

Epsiode count: 75
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[52], [53], [52], [55], [56], [53], [51], [52], [55], [60], [56], [60], [55]]
Reward list (length = 12):
 [-4.0, 5.0, -1.3333333730697632, -7.0, 2.6666667461395264, 2.5, -3.0, -1.3333333730697632, -1.399999976158142, 3.0, -2.0, 2.4000000953674316] 

Epsiode count: 76
Epsiode goal: [48]
Epsiode state sequence length = 2.
 [[56], [48]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 77
Epsiode goal: [58]


  0%|▏                                       | 78/20000 [00:08<42:23,  7.83it/s]

Epsiode state sequence length = 13.
 [[50], [49], [50], [48], [50], [54], [51], [53], [60], [53], [54], [60], [52]]
Reward list (length = 12):
 [-8.0, 9.0, -4.0, 5.0, 2.0, -1.3333333730697632, 3.5, 0.7142857313156128, 0.2857142984867096, 5.0, 0.6666666865348816, 0.25] 

Epsiode count: 78
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[56], [55], [56], [58], [55], [58], [53], [53], [58], [56], [58], [56], [49]]
Reward list (length = 12):
 [7.0, -6.0, -3.5, 3.0, -2.0, 1.7999999523162842, -100.0, -0.800000011920929, 4.5, -3.5, 4.5, 7.0] 



  0%|▏                                       | 79/20000 [00:08<46:12,  7.18it/s]

Epsiode count: 79
Epsiode goal: [58]
Epsiode state sequence length = 6.
 [[54], [53], [59], [54], [56], [58]]
Reward list (length = 5):
 [-4.0, 0.8333333134651184, 0.20000000298023224, 2.0, 7.0] 

Epsiode count: 80
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[58], [57], [58], [57], [59], [48], [57], [58], [56], [57], [53], [52], [59]]
Reward list (length = 12):
 [-2.0, 3.0, -2.0, 1.5, -0.09090909361839294, 1.3333333730697632, 3.0, -1.0, 4.0, -0.75, -7.0, 1.1428571939468384] 



  0%|▏                                       | 83/20000 [00:09<38:52,  8.54it/s]

Epsiode count: 81
Epsiode goal: [57]
Epsiode state sequence length = 2.
 [[53], [57]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 82
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[51], [59], [49], [51], [58], [48], [57], [59], [52], [54], [48], [54], [56]]
Reward list (length = 12):
 [1.125, -0.10000000149011612, 5.5, 1.2857142686843872, -0.20000000298023224, 1.3333333730697632, 1.5, -0.1428571492433548, 4.0, -1.0, 2.0, 3.0] 

Epsiode count: 83
Epsiode goal: [60]
Epsiode state sequence length = 9.
 [[59], [54], [58], [57], [48], [58], [56], [52], [60]]
Reward list (length = 8):
 [-0.20000000298023224, 1.5, -2.0, -0.3333333432674408, 1.2000000476837158, -1.0, -1.0, 7.0] 



  0%|▏                                       | 85/20000 [00:09<40:49,  8.13it/s]

Epsiode count: 84
Epsiode goal: [52]
Epsiode state sequence length = 11.
 [[49], [54], [57], [58], [56], [56], [60], [58], [60], [49], [52]]
Reward list (length = 10):
 [0.6000000238418579, -0.6666666865348816, -5.0, 3.0, -100.0, -1.0, 4.0, -3.0, 0.7272727489471436, 7.0] 

Epsiode count: 85
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[60], [60], [60], [52], [60], [53], [52], [53], [52], [53], [52], [60], [60]]
Reward list (length = 12):
 [-100.0, -100.0, 0.125, 0.875, 0.1428571492433548, -6.0, 7.0, -6.0, 7.0, -6.0, 0.875, -100.0] 



  0%|▏                                       | 87/20000 [00:09<48:03,  6.91it/s]

Epsiode count: 86
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[60], [55], [56], [57], [55], [55], [49], [51], [58], [56], [49], [51], [55]]
Reward list (length = 12):
 [2.4000000953674316, -7.0, -8.0, 4.5, -100.0, 1.1666666269302368, -0.5, -0.4285714328289032, 5.0, 1.1428571939468384, -0.5, -0.75] 

Epsiode count: 87
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[60], [53], [60], [57], [51], [55], [58], [56], [57], [57], [56], [60], [55]]
Reward list (length = 12):
 [1.7142857313156128, -0.7142857313156128, 4.0, 1.5, -0.75, -2.3333332538604736, 5.0, -8.0, -100.0, 9.0, -2.0, 2.4000000953674316] 



  0%|▏                                       | 89/20000 [00:10<52:37,  6.31it/s]

Epsiode count: 88
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[52], [59], [59], [52], [58], [56], [56], [58], [56], [48], [52], [48], [59]]
Reward list (length = 12):
 [1.1428571939468384, -100.0, -0.1428571492433548, 1.3333333730697632, -1.0, -100.0, 2.0, -1.0, -0.5, 3.0, -2.0, 1.0909091234207153] 

Epsiode count: 89
Epsiode goal: [57]


  0%|▏                                       | 90/20000 [00:10<54:48,  6.05it/s]

Epsiode state sequence length = 13.
 [[60], [49], [60], [52], [56], [60], [55], [56], [55], [53], [58], [56], [49]]
Reward list (length = 12):
 [0.27272728085517883, 0.7272727489471436, 0.375, 1.25, 0.25, 0.6000000238418579, 2.0, -1.0, -1.0, 0.800000011920929, 0.5, -0.1428571492433548] 

Epsiode count: 90
Epsiode goal: [57]
Epsiode state sequence length = 13.
 [[56], [55], [48], [55], [56], [55], [53], [51], [48], [58], [56], [58], [53]]
Reward list (length = 12):
 [-1.0, -0.2857142984867096, 1.2857142686843872, 2.0, -1.0, -1.0, -2.0, -2.0, 0.8999999761581421, 0.5, 0.5, 0.20000000298023224] 



  0%|▏                                       | 92/20000 [00:10<55:01,  6.03it/s]

Epsiode count: 91
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[56], [51], [60], [53], [54], [56], [50], [50], [48], [52], [49], [51], [56]]
Reward list (length = 12):
 [-0.4000000059604645, 0.7777777910232544, 0.2857142984867096, 5.0, 2.0, -0.3333333432674408, -100.0, -4.0, 2.5, -2.0, 4.5, 1.399999976158142] 

Epsiode count: 92
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[49], [51], [58], [56], [58], [56], [48], [55], [58], [57], [49], [58], [56]]
Reward list (length = 12):
 [2.5, 0.4285714328289032, 2.0, -1.0, 2.0, 0.25, 0.8571428656578064, -0.3333333432674408, 4.0, 0.375, 0.5555555820465088, 2.0] 



  0%|▏                                       | 94/20000 [00:10<57:53,  5.73it/s]

Epsiode count: 93
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[53], [48], [53], [52], [56], [60], [53], [52], [53], [51], [48], [57], [53]]
Reward list (length = 12):
 [0.6000000238418579, 0.4000000059604645, 3.0, -0.5, -1.5, 1.4285714626312256, 3.0, -2.0, 1.5, 0.3333333432674408, 0.2222222238779068, 1.75] 

Epsiode count: 94
Epsiode goal: [50]


  0%|▏                                       | 96/20000 [00:11<41:09,  8.06it/s]

Epsiode state sequence length = 7.
 [[56], [55], [53], [51], [56], [52], [50]]
Reward list (length = 6):
 [6.0, 2.5, 1.5, -0.20000000298023224, 1.5, 7.0] 

Epsiode count: 95
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[53], [51], [55]]
Reward list (length = 2):
 [-1.0, 7.0] 

Epsiode count: 96
Epsiode goal: [51]
Epsiode state sequence length = 1.
 [[51]]
Reward list (length = 0):
 [] 

Epsiode count: 97
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[54], [48], [52], [52], [58], [48], [55], [50], [48], [58], [48], [53], [52]]
Reward list (length = 12):
 [-0.8333333134651184, 2.75, -100.0, 1.1666666269302368, -0.10000000149011612, 1.5714285373687744, -0.800000011920929, -4.5, 1.100000023841858, -0.10000000149011612, 2.200000047683716, -6.0] 



  0%|▏                                       | 99/20000 [00:11<40:30,  8.19it/s]

Epsiode count: 98
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[57], [58], [48], [52], [51], [56], [58], [48], [58], [48], [53], [51], [58]]
Reward list (length = 12):
 [-8.0, 0.8999999761581421, 0.25, 3.0, -0.4000000059604645, -3.5, 0.8999999761581421, 0.10000000149011612, 0.8999999761581421, 0.20000000298023224, 2.0, -0.2857142984867096] 

Epsiode count: 99
Epsiode goal: [57]
Epsiode state sequence length = 4.
 [[54], [48], [58], [57]]
Reward list (length = 3):
 [-0.5, 0.8999999761581421, 7.0] 



  1%|▏                                      | 102/20000 [00:11<32:30, 10.20it/s]

Epsiode count: 100
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[54], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 101
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[60], [53], [52], [55], [49], [51], [54], [53], [49], [51], [56], [48], [52]]
Reward list (length = 12):
 [1.4285714626312256, 3.0, -0.6666666865348816, 0.8333333134651184, 0.5, -0.3333333432674408, 4.0, 0.75, 0.5, -0.20000000298023224, 0.75, 0.5] 

Epsiode count: 102
Epsiode goal: [57]
Epsiode state sequence length = 5.
 [[56], [59], [54], [53], [57]]
Reward list (length = 4):
 [0.3333333432674408, 0.4000000059604645, -3.0, 7.0] 



  1%|▏                                      | 107/20000 [00:11<22:30, 14.73it/s]

Epsiode count: 103
Epsiode goal: [49]
Epsiode state sequence length = 5.
 [[59], [55], [58], [56], [49]]
Reward list (length = 4):
 [2.5, -2.0, 4.5, 7.0] 

Epsiode count: 104
Epsiode goal: [56]
Epsiode state sequence length = 1.
 [[56]]
Reward list (length = 0):
 [] 

Epsiode count: 105
Epsiode goal: [53]
Epsiode state sequence length = 3.
 [[48], [56], [53]]
Reward list (length = 2):
 [0.625, 7.0] 

Epsiode count: 106
Epsiode goal: [50]
Epsiode state sequence length = 8.
 [[56], [55], [49], [52], [54], [48], [52], [50]]
Reward list (length = 7):
 [6.0, 0.8333333134651184, 0.3333333432674408, -1.0, 0.6666666865348816, 0.5, 7.0] 

Epsiode count: 107
Epsiode goal: [54]
Epsiode state sequence length = 7.
 [[51], [49], [52], [56], [48], [49], [54]]
Reward list (length = 6):
 [-1.5, 1.6666666269302368, 0.5, 0.25, 6.0, 7.0] 



  1%|▏                                      | 110/20000 [00:12<24:31, 13.52it/s]

Epsiode count: 108
Epsiode goal: [57]
Epsiode state sequence length = 1.
 [[57]]
Reward list (length = 0):
 [] 

Epsiode count: 109
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[49], [50], [56], [53], [53], [51], [58], [56], [60], [53], [60], [55], [56]]
Reward list (length = 12):
 [-1.0, -0.3333333432674408, 2.6666667461395264, -100.0, 2.5, -0.4285714328289032, 5.0, -2.0, 1.7142857313156128, -0.7142857313156128, 2.4000000953674316, -7.0] 

Epsiode count: 110
Epsiode goal: [49]
Epsiode state sequence length = 2.
 [[60], [49]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 111
Epsiode goal: [57]


  1%|▏                                      | 112/20000 [00:12<26:20, 12.58it/s]

Epsiode state sequence length = 13.
 [[59], [52], [58], [56], [53], [52], [55], [53], [60], [52], [51], [58], [56]]
Reward list (length = 12):
 [0.2857142984867096, 0.8333333134651184, 0.5, -0.3333333432674408, -4.0, 1.6666666269302368, -1.0, 0.5714285969734192, 0.375, -5.0, 0.8571428656578064, 0.5] 

Epsiode count: 112
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[50], [48], [57], [58], [56]]
Reward list (length = 4):
 [-3.0, 0.8888888955116272, -1.0, 7.0] 



  1%|▏                                      | 114/20000 [00:12<29:27, 11.25it/s]

Epsiode count: 113
Epsiode goal: [57]
Epsiode state sequence length = 13.
 [[59], [58], [56], [48], [58], [53], [58], [56], [55], [52], [55], [56], [55]]
Reward list (length = 12):
 [2.0, 0.5, -0.125, 0.8999999761581421, 0.20000000298023224, 0.800000011920929, 0.5, -1.0, -0.6666666865348816, 1.6666666269302368, 2.0, -1.0] 

Epsiode count: 114
Epsiode goal: [54]
Epsiode state sequence length = 1.
 [[54]]
Reward list (length = 0):
 [] 

Epsiode count: 115
Epsiode goal: [54]
Epsiode state sequence length = 12.
 [[52], [53], [60], [53], [51], [58], [56], [58], [53], [52], [58], [54]]
Reward list (length = 11):
 [2.0, 0.1428571492433548, 0.8571428656578064, -0.5, 0.4285714328289032, 2.0, -1.0, 0.800000011920929, -1.0, 0.3333333432674408, 7.0] 



  1%|▏                                      | 116/20000 [00:12<28:23, 11.67it/s]

Epsiode count: 116
Epsiode goal: [51]
Epsiode state sequence length = 2.
 [[56], [51]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 117
Epsiode goal: [53]
Epsiode state sequence length = 1.
 [[53]]
Reward list (length = 0):
 [] 

Epsiode count: 118
Epsiode goal: [56]
Epsiode state sequence length = 3.
 [[59], [57], [56]]
Reward list (length = 2):
 [1.5, 7.0] 

Epsiode count: 119
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[52], [56], [55], [58], [56], [49], [52], [58], [49], [60], [49], [52], [57]]
Reward list (length = 12):
 [-1.0, 8.0, -2.3333332538604736, 5.0, 1.1428571939468384, -0.3333333432674408, -0.6666666865348816, 1.1111111640930176, -0.09090909361839294, 1.0909091234207153, -0.3333333432674408, -0.800000011920929] 



  1%|▏                                      | 120/20000 [00:12<23:30, 14.09it/s]

Epsiode count: 120
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[59], [54], [56], [58], [56], [55], [56], [59], [54], [55], [58], [48], [54]]
Reward list (length = 12):
 [1.399999976158142, -1.0, -2.0, 3.0, 4.0, -3.0, -1.3333333730697632, 1.399999976158142, -2.0, -1.0, 0.6000000238418579, 0.6666666865348816] 

Epsiode count: 121
Epsiode goal: [60]


  1%|▏                                      | 122/20000 [00:13<32:47, 10.10it/s]

Epsiode state sequence length = 13.
 [[53], [52], [59], [48], [58], [53], [58], [56], [49], [58], [57], [53], [49]]
Reward list (length = 12):
 [-7.0, 1.1428571939468384, -0.09090909361839294, 1.2000000476837158, -0.4000000059604645, 1.399999976158142, -1.0, -0.5714285969734192, 1.2222222089767456, -2.0, -0.75, -1.75] 

Epsiode count: 122
Epsiode goal: [55]
Epsiode state sequence length = 6.
 [[48], [57], [57], [56], [48], [55]]
Reward list (length = 5):
 [0.7777777910232544, -100.0, 2.0, 0.125, 7.0] 



  1%|▏                                      | 124/20000 [00:13<34:59,  9.47it/s]

Epsiode count: 123
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[60], [52], [51], [48], [48], [51], [56], [52], [49], [48], [49], [51], [52]]
Reward list (length = 12):
 [1.25, 2.0, 0.3333333432674408, -100.0, 0.6666666865348816, -0.20000000298023224, 1.5, 0.6666666865348816, -1.0, 2.0, 0.5, -1.0] 

Epsiode count: 124
Epsiode goal: [52]


  1%|▏                                      | 126/20000 [00:13<34:04,  9.72it/s]

Epsiode state sequence length = 13.
 [[54], [53], [60], [58], [57], [56], [60], [58], [56], [55], [56], [55], [52]]
Reward list (length = 12):
 [2.0, -0.1428571492433548, 4.0, 6.0, 5.0, -1.0, 4.0, 3.0, 4.0, -3.0, 4.0, 7.0] 

Epsiode count: 125
Epsiode goal: [56]
Epsiode state sequence length = 2.
 [[60], [56]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 126
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[60], [55], [56], [55], [53], [51], [55], [52], [51], [56], [58], [55], [58]]
Reward list (length = 12):
 [2.200000047683716, -6.0, 7.0, 3.0, 2.0, -0.5, 2.0, 3.0, -0.4000000059604645, -3.5, 3.0, -2.0] 



  1%|▏                                      | 128/20000 [00:14<40:31,  8.17it/s]

Epsiode count: 127
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[58], [56], [55], [58], [56], [49], [60], [53], [51], [55], [56], [52], [56]]
Reward list (length = 12):
 [5.0, 8.0, -2.3333332538604736, 5.0, 1.1428571939468384, -0.09090909361839294, 1.7142857313156128, 2.5, -0.75, -7.0, 2.0, -1.0] 

Epsiode count: 128
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[58], [48], [58], [56], [53], [58], [57], [51], [51], [52], [51], [51], [56]]
Reward list (length = 12):
 [-0.10000000149011612, 1.100000023841858, -0.5, -1.0, 1.2000000476837158, -1.0, -0.3333333432674408, -100.0, 8.0, -7.0, -100.0, 1.600000023841858] 



  1%|▎                                      | 130/20000 [00:14<42:23,  7.81it/s]

Epsiode count: 129
Epsiode goal: [57]
Epsiode state sequence length = 8.
 [[51], [58], [56], [55], [52], [60], [55], [57]]
Reward list (length = 7):
 [0.8571428656578064, 0.5, -1.0, -0.6666666865348816, 0.625, 0.6000000238418579, 7.0] 

Epsiode count: 130
Epsiode goal: [57]
Epsiode state sequence length = 6.
 [[54], [55], [56], [55], [53], [57]]
Reward list (length = 5):
 [3.0, 2.0, -1.0, -1.0, 7.0] 

Epsiode count: 131
Epsiode goal: [56]
Epsiode state sequence length = 3.
 [[59], [58], [56]]
Reward list (length = 2):
 [3.0, 7.0] 



  1%|▎                                      | 134/20000 [00:14<32:15, 10.27it/s]

Epsiode count: 132
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[51], [52], [58], [56], [60], [49], [51], [48], [53], [54], [48], [58], [60]]
Reward list (length = 12):
 [8.0, 1.1666666269302368, -0.5, 0.75, 0.09090909361839294, 5.0, -2.6666667461395264, 2.200000047683716, 6.0, -0.8333333134651184, 1.100000023841858, 0.5] 

Epsiode count: 133
Epsiode goal: [54]
Epsiode state sequence length = 1.
 [[54]]
Reward list (length = 0):
 [] 

Epsiode count: 134
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[49], [51], [55]]
Reward list (length = 2):
 [3.0, 7.0] 



  1%|▎                                      | 136/20000 [00:14<33:31,  9.87it/s]

Epsiode count: 135
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[60], [53], [60], [53], [58], [56], [54], [53], [58], [55], [58], [60], [52]]
Reward list (length = 12):
 [0.1428571492433548, 0.8571428656578064, 0.1428571492433548, 1.2000000476837158, -0.5, -1.5, -5.0, 1.2000000476837158, -0.3333333432674408, 1.3333333730697632, 0.5, 0.125] 

Epsiode count: 136
Epsiode goal: [56]
Epsiode state sequence length = 13.
 [[54], [48], [57], [60], [48], [60], [60], [60], [58], [48], [50], [55], [56]]
Reward list (length = 12):
 [-0.3333333432674408, 0.8888888955116272, -0.3333333432674408, 0.3333333432674408, 0.6666666865348816, -100.0, -100.0, 2.0, 0.20000000298023224, 4.0, 1.2000000476837158, 7.0] 

Epsiode count: 137
Epsiode goal: [59]
Epsiode state sequence length = 5.
 [[58], [56], [58], [53], [59]]
Reward list (length = 4):
 [-0.5, 1.5, -0.20000000298023224, 7.0] 



  1%|▎                                      | 140/20000 [00:15<30:51, 10.73it/s]

Epsiode count: 138
Epsiode goal: [52]
Epsiode state sequence length = 5.
 [[57], [57], [56], [55], [52]]
Reward list (length = 4):
 [-100.0, 5.0, 4.0, 7.0] 

Epsiode count: 139
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[59], [48], [57], [51], [56]]
Reward list (length = 4):
 [0.27272728085517883, 0.8888888955116272, 0.1666666716337204, 7.0] 

Epsiode count: 140
Epsiode goal: [56]
Epsiode state sequence length = 3.
 [[50], [50], [56]]
Reward list (length = 2):
 [-100.0, 7.0] 

Epsiode count: 141
Epsiode goal: [53]
Epsiode state sequence length = 1.
 [[53]]
Reward list (length = 0):
 [] 

Epsiode count: 142
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[50], [49], [51], [52], [56]]
Reward list (length = 4):
 [-6.0, 3.5, 5.0, 7.0] 



  1%|▎                                      | 145/20000 [00:15<19:57, 16.57it/s]

Epsiode count: 143
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[48], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 144
Epsiode goal: [53]
Epsiode state sequence length = 5.
 [[60], [48], [55], [55], [53]]
Reward list (length = 4):
 [0.5833333134651184, 0.7142857313156128, -100.0, 7.0] 

Epsiode count: 145
Epsiode goal: [57]
Epsiode state sequence length = 8.
 [[51], [59], [58], [53], [60], [53], [53], [57]]
Reward list (length = 7):
 [0.75, 2.0, 0.20000000298023224, 0.5714285969734192, 0.4285714328289032, -100.0, 7.0] 

Epsiode count: 146
Epsiode goal: [58]
Epsiode state sequence length = 6.
 [[54], [56], [49], [54], [56], [58]]
Reward list (length = 5):
 [2.0, -0.2857142984867096, 1.7999999523162842, 2.0, 7.0] 



  1%|▎                                      | 147/20000 [00:15<22:04, 14.99it/s]

Epsiode count: 147
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[59], [48], [58], [49], [48], [58], [56], [58], [51], [58], [56], [58], [56]]
Reward list (length = 12):
 [0.4545454680919647, 0.6000000238418579, 0.4444444477558136, -5.0, 0.6000000238418579, 2.0, -1.0, 0.5714285969734192, 0.4285714328289032, 2.0, -1.0, 2.0] 

Epsiode count: 148
Epsiode goal: [48]


  1%|▎                                      | 149/20000 [00:15<30:49, 10.73it/s]

Epsiode state sequence length = 13.
 [[49], [51], [52], [56], [58], [56], [50], [60], [58], [56], [55], [56], [56]]
Reward list (length = 12):
 [-0.5, -3.0, -1.0, -4.0, 5.0, 1.3333333730697632, -0.20000000298023224, 6.0, 5.0, 8.0, -7.0, -100.0] 

Epsiode count: 149
Epsiode goal: [53]
Epsiode state sequence length = 2.
 [[60], [53]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 150
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[58], [54], [53], [51], [56], [55], [56], [54], [56], [52], [54], [49], [54]]
Reward list (length = 12):
 [2.0, 4.0, 1.5, -0.20000000298023224, 6.0, -5.0, 3.0, -2.0, 1.5, -1.0, 0.800000011920929, 0.20000000298023224] 



  1%|▎                                      | 151/20000 [00:16<31:00, 10.67it/s]

Epsiode count: 151
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[52], [58], [52], [55], [56], [58], [53], [60], [53], [52], [56], [55], [48]]
Reward list (length = 12):
 [1.1666666269302368, -0.1666666716337204, 2.3333332538604736, 4.0, 1.5, -0.20000000298023224, 0.8571428656578064, 0.1428571492433548, -6.0, 1.75, -3.0, -0.5714285969734192] 

Epsiode count: 152
Epsiode goal: [58]
Epsiode state sequence length = 8.
 [[49], [54], [55], [54], [56], [54], [48], [58]]
Reward list (length = 7):
 [1.7999999523162842, 4.0, -3.0, 2.0, -1.0, -0.6666666865348816, 7.0] 



  1%|▎                                      | 156/20000 [00:16<30:15, 10.93it/s]

Epsiode count: 153
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[58], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 154
Epsiode goal: [51]
Epsiode state sequence length = 3.
 [[54], [49], [51]]
Reward list (length = 2):
 [0.6000000238418579, 7.0] 

Epsiode count: 155
Epsiode goal: [54]
Epsiode state sequence length = 11.
 [[56], [55], [53], [48], [49], [52], [58], [53], [51], [59], [54]]
Reward list (length = 10):
 [2.0, 0.5, -0.20000000298023224, 6.0, 1.6666666269302368, 0.3333333432674408, 0.800000011920929, -0.5, 0.375, 7.0] 

Epsiode count: 156
Epsiode goal: [50]
Epsiode state sequence length = 2.
 [[52], [50]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 157
Epsiode goal: [49]
Epsiode state sequence length = 4.
 [[59], [55], [56], [49]]
Reward list (length = 3):
 [2.5, -6.0, 7.0] 

Epsiode count: 158
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[48], [56], [55], [52], [54], [55], [52], [58], [48], [48], [58], [53], [52]]
Reward list (length = 

  1%|▎                                      | 159/20000 [00:16<28:30, 11.60it/s]

Epsiode count: 159
Epsiode goal: [52]
Epsiode state sequence length = 7.
 [[60], [54], [58], [56], [60], [58], [52]]
Reward list (length = 6):
 [1.3333333730697632, -0.5, 3.0, -1.0, 4.0, 7.0] 

Epsiode count: 160
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[48], [54], [56], [55], [48], [54], [55], [56], [55], [53], [51], [48], [55]]
Reward list (length = 12):
 [1.6666666269302368, 2.0, -2.0, -0.4285714328289032, 1.6666666269302368, 4.0, 3.0, -2.0, -1.5, -2.5, -2.3333332538604736, 1.4285714626312256] 



  1%|▎                                      | 161/20000 [00:17<31:58, 10.34it/s]

Epsiode count: 161
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[57], [56], [55], [57], [55], [52], [53], [56], [55], [56], [55], [56], [55]]
Reward list (length = 12):
 [9.0, 8.0, -3.5, 4.5, 2.3333332538604736, -4.0, -1.6666666269302368, 8.0, -7.0, 8.0, -7.0, 8.0] 

Epsiode count: 162
Epsiode goal: [54]


  1%|▎                                      | 163/20000 [00:17<38:45,  8.53it/s]

Epsiode state sequence length = 13.
 [[60], [53], [48], [58], [51], [48], [49], [58], [51], [49], [58], [55], [48]]
Reward list (length = 12):
 [0.8571428656578064, -0.20000000298023224, 0.6000000238418579, 0.5714285969734192, -1.0, 6.0, 0.5555555820465088, 0.5714285969734192, -1.5, 0.5555555820465088, 1.3333333730697632, 0.1428571492433548] 

Epsiode count: 163
Epsiode goal: [57]
Epsiode state sequence length = 2.
 [[58], [57]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 164
Epsiode goal: [55]
Epsiode state sequence length = 4.
 [[58], [54], [56], [55]]
Reward list (length = 3):
 [0.75, 0.5, 7.0] 



  1%|▎                                      | 166/20000 [00:17<30:33, 10.82it/s]

Epsiode count: 165
Epsiode goal: [51]
Epsiode state sequence length = 7.
 [[50], [60], [53], [58], [48], [58], [51]]
Reward list (length = 6):
 [0.10000000149011612, 1.2857142686843872, -0.4000000059604645, 0.699999988079071, 0.30000001192092896, 7.0] 

Epsiode count: 166
Epsiode goal: [57]
Epsiode state sequence length = 7.
 [[55], [55], [58], [55], [53], [52], [57]]
Reward list (length = 6):
 [-100.0, 0.6666666865348816, 0.3333333432674408, -1.0, -4.0, 7.0] 

Epsiode count: 167
Epsiode goal: [54]
Epsiode state sequence length = 3.
 [[60], [49], [54]]
Reward list (length = 2):
 [0.5454545617103577, 7.0] 



  1%|▎                                      | 168/20000 [00:17<27:52, 11.86it/s]

Epsiode count: 168
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[58], [60], [58], [56], [57], [54], [56], [58], [56], [58], [56], [55], [58]]
Reward list (length = 12):
 [-3.0, 4.0, 3.0, -4.0, 1.6666666269302368, -1.0, -2.0, 3.0, -2.0, 3.0, 4.0, -1.0] 

Epsiode count: 169
Epsiode goal: [51]
Epsiode state sequence length = 3.
 [[52], [53], [51]]
Reward list (length = 2):
 [-1.0, 7.0] 



  1%|▎                                      | 172/20000 [00:18<29:44, 11.11it/s]

Epsiode count: 170
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[56], [49], [58], [57], [54], [48], [58], [56], [50], [52], [55], [56], [49]]
Reward list (length = 12):
 [-0.4285714328289032, 1.1111111640930176, -1.0, -0.6666666865348816, -0.8333333134651184, 1.100000023841858, -0.5, -0.5, 4.5, 2.3333332538604736, 4.0, -0.4285714328289032] 

Epsiode count: 171
Epsiode goal: [51]
Epsiode state sequence length = 2.
 [[49], [51]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 172
Epsiode goal: [50]


  1%|▎                                      | 174/20000 [00:18<29:14, 11.30it/s]

Epsiode state sequence length = 11.
 [[49], [53], [48], [49], [54], [48], [53], [48], [53], [52], [50]]
Reward list (length = 10):
 [0.25, 0.6000000238418579, 2.0, 0.20000000298023224, 0.6666666865348816, 0.4000000059604645, 0.6000000238418579, 0.4000000059604645, 3.0, 7.0] 

Epsiode count: 173
Epsiode goal: [54]
Epsiode state sequence length = 3.
 [[59], [49], [54]]
Reward list (length = 2):
 [0.5, 7.0] 

Epsiode count: 174
Epsiode goal: [56]
Epsiode state sequence length = 7.
 [[48], [58], [48], [55], [52], [55], [56]]
Reward list (length = 6):
 [0.800000011920929, 0.20000000298023224, 1.1428571939468384, -0.3333333432674408, 1.3333333730697632, 7.0] 



  1%|▎                                      | 176/20000 [00:18<33:08,  9.97it/s]

Epsiode count: 175
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[49], [54], [55], [57], [54], [56], [53], [54], [56], [54], [50], [56], [54]]
Reward list (length = 12):
 [1.7999999523162842, 4.0, 1.5, -0.3333333432674408, 2.0, -0.6666666865348816, 5.0, 2.0, -1.0, -1.0, 1.3333333730697632, -1.0] 

Epsiode count: 176
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[60], [56], [58], [52], [48], [58], [56], [52], [58], [56], [49], [51], [52]]
Reward list (length = 12):
 [2.5, -3.0, 1.3333333730697632, 0.5, 0.20000000298023224, 4.0, 1.5, -0.3333333432674408, 4.0, 0.8571428656578064, 0.5, -1.0] 

Epsiode count: 177
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[53], [49], [58], [56], [58], [56], [49], [58], [56], [55], [52], [56], [60]]
Reward list (length = 12):
 [-0.25, 0.5555555820465088, 2.0, -1.0, 2.0, 0.2857142984867096, 0.5555555820465088, 2.0, 2.0, 0.3333333432674408, 0.5, -0.5] 



  1%|▎                                      | 179/20000 [00:18<42:46,  7.72it/s]

Epsiode count: 178
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[59], [54], [56], [58], [56], [55], [56], [56], [55], [56], [60], [53], [51]]
Reward list (length = 12):
 [2.200000047683716, -3.0, -4.0, 5.0, 8.0, -7.0, -100.0, 8.0, -7.0, -2.0, 1.7142857313156128, 2.5] 

Epsiode count: 179
Epsiode goal: [58]
Epsiode state sequence length = 4.
 [[53], [52], [54], [58]]
Reward list (length = 3):
 [-5.0, 3.0, 7.0] 



  1%|▎                                      | 182/20000 [00:19<29:18, 11.27it/s]

Epsiode count: 180
Epsiode goal: [59]
Epsiode state sequence length = 2.
 [[51], [59]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 181
Epsiode goal: [57]
Epsiode state sequence length = 4.
 [[55], [53], [52], [57]]
Reward list (length = 3):
 [-1.0, -4.0, 7.0] 

Epsiode count: 182
Epsiode goal: [58]
Epsiode state sequence length = 10.
 [[59], [54], [51], [56], [54], [56], [54], [49], [48], [58]]
Reward list (length = 9):
 [0.20000000298023224, -1.3333333730697632, 1.399999976158142, -1.0, 2.0, -1.0, -0.800000011920929, -9.0, 7.0] 

Epsiode count: 183
Epsiode goal: [52]


  1%|▎                                      | 184/20000 [00:19<33:10,  9.96it/s]

Epsiode state sequence length = 10.
 [[55], [56], [60], [58], [56], [58], [56], [58], [57], [52]]
Reward list (length = 9):
 [-3.0, -1.0, 4.0, 3.0, -2.0, 3.0, -2.0, 6.0, 7.0] 

Epsiode count: 184
Epsiode goal: [52]
Epsiode state sequence length = 3.
 [[48], [58], [52]]
Reward list (length = 2):
 [0.4000000059604645, 7.0] 

Epsiode count: 185
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[56], [60], [57], [58], [56], [58], [60], [56], [55], [53], [49], [59], [54]]
Reward list (length = 12):
 [-1.0, 2.6666667461395264, -5.0, 3.0, -2.0, -3.0, 2.0, 4.0, 1.5, 0.25, 0.30000001192092896, 1.399999976158142] 



  1%|▎                                      | 186/20000 [00:19<32:53, 10.04it/s]

Epsiode count: 186
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[49], [49], [49], [48], [58], [56], [55], [52], [56], [55], [56], [55], [50]]
Reward list (length = 12):
 [-100.0, -100.0, -1.0, 0.20000000298023224, 4.0, 6.0, 1.6666666269302368, -0.5, 6.0, -5.0, 6.0, 7.0] 

Epsiode count: 187
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[56], [55], [56], [58], [56], [50], [55], [56], [58], [56], [58], [57], [53]]
Reward list (length = 12):
 [7.0, -6.0, -3.5, 4.5, 1.1666666269302368, -0.20000000298023224, -6.0, -3.5, 4.5, -3.5, 9.0, 2.0] 



  1%|▎                                      | 188/20000 [00:19<39:28,  8.36it/s]

Epsiode count: 188
Epsiode goal: [55]
Epsiode state sequence length = 6.
 [[48], [54], [60], [58], [56], [55]]
Reward list (length = 5):
 [1.1666666269302368, 0.1666666716337204, 2.5, 1.5, 7.0] 

Epsiode count: 189
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[55], [56], [58], [53], [52], [58], [56], [58], [56], [58], [54], [58], [56]]
Reward list (length = 12):
 [5.0, 2.0, -0.4000000059604645, -7.0, 1.3333333730697632, -1.0, 2.0, -1.0, 2.0, -0.5, 1.5, -1.0] 



  1%|▎                                      | 191/20000 [00:20<40:57,  8.06it/s]

Epsiode count: 190
Epsiode goal: [53]
Epsiode state sequence length = 10.
 [[50], [58], [56], [55], [49], [48], [58], [56], [55], [53]]
Reward list (length = 9):
 [0.375, 2.5, 3.0, 0.3333333432674408, -4.0, 0.5, 2.5, 3.0, 7.0] 

Epsiode count: 191
Epsiode goal: [59]
Epsiode state sequence length = 2.
 [[54], [59]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 192
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[56], [58], [56], [58], [55], [52], [56], [57], [55], [56], [55], [57], [55]]
Reward list (length = 12):
 [-4.0, 5.0, -4.0, 3.3333332538604736, 2.3333332538604736, -1.0, -8.0, 4.5, -7.0, 8.0, -3.5, 4.5] 



  1%|▍                                      | 194/20000 [00:20<41:57,  7.87it/s]

Epsiode count: 193
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[60], [49], [48], [58], [56], [55], [52], [56], [58], [56], [58], [56], [58]]
Reward list (length = 12):
 [0.5454545617103577, -5.0, 0.6000000238418579, 2.0, 2.0, 0.3333333432674408, 0.5, -1.0, 2.0, -1.0, 2.0, -1.0] 

Epsiode count: 194
Epsiode goal: [52]
Epsiode state sequence length = 2.
 [[53], [52]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 195
Epsiode goal: [57]


  1%|▍                                      | 196/20000 [00:20<37:51,  8.72it/s]

Epsiode state sequence length = 13.
 [[54], [58], [56], [58], [56], [58], [56], [55], [52], [58], [48], [55], [55]]
Reward list (length = 12):
 [0.75, 0.5, 0.5, 0.5, 0.5, 0.5, -1.0, -0.6666666865348816, 0.8333333134651184, 0.10000000149011612, 1.2857142686843872, -100.0] 

Epsiode count: 196
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[49], [50], [56], [58], [51], [56], [55], [52], [55], [56], [55], [58], [56]]
Reward list (length = 12):
 [-1.0, -0.3333333432674408, -4.0, 1.4285714626312256, -0.6000000238418579, 8.0, 2.3333332538604736, -1.3333333730697632, -7.0, 8.0, -2.3333332538604736, 5.0] 



  1%|▍                                      | 198/20000 [00:21<46:01,  7.17it/s]

Epsiode count: 197
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[52], [58], [60], [53], [57], [53], [51], [56], [57], [49], [52], [55], [56]]
Reward list (length = 12):
 [1.1666666269302368, 0.5, 0.1428571492433548, 1.5, -0.5, -3.0, 1.600000023841858, 3.0, -0.25, 3.3333332538604736, 2.3333332538604736, 4.0] 

Epsiode count: 198
Epsiode goal: [49]


  1%|▍                                      | 199/20000 [00:21<48:25,  6.82it/s]

Epsiode state sequence length = 13.
 [[51], [56], [54], [53], [51], [59], [58], [56], [55], [56], [55], [56], [55]]
Reward list (length = 12):
 [-0.4000000059604645, 3.5, 5.0, 2.0, -0.25, 10.0, 4.5, 7.0, -6.0, 7.0, -6.0, 7.0] 

Epsiode count: 199
Epsiode goal: [51]
Epsiode state sequence length = 2.
 [[53], [51]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 200
Epsiode goal: [52]
Epsiode state sequence length = 7.
 [[59], [56], [55], [58], [56], [55], [52]]
Reward list (length = 6):
 [2.3333332538604736, 4.0, -1.0, 3.0, 4.0, 7.0] 



  1%|▍                                      | 203/20000 [00:21<32:02, 10.30it/s]

Epsiode count: 201
Epsiode goal: [53]
Epsiode state sequence length = 7.
 [[59], [55], [55], [48], [58], [48], [53]]
Reward list (length = 6):
 [1.5, -100.0, 0.2857142984867096, 0.5, 0.5, 7.0] 

Epsiode count: 202
Epsiode goal: [51]
Epsiode state sequence length = 5.
 [[58], [57], [58], [53], [51]]
Reward list (length = 4):
 [7.0, -6.0, 1.399999976158142, 7.0] 

Epsiode count: 203
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[51], [56], [48], [58], [56], [58], [56], [58], [48], [58], [56], [58], [56]]
Reward list (length = 12):
 [-0.4000000059604645, 0.875, 0.10000000149011612, 4.5, -3.5, 4.5, -3.5, 0.8999999761581421, 0.10000000149011612, 4.5, -3.5, 4.5] 



  1%|▍                                      | 205/20000 [00:21<40:29,  8.15it/s]

Epsiode count: 204
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[50], [55], [55], [49], [52], [51], [56], [55], [53], [52], [58], [57], [49]]
Reward list (length = 12):
 [-0.4000000059604645, -100.0, 1.1666666269302368, -0.3333333432674408, 4.0, -0.6000000238418579, 8.0, 3.5, 5.0, -0.6666666865348816, 10.0, 1.125] 

Epsiode count: 205
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[52], [54], [56], [53], [54], [56], [54], [56], [54], [50], [48], [54], [51]]
Reward list (length = 12):
 [3.0, 2.0, -0.6666666865348816, 5.0, 2.0, -1.0, 2.0, -1.0, -1.0, -4.0, 1.6666666269302368, -1.3333333730697632] 



  1%|▍                                      | 206/20000 [00:22<44:09,  7.47it/s]

Epsiode count: 206
Epsiode goal: [58]
Epsiode state sequence length = 3.
 [[48], [54], [58]]
Reward list (length = 2):
 [1.6666666269302368, 7.0] 

Epsiode count: 207
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[59], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 208
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[57], [55], [56], [55], [53], [51], [51], [58], [56], [53], [52], [55], [52]]
Reward list (length = 12):
 [-1.0, 4.0, -3.0, -2.0, -3.0, -100.0, 1.1428571939468384, -0.5, -1.0, -6.0, 2.3333332538604736, -1.3333333730697632] 



  1%|▍                                      | 210/20000 [00:22<36:27,  9.05it/s]

Epsiode count: 209
Epsiode goal: [59]
Epsiode state sequence length = 10.
 [[49], [52], [58], [54], [53], [58], [56], [54], [58], [59]]
Reward list (length = 9):
 [3.3333332538604736, 1.1666666269302368, -0.25, -5.0, 1.2000000476837158, -0.5, -1.5, 1.25, 7.0] 

Epsiode count: 210
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[56], [55], [56], [55], [52], [54], [56], [58], [56], [55], [56], [55], [53]]
Reward list (length = 12):
 [6.0, -5.0, 6.0, 1.6666666269302368, -1.0, -2.0, -3.0, 4.0, 6.0, -5.0, 6.0, 2.5] 



  1%|▍                                      | 211/20000 [00:22<40:22,  8.17it/s]

Epsiode count: 211
Epsiode goal: [57]
Epsiode state sequence length = 4.
 [[56], [55], [58], [57]]
Reward list (length = 3):
 [-1.0, 0.6666666865348816, 7.0] 

Epsiode count: 212
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[51], [59], [48], [53], [51], [58], [57], [58], [59], [54], [58], [56], [55]]
Reward list (length = 12):
 [1.125, -0.09090909361839294, 2.4000000953674316, -3.5, 1.2857142686843872, -2.0, 3.0, 2.0, -0.20000000298023224, 1.5, -1.0, -4.0] 



  1%|▍                                      | 216/20000 [00:23<30:33, 10.79it/s]

Epsiode count: 213
Epsiode goal: [58]
Epsiode state sequence length = 2.
 [[51], [58]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 214
Epsiode goal: [49]
Epsiode state sequence length = 1.
 [[49]]
Reward list (length = 0):
 [] 

Epsiode count: 215
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[53], [52], [51], [56], [55], [52], [56], [53], [54], [51], [56], [49], [54]]
Reward list (length = 12):
 [3.0, 2.0, -0.20000000298023224, 6.0, 1.6666666269302368, -0.5, 2.0, -3.0, 1.3333333730697632, -0.20000000298023224, 0.8571428656578064, 0.20000000298023224] 

Epsiode count: 216
Epsiode goal: [56]
Epsiode state sequence length = 2.
 [[58], [56]]
Reward list (length = 1):
 [7.0] 



  1%|▍                                      | 219/20000 [00:23<23:37, 13.95it/s]

Epsiode count: 217
Epsiode goal: [54]
Epsiode state sequence length = 4.
 [[58], [56], [58], [54]]
Reward list (length = 3):
 [2.0, -1.0, 7.0] 

Epsiode count: 218
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[50], [48], [58], [55], [56]]
Reward list (length = 4):
 [-3.0, 0.800000011920929, 0.6666666865348816, 7.0] 

Epsiode count: 219
Epsiode goal: [51]
Epsiode state sequence length = 6.
 [[56], [55], [60], [58], [54], [51]]
Reward list (length = 5):
 [5.0, -0.800000011920929, 4.5, 1.75, 7.0] 

Epsiode count: 220
Epsiode goal: [55]
Epsiode state sequence length = 2.
 [[59], [55]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 221
Epsiode goal: [58]
Epsiode state sequence length = 3.
 [[57], [54], [58]]
Reward list (length = 2):
 [-0.3333333432674408, 7.0] 



  1%|▍                                      | 222/20000 [00:23<19:42, 16.72it/s]

Epsiode count: 222
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[57], [58], [56], [58], [56], [55], [48], [58], [56], [55], [52], [53], [51]]
Reward list (length = 12):
 [2.0, -0.5, 1.5, -0.5, -3.0, -0.5714285969734192, 1.100000023841858, -0.5, -3.0, -1.3333333730697632, 7.0, -3.0] 

Epsiode count: 223
Epsiode goal: [53]
Epsiode state sequence length = 5.
 [[49], [51], [56], [55], [53]]
Reward list (length = 4):
 [2.0, 0.4000000059604645, 3.0, 7.0] 



  1%|▍                                      | 224/20000 [00:23<24:13, 13.60it/s]

Epsiode count: 224
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[54], [56], [52], [51], [57], [56], [58], [49], [52], [58], [57], [58], [56]]
Reward list (length = 12):
 [3.0, -1.0, -8.0, 1.5, -3.0, 2.0, -0.2222222238779068, 3.6666667461395264, 1.3333333730697632, -2.0, 3.0, -1.0] 

Epsiode count: 225
Epsiode goal: [58]


  1%|▍                                      | 226/20000 [00:23<33:45,  9.76it/s]

Epsiode state sequence length = 13.
 [[56], [54], [51], [54], [56], [54], [48], [55], [53], [52], [54], [56], [53]]
Reward list (length = 12):
 [-1.0, -1.3333333730697632, 2.3333332538604736, 2.0, -1.0, -0.6666666865348816, 1.4285714626312256, -1.5, -5.0, 3.0, 2.0, -0.6666666865348816] 

Epsiode count: 226
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[59], [52], [59], [48], [58], [56], [54], [56], [52], [52], [59], [49], [54]]
Reward list (length = 12):
 [-0.1428571492433548, 1.1428571939468384, -0.09090909361839294, 1.2000000476837158, -1.0, -2.0, 3.0, -1.0, -100.0, 1.1428571939468384, -0.10000000149011612, 2.200000047683716] 



  1%|▍                                      | 228/20000 [00:24<40:06,  8.21it/s]

Epsiode count: 227
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[52], [55], [56], [55], [56], [54], [53], [54], [56], [53], [52], [54], [56]]
Reward list (length = 12):
 [-0.6666666865348816, -5.0, 6.0, -5.0, 3.0, 4.0, -3.0, -2.0, 2.0, 3.0, -1.0, -2.0] 

Epsiode count: 228
Epsiode goal: [49]
Epsiode state sequence length = 10.
 [[59], [52], [59], [57], [58], [56], [55], [58], [56], [49]]
Reward list (length = 9):
 [1.4285714626312256, -0.4285714328289032, 5.0, -8.0, 4.5, 7.0, -2.0, 4.5, 7.0] 



  1%|▍                                      | 230/20000 [00:24<38:08,  8.64it/s]

Epsiode count: 229
Epsiode goal: [57]
Epsiode state sequence length = 6.
 [[49], [55], [53], [60], [53], [57]]
Reward list (length = 5):
 [1.3333333730697632, -1.0, 0.5714285969734192, 0.4285714328289032, 7.0] 

Epsiode count: 230
Epsiode goal: [51]
Epsiode state sequence length = 12.
 [[54], [56], [58], [56], [55], [56], [58], [56], [49], [60], [53], [51]]
Reward list (length = 11):
 [-1.5, -2.5, 3.5, 5.0, -4.0, -2.5, 3.5, 0.7142857313156128, 0.1818181872367859, 1.2857142686843872, 7.0] 



  1%|▍                                      | 232/20000 [00:24<39:50,  8.27it/s]

Epsiode count: 231
Epsiode goal: [57]
Epsiode state sequence length = 8.
 [[58], [56], [58], [56], [58], [56], [58], [57]]
Reward list (length = 7):
 [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 7.0] 

Epsiode count: 232
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[51], [48], [53], [54], [56], [54], [55], [56], [54], [55], [56], [54], [58]]
Reward list (length = 12):
 [-2.3333332538604736, 2.0, 5.0, 2.0, -1.0, 4.0, 3.0, -1.0, 4.0, 3.0, -1.0, 7.0] 



  1%|▍                                      | 234/20000 [00:24<42:26,  7.76it/s]

Epsiode count: 233
Epsiode goal: [51]
Epsiode state sequence length = 9.
 [[56], [55], [52], [56], [57], [58], [53], [54], [51]]
Reward list (length = 8):
 [5.0, 1.3333333730697632, -0.25, -5.0, -6.0, 1.399999976158142, -2.0, 7.0] 

Epsiode count: 234
Epsiode goal: [53]
Epsiode state sequence length = 4.
 [[59], [54], [48], [53]]
Reward list (length = 3):
 [1.2000000476837158, 0.1666666716337204, 7.0] 

Epsiode count: 235
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[58], [56], [55]]
Reward list (length = 2):
 [1.5, 7.0] 

Epsiode count: 236
Epsiode goal: [50]


  1%|▍                                      | 237/20000 [00:25<32:47, 10.04it/s]

Epsiode state sequence length = 10.
 [[60], [53], [51], [55], [56], [53], [51], [56], [55], [50]]
Reward list (length = 9):
 [1.4285714626312256, 1.5, -0.25, -5.0, 2.0, 1.5, -0.20000000298023224, 6.0, 7.0] 

Epsiode count: 237
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[52], [60], [57], [58], [56]]
Reward list (length = 4):
 [0.5, 1.3333333730697632, -1.0, 7.0] 

Epsiode count: 238
Epsiode goal: [53]
Epsiode state sequence length = 11.
 [[50], [49], [48], [58], [56], [58], [56], [48], [55], [56], [53]]
Reward list (length = 10):
 [-3.0, -4.0, 0.5, 2.5, -1.5, 2.5, 0.375, 0.7142857313156128, -2.0, 7.0] 



  1%|▍                                      | 241/20000 [00:25<32:51, 10.02it/s]

Epsiode count: 239
Epsiode goal: [51]
Epsiode state sequence length = 7.
 [[50], [58], [56], [58], [56], [49], [51]]
Reward list (length = 6):
 [0.125, 3.5, -2.5, 3.5, 0.7142857313156128, 7.0] 

Epsiode count: 240
Epsiode goal: [49]
Epsiode state sequence length = 8.
 [[48], [55], [56], [58], [55], [58], [56], [49]]
Reward list (length = 7):
 [0.1428571492433548, -6.0, -3.5, 3.0, -2.0, 4.5, 7.0] 

Epsiode count: 241
Epsiode goal: [56]


  1%|▍                                      | 243/20000 [00:25<32:36, 10.10it/s]

Epsiode state sequence length = 3.
 [[54], [55], [56]]
Reward list (length = 2):
 [2.0, 7.0] 

Epsiode count: 242
Epsiode goal: [52]
Epsiode state sequence length = 12.
 [[51], [49], [51], [56], [58], [56], [58], [56], [58], [56], [55], [52]]
Reward list (length = 11):
 [-0.5, 1.5, 0.20000000298023224, -2.0, 3.0, -2.0, 3.0, -2.0, 3.0, 4.0, 7.0] 

Epsiode count: 243
Epsiode goal: [51]
Epsiode state sequence length = 13.
 [[58], [56], [52], [58], [57], [56], [58], [57], [58], [56], [58], [53], [60]]
Reward list (length = 12):
 [3.5, 1.25, -0.1666666716337204, 7.0, 6.0, -2.5, 7.0, -6.0, 3.5, -2.5, 1.399999976158142, -0.2857142984867096] 

Epsiode count: 244
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[54], [56], [49], [58], [56], [49], [52], [56], [55], [53], [52], [55], [56]]
Reward list (length = 12):
 [-3.0, 1.1428571939468384, -0.1111111119389534, 5.0, 1.1428571939468384, -0.3333333432674408, -1.0, 8.0, 3.5, 5.0, -1.3333333730697632, -7.0] 



  1%|▍                                      | 246/20000 [00:26<42:57,  7.66it/s]

Epsiode count: 245
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[54], [53], [58], [56], [55], [56], [55], [49], [51], [58], [56], [55], [56]]
Reward list (length = 12):
 [6.0, -1.0, 5.0, 8.0, -7.0, 8.0, 1.1666666269302368, -0.5, -0.4285714328289032, 5.0, 8.0, -7.0] 

Epsiode count: 246
Epsiode goal: [56]
Epsiode state sequence length = 1.
 [[56]]
Reward list (length = 0):
 [] 

Epsiode count: 247
Epsiode goal: [58]


  1%|▍                                      | 248/20000 [00:26<38:39,  8.52it/s]

Epsiode state sequence length = 13.
 [[53], [56], [54], [54], [54], [53], [49], [48], [56], [54], [53], [51], [57]]
Reward list (length = 12):
 [1.6666666269302368, -1.0, -100.0, -100.0, -4.0, -1.25, -9.0, 1.25, -1.0, -4.0, -2.5, 1.1666666269302368] 

Epsiode count: 248
Epsiode goal: [59]
Epsiode state sequence length = 13.
 [[48], [57], [58], [52], [55], [52], [58], [56], [53], [52], [52], [57], [58]]
Reward list (length = 12):
 [1.2222222089767456, 2.0, -0.1666666716337204, 2.3333332538604736, -1.3333333730697632, 1.1666666269302368, -0.5, -1.0, -6.0, -100.0, 1.399999976158142, 2.0] 



  1%|▍                                      | 251/20000 [00:26<36:46,  8.95it/s]

Epsiode count: 249
Epsiode goal: [56]
Epsiode state sequence length = 6.
 [[51], [48], [53], [48], [55], [56]]
Reward list (length = 5):
 [-1.6666666269302368, 1.600000023841858, -0.6000000238418579, 1.1428571939468384, 7.0] 

Epsiode count: 250
Epsiode goal: [55]
Epsiode state sequence length = 7.
 [[49], [60], [50], [48], [58], [57], [55]]
Reward list (length = 6):
 [0.5454545617103577, 0.5, -2.5, 0.699999988079071, 3.0, 7.0] 

Epsiode count: 251
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[60], [57], [51], [56], [57], [59], [53], [51], [56], [55], [58], [53], [51]]
Reward list (length = 12):
 [4.0, 1.5, -0.6000000238418579, -8.0, -4.5, 1.8333333730697632, 2.5, -0.6000000238418579, 8.0, -2.3333332538604736, 2.0, 2.5] 



  1%|▍                                      | 254/20000 [00:27<33:18,  9.88it/s]

Epsiode count: 252
Epsiode goal: [53]
Epsiode state sequence length = 3.
 [[59], [48], [53]]
Reward list (length = 2):
 [0.5454545617103577, 7.0] 

Epsiode count: 253
Epsiode goal: [52]
Epsiode state sequence length = 7.
 [[56], [58], [48], [58], [56], [55], [52]]
Reward list (length = 6):
 [-2.0, 0.6000000238418579, 0.4000000059604645, 3.0, 4.0, 7.0] 

Epsiode count: 254
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[51], [55], [56], [48], [57], [49], [55], [56], [55], [56], [58], [56], [55]]
Reward list (length = 12):
 [0.25, -3.0, 0.5, 0.4444444477558136, 0.625, 0.5, -3.0, 4.0, -3.0, -2.0, 3.0, 4.0] 



  1%|▍                                      | 256/20000 [00:27<33:40,  9.77it/s]

Epsiode count: 255
Epsiode goal: [51]
Epsiode state sequence length = 3.
 [[58], [56], [51]]
Reward list (length = 2):
 [3.5, 7.0] 

Epsiode count: 256
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[57], [56], [55], [56], [53], [49], [51], [56], [48], [51], [56], [54], [48]]
Reward list (length = 12):
 [7.0, 6.0, -5.0, 2.0, 0.75, 0.5, -0.20000000298023224, 0.75, 0.6666666865348816, -0.20000000298023224, 3.0, 0.6666666865348816] 



  1%|▌                                      | 258/20000 [00:27<32:16, 10.19it/s]

Epsiode count: 257
Epsiode goal: [52]
Epsiode state sequence length = 1.
 [[52]]
Reward list (length = 0):
 [] 

Epsiode count: 258
Epsiode goal: [60]
Epsiode state sequence length = 12.
 [[58], [48], [58], [53], [57], [58], [56], [55], [56], [55], [58], [60]]
Reward list (length = 11):
 [-0.20000000298023224, 1.2000000476837158, -0.4000000059604645, 1.75, 3.0, -1.0, -4.0, 5.0, -4.0, 1.6666666269302368, 7.0] 

Epsiode count: 259
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[50], [48], [58], [57], [56]]
Reward list (length = 4):
 [-3.0, 0.800000011920929, 2.0, 7.0] 



  1%|▌                                      | 260/20000 [00:27<33:10,  9.92it/s]

Epsiode count: 260
Epsiode goal: [59]
Epsiode state sequence length = 11.
 [[60], [53], [51], [55], [53], [57], [49], [50], [58], [57], [59]]
Reward list (length = 10):
 [0.1428571492433548, -3.0, 2.0, -2.0, 1.5, -0.25, 10.0, 1.125, -1.0, 7.0] 

Epsiode count: 261
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[59], [51], [55], [53], [49], [51], [58], [56], [55], [56], [57], [58], [56]]
Reward list (length = 12):
 [1.375, -0.75, 3.5, 1.25, -0.5, -0.4285714328289032, 5.0, 8.0, -7.0, -8.0, -9.0, 5.0] 



  1%|▌                                      | 264/20000 [00:28<34:08,  9.63it/s]

Epsiode count: 262
Epsiode goal: [49]
Epsiode state sequence length = 3.
 [[52], [53], [49]]
Reward list (length = 2):
 [-3.0, 7.0] 

Epsiode count: 263
Epsiode goal: [51]
Epsiode state sequence length = 8.
 [[59], [58], [56], [60], [58], [56], [58], [51]]
Reward list (length = 7):
 [8.0, 3.5, -1.25, 4.5, 3.5, -2.5, 7.0] 

Epsiode count: 264
Epsiode goal: [60]
Epsiode state sequence length = 1.
 [[60]]
Reward list (length = 0):
 [] 

Epsiode count: 265
Epsiode goal: [52]
Epsiode state sequence length = 13.
 [[48], [58], [56], [55], [56], [58], [56], [58], [60], [55], [56], [58], [56]]
Reward list (length = 12):
 [0.4000000059604645, 3.0, 4.0, -3.0, -2.0, 3.0, -2.0, -3.0, 1.600000023841858, -3.0, -2.0, 3.0] 



  1%|▌                                      | 268/20000 [00:28<31:47, 10.34it/s]

Epsiode count: 266
Epsiode goal: [53]
Epsiode state sequence length = 2.
 [[55], [53]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 267
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[50], [55], [56], [55], [56], [55], [56], [55], [56], [58], [56], [58], [55]]
Reward list (length = 12):
 [-0.4000000059604645, -7.0, 8.0, -7.0, 8.0, -7.0, 8.0, -7.0, -4.0, 5.0, -4.0, 3.3333332538604736] 

Epsiode count: 268
Epsiode goal: [51]
Epsiode state sequence length = 3.
 [[59], [49], [51]]
Reward list (length = 2):
 [0.800000011920929, 7.0] 



  1%|▌                                      | 272/20000 [00:28<21:09, 15.55it/s]

Epsiode count: 269
Epsiode goal: [52]
Epsiode state sequence length = 2.
 [[51], [52]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 270
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[57], [56], [55]]
Reward list (length = 2):
 [2.0, 7.0] 

Epsiode count: 271
Epsiode goal: [51]
Epsiode state sequence length = 3.
 [[55], [53], [51]]
Reward list (length = 2):
 [2.0, 7.0] 

Epsiode count: 272
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[55], [58], [56], [58], [56], [55], [58], [57], [57], [51], [58], [56], [59]]
Reward list (length = 12):
 [1.6666666269302368, -1.0, 2.0, -1.0, -4.0, 1.6666666269302368, -2.0, -100.0, -0.5, 1.2857142686843872, -1.0, 1.3333333730697632] 



  1%|▌                                      | 274/20000 [00:28<23:02, 14.27it/s]

Epsiode count: 273
Epsiode goal: [52]
Epsiode state sequence length = 1.
 [[52]]
Reward list (length = 0):
 [] 

Epsiode count: 274
Epsiode goal: [55]
Epsiode state sequence length = 4.
 [[59], [60], [57], [55]]
Reward list (length = 3):
 [-4.0, 1.6666666269302368, 7.0] 

Epsiode count: 275
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[52], [51], [56], [48], [58], [56], [59], [58], [56], [58], [52], [58], [56]]
Reward list (length = 12):
 [-8.0, 1.7999999523162842, -0.5, 1.2000000476837158, -1.0, 1.3333333730697632, -1.0, -1.0, 2.0, -0.3333333432674408, 1.3333333730697632, -1.0] 



  1%|▌                                      | 276/20000 [00:28<26:08, 12.58it/s]

Epsiode count: 276
Epsiode goal: [51]
Epsiode state sequence length = 5.
 [[48], [53], [52], [58], [51]]
Reward list (length = 4):
 [0.6000000238418579, 2.0, -0.1666666716337204, 7.0] 

Epsiode count: 277
Epsiode goal: [56]
Epsiode state sequence length = 3.
 [[57], [58], [56]]
Reward list (length = 2):
 [-1.0, 7.0] 

Epsiode count: 278
Epsiode goal: [48]
Epsiode state sequence length = 13.
 [[54], [53], [58], [56], [55], [56], [55], [58], [56], [55], [53], [52], [55]]
Reward list (length = 12):
 [6.0, -1.0, 5.0, 8.0, -7.0, 8.0, -2.3333332538604736, 5.0, 8.0, 3.5, 5.0, -1.3333333730697632] 



  1%|▌                                      | 279/20000 [00:29<27:05, 12.13it/s]

Epsiode count: 279
Epsiode goal: [57]
Epsiode state sequence length = 3.
 [[51], [56], [57]]
Reward list (length = 2):
 [1.2000000476837158, 7.0] 

Epsiode count: 280
Epsiode goal: [49]
Epsiode state sequence length = 13.
 [[59], [51], [55], [53], [58], [56], [55], [48], [58], [54], [58], [56], [55]]
Reward list (length = 12):
 [1.25, -0.5, 3.0, -0.800000011920929, 4.5, 7.0, 0.8571428656578064, 0.10000000149011612, 2.25, -1.25, 4.5, 7.0] 



  1%|▌                                      | 283/20000 [00:29<28:35, 11.49it/s]

Epsiode count: 281
Epsiode goal: [60]
Epsiode state sequence length = 13.
 [[54], [56], [51], [49], [51], [58], [56], [48], [58], [56], [58], [56], [55]]
Reward list (length = 12):
 [3.0, -0.800000011920929, -4.5, 5.5, 1.2857142686843872, -1.0, -0.5, 1.2000000476837158, -1.0, 2.0, -1.0, -4.0] 

Epsiode count: 282
Epsiode goal: [54]
Epsiode state sequence length = 1.
 [[54]]
Reward list (length = 0):
 [] 

Epsiode count: 283
Epsiode goal: [57]
Epsiode state sequence length = 2.
 [[59], [57]]
Reward list (length = 1):
 [7.0] 

Epsiode count: 284
Epsiode goal: [52]


  1%|▌                                      | 285/20000 [00:29<26:36, 12.35it/s]

Epsiode state sequence length = 7.
 [[60], [53], [50], [48], [58], [57], [52]]
Reward list (length = 6):
 [1.1428571939468384, 0.3333333432674408, -1.0, 0.4000000059604645, 6.0, 7.0] 

Epsiode count: 285
Epsiode goal: [49]
Epsiode state sequence length = 1.
 [[49]]
Reward list (length = 0):
 [] 

Epsiode count: 286
Epsiode goal: [57]
Epsiode state sequence length = 12.
 [[60], [60], [53], [54], [55], [58], [53], [58], [56], [55], [58], [57]]
Reward list (length = 11):
 [-100.0, 0.4285714328289032, 4.0, 3.0, 0.6666666865348816, 0.20000000298023224, 0.800000011920929, 0.5, -1.0, 0.6666666865348816, 7.0] 



  1%|▌                                      | 287/20000 [00:29<26:38, 12.33it/s]

Epsiode count: 287
Epsiode goal: [53]
Epsiode state sequence length = 1.
 [[53]]
Reward list (length = 0):
 [] 

Epsiode count: 288
Epsiode goal: [56]
Epsiode state sequence length = 5.
 [[53], [60], [53], [57], [56]]
Reward list (length = 4):
 [0.4285714328289032, 0.5714285969734192, 0.75, 7.0] 

Epsiode count: 289
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[55], [53], [51], [52], [56], [55], [52], [51], [58], [59], [49], [51], [58]]
Reward list (length = 12):
 [0.5, -0.5, 3.0, 0.5, 2.0, 0.3333333432674408, -2.0, 0.4285714328289032, -4.0, 0.5, 2.5, 0.4285714328289032] 



  1%|▌                                      | 290/20000 [00:30<26:35, 12.36it/s]

Epsiode count: 290
Epsiode goal: [54]
Epsiode state sequence length = 13.
 [[50], [55], [49], [58], [53], [52], [49], [52], [58], [56], [55], [58], [52]]
Reward list (length = 12):
 [0.800000011920929, 0.1666666716337204, 0.5555555820465088, 0.800000011920929, -1.0, -0.6666666865348816, 1.6666666269302368, 0.3333333432674408, 2.0, 2.0, -0.3333333432674408, 0.6666666865348816] 

Epsiode count: 291
Epsiode goal: [52]
Epsiode state sequence length = 3.
 [[60], [58], [52]]
Reward list (length = 2):
 [4.0, 7.0] 



  1%|▌                                      | 292/20000 [00:30<28:37, 11.48it/s]

Epsiode count: 292
Epsiode goal: [53]
Epsiode state sequence length = 5.
 [[51], [48], [56], [58], [53]]
Reward list (length = 4):
 [-0.6666666865348816, 0.625, -1.5, 7.0] 

Epsiode count: 293
Epsiode goal: [50]
Epsiode state sequence length = 13.
 [[59], [49], [55], [52], [53], [49], [54], [55], [56], [55], [52], [54], [56]]
Reward list (length = 12):
 [0.8999999761581421, 0.1666666716337204, 1.6666666269302368, -2.0, 0.75, 0.20000000298023224, -4.0, -5.0, 6.0, 1.6666666269302368, -1.0, -2.0] 



  1%|▌                                      | 294/20000 [00:30<31:54, 10.29it/s]

Epsiode count: 294
Epsiode goal: [58]
Epsiode state sequence length = 13.
 [[53], [51], [56], [55], [54], [56], [55], [53], [54], [55], [56], [59], [52]]
Reward list (length = 12):
 [-2.5, 1.399999976158142, -2.0, -3.0, 2.0, -2.0, -1.5, 5.0, 4.0, 3.0, 0.6666666865348816, 0.1428571492433548] 

Epsiode count: 295
Epsiode goal: [51]
Epsiode state sequence length = 8.
 [[54], [58], [56], [49], [58], [56], [49], [51]]
Reward list (length = 7):
 [-0.75, 3.5, 0.7142857313156128, 0.2222222238779068, 3.5, 0.7142857313156128, 7.0] 



  1%|▌                                      | 297/20000 [00:31<36:13,  9.07it/s]

Epsiode count: 296
Epsiode goal: [59]
Epsiode state sequence length = 9.
 [[51], [56], [53], [58], [53], [52], [53], [58], [59]]
Reward list (length = 8):
 [1.600000023841858, -1.0, 1.2000000476837158, -0.20000000298023224, -6.0, 7.0, 1.2000000476837158, 7.0] 

Epsiode count: 297
Epsiode goal: [56]
Epsiode state sequence length = 1.
 [[56]]
Reward list (length = 0):
 [] 

Epsiode count: 298
Epsiode goal: [55]
Epsiode state sequence length = 3.
 [[59], [57], [55]]
Reward list (length = 2):
 [2.0, 7.0] 

Epsiode count: 299
Epsiode goal: [52]
Epsiode state sequence length = 1.
 [[52]]
Reward list (length = 0):
 [] 

Epsiode count: 300
Epsiode goal: [52]
Epsiode state sequence length = 11.
 [[49], [48], [58], [56], [55], [54], [56], [55], [56], [55], [52]]
Reward list (length = 10):
 [-3.0, 0.4000000059604645, 3.0, 4.0, 3.0, -1.0, 4.0, -3.0, 4.0, 7.0] 



  2%|▌                                      | 301/20000 [00:31<25:38, 12.80it/s]

Epsiode count: 301
Epsiode goal: [57]
Epsiode state sequence length = 12.
 [[56], [55], [53], [58], [56], [55], [53], [51], [58], [56], [55], [57]]
Reward list (length = 11):
 [-1.0, -1.0, 0.800000011920929, 0.5, -1.0, -1.0, -2.0, 0.8571428656578064, 0.5, -1.0, 7.0] 

Epsiode count: 302
Epsiode goal: [53]
Epsiode state sequence length = 5.
 [[57], [55], [56], [55], [53]]
Reward list (length = 4):
 [2.0, -2.0, 3.0, 7.0] 



  2%|▌                                      | 303/20000 [00:31<34:03,  9.64it/s]

Epsiode count: 303
Epsiode goal: [49]





KeyboardInterrupt: 