![image.png](attachment:image.png)

Reward and Return
The reward determines how good was the action from state s to reach the next state. This is the crucial component of RL that determines the learning of the RL agent. A reward at specific timestep t is given below[5]:



![image.png](attachment:image.png)

This formula implies rt is the reward at timestep t for taking action, at from state st to reach new state st+1. R is the indication of a reward function.

On the other hand, the return is the sum of rewards from the current state to the goal state. There are two types of return: finite-horizon undiscounted return and infinite-horizon discounted return[5]

Finite-horizon undiscounted return
It is the sum of reward from the current state to goal state which has a fixed timestep or a finite number of timesteps Τ[5].

![image.png](attachment:image.png)

This is an undiscounted return as the name suggested because the finite horizon of timesteps we do not really multiply rewards with discounting factor.

Infinite-horizon discounted return
It is the sum of all rewards ever obtained by the RL agent, but discounting factors determines how far future rewards need to be accounted for[5].

![image.png](attachment:image.png)

Discounting factor γ
It determines how far future rewards are taken into account in the return. The value of γ is between 0 and 1. On the extreme end, γ = 0 means agent only care about immediate rewards and γ = 1 indicates that all the future rewards are taken into consideration[4]. Consider another example where γ = 0.9 has a different return compared to γ = 0.99 has. For γ = 0.9, the sum of rewards for the return is accounted until 6th timesteps. Whereas, γ = 0.99 need to take the sum of reward until 60th timesteps.

![image.png](attachment:image.png)

The discount factor is used for an intuitive reason and mathematical reason. For the intuition, reward now is better than the reward later. Whereas mathematically, the infinite sum of rewards may not converge to the finite value which is hard to deal with in mathematical calculation[5]. Using discount factor, far future rewards can be discarded which enable the return to converge to a finite value.

States and Observations
The state s is a complete description of the state of the world where the states are fully observable. Whereas observation o is a partial description of the state of the world

Action and action spaces
The agent performs the action in the environment to reach the next state from the current state. For instance, in a navigation task, turning left or turning right is an example of action. The set of all valid actions in a given environment is called the action space[5]. There are two types of action space: discrete action space and continuous action space. In a discrete action space, the finite number of actions are possible. For example, turning left or right. Whereas continuous action space can have an infinite number of actions. For instance, steering angle instead of turning left or right.

Policy
The policy is a mapping from states to actions. In other words, policy determines how the agent behaves from a specific state. There are two types of policies: deterministic policy and stochastic policy.

Deterministic policy
The deterministic policy output an action with probability one. For instance, In a car driving scenario, consider we have three actions: turn left, go straight, and turn right. The RL agent with deterministic policy always outputs one of the actions with probability 1. That means the agent always choose an action without considering any uncertainties. Normally deterministic policies are represented with the following notation:

![image.png](attachment:image.png)

Stochastic policy
Stochastic policy output the probability distribution over the actions from states. For instance, consider three actions: turn left, go straight, turn right from a state. The output of the policy will be a probability distribution over the actions, say 20% to turn left, 50% to go straight, and 30% to turn right. This type of probability will be used in non-deterministic environments. Stochastic policies are represented with the following notation:

![image.png](attachment:image.png)

Trajectories
Trajectory τ is a sequence of states and action[5].

![image.png](attachment:image.png)

Value function
state-value function
The state-value Vπ(s) is the expected total reward, starting from state s and acts according to policy π. If the agent uses a given policy π to select actions, the corresponding value function is given by:

![image.png](attachment:image.png)

Optimal state-value function: It has high possible value function compared to other value function for all states

![image.png](attachment:image.png)

If we know optimal value function, then the policy that corresponds to optimal value function is optimal policy 𝛑*.

![image.png](attachment:image.png)

Action-value function
It is the expected return for an agent starting from state s and taking arbitrary action a then forever after act according to policy 𝛑.

The optimal Q-function Q*(s, a) means highest possible q value for an agent starting from state s and choosing action a. There, Q*(s, a) is an indication for how good it is for an agent to pick action while being in state s.

Since V*(s) is the maximum expected total reward when starting from state s , it will be the maximum of Q*(s, a)overall possible actions. Therefore, the relationship between Q*(s, a) and V*(s) is easily obtained as:

![image.png](attachment:image.png)

and If we know the optimal Q-function Q*(s, a), the optimal policy can be easily extracted by choosing the action a that gives maximum Q*(s, a) for state s.

![image.png](attachment:image.png)

Policy iteration and value iteration
In policy iteration, the random policy is selected initially and find the value function of that policy in the evaluation step. Then find the new policy from the value function computed in the improve step. The process repeats until it finds the optimal policy. In this type of RL, the policy is manipulated directly.

![image.png](attachment:image.png)

In value iteration, the random value function is selected initially, then find new value function. This process repeated until it finds the optimal value function. The intuition here is the policy that follows the optimal value function will be optimal policy. Here, the policy is implicitly manipulated.

Policy Gradient algorithm
Policy gradient algorithm is a policy iteration approach where policy is directly manipulated to reach the optimal policy that maximises the expected return. This type of algorithms is model-free reinforcement learning(RL). The model-free indicates that there is no prior knowledge of the model of the environment. In other words, we do not know the environment dynamics or transition probability. The environment dynamics or transition probability is indicated as below:



![image.png](attachment:image.png)

It can be read the probability of reaching the next state st+1 by taking the action from the current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a distribution over actions given states. In other words, the policy defines the behaviour of the agent.

![image.png](attachment:image.png)

Whereas, transition probability explains the dynamics of the environment which is not readily available in many practical applications.

Return and reward
We can define our return as the sum of rewards from the current state to the goal state i.e. the sum of rewards in a trajectory(we are just considering finite undiscounted horizon).

![image.png](attachment:image.png)

Where τ = (s0​,a0​,…,sT−1​,aT−1​).

Objective function
In policy gradient, the policy is usually modelled with a parameterized function respect to θ, πθ(a|s). From a mathematical perspective, an objective function is to minimise or maximise something. We consider a stochastic, parameterized policy πθ and aim to maximise the expected return using objective function J(πθ)[7].

![image.png](attachment:image.png)


Here R(st, at) is defined as reward obtained at timestep t by performing an action at from the state st. We know the fact that R(st, at) can be represented as R(τ).

We can maximise the objective function J to maximises the return by adjusting the policy parameter θ to get the best policy. The best policy will always maximise the return. The gradient ascent is the optimisation algorithm that iteratively searches for optimal parameters that maximise the objective function.

If we can find out the gradient ∇ of the objective function J, as shown below:

![image.png](attachment:image.png)

Then, we can update the policy parameter θ(for simplicity, we are going to use θ instead of πθ), using the gradient ascent rule. This way, we can update the parameters θ in the direction of the gradient(Remember the gradient gives the direction of the maximum change, and the magnitude indicates the maximum rate of change ). The gradient update rule is as shown below:



![image.png](attachment:image.png)

Let’s derive the policy gradient expression
The expectation of a discrete random variable X can be defined as:

![image.png](attachment:image.png)

where x is the value of random variable X and P(x) is the probability function of x.

Now we can rewrite our gradient as below:

X = R(t),  xi = R(t), P(x) =P(t/0)

![image.png](attachment:image.png)

We can derive this equation as follows[6][7][9]

![image.png](attachment:image.png)

Probability of trajectory with respect to parameter θ, P(τ|θ) can be expanded as follows[6][7]:



![image.png](attachment:image.png)

Where p(s0) is the probability distribution of starting state and P(st+1|st, at) is the transition probability of reaching new state st+1 by performing the action at from the state st.

If we take the log-probability of the trajectory, then it can be derived as below[7]:


![image.png](attachment:image.png)

We can take the gradient of the log-probability of a trajectory thus gives[6][7


![image.png](attachment:image.png)

We can modify this function as shown below based on the transition probability model, P(st+1​∣st​, at​) disappears because we are considering the model-free policy gradient algorithm where the transition probability model is not necessary.

![image.png](attachment:image.png)

We can now go back to the expectation of our algorithm and time to replace the gradient of the log-probability of a trajectory with the derived equation above.

![image.png](attachment:image.png)

Now the policy gradient expression is derived as

![image.png](attachment:image.png)

REINFORCE
REINFORCE is the Mote-Carlo sampling of policy gradient methods. That means the RL agent sample from starting state to goal state directly from the environment, rather than bootstrapping compared to other methods such as Temporal Difference Learning and Dynamic programming.

![image.png](attachment:image.png)

We can rewrite our policy gradient expression in the context of Monte-Carlo sampling

![image.png](attachment:image.png)

Where N is the number of trajectories is for one gradient update[6].

Pseudocode for the REINFORCE algorithm[11]:
Sample N trajectories by following the policy πθ.
2. Evaluate the gradient using the below expression:

![image.png](attachment:image.png)

Update the policy parameters

![image.png](attachment:image.png)

 Repeat 1 to 3 until we find the optimal policy πθ.
 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Actor-Critic methods

Actor-Critic methods are temporal difference (TD) learning methods that represent the policy function independent of the value function.

A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state. A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.

In the Actor-Critic method, the policy is referred to as the actor that proposes a set of possible actions given a state, and the estimated value function is referred to as the critic, which evaluates actions taken by the actor based on the given policy.

In this tutorial, both the Actor and Critic will be represented using one neural network with two outputs.

CartPole-v0

n the CartPole-v0 environment, a pole is attached to a cart moving along a frictionless track. The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of -1 or +1 to the cart. A reward of +1 is given for every time step the pole remains upright. An episode ends when: 1) the pole is more than 15 degrees from vertical; or 2) the cart moves more than 2.4 units from the center.

![image.png](attachment:image.png)

This environment is part of the Classic Control environments. Please read that page first for general information.

Action Space

Discrete(2)

Observation Shape

(4,)

Observation High

[4.8 inf 0.42 inf]

Observation Low

[-4.8 -inf -0.42 -inf]

Import

gym.make("CartPole-v1")

Description
This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem”. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.

Action Space
The action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

Num

Action

0

Push cart to the left

1

Push cart to the right

Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

Observation Space
The observation is a ndarray with shape (4,) with the values corresponding to the following positions and velocities:

Num

Observation

Min

Max

0

Cart Position

-4.8

4.8

1

Cart Velocity

-Inf

Inf

2

Pole Angle

~ -0.418 rad (-24°)

~ 0.418 rad (24°)

3

Pole Angular Velocity

-Inf

Inf

Note: While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

The cart x-position (index 0) can be take values between (-4.8, 4.8), but the episode terminates if the cart leaves the (-2.4, 2.4) range.

The pole angle can be observed between (-.418, .418) radians (or ±24°), but the episode terminates if the pole angle is not in the range (-.2095, .2095) (or ±12°)

Rewards
Since the goal is to keep the pole upright for as long as possible, a reward of +1 for every step taken, including the termination step, is allotted. The threshold for rewards is 475 for v1.

Starting State
All observations are assigned a uniformly random value in (-0.05, 0.05)

Episode End
The episode ends if any one of the following occurs:

Termination: Pole Angle is greater than ±12°

Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)

Truncation: Episode length is greater than 500 (200 for v0)

Arguments
gym.make('CartPole-v1')
No additional arguments are currently supported.

The problem is considered "solved" when the average total reward for the episode reaches 195 over 100 consecutive trials

In [1]:

import collections
import gym
import numpy as np
import statistics
import tensorflow as tf
import tqdm

from matplotlib import pyplot as plt
from tensorflow.keras import layers
from typing import Any, List, Sequence, Tuple


# Create the environment
env = gym.make("CartPole-v1")

# Set seed for experiment reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

# Small epsilon value for stabilizing division operations
eps = np.finfo(np.float32).eps.item()

In [2]:
env

<TimeLimit<OrderEnforcing<PassiveEnvChecker<CartPoleEnv<CartPole-v1>>>>>

seed = 42: Assigns a fixed seed value. (You can use any number; 42 is commonly used in examples.)
tf.random.set_seed(seed): Sets the seed for TensorFlow's random operations.
np.random.seed(seed): Sets the seed for NumPy's random operations.

np.finfo(np.float32): Retrieves information about the floating-point type float32.
eps: A property of np.finfo that represents the smallest positive number such that 
1.0
+
eps
≠
1.0
1.0+eps

=1.0 in the float32 type.
In other words, eps is the smallest difference the data type can represent.
.item(): Converts the value to a Python scalar.

The model
The Actor and Critic will be modeled using one neural network that generates the action probabilities and Critic value respectively. This tutorial uses model subclassing to define the model.

During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value 
, which models the state-dependent value function. The goal is to train a model that chooses actions based on a policy 
 that maximizes expected return.

For CartPole-v0, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (0) and right (1), respectively.

In [3]:
class ActorCritic(tf.keras.Model):
  """Combined actor-critic network."""

  def __init__(
      self,
      num_actions: int,
      num_hidden_units: int):
    """Initialize."""
    super().__init__()

    self.common = layers.Dense(num_hidden_units, activation="relu")
    self.actor = layers.Dense(num_actions)
    self.critic = layers.Dense(1)

  def call(self, inputs: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
    x = self.common(inputs)
    return self.actor(x), self.critic(x)

num_actions:
The number of possible actions in the environment. This determines the size of the output for the actor head.

num_hidden_units:
The number of units in the shared hidden layer.

Components of the Network:

self.common:

A shared dense layer (fully connected layer) with num_hidden_units neurons and ReLU activation.
It processes the input state and learns features useful for both actor and critic.
self.actor:

Outputs the action logits (unnormalized probabilities) for each possible action.
num_actions neurons correspond to the number of actions.
self.critic:

Outputs a single scalar value that represents the state value (a prediction of how good the current state is).


In [4]:
num_actions = env.action_space.n  # 2
num_hidden_units = 128

model = ActorCritic(num_actions, num_hidden_units)

In [5]:
model.summary()

Train the agent
To train the agent, you will follow these steps:

Run the agent on the environment to collect training data per episode.
Compute expected return at each time step.
Compute the loss for the combined Actor-Critic model.
Compute gradients and update network parameters.
Repeat 1-4 until either success criterion or max episodes has been reached.

Collect training data
As in supervised learning, in order to train the actor-critic model, you need to have training data. However, in order to collect such data, the model would need to be "run" in the environment.

Training data is collected for each episode. Then at each time step, the model's forward pass will be run on the environment's state in order to generate action probabilities and the critic value based on the current policy parameterized by the model's weights.

The next action will be sampled from the action probabilities generated by the model, which would then be applied to the environment, causing the next state and reward to be generated.

This process is implemented in the run_episode function, which uses TensorFlow operations so that it can later be compiled into a TensorFlow graph for faster training. Note that tf.TensorArrays were used to support Tensor iteration on variable length arrays.

In [6]:
# Wrap Gym's `env.step` call as an operation in a TensorFlow function.
# This would allow it to be included in a callable TensorFlow graph.

@tf.numpy_function(Tout=[tf.float32, tf.int32, tf.int32])
def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
  """Returns state, reward and done flag given an action."""
  print("enter env_step")
  state, reward, done, truncated, info = env.step(action)
  return (state.astype(np.float32),
          np.array(reward, np.int32),
          np.array(done, np.int32))

This code defines a function env_step that interacts with a reinforcement learning (RL) environment and wraps it in TensorFlow's @tf.numpy_function decorator to allow it to be used as part of a TensorFlow computational graph. Let’s break it down step by step:

action: A NumPy array representing the action selected by the agent.
This is usually an integer or a vector of floats, depending on the RL environment.
Returns:
state: The new state after the action is performed.
reward: The scalar reward for the action.
done: A flag (1 or 0) indicating whether the episode has ended.

A TensorArray is a TensorFlow data structure designed to hold multiple tensors of the same data type and shape. It is useful for handling variable-length data or when you need dynamic-sized arrays during computations (e.g., in loops or sequences).
Unlike regular tensors, a TensorArray allows for dynamic growth, making it ideal for applications like storing intermediate results in training or rollout steps.


Specifies that the data type of the tensors stored in this TensorArray will be tf.float32. This is typical when working with neural network computations, probabilities, or gradients.
size=0:

Indicates the initial size of the TensorArray is zero. Since the size is dynamic, you can append or write tensors to it without being constrained by an initial fixed size.
dynamic_size=True:

Allows the TensorArray to grow dynamically as more tensors are written to it. This is important for tasks where the number of elements is not known beforehand (e.g., episodes of different lengths in reinforcement learning).

In [7]:
def run_episode(
    initial_state: tf.Tensor,
    model: tf.keras.Model,
    max_steps: int) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]:
  """Runs a single episode to collect training data."""
  print("enter run_episode")
  action_probs = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
  values = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
  rewards = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)

  initial_state_shape = initial_state.shape
  state = initial_state
  print("Initial state:",state)

  for t in tf.range(max_steps):
    # Convert state into a batched tensor (batch size = 1)
    state = tf.expand_dims(state, 0)
    print("State:",state)

    # Run the model and to get action probabilities and critic value
    action_logits_t, value = model(state)
    print("Action logits:",action_logits_t)
    print("Value:",value)
    model.summary()

    # Sample next action from the action probability distribution
    action = tf.random.categorical(action_logits_t, 1)[0, 0]
    print("Action:",action)
    action_probs_t = tf.nn.softmax(action_logits_t)
    print("Action probs:",action_probs_t)


    # Store critic values
    values = values.write(t, tf.squeeze(value))
    print("Values:",values)

    # Store log probability of the action chosen
    action_probs = action_probs.write(t, action_probs_t[0, action])
    print("Action probs:", action_probs)

    # Apply action to the environment to get next state and reward
    state, reward, done = env_step(action)
    print(f"state: {state} \n reward: {reward} \n done: {done}")
    state.set_shape(initial_state_shape)

    # Store reward
    rewards = rewards.write(t, reward)
    print("Rewards:",rewards)

    if tf.cast(done, tf.bool):
      break

  action_probs = action_probs.stack()
  print("Action probs:", action_probs)
  values = values.stack()
  print("Values:", values)
  rewards = rewards.stack()
  print("Rewards:", rewards)

  return action_probs, values, rewards

Compute the expected returns
The sequence of rewards for each timestep t <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mo fence="false" stretchy="false">{</mo>
  <msub>
    <mi>r</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <msubsup>
    <mo fence="false" stretchy="false">}</mo>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
      <mo>=</mo>
      <mn>1</mn>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>T</mi>
    </mrow>
  </msubsup>
</math>
, 
 collected during one episode is converted into a sequence of expected returns <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mo fence="false" stretchy="false">{</mo>
  <msub>
    <mi>G</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <msubsup>
    <mo fence="false" stretchy="false">}</mo>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
      <mo>=</mo>
      <mn>1</mn>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>T</mi>
    </mrow>
  </msubsup>
</math>
 in which the sum of rewards is taken from the current timestep t
 to T
 and each reward is multiplied with an exponentially decaying discount factor 
:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mi>G</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo>=</mo>
  <munderover>
    <mo>&#x2211;<!-- ∑ --></mo>
    <mrow class="MJX-TeXAtom-ORD">
      <msup>
        <mi>t</mi>
        <mo>&#x2032;</mo>
      </msup>
      <mo>=</mo>
      <mi>t</mi>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>T</mi>
    </mrow>
  </munderover>
  <msup>
    <mi>&#x03B3;<!-- γ --></mi>
    <mrow class="MJX-TeXAtom-ORD">
      <msup>
        <mi>t</mi>
        <mo>&#x2032;</mo>
      </msup>
      <mo>&#x2212;<!-- − --></mo>
      <mi>t</mi>
    </mrow>
  </msup>
  <msub>
    <mi>r</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <msup>
        <mi>t</mi>
        <mo>&#x2032;</mo>
      </msup>
    </mrow>
  </msub>
</math>

Since <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>&#x03B3;<!-- γ --></mi>
  <mo>&#x2208;<!-- ∈ --></mo>
  <mo stretchy="false">(</mo>
  <mn>0</mn>
  <mo>,</mo>
  <mn>1</mn>
  <mo stretchy="false">)</mo>
</math>
, rewards further out from the current timestep are given less weight.

Intuitively, expected return simply implies that rewards now are better than rewards later. In a mathematical sense, it is to ensure that the sum of the rewards converges.

To stabilize training, the resulting sequence of returns is also standardized (i.e. to have zero mean and unit standard deviation).

In [8]:
def get_expected_return(
    rewards: tf.Tensor,
    gamma: float,
    standardize: bool = True) -> tf.Tensor:
  """Compute expected returns per timestep."""
  print("enter get_expected_return")
  n = tf.shape(rewards)[0]
  print("n",n)
  returns = tf.TensorArray(dtype=tf.float32, size=n)
  print("returns", returns)

  # Start from the end of `rewards` and accumulate reward sums
  # into the `returns` array
  rewards = tf.cast(rewards[::-1], dtype=tf.float32)
  print("rewards", rewards)
  discounted_sum = tf.constant(0.0)
  print("discounted_sum", discounted_sum)
  discounted_sum_shape = discounted_sum.shape
  print("discounted_sum_shape", discounted_sum_shape)
  for i in tf.range(n):
    reward = rewards[i]
    discounted_sum = reward + gamma * discounted_sum
    discounted_sum.set_shape(discounted_sum_shape)
    returns = returns.write(i, discounted_sum)
  returns = returns.stack()[::-1]
  print("Returns", returns)

  if standardize:
    returns = ((returns - tf.math.reduce_mean(returns)) /
               (tf.math.reduce_std(returns) + eps))
    

  return returns

The Actor-Critic loss
Since you're using a hybrid Actor-Critic model, the chosen loss function is a combination of Actor and Critic losses for training, as shown below:<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>L</mi>
  <mo>=</mo>
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>a</mi>
      <mi>c</mi>
      <mi>t</mi>
      <mi>o</mi>
      <mi>r</mi>
    </mrow>
  </msub>
  <mo>+</mo>
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>c</mi>
      <mi>r</mi>
      <mi>i</mi>
      <mi>t</mi>
      <mi>i</mi>
      <mi>c</mi>
    </mrow>
  </msub>
</math>


The Actor loss
The Actor loss is based on policy gradients with the Critic as a state dependent baseline and computed with single-sample (per-episode) estimates.
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>a</mi>
      <mi>c</mi>
      <mi>t</mi>
      <mi>o</mi>
      <mi>r</mi>
    </mrow>
  </msub>
  <mo>=</mo>
  <mo>&#x2212;<!-- − --></mo>
  <munderover>
    <mo>&#x2211;<!-- ∑ --></mo>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
      <mo>=</mo>
      <mn>1</mn>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>T</mi>
    </mrow>
  </munderover>
  <mi>log</mi>
  <mo>&#x2061;<!-- ⁡ --></mo>
  <msub>
    <mi>&#x03C0;<!-- π --></mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03B8;<!-- θ --></mi>
    </mrow>
  </msub>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>a</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mrow class="MJX-TeXAtom-ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <msub>
    <mi>s</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo stretchy="false">)</mo>
  <mo stretchy="false">[</mo>
  <mi>G</mi>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>s</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo>,</mo>
  <msub>
    <mi>a</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo stretchy="false">)</mo>
  <mo>&#x2212;<!-- − --></mo>
  <msubsup>
    <mi>V</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03B8;<!-- θ --></mi>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03C0;<!-- π --></mi>
    </mrow>
  </msubsup>
  <mo stretchy="false">(</mo>
  <msub>
    <mi>s</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo stretchy="false">)</mo>
  <mo stretchy="false">]</mo>
</math>

where:

T: the number of timesteps per episode, which can vary per episode
st: the state at timestep t
at: chosen action at timestep t
 given state s
pi : is the policy (Actor) parameterized by theta
V: is the value function (Critic) also parameterized by theta
G =Gt : the expected return for a given state, action pair at timestep 
A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.


The Advantage
The <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>G</mi>
  <mo>&#x2212;<!-- − --></mo>
  <mi>V</mi>
</math>
 term in our L actor
 formulation is called the Advantage, which indicates how much better an action is given a particular state over a random action selected according to the policy  pi
 for that state.

While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic V
 as a baseline is that it trained to be as close as possible to G
, leading to a lower variance.

In addition, without the Critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.

For instance, suppose that two actions for a given state would yield the same expected return. Without the Critic, the algorithm would try to raise the probability of these actions based on the objective 
. With the Critic, it may turn out that there's no Advantage ( G-V=0)
), and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.

The Critic loss
Training V
 to be as close possible to G
 can be set up as a regression problem with the following loss function:
The Critic loss
Training 
 to be as close possible to 
 can be set up as a regression problem with the following loss function:
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>c</mi>
      <mi>r</mi>
      <mi>i</mi>
      <mi>t</mi>
      <mi>i</mi>
      <mi>c</mi>
    </mrow>
  </msub>
  <mo>=</mo>
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03B4;<!-- δ --></mi>
    </mrow>
  </msub>
  <mo stretchy="false">(</mo>
  <mi>G</mi>
  <mo>,</mo>
  <msubsup>
    <mi>V</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03B8;<!-- θ --></mi>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03C0;<!-- π --></mi>
    </mrow>
  </msubsup>
  <mo stretchy="false">)</mo>
</math>



where <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msub>
    <mi>L</mi>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>&#x03B4;<!-- δ --></mi>
    </mrow>
  </msub>
</math>
 is the Huber loss, which is less sensitive to outliers in data than squared-error loss.

In [9]:
huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)

def compute_loss(
    action_probs: tf.Tensor,
    values: tf.Tensor,
    returns: tf.Tensor) -> tf.Tensor:
  """Computes the combined Actor-Critic loss."""
  print("enter compute loss")
  advantage = returns - values
  print("advantage", advantage)

  action_log_probs = tf.math.log(action_probs)
  print("action_log_probs", action_log_probs)
  actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)
  print("actor_loss", actor_loss)

  critic_loss = huber_loss(values, returns)
  print("critic_loss", critic_loss)

  return actor_loss + critic_loss

 Define the training step to update parameters
All of the steps above are combined into a training step that is run every episode. All steps leading up to the loss function are executed with the tf.GradientTape context to enable automatic differentiation.

This tutorial uses the Adam optimizer to apply the gradients to the model parameters.

The sum of the undiscounted rewards, episode_reward, is also computed in this step. This value will be used later on to evaluate if the success criterion is met.

The tf.function context is applied to the train_step function so that it can be compiled into a callable TensorFlow graph, which can lead to 10x speedup in training

In [10]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)


@tf.function
def train_step(
    initial_state: tf.Tensor,
    model: tf.keras.Model,
    optimizer: tf.keras.optimizers.Optimizer,
    gamma: float,
    max_steps_per_episode: int) -> tf.Tensor:
  """Runs a model training step."""
  print("enter training step")
  with tf.GradientTape() as tape:

    # Run the model for one episode to collect training data
    action_probs, values, rewards = run_episode(
        initial_state, model, max_steps_per_episode)
    print("action_probs: ", action_probs)
    print("values: ", values)
    print("rewards: ", rewards)
    # Calculate the expected returns
    returns = get_expected_return(rewards, gamma)
    print("returns: ", returns)

    # Convert training data to appropriate TF tensor shapes
    action_probs, values, returns = [
        tf.expand_dims(x, 1) for x in [action_probs, values, returns]]
    print("action_probs: ", action_probs)
    print("values: ", values)
    print("returns: ", returns)

    # Calculate the loss values to update our network
    loss = compute_loss(action_probs, values, returns)
    print("loss: ", loss)

  # Compute the gradients from the loss
  grads = tape.gradient(loss, model.trainable_variables)
  print("grads: ", grads)

  # Apply the gradients to the model's parameters
  optimizer.apply_gradients(zip(grads, model.trainable_variables))
  print("optimizer: ", optimizer)

  episode_reward = tf.math.reduce_sum(rewards)
  

  return episode_reward

. Run the training loop
Training is executed by running the training step until either the success criterion or maximum number of episodes is reached.

A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency.

Depending on your runtime, training can finish in less than a minute.

A deque is a data structure that acts like a list but is optimized for appending and popping items from both ends with O(1) time complexity.
It is particularly useful when you need to maintain a sliding window or a fixed-size list of recent elements

tqdm is a Python library for creating progress bars in loops.
It helps visualize the progress of time-consuming operations, like training a machine learning model or iterating over a large dataset.

trange():

tqdm.trange() is a shorthand for combining Python's built-in range() function with tqdm's progress bar.
It returns an iterator that behaves like range() but displays a progress bar when used in a loop.

Episode 5:  50%|█████     | 5/10 [00:02<00:02,  2.00it/s]


In [None]:
%%time

min_episodes_criterion = 100
max_episodes = 10000
max_steps_per_episode = 500

# `CartPole-v1` is considered solved if average reward is >= 475 over 500
# consecutive trials
reward_threshold = 475
running_reward = 0

# The discount factor for future rewards
gamma = 0.99

# Keep the last episodes reward
episodes_reward: collections.deque = collections.deque(maxlen=min_episodes_criterion)

t = tqdm.trange(max_episodes)

print("t", t)
for i in t:
    initial_state, info = env.reset()
    print("initial_state", initial_state)
    print("info", info)
    initial_state = tf.constant(initial_state, dtype=tf.float32)
    print("initial_state", initial_state)
    episode_reward = int(train_step(
        initial_state, model, optimizer, gamma, max_steps_per_episode))
    print("episode_reward", episode_reward)

    episodes_reward.append(episode_reward)
    print("episodes_reward", episodes_reward)

    running_reward = statistics.mean(episodes_reward)
    avg_reward = statistics.mean(episodes_reward)


    t.set_postfix(
        episode_reward=episode_reward, running_reward=running_reward)

    # Show the average episode reward every 10 episodes
    if i % 10 == 0:
      pass # print(f'Episode {i}: average reward: {avg_reward}')

    if running_reward > reward_threshold and i >= min_episodes_criterion:
        break

print(f'\nSolved at episode {i}: average reward: {running_reward:.2f}!')

  0%|          | 0/10000 [00:00<?, ?it/s]

t   0%|          | 0/10000 [00:00<?, ?it/s]
initial_state [-0.02267222 -0.03225593  0.00733247  0.02886541]
info {}
initial_state tf.Tensor([-0.02267222 -0.03225593  0.00733247  0.02886541], shape=(4,), dtype=float32)
enter training step
enter run_episode
Initial state: Tensor("initial_state:0", shape=(4,), dtype=float32)
State: Tensor("while/ExpandDims:0", shape=(1, 4), dtype=float32)
Action logits: Tensor("while/actor_critic_1/dense_1_2/Add:0", shape=(1, 2), dtype=float32)
Value: Tensor("while/actor_critic_1/dense_2_1/Add:0", shape=(1, 1), dtype=float32)


Action: Tensor("while/strided_slice:0", shape=(), dtype=int64)
Action probs: Tensor("while/Softmax:0", shape=(1, 2), dtype=float32)
Values: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DB6F0DD0>
Action probs: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DB732950>
state: Tensor("while/PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 
 reward: Tensor("while/PyFunc:1", dtype=int32, device=/job:localhost/replica:0/task:0) 
 done: Tensor("while/PyFunc:2", dtype=int32, device=/job:localhost/replica:0/task:0)
Rewards: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DA19C890>
Action probs: Tensor("TensorArrayV2Stack/TensorListStack:0", shape=(None,), dtype=float32)
Values: Tensor("TensorArrayV2Stack_1/TensorListStack:0", shape=(None,), dtype=float32)
Rewards: Tensor("TensorArrayV2Stack_2/TensorListStack:0", dtype=int32)
action_probs:  Tensor("TensorArrayV2Stack/TensorListStack:0", shape=(None,), dtyp

Action: Tensor("while/strided_slice:0", shape=(), dtype=int64)
Action probs: Tensor("while/Softmax:0", shape=(1, 2), dtype=float32)
Values: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DB70C050>
Action probs: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DBCF7B90>
state: Tensor("while/PyFunc:0", dtype=float32, device=/job:localhost/replica:0/task:0) 
 reward: Tensor("while/PyFunc:1", dtype=int32, device=/job:localhost/replica:0/task:0) 
 done: Tensor("while/PyFunc:2", dtype=int32, device=/job:localhost/replica:0/task:0)
Rewards: <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x00000129DB5940D0>
Action probs: Tensor("TensorArrayV2Stack/TensorListStack:0", shape=(None,), dtype=float32)
Values: Tensor("TensorArrayV2Stack_1/TensorListStack:0", shape=(None,), dtype=float32)
Rewards: Tensor("TensorArrayV2Stack_2/TensorListStack:0", dtype=int32)
action_probs:  Tensor("TensorArrayV2Stack/TensorListStack:0", shape=(None,), dtyp

  if not isinstance(terminated, (bool, np.bool8)):
  0%|          | 5/10000 [00:07<3:15:58,  1.18s/it, episode_reward=118, running_reward=35.9]

episode_reward 34
episodes_reward deque([34], maxlen=100)
initial_state [-0.04440443 -0.01816463 -0.03256057 -0.00164035]
info {}
initial_state tf.Tensor([-0.04440443 -0.01816463 -0.03256057 -0.00164035], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 23
episodes_reward deque([34, 23], maxlen=100)
initial_state [ 0.0468011  -0.00259242 -0.03016618 -0.00077777]
info {}
initial_state tf.Tensor([ 0.0468011  -0.00259242 -0.03016618 -0.00077777], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  0%|          | 15/10000 [00:08<44:44,  3.72it/s, episode_reward=30, running_reward=39.8]  

initial_state [-0.00581337  0.04923714 -0.04610572 -0.04571027]
info {}
initial_state tf.Tensor([-0.00581337  0.04923714 -0.04610572 -0.04571027], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  0%|          | 20/10000 [00:08<28:35,  5.82it/s, episode_reward=56, running_reward=45.2] 

initial_state [ 0.04347823 -0.00895929 -0.01116709  0.03316412]
info {}
initial_state tf.Tensor([ 0.04347823 -0.00895929 -0.01116709  0.03316412], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 42
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42], maxlen=100)
initial_state [ 0.00330263  0.03530178 -0.02277077 -0.01932517]
info 

  0%|          | 28/10000 [00:08<16:58,  9.79it/s, episode_reward=85, running_reward=53.2]

episode_reward 83
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83], maxlen=100)
initial_state [-0.03837399 -0.00480212  0.01257677 -0.02500296]
info {}
initial_state tf.Tensor([-0.03837399 -0.00480212  0.01257677 -0.02500296], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step

  0%|          | 28/10000 [00:08<16:58,  9.79it/s, episode_reward=116, running_reward=58.4]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 110
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110], maxlen=100)
initial_state [-0.00734438 -0.04689434 -0.04272613 -0.035665  ]
info {}
initial_state tf.Tensor([-0.00734438 -0.04689434 -0.04272613 -0.035665  ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  0%|          | 32/10000 [00:08<15:07, 10.99it/s, episode_reward=104, running_reward=73.6]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 228
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228], maxlen=100)
initial_state [-0.04362499  0.04585437  0.00498482  0.02720767]
info {}
initial_state tf.Tensor([-0.04362499  0.04585437  0.00498482  0.02720767], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  0%|          | 38/10000 [00:09<12:35, 13.18it/s, episode_reward=127, running_reward=77]  

initial_state [ 0.00432146  0.04000869  0.01663728 -0.02051998]
info {}
initial_state tf.Tensor([ 0.00432146  0.04000869  0.01663728 -0.02051998], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  0%|          | 41/10000 [00:09<10:59, 15.10it/s, episode_reward=24, running_reward=76.4] 

enter env_step
enter env_step
enter env_step
episode_reward 116
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116], maxlen=100)
initial_state [ 0.03849838 -0.00346414 -0.00341208  0.04965422]
info {}
initial_state tf.Tensor([ 0.03849838 -0.00346414 -0.00341208  0.04965422], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 35
episodes_reward

  0%|          | 50/10000 [00:09<07:41, 21.55it/s, episode_reward=115, running_reward=76.6]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 143
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143], maxlen=100)
initial_state [ 0.0176503  -0.03833391 -0.02214547  0.02085873]
info {}
initial_state tf.Tensor([ 0.0176503  -0.03833391 -0.02214547  0.02085873], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  1%|          | 53/10000 [00:09<08:04, 20.55it/s, episode_reward=34, running_reward=77.7] 

initial_state [-0.03654895  0.02704557 -0.02116253 -0.018233  ]
info {}
initial_state tf.Tensor([-0.03654895  0.02704557 -0.02116253 -0.018233  ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 53/10000 [00:09<08:04, 20.55it/s, episode_reward=315, running_reward=82.1]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 56/10000 [00:10<10:23, 15.96it/s, episode_reward=159, running_reward=86.1]

episode_reward 235
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235], maxlen=100)
initial_state [-0.03121899  0.03373321 -0.02414295 -0.00857247]
info {}
initial_state tf.Tensor([-0.03121899  0.03373321 -0.02414295 -0.00857247], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 56/10000 [00:10<10:23, 15.96it/s, episode_reward=261, running_reward=89.1]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 261
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 

  1%|          | 59/10000 [00:10<13:42, 12.09it/s, episode_reward=92, running_reward=93.7] 

episode_reward 183
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183], maxlen=100)
initial_state [ 0.04315516  0.03220192  0.04111085 -0.01825038]
info {}
initial_state tf.Tensor([ 0.04315516  0.03220192  0.04111085 -0.01825038], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 61/10000 [00:10<14:29, 11.43it/s, episode_reward=92, running_reward=93.7]

initial_state [ 0.01435593  0.04597715  0.01946861 -0.00057371]
info {}
initial_state tf.Tensor([ 0.01435593  0.04597715  0.01946861 -0.00057371], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 63/10000 [00:11<17:57,  9.23it/s, episode_reward=258, running_reward=103]

episode_reward 487
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487], maxlen=100)
initial_state [-0.03557863  0.04166614  0.00486708  0.0094837 ]
info {}
initial_state tf.Tensor([-0.03557863  0.04166614  0.00486708  0.0094837 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  1%|          | 63/10000 [00:11<17:57,  9.23it/s, episode_reward=360, running_reward=107]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 66/10000 [00:11<19:54,  8.31it/s, episode_reward=230, running_reward=110]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 215
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215], maxlen=100)
initial_state [-0.01405145  0.01315653 -0.02328361 -0.02752738]
info {}
initial_state tf.Tensor([-0.01405145  0.01315653 -0.0232836

  1%|          | 67/10000 [00:11<21:07,  7.83it/s, episode_reward=334, running_reward=113]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 68/10000 [00:11<21:45,  7.61it/s, episode_reward=284, running_reward=116]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 284
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228,

  1%|          | 69/10000 [00:11<23:48,  6.95it/s, episode_reward=500, running_reward=121]

episode_reward 500
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500], maxlen=100)
initial_state [ 0.01957249  0.00694912 -0.02086524 -0.02072479]
info {}
initial_state tf.Tensor([ 0.01957249  0.00694912 -0.02086524 -0.02072479], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_st

  1%|          | 70/10000 [00:12<27:13,  6.08it/s, episode_reward=500, running_reward=127]

episode_reward 500
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500], maxlen=100)
initial_state [-0.02564248  0.0378058  -0.00200169 -0.01277638]
info {}
initial_state tf.Tensor([-0.02564248  0.0378058  -0.00200169 -0.01277638], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  1%|          | 71/10000 [00:12<30:29,  5.43it/s, episode_reward=500, running_reward=132]

episode_reward 500
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500], maxlen=100)
initial_state [-0.00969582 -0.00563833  0.04090779  0.01038947]
info {}
initial_state tf.Tensor([-0.00969582 -0.00563833  0.04090779  0.01038947], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  1%|          | 73/10000 [00:12<29:59,  5.52it/s, episode_reward=345, running_reward=140]

initial_state [-0.02561566  0.02942779 -0.01907903  0.04480657]
info {}
initial_state tf.Tensor([-0.02561566  0.02942779 -0.01907903  0.04480657], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 74/10000 [00:12<29:57,  5.52it/s, episode_reward=419, running_reward=144]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 75/10000 [00:13<31:26,  5.26it/s, episode_reward=500, running_reward=148]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 77/10000 [00:13<24:25,  6.77it/s, episode_reward=210, running_reward=149]

episode_reward 156
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156], maxlen=100)
initial_state [-0.02022521 -0.01258377  0.04247963 -0.04161829]
info {}
initial_state tf.Tensor([-0.02022521 -0.01258377  0.04247963 -0.04161829], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  1%|          | 78/10000 [00:13<23:44,  6.96it/s, episode_reward=500, running_reward=156]

episode_reward 327
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327], maxlen=100)
initial_state [-0.00853356 -0.03873906 -0.03093519 -0.00224386]
info {}
initial_state tf.Tensor([-0.00853356 -0.03873906 -0.03093519 -0.00224386], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_st

  1%|          | 81/10000 [00:13<20:32,  8.05it/s, episode_reward=172, running_reward=156]

initial_state [ 0.01587066  0.00275837  0.03278347 -0.01275898]
info {}
initial_state tf.Tensor([ 0.01587066  0.00275837  0.03278347 -0.01275898], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 83/10000 [00:14<16:45,  9.86it/s, episode_reward=138, running_reward=155]

episode_reward 126
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126], maxlen=100)
initial_state [ 0.0203263   0.02548441  0.01536369 -0.00704381]
info {}
initial_state tf.Tensor([ 0.0203263   0.02548441  0.01536369 -0.00704381], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  1%|          | 87/10000 [00:14<12:49, 12.89it/s, episode_reward=152, running_reward=155]

initial_state [ 0.00841803  0.04192651 -0.0078736   0.01323659]
info {}
initial_state tf.Tensor([ 0.00841803  0.04192651 -0.0078736   0.01323659], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 91/10000 [00:14<11:05, 14.88it/s, episode_reward=126, running_reward=154]

initial_state [-0.0125045   0.01281465 -0.01120003  0.02744897]
info {}
initial_state tf.Tensor([-0.0125045   0.01281465 -0.01120003  0.02744897], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 94/10000 [00:14<09:48, 16.83it/s, episode_reward=129, running_reward=152]

episode_reward 85
episodes_reward deque([34, 23, 34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85], maxlen=100)
initial_state [ 0.03863456 -0.00672337 -0.03381245 -0.0467406 ]
info {}
initial_state tf.Tensor([ 0.03863456 -0.00672337 -0.03381245 -0.0467406 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ente

  1%|          | 96/10000 [00:14<10:26, 15.82it/s, episode_reward=190, running_reward=152]

initial_state [ 0.00196564  0.03362392 -0.03975436 -0.04365167]
info {}
initial_state tf.Tensor([ 0.00196564  0.03362392 -0.03975436 -0.04365167], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 98/10000 [00:15<12:40, 13.01it/s, episode_reward=104, running_reward=151]

initial_state [-0.03102466 -0.04177629 -0.0262639   0.01560436]
info {}
initial_state tf.Tensor([-0.03102466 -0.04177629 -0.0262639   0.01560436], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 100/10000 [00:15<13:50, 11.92it/s, episode_reward=128, running_reward=152]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 102/10000 [00:15<13:50, 11.92it/s, episode_reward=118, running_reward=155]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 122
episodes_reward deque([34, 23, 26, 19, 16, 30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122], maxlen=100)
initial_state [-0.01938753  0.02539697 -0.02506209  0.02692543]
info {}
initial_state tf.Tensor([-0.01938753  0.02539697 -0.02506209  0.02692543], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter e

  1%|          | 106/10000 [00:15<13:43, 12.02it/s, episode_reward=110, running_reward=157]

initial_state [-0.00524567  0.02117952 -0.04095489 -0.00818122]
info {}
initial_state tf.Tensor([-0.00524567  0.02117952 -0.04095489 -0.00818122], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|          | 108/10000 [00:15<14:40, 11.24it/s, episode_reward=139, running_reward=159]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 146
episodes_reward deque([30, 118, 68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146], maxlen=100)
initial_state [-0.03335394  0.03610076 -0.02495882  0.01246864]
info {}
initial_state 

  1%|          | 110/10000 [00:16<15:09, 10.87it/s, episode_reward=34, running_reward=159] 

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 150
episodes_reward deque([68, 41, 48, 53, 49, 22, 42, 30, 42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 5

  1%|          | 113/10000 [00:16<11:38, 14.16it/s, episode_reward=39, running_reward=159]

initial_state [ 0.01975493  0.03120496 -0.00335289 -0.04739154]
info {}
initial_state tf.Tensor([ 0.01975493  0.03120496 -0.00335289 -0.04739154], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 5

  1%|          | 115/10000 [00:16<15:40, 10.51it/s, episode_reward=62, running_reward=160]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|          | 119/10000 [00:16<13:16, 12.40it/s, episode_reward=69, running_reward=162]

episode_reward 82
episodes_reward deque([42, 43, 57, 111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82], maxlen=100)
initial_state [-0.03480365 -0.02870883  0.00039194  0.01896262]
info {}
initial_state tf.Tensor([-0.03480365 -0.02870883  0.00039194  0.01896262], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  1%|          | 121/10000 [00:17<14:22, 11.45it/s, episode_reward=130, running_reward=163]

episode_reward 124
episodes_reward deque([111, 55, 56, 83, 67, 39, 175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124], maxlen=100)
initial_state [ 0.02092637  0.02320517 -0.03016713 -0.01075231]
info {}
initial_state tf.Tensor([ 0.02092637  0.02320517 -0.03016713 -0.01075231], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step


  1%|          | 123/10000 [00:17<14:29, 11.35it/s, episode_reward=133, running_reward=165]

initial_state [ 0.0208016  -0.04063834  0.01710142  0.02159538]
info {}
initial_state tf.Tensor([ 0.0208016  -0.04063834  0.01710142  0.02159538], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|▏         | 125/10000 [00:17<14:32, 11.31it/s, episode_reward=168, running_reward=166]

initial_state [-0.0224164  -0.02686095  0.03810284  0.00497099]
info {}
initial_state tf.Tensor([-0.0224164  -0.02686095  0.03810284  0.00497099], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|▏         | 127/10000 [00:17<16:38,  9.89it/s, episode_reward=173, running_reward=167]

episode_reward 141
episodes_reward deque([175, 85, 110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141], maxlen=100)
initial_state [ 0.03634976 -0.03710467 -0.00805256  0.04738257]
info {}
initial_state tf.Tensor([ 0.03634976 -0.03710467 -0.00805256  0.04738257], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|▏         | 127/10000 [00:17<16:38,  9.89it/s, episode_reward=145, running_reward=168]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 129
episodes_reward deque([110, 95, 116, 228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500,

  1%|▏         | 130/10000 [00:17<17:53,  9.19it/s, episode_reward=129, running_reward=168]

initial_state [ 0.01765709  0.04877453 -0.02235374 -0.01954375]
info {}
initial_state tf.Tensor([ 0.01765709  0.04877453 -0.02235374 -0.01954375], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|▏         | 132/10000 [00:18<18:06,  9.08it/s, episode_reward=167, running_reward=167]

episode_reward 127
episodes_reward deque([228, 361, 104, 133, 46, 117, 127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127], maxlen=100)
initial_state [ 0.04707036  0.04653681 -0.00399868  0.04478892]
info {}
initial_state tf.Tensor([ 0.04707036  0.04653681 -0.00399868  0.04478892], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter en

  1%|▏         | 133/10000 [00:18<19:04,  8.62it/s, episode_reward=122, running_reward=166]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|▏         | 135/10000 [00:18<18:52,  8.71it/s, episode_reward=158, running_reward=167]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|▏         | 137/10000 [00:18<16:44,  9.82it/s, episode_reward=138, running_reward=168]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 101
episodes_reward deque([127, 116, 35, 129, 27, 127, 53, 24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101], maxlen=100)
initial_state [-0.00628122 -0.04668389  0.01939652  0.00456218]
info {}
initial_state tf.Tensor([-0.00628122 -0.04668389  0.01939652  0.00456218], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  1%|▏         | 141/10000 [00:19<15:20, 10.71it/s, episode_reward=134, running_reward=169]

initial_state [0.02513871 0.04189767 0.03194761 0.01107351]
info {}
initial_state tf.Tensor([0.02513871 0.04189767 0.03194761 0.01107351], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  1%|▏         | 143/10000 [00:19<15:15, 10.76it/s, episode_reward=155, running_reward=171]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  1%|▏         | 145/10000 [00:19<15:28, 10.61it/s, episode_reward=189, running_reward=174]

episode_reward 173
episodes_reward deque([24, 143, 33, 34, 38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173], maxlen=100)
initial_state [ 0.031073   -0.04360844  0.03026859 -0.01529265]
info {}
initial_state tf.Tensor([ 0.031073   -0.04360844  0.03026859 -0.01529265], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ente

  1%|▏         | 147/10000 [00:19<15:54, 10.33it/s, episode_reward=152, running_reward=175]

initial_state [-0.03792018  0.02619896 -0.03097291  0.02758788]
info {}
initial_state tf.Tensor([-0.03792018  0.02619896 -0.03097291  0.02758788], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  1%|▏         | 149/10000 [00:19<16:17, 10.08it/s, episode_reward=146, running_reward=178]

episode_reward 208
episodes_reward deque([38, 107, 115, 123, 134, 34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208], maxlen=100)
initial_state [ 0.03512775 -0.02687343  0.00396286  0.01932393]
info {}
initial_state tf.Tensor([ 0.03512775 -0.02687343  0.00396286  0.01932393], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 151/10000 [00:20<18:15,  8.99it/s, episode_reward=392, running_reward=181]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 152/10000 [00:20<18:14,  9.00it/s, episode_reward=198, running_reward=182]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 154/10000 [00:20<20:30,  8.00it/s, episode_reward=211, running_reward=186]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 348
episodes_reward deque([34, 315, 235, 159, 261, 183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 10

  2%|▏         | 155/10000 [00:20<22:42,  7.23it/s, episode_reward=308, running_reward=186]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 156/10000 [00:20<22:21,  7.34it/s, episode_reward=205, running_reward=185]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 157/10000 [00:21<28:31,  5.75it/s, episode_reward=465, running_reward=188]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 159/10000 [00:21<24:37,  6.66it/s, episode_reward=306, running_reward=189]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 190
episodes_reward deque([183, 271, 92, 487, 258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82,

  2%|▏         | 159/10000 [00:21<24:37,  6.66it/s, episode_reward=500, running_reward=191]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 161/10000 [00:21<26:00,  6.30it/s, episode_reward=255, running_reward=193]

initial_state [-0.04678959 -0.04807194  0.04717474 -0.00295984]
info {}
initial_state tf.Tensor([-0.04678959 -0.04807194  0.04717474 -0.00295984], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 162/10000 [00:21<24:17,  6.75it/s, episode_reward=272, running_reward=191]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 272
episodes_reward deque([258, 360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272], maxlen=100)
initial_state [-0.00378738 -0.03544762 -0.00267891  0.03655322]
info {}
initial_state tf.Tensor([-0.00378738 -0.03544762 -0.00267891  0.03655322], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_ste

  2%|▏         | 163/10000 [00:22<27:30,  5.96it/s, episode_reward=448, running_reward=193]

episode_reward 448
episodes_reward deque([360, 215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448], maxlen=100)
initial_state [-0.02383895 -0.01005277 -0.0323072   0.00943565]
info {}
initial_state tf.Tensor([-0.02383895 -0.01005277 -0.0323072   0.00943565], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_ste

  2%|▏         | 164/10000 [00:22<31:12,  5.25it/s, episode_reward=409, running_reward=196]

episode_reward 500
episodes_reward deque([215, 230, 334, 284, 500, 500, 500, 476, 345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500], maxlen=100)
initial_state [ 0.00743406 -0.03627849  0.00113637 -0.03887296]
info {}
initial_state tf.Tensor([ 0.00743406 -0.03627849  0.00113637 -0.03887296], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_ste

  2%|▏         | 166/10000 [00:22<30:03,  5.45it/s, episode_reward=256, running_reward=196]

initial_state [ 0.0007056  -0.00515353 -0.03482153 -0.02698923]
info {}
initial_state tf.Tensor([ 0.0007056  -0.00515353 -0.03482153 -0.02698923], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 167/10000 [00:22<33:04,  4.96it/s, episode_reward=320, running_reward=196]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 168/10000 [00:23<30:54,  5.30it/s, episode_reward=158, running_reward=195]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 169/10000 [00:23<43:50,  3.74it/s, episode_reward=500, running_reward=195]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 169/10000 [00:23<43:50,  3.74it/s, episode_reward=351, running_reward=193]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 171/10000 [00:23<39:00,  4.20it/s, episode_reward=170, running_reward=190]

initial_state [-0.03765458  0.01010043 -0.02083189 -0.04684743]
info {}
initial_state tf.Tensor([-0.03765458  0.01010043 -0.02083189 -0.04684743], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 172/10000 [00:24<32:44,  5.00it/s, episode_reward=160, running_reward=185]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 147
episodes_reward deque([345, 419, 500, 156, 210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 

  2%|▏         | 174/10000 [00:24<27:08,  6.03it/s, episode_reward=179, running_reward=182]

initial_state [ 0.01353722 -0.0056216  -0.01971177  0.01936319]
info {}
initial_state tf.Tensor([ 0.01353722 -0.0056216  -0.01971177  0.01936319], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 175/10000 [00:24<25:48,  6.35it/s, episode_reward=190, running_reward=179]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 190
episod

  2%|▏         | 176/10000 [00:24<28:12,  5.80it/s, episode_reward=260, running_reward=180]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 260
episodes_reward deque([210, 327, 500, 167, 172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260], maxlen=100)
initial_state [-0.00988515 -0.02404932  0.03673315 -0.00744918]
info {}
initial_state tf.Tensor([-0.00988515 -0.02404932  0.03673315 -0.007449

  2%|▏         | 177/10000 [00:24<31:55,  5.13it/s, episode_reward=294, running_reward=181]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 178/10000 [00:25<30:57,  5.29it/s, episode_reward=85, running_reward=176] 

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 180/10000 [00:25<22:47,  7.18it/s, episode_reward=102, running_reward=175]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 109
episodes_reward deque([172, 126, 150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 

  2%|▏         | 182/10000 [00:25<19:26,  8.42it/s, episode_reward=116, running_reward=174]

enter env_step
enter env_step
enter env_step
episode_reward 107
episodes_reward deque([150, 129, 138, 149, 117, 152, 112, 131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107], maxlen=100)
initial_state [0.01275614 0.03246063 0.01023909 0.00702862]
info {}
initial_state tf.Tensor([0.01275614 0.03246063 0.01023909 0.00702862], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  2%|▏         | 184/10000 [00:25<17:15,  9.48it/s, episode_reward=95, running_reward=174] 

initial_state [ 0.03209061  0.03432763 -0.03327183 -0.01701912]
info {}
initial_state tf.Tensor([ 0.03209061  0.03432763 -0.03327183 -0.01701912], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 186/10000 [00:25<19:28,  8.40it/s, episode_reward=110, running_reward=175]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 330
episod

  2%|▏         | 188/10000 [00:26<19:18,  8.47it/s, episode_reward=232, running_reward=176]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 190/10000 [00:26<19:50,  8.24it/s, episode_reward=240, running_reward=178]

episode_reward 190
episodes_reward deque([131, 126, 85, 57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190], maxlen=100)
initial_state [0.00359529 0.01310333 0.01883093 0.01161899]
info {}
initial_state tf.Tensor([0.00359529 0.01310333 0.01883093 0.01161899], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter en

  2%|▏         | 191/10000 [00:26<20:42,  7.90it/s, episode_reward=235, running_reward=179]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 193/10000 [00:26<22:25,  7.29it/s, episode_reward=286, running_reward=183]

episode_reward 245
episodes_reward deque([57, 161, 129, 112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245], maxlen=100)
initial_state [ 0.02668685 -0.02142066  0.01029355  0.04827315]
info {}
initial_state tf.Tensor([ 0.02668685 -0.02142066  0.01029355  0.04827315], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step

  2%|▏         | 194/10000 [00:26<23:52,  6.85it/s, episode_reward=202, running_reward=183]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 196/10000 [00:27<21:08,  7.73it/s, episode_reward=201, running_reward=185]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 202
episodes_reward deque([112, 190, 108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116,

  2%|▏         | 198/10000 [00:27<19:27,  8.39it/s, episode_reward=207, running_reward=186]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 170
episodes_reward deque([108, 104, 170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205

  2%|▏         | 199/10000 [00:27<25:49,  6.33it/s, episode_reward=341, running_reward=192]

episode_reward 500
episodes_reward deque([170, 128, 122, 113, 118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500], maxlen=100)
initial_state [ 0.03859917 -0.04050899  0.00646435  0.04147761]
info {}
initial_state tf.Tensor([ 0.03859917 -0.04050899  0.00646435  0.04147761], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_ste

  2%|▏         | 200/10000 [00:27<27:07,  6.02it/s, episode_reward=341, running_reward=192]

initial_state [-0.01684466 -0.02568466  0.01636148 -0.03823461]
info {}
initial_state tf.Tensor([-0.01684466 -0.02568466  0.01636148 -0.03823461], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 201/10000 [00:28<33:07,  4.93it/s, episode_reward=500, running_reward=195]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 202/10000 [00:28<35:29,  4.60it/s, episode_reward=456, running_reward=199]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 203/10000 [00:28<37:53,  4.31it/s, episode_reward=500, running_reward=202]

episode_reward 500
episodes_reward deque([118, 139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500], maxlen=100)
initial_state [0.01512958 0.00372735 0.01273971 0.00659199]
info {}
initial_state tf.Tensor([0.01512958 0.00372735 0.01273971 0.00659199], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  2%|▏         | 204/10000 [00:28<39:33,  4.13it/s, episode_reward=500, running_reward=206]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([139, 110, 146, 139, 150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500], maxlen=100)
initial_state [ 0.04227404 -0.03229893  0.02112112 -0.01542849]
info {}
initial_state tf.Tensor([

  2%|▏         | 205/10000 [00:29<41:24,  3.94it/s, episode_reward=500, running_reward=210]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 205/10000 [00:29<41:24,  3.94it/s, episode_reward=500, running_reward=214]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 207/10000 [00:29<36:23,  4.49it/s, episode_reward=234, running_reward=215]

initial_state [-0.0156443  -0.0476175   0.02940942  0.04618611]
info {}
initial_state tf.Tensor([-0.0156443  -0.0476175   0.02940942  0.04618611], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 208/10000 [00:30<42:22,  3.85it/s, episode_reward=500, running_reward=218]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([150, 78, 52, 34, 54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500], maxlen=100)
initial_state [-0.03301277 -0.02836618  0.0472766   0.03578316]
info {}
initial_state tf.Tensor([-0.03301277 -0.

  2%|▏         | 209/10000 [00:30<51:40,  3.16it/s, episode_reward=500, running_reward=222]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 210/10000 [00:30<56:42,  2.88it/s, episode_reward=500, running_reward=226]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 210/10000 [00:31<56:42,  2.88it/s, episode_reward=500, running_reward=231]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 211/10000 [00:31<1:03:15,  2.58it/s, episode_reward=500, running_reward=231]

initial_state [ 0.03595323  0.02576861 -0.04543555 -0.02365622]
info {}
initial_state tf.Tensor([ 0.03595323  0.02576861 -0.04543555 -0.02365622], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 212/10000 [00:31<1:06:55,  2.44it/s, episode_reward=500, running_reward=235]

episode_reward 500
episodes_reward deque([54, 39, 84, 62, 82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01609719 -0.01737196  0.04797253  0.00612463]
info {}
initial_state tf.Tensor([ 0.01609719 -0.01737196  0.04797253  0.00612463], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 213/10000 [00:32<1:04:20,  2.54it/s, episode_reward=500, running_reward=240]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 213/10000 [00:32<1:04:20,  2.54it/s, episode_reward=500, running_reward=244]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 214/10000 [00:32<1:05:56,  2.47it/s, episode_reward=500, running_reward=244]

initial_state [ 0.00070061 -0.01969157 -0.01816016 -0.03129357]
info {}
initial_state tf.Tensor([ 0.00070061 -0.01969157 -0.01816016 -0.03129357], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 215/10000 [00:32<1:02:04,  2.63it/s, episode_reward=500, running_reward=248]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 216/10000 [00:33<57:45,  2.82it/s, episode_reward=500, running_reward=253]  

episode_reward 500
episodes_reward deque([82, 99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0159758  -0.03807856 -0.02879488 -0.04478619]
info {}
initial_state tf.Tensor([ 0.0159758  -0.03807856 -0.02879488 -0.04478619], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  2%|▏         | 217/10000 [00:33<53:20,  3.06it/s, episode_reward=500, running_reward=257]

episode_reward 500
episodes_reward deque([99, 69, 124, 154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01872367 -0.01512844 -0.01628173  0.04128001]
info {}
initial_state tf.Tensor([-0.01872367 -0.01512844 -0.01628173  0.04128001], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter

  2%|▏         | 218/10000 [00:33<52:16,  3.12it/s, episode_reward=500, running_reward=261]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 219/10000 [00:34<49:48,  3.27it/s, episode_reward=500, running_reward=265]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 220/10000 [00:34<47:48,  3.41it/s, episode_reward=500, running_reward=269]

episode_reward 500
episodes_reward deque([154, 130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01918261  0.00465195  0.02704662 -0.02298808]
info {}
initial_state tf.Tensor([-0.01918261  0.00465195  0.02704662 -0.02298808], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 221/10000 [00:34<47:24,  3.44it/s, episode_reward=500, running_reward=272]

episode_reward 500
episodes_reward deque([130, 147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0175629   0.03272622 -0.02737097 -0.03212547]
info {}
initial_state tf.Tensor([ 0.0175629   0.03272622 -0.02737097 -0.03212547], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 222/10000 [00:34<45:51,  3.55it/s, episode_reward=500, running_reward=276]

episode_reward 500
episodes_reward deque([147, 133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.02256579  0.00739323 -0.01720794 -0.00373275]
info {}
initial_state tf.Tensor([ 0.02256579  0.00739323 -0.01720794 -0.00373275], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 223/10000 [00:35<44:04,  3.70it/s, episode_reward=500, running_reward=280]

episode_reward 500
episodes_reward deque([133, 168, 141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02566531 -0.00513291 -0.00797736  0.03210044]
info {}
initial_state tf.Tensor([-0.02566531 -0.00513291 -0.00797736  0.03210044], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 224/10000 [00:35<45:35,  3.57it/s, episode_reward=500, running_reward=283]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 225/10000 [00:35<49:25,  3.30it/s, episode_reward=500, running_reward=287]

episode_reward 500
episodes_reward deque([141, 173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04853325 -0.02024443 -0.0269582  -0.03585042]
info {}
initial_state tf.Tensor([-0.04853325 -0.02024443 -0.0269582  -0.03585042], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 226/10000 [00:36<48:43,  3.34it/s, episode_reward=500, running_reward=290]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([173, 129, 145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 

  2%|▏         | 227/10000 [00:36<48:33,  3.35it/s, episode_reward=500, running_reward=294]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 228/10000 [00:36<47:15,  3.45it/s, episode_reward=500, running_reward=297]

episode_reward 500
episodes_reward deque([145, 129, 127, 167, 193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.0310006   0.01235388  0.01336836  0.04860901]
info {}
initial_state tf.Tensor([-0.0310006   0.01235388  0.01336836  0.04860901], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 228/10000 [00:37<47:15,  3.45it/s, episode_reward=500, running_reward=301]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 229/10000 [00:37<51:00,  3.19it/s, episode_reward=500, running_reward=301]

initial_state [-0.03039841  0.02076653  0.03058554 -0.04803947]
info {}
initial_state tf.Tensor([-0.03039841  0.02076653  0.03058554 -0.04803947], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 230/10000 [00:37<55:02,  2.96it/s, episode_reward=500, running_reward=305]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 230/10000 [00:37<55:02,  2.96it/s, episode_reward=500, running_reward=308]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 231/10000 [00:37<58:09,  2.80it/s, episode_reward=500, running_reward=308]

initial_state [-0.01931148 -0.03436536 -0.02377265  0.04328324]
info {}
initial_state tf.Tensor([-0.01931148 -0.03436536 -0.02377265  0.04328324], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 232/10000 [00:38<1:02:30,  2.60it/s, episode_reward=500, running_reward=312]

episode_reward 500
episodes_reward deque([193, 122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.04200706  0.03498891 -0.02245535  0.00140696]
info {}
initial_state tf.Tensor([ 0.04200706  0.03498891 -0.02245535  0.00140696], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 233/10000 [00:38<1:04:34,  2.52it/s, episode_reward=500, running_reward=315]

episode_reward 500
episodes_reward deque([122, 165, 158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0210661   0.04854957 -0.03842244  0.04689028]
info {}
initial_state tf.Tensor([ 0.0210661   0.04854957 -0.03842244  0.04689028], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 233/10000 [00:39<1:04:34,  2.52it/s, episode_reward=500, running_reward=318]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 234/10000 [00:39<1:05:53,  2.47it/s, episode_reward=500, running_reward=318]

initial_state [-0.02969386 -0.00022323  0.02117301 -0.04698949]
info {}
initial_state tf.Tensor([-0.02969386 -0.00022323  0.02117301 -0.04698949], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 235/10000 [00:39<1:02:19,  2.61it/s, episode_reward=390, running_reward=321]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 390
episodes_reward deque([158, 101, 154, 138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 

  2%|▏         | 235/10000 [00:39<1:02:19,  2.61it/s, episode_reward=344, running_reward=323]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 236/10000 [00:39<57:43,  2.82it/s, episode_reward=344, running_reward=323]  

initial_state [-0.02615773  0.0411133   0.03686468 -0.00927232]
info {}
initial_state tf.Tensor([-0.02615773  0.0411133   0.03686468 -0.00927232], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  2%|▏         | 237/10000 [00:40<58:52,  2.76it/s, episode_reward=500, running_reward=327]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 238/10000 [00:40<56:05,  2.90it/s, episode_reward=500, running_reward=330]

episode_reward 500
episodes_reward deque([138, 156, 134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500], maxlen=100)
initial_state [ 0.02250731 -0.03731199  0.01995146  0.03639168]
info {}
initial_state tf.Tensor([ 0.02250731 -0.03731199  0.01995146  0.03639168], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 239/10000 [00:40<54:42,  2.97it/s, episode_reward=500, running_reward=334]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 241/10000 [00:41<44:11,  3.68it/s, episode_reward=271, running_reward=338]

episode_reward 500
episodes_reward deque([134, 176, 155, 173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02716829 -0.03235304  0.0103634  -0.01517358]
info {}
initial_state tf.Tensor([-0.02716829 -0.03235304  0.0103634  -0.01517358], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 242/10000 [00:41<41:23,  3.93it/s, episode_reward=394, running_reward=341]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  2%|▏         | 244/10000 [00:41<28:06,  5.78it/s, episode_reward=82, running_reward=339] 

episode_reward 114
episodes_reward deque([173, 156, 189, 152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114], maxlen=100)
initial_state [-0.03063913  0.03505427 -0.00535024 -0.03648394]
info {}
initial_state tf.Tensor([-0.03063913  0.03505427 -0.00535024 -0.03648394], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  2%|▏         | 248/10000 [00:41<17:10,  9.46it/s, episode_reward=110, running_reward=336]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 154
episodes_reward deque([152, 208, 155, 146, 392, 198, 348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154], maxlen=100)
initial_state [ 0.04608076  0.00027175 -0.04633032 -0.03987687]
info {}
initial_state tf.Tensor([ 0.04608076  0.00027175 -0.04633032 -0.03987687], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ente

  2%|▎         | 250/10000 [00:42<15:41, 10.35it/s, episode_reward=126, running_reward=334]

initial_state [0.00665858 0.0336121  0.00670198 0.0378124 ]
info {}
initial_state tf.Tensor([0.00665858 0.0336121  0.00670198 0.0378124 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  3%|▎         | 252/10000 [00:42<14:47, 10.99it/s, episode_reward=125, running_reward=331]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 119
episodes_reward deque([348, 211, 308, 205, 465, 190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351

  3%|▎         | 254/10000 [00:42<15:48, 10.27it/s, episode_reward=120, running_reward=330]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 120
episodes_reward deque([308, 205, 465, 190, 306, 500

  3%|▎         | 256/10000 [00:42<18:50,  8.62it/s, episode_reward=148, running_reward=330]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 258/10000 [00:42<17:31,  9.26it/s, episode_reward=115, running_reward=324]

enter env_step
enter env_step
episode_reward 127
episodes_reward deque([190, 306, 500, 255, 272, 448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127], maxlen=100)
initial_state [-0.02150937 -0.01326185  0.03240913  0.04275018]
info {}
initial_state tf.Tensor([-0.02150937 -0.01326185  0.03240913  0.04275018], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter

  3%|▎         | 260/10000 [00:43<16:01, 10.13it/s, episode_reward=103, running_reward=319]

initial_state [ 0.00109107 -0.04509362  0.00953288  0.00077348]
info {}
initial_state tf.Tensor([ 0.00109107 -0.04509362  0.00953288  0.00077348], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 262/10000 [00:43<14:16, 11.37it/s, episode_reward=96, running_reward=313] 

initial_state [ 0.0343637  -0.0073126  -0.0072095  -0.00217197]
info {}
initial_state tf.Tensor([ 0.0343637  -0.0073126  -0.0072095  -0.00217197], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 29
episodes_reward deque([448, 500, 409, 256, 320, 158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 50

  3%|▎         | 264/10000 [00:43<18:27,  8.79it/s, episode_reward=500, running_reward=313]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 266/10000 [00:43<17:35,  9.23it/s, episode_reward=147, running_reward=308]

initial_state [-0.03625597 -0.02054708  0.03874961 -0.02247309]
info {}
initial_state tf.Tensor([-0.03625597 -0.02054708  0.03874961 -0.02247309], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 268/10000 [00:43<15:59, 10.14it/s, episode_reward=108, running_reward=302]

episode_reward 127
episodes_reward deque([158, 500, 351, 170, 147, 160, 179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127], maxlen=100)
initial_state [-0.04597812  0.01684915 -0.03553956  0.03539432]
info {}
initial_state tf.Tensor([-0.04597812  0.01684915 -0.03553956  0.03539432], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter en

  3%|▎         | 270/10000 [00:44<23:48,  6.81it/s, episode_reward=500, running_reward=304]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 271/10000 [00:44<32:05,  5.05it/s, episode_reward=500, running_reward=307]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 272/10000 [00:45<38:33,  4.21it/s, episode_reward=500, running_reward=310]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 273/10000 [00:45<43:39,  3.71it/s, episode_reward=500, running_reward=314]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([179, 190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02023784  0.03825769 -0.00985331  0.00629778]
info {}
initial_state tf.Tensor([-0.02023784  0.03825769 -0.00985331  0.00629778], shape=(4,), dtype=float32)
enter env_step
enter en

  3%|▎         | 274/10000 [00:46<48:51,  3.32it/s, episode_reward=102, running_reward=316]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([190, 260, 294, 233, 85, 109, 102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.03109401 -0.04584027 -0.00628318 -0.03520425]
info {}
initial_state tf.Tensor([ 0.03109401 -0.04584027 -0.00628318 -0.03520425], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter en

  3%|▎         | 276/10000 [00:46<44:33,  3.64it/s, episode_reward=500, running_reward=319]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 277/10000 [00:46<50:53,  3.18it/s, episode_reward=500, running_reward=321]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 278/10000 [00:47<53:45,  3.01it/s, episode_reward=500, running_reward=323]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 279/10000 [00:47<54:19,  2.98it/s, episode_reward=500, running_reward=327]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 280/10000 [00:47<53:11,  3.05it/s, episode_reward=500, running_reward=331]

episode_reward 500
episodes_reward deque([102, 107, 110, 116, 95, 330, 110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00349623 -0.04992458  0.013384    0.04242913]
info {}
initial_state tf.Tensor([ 0.00349623 -0.04992458  0.013384    0.04242913], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter e

  3%|▎         | 281/10000 [00:48<52:30,  3.09it/s, episode_reward=121, running_reward=335]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 283/10000 [00:48<43:43,  3.70it/s, episode_reward=500, running_reward=339]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 283/10000 [00:48<43:43,  3.70it/s, episode_reward=500, running_reward=343]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 284/10000 [00:48<44:09,  3.67it/s, episode_reward=500, running_reward=343]

initial_state [-0.03146115  0.0475545  -0.01892064 -0.03534974]
info {}
initial_state tf.Tensor([-0.03146115  0.0475545  -0.01892064 -0.03534974], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 285/10000 [00:49<45:07,  3.59it/s, episode_reward=500, running_reward=347]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 286/10000 [00:49<47:32,  3.41it/s, episode_reward=500, running_reward=349]

episode_reward 500
episodes_reward deque([110, 232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04837527 -0.03116997 -0.04053343  0.02140818]
info {}
initial_state tf.Tensor([-0.04837527 -0.03116997 -0.04053343  0.02140818], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 287/10000 [00:49<47:27,  3.41it/s, episode_reward=500, running_reward=353]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([232, 190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01687968  0.00746258  0.031

  3%|▎         | 288/10000 [00:50<42:19,  3.83it/s, episode_reward=348, running_reward=354]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 348
episodes_reward deque([190, 240, 235, 245, 286, 202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500

  3%|▎         | 289/10000 [00:50<43:54,  3.69it/s, episode_reward=500, running_reward=357]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 290/10000 [00:50<43:56,  3.68it/s, episode_reward=113, running_reward=358]

initial_state [-0.01119541  0.02184445 -0.02267656  0.04882118]
info {}
initial_state tf.Tensor([-0.01119541  0.02184445 -0.02267656  0.04882118], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 292/10000 [00:50<36:12,  4.47it/s, episode_reward=500, running_reward=361]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 293/10000 [00:51<41:27,  3.90it/s, episode_reward=500, running_reward=363]

episode_reward 500
episodes_reward deque([202, 202, 201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500], maxlen=100)
initial_state [-0.04147956 -0.01107444  0.01384948 -0.04453497]
info {}
initial_state tf.Tensor([-0.04147956 -0.01107444  0.01384948 -0.04453497], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 293/10000 [00:51<41:27,  3.90it/s, episode_reward=500, running_reward=366]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 294/10000 [00:51<47:45,  3.39it/s, episode_reward=500, running_reward=366]

initial_state [-0.00647949 -0.0231065   0.03320239  0.00575632]
info {}
initial_state tf.Tensor([-0.00647949 -0.0231065   0.03320239  0.00575632], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 295/10000 [00:52<59:31,  2.72it/s, episode_reward=500, running_reward=369]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([201, 170, 207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500], maxlen=100)
initial_state [-0.0285774  -0.02064749  0.032

  3%|▎         | 295/10000 [00:52<59:31,  2.72it/s, episode_reward=500, running_reward=372]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 296/10000 [00:52<1:04:25,  2.51it/s, episode_reward=500, running_reward=372]

initial_state [-0.00491039  0.04706926 -0.04503821  0.01422692]
info {}
initial_state tf.Tensor([-0.00491039  0.04706926 -0.04503821  0.01422692], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 297/10000 [00:53<1:06:13,  2.44it/s, episode_reward=500, running_reward=375]

episode_reward 500
episodes_reward deque([207, 500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04826293 -0.00301144 -0.04293631  0.00110469]
info {}
initial_state tf.Tensor([-0.04826293 -0.00301144 -0.04293631  0.00110469], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 298/10000 [00:53<1:08:28,  2.36it/s, episode_reward=500, running_reward=378]

episode_reward 500
episodes_reward deque([500, 341, 500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.03312075  0.02441748 -0.00112274 -0.04626407]
info {}
initial_state tf.Tensor([-0.03312075  0.02441748 -0.00112274 -0.04626407], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 298/10000 [00:54<1:08:28,  2.36it/s, episode_reward=500, running_reward=378]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 299/10000 [00:54<1:07:23,  2.40it/s, episode_reward=500, running_reward=378]

initial_state [-0.00193376  0.0304219   0.03451307  0.03315598]
info {}
initial_state tf.Tensor([-0.00193376  0.0304219   0.03451307  0.03315598], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 300/10000 [00:54<1:08:06,  2.37it/s, episode_reward=500, running_reward=380]

episode_reward 500
episodes_reward deque([500, 456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.02523114 -0.02368444  0.0383695   0.03913577]
info {}
initial_state tf.Tensor([ 0.02523114 -0.02368444  0.0383695   0.03913577], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 301/10000 [00:54<1:08:30,  2.36it/s, episode_reward=500, running_reward=380]

episode_reward 500
episodes_reward deque([456, 500, 500, 500, 500, 234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [0.04443916 0.03928763 0.00142029 0.03546554]
info {}
initial_state tf.Tensor([0.04443916 0.03928763 0.00142029 0.03546554], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step

  3%|▎         | 302/10000 [00:55<1:03:44,  2.54it/s, episode_reward=500, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 302/10000 [00:55<1:03:44,  2.54it/s, episode_reward=500, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 303/10000 [00:55<59:03,  2.74it/s, episode_reward=500, running_reward=380]  

initial_state [0.02091543 0.02084102 0.03433522 0.00469204]
info {}
initial_state tf.Tensor([0.02091543 0.02084102 0.03433522 0.00469204], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  3%|▎         | 304/10000 [00:55<57:25,  2.81it/s, episode_reward=500, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 304/10000 [00:56<57:25,  2.81it/s, episode_reward=500, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 305/10000 [00:56<54:34,  2.96it/s, episode_reward=500, running_reward=380]

initial_state [ 0.02271928 -0.03844029 -0.04522346 -0.00480818]
info {}
initial_state tf.Tensor([ 0.02271928 -0.03844029 -0.04522346 -0.00480818], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 306/10000 [00:56<52:08,  3.10it/s, episode_reward=500, running_reward=380]

episode_reward 500
episodes_reward deque([234, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0237767  -0.00289704  0.04501394  0.03030607]
info {}
initial_state tf.Tensor([ 0.0237767  -0.00289704  0.04501394  0.03030607], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 307/10000 [00:56<51:32,  3.13it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 

  3%|▎         | 308/10000 [00:57<51:22,  3.14it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00475181  0.0011575  -0.00325448 -0.02280125]
info {}
initial_state tf.Tensor([ 0.00475181  0.0011575  -0.00325448 -0.02280125], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 309/10000 [00:57<52:36,  3.07it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 310/10000 [00:57<51:00,  3.17it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.048207    0.02774276  0.00259927  0.02348295]
info {}
initial_state tf.Tensor([-0.048207    0.02774276  0.00259927  0.02348295], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 311/10000 [00:58<52:00,  3.11it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  3%|▎         | 312/10000 [00:58<50:12,  3.22it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00130884  0.01301463 -0.00935122  0.02358047]
info {}
initial_state tf.Tensor([-0.00130884  0.01301463 -0.00935122  0.02358047], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 313/10000 [00:58<50:31,  3.20it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 314/10000 [00:59<54:32,  2.96it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 315/10000 [00:59<57:31,  2.81it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00979917 -0.02583158 -0.00373683 -0.03309635]
info {}
initial_state tf.Tensor([-0.00979917 -0.02583158 -0.00373683 -0.03309635], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 316/10000 [00:59<1:00:19,  2.68it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.0008789   0.02304877  0.03715757 -0.02996839]
info {}
initial_state tf.Tensor([-0.0008789   0.02304877  0.03715757 -0.02996839], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 317/10000 [01:00<1:10:23,  2.29it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  3%|▎         | 318/10000 [01:00<1:14:28,  2.17it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04388517  0.00389874  0.01128112  0.03129701]
info {}
initial_state tf.Tensor([-0.04388517  0.00389874  0.01128112  0.03129701], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 319/10000 [01:01<1:13:09,  2.21it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01753598 -0.04341508  0.0275643   0.00457312]
info {}
initial_state tf.Tensor([ 0.01753598 -0.04341508  0.0275643   0.00457312], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 319/10000 [01:01<1:13:09,  2.21it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 320/10000 [01:01<1:10:33,  2.29it/s, episode_reward=500, running_reward=383]

initial_state [-0.02590725 -0.01858153  0.01866866  0.00536789]
info {}
initial_state tf.Tensor([-0.02590725 -0.01858153  0.01866866  0.00536789], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 321/10000 [01:02<1:09:31,  2.32it/s, episode_reward=500, running_reward=383]

initial_state [ 0.02439089  0.04924943 -0.03849088 -0.02816305]
info {}
initial_state tf.Tensor([ 0.02439089  0.04924943 -0.03849088 -0.02816305], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 322/10000 [01:02<1:03:13,  2.55it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 

  3%|▎         | 323/10000 [01:02<59:40,  2.70it/s, episode_reward=500, running_reward=383]  

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01547137 -0.03190304  0.03898662 -0.02512883]
info {}
initial_state tf.Tensor([ 0.01547137 -0.03190304  0.03898662 -0.02512883], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 324/10000 [01:03<56:26,  2.86it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 325/10000 [01:03<52:49,  3.05it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 326/10000 [01:03<53:38,  3.01it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 327/10000 [01:04<53:03,  3.04it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02681816  0.00071495 -0.03908523  0.03806349]
info {}
initial_state tf.Tensor([-0.02681816  0.00071495 -0.03908523  0.03806349], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 328/10000 [01:04<49:57,  3.23it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 328/10000 [01:04<49:57,  3.23it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 329/10000 [01:04<48:30,  3.32it/s, episode_reward=500, running_reward=383]

initial_state [-0.00148218  0.0107487   0.0326052  -0.0420675 ]
info {}
initial_state tf.Tensor([-0.00148218  0.0107487   0.0326052  -0.0420675 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 330/10000 [01:04<47:19,  3.41it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.03697388  0.00394416 -0.0348307  -0.02481306]
info {}
initial_state tf.Tensor([ 0.03697388  0.00394416 -0.0348307  -0.02481306], shape=(4,), dtype=float32)
enter env_step
enter 

  3%|▎         | 331/10000 [01:05<48:04,  3.35it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 332/10000 [01:05<50:17,  3.20it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  3%|▎         | 333/10000 [01:05<50:59,  3.16it/s, episode_reward=500, running_reward=383]

episode_reward 500
episodes_reward deque([500, 390, 344, 500, 500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00227284  0.00333446  0.0380304  -0.00996785]
info {}
initial_state tf.Tensor([-0.00227284  0.00333446  0.0380304  -0.00996785], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 334/10000 [01:06<52:36,  3.06it/s, episode_reward=455, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 335/10000 [01:06<56:06,  2.87it/s, episode_reward=500, running_reward=384]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 335/10000 [01:07<56:06,  2.87it/s, episode_reward=500, running_reward=385]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 336/10000 [01:07<1:02:27,  2.58it/s, episode_reward=500, running_reward=385]

initial_state [-0.03247894  0.0438397   0.0084716  -0.03668659]
info {}
initial_state tf.Tensor([-0.03247894  0.0438397   0.0084716  -0.03668659], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 337/10000 [01:07<1:06:42,  2.41it/s, episode_reward=500, running_reward=385]

episode_reward 500
episodes_reward deque([500, 500, 500, 271, 394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500], maxlen=100)
initial_state [ 0.00891912 -0.03704495  0.04870311 -0.0416446 ]
info {}
initial_state tf.Tensor([ 0.00891912 -0.03704495  0.04870311 -0.0416446 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 337/10000 [01:07<1:06:42,  2.41it/s, episode_reward=500, running_reward=385]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 339/10000 [01:08<51:55,  3.10it/s, episode_reward=126, running_reward=381]  

initial_state [ 0.00219115  0.02488327 -0.02023412  0.03723564]
info {}
initial_state tf.Tensor([ 0.00219115  0.02488327 -0.02023412  0.03723564], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 339/10000 [01:08<51:55,  3.10it/s, episode_reward=307, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 340/10000 [01:08<50:57,  3.16it/s, episode_reward=307, running_reward=380]

initial_state [ 0.0174482  -0.04787714  0.0160294  -0.03652694]
info {}
initial_state tf.Tensor([ 0.0174482  -0.04787714  0.0160294  -0.03652694], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 341/10000 [01:08<59:37,  2.70it/s, episode_reward=500, running_reward=382]

episode_reward 500
episodes_reward deque([394, 114, 109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500], maxlen=100)
initial_state [-0.03192708 -0.03096731  0.0450218   0.00785151]
info {}
initial_state tf.Tensor([-0.03192708 -0.03096731  0.0450218   0.00785151], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 342/10000 [01:09<1:00:47,  2.65it/s, episode_reward=500, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  3%|▎         | 343/10000 [01:09<58:09,  2.77it/s, episode_reward=500, running_reward=387]  

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([109, 82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  3%|▎         | 344/10000 [01:09<48:09,  3.34it/s, episode_reward=203, running_reward=388]

episode_reward 203
episodes_reward deque([82, 154, 52, 129, 110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203], maxlen=100)
initial_state [ 0.04617335 -0.04927642  0.01016017  0.00475244]
info {}
initial_state tf.Tensor([ 0.04617335 -0.04927642  0.01016017  0.00475244], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter 

  3%|▎         | 345/10000 [01:10<48:19,  3.33it/s, episode_reward=139, running_reward=392]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([154, 52, 129, 110, 146, 126, 119, 125, 120,

  3%|▎         | 347/10000 [01:10<32:32,  4.94it/s, episode_reward=131, running_reward=393]

initial_state [-0.04629687 -0.04805342  0.04506364  0.04507042]
info {}
initial_state tf.Tensor([-0.04629687 -0.04805342  0.04506364  0.04507042], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  3%|▎         | 348/10000 [01:10<40:12,  4.00it/s, episode_reward=500, running_reward=396]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([110, 146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500], maxlen=100)
initial_state [-0.02919184 -0.04016663 -0.03434803  0.04691039]
info {}
initial_state tf.Tensor([-0.02919184 -0.04016663 -0.03434803  0.04691039], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ente

  3%|▎         | 349/10000 [01:10<37:26,  4.30it/s, episode_reward=242, running_reward=398]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 242
episodes_reward deque([146, 126, 119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242], maxlen=100)
initial_state [ 0.03572853  0.01154319  0.04458446 -0.03580245]
info {}
initial_state tf

  3%|▎         | 349/10000 [01:11<37:26,  4.30it/s, episode_reward=500, running_reward=401]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 350/10000 [01:11<41:55,  3.84it/s, episode_reward=500, running_reward=401]

initial_state [-0.00177722 -0.02672692 -0.03344574 -0.04095237]
info {}
initial_state tf.Tensor([-0.00177722 -0.02672692 -0.03344574 -0.04095237], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 351/10000 [01:11<40:16,  3.99it/s, episode_reward=371, running_reward=404]

episode_reward 371
episodes_reward deque([119, 125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371], maxlen=100)
initial_state [-0.03693197  0.01526863  0.04380486 -0.0093191 ]
info {}
initial_state tf.Tensor([-0.03693197  0.01526863  0.04380486 -0.0093191 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ente

  4%|▎         | 352/10000 [01:11<45:03,  3.57it/s, episode_reward=500, running_reward=407]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([125, 120, 406, 148, 127, 130, 115, 110, 103, 29, 96, 500, 

  4%|▎         | 352/10000 [01:11<45:03,  3.57it/s, episode_reward=500, running_reward=411]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 353/10000 [01:12<43:36,  3.69it/s, episode_reward=500, running_reward=415]

initial_state [-0.0221059  -0.01903975 -0.04536121 -0.0025462 ]
info {}
initial_state tf.Tensor([-0.0221059  -0.01903975 -0.04536121 -0.0025462 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 354/10000 [01:12<39:53,  4.03it/s, episode_reward=452, running_reward=415]

initial_state [ 0.02325192 -0.01486139 -0.00804738 -0.04860378]
info {}
initial_state tf.Tensor([ 0.02325192 -0.01486139 -0.00804738 -0.04860378], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 357/10000 [01:12<25:48,  6.23it/s, episode_reward=205, running_reward=416]

initial_state [ 0.04529205 -0.0276618   0.04552941  0.04129202]
info {}
initial_state tf.Tensor([ 0.04529205 -0.0276618   0.04552941  0.04129202], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 357/10000 [01:12<25:48,  6.23it/s, episode_reward=500, running_reward=419]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 359/10000 [01:12<26:44,  6.01it/s, episode_reward=279, running_reward=421]

initial_state [-0.04848338 -0.0015237   0.04271589  0.02616817]
info {}
initial_state tf.Tensor([-0.04848338 -0.0015237   0.04271589  0.02616817], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 359/10000 [01:13<26:44,  6.01it/s, episode_reward=500, running_reward=425]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 361/10000 [01:13<29:22,  5.47it/s, episode_reward=281, running_reward=427]

initial_state [ 0.00976312 -0.03922023 -0.02840649  0.03257671]
info {}
initial_state tf.Tensor([ 0.00976312 -0.03922023 -0.02840649  0.03257671], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 363/10000 [01:13<31:05,  5.17it/s, episode_reward=204, running_reward=432]

episode_reward 455
episodes_reward deque([96, 500, 69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455], maxlen=100)
initial_state [ 0.0336073  -0.00761883  0.02665045  0.0346342 ]
info {}
initial_state tf.Tensor([ 0.0336073  -0.00761883  0.02665045  0.0346342 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
ent

  4%|▎         | 364/10000 [01:14<38:07,  4.21it/s, episode_reward=500, running_reward=432]

episode_reward 500
episodes_reward deque([69, 147, 127, 118, 108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500], maxlen=100)
initial_state [ 0.04851546  0.04595792 -0.01939285 -0.01737857]
info {}
initial_state tf.Tensor([ 0.04851546  0.04595792 -0.01939285 -0.01737857], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  4%|▎         | 365/10000 [01:14<43:18,  3.71it/s, episode_reward=500, running_reward=436]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 366/10000 [01:14<51:30,  3.12it/s, episode_reward=500, running_reward=440]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 367/10000 [01:15<52:46,  3.04it/s, episode_reward=500, running_reward=444]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 368/10000 [01:15<51:44,  3.10it/s, episode_reward=500, running_reward=447]

episode_reward 500
episodes_reward deque([108, 500, 500, 500, 500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.03676288 -0.04722179  0.01228821  0.03355048]
info {}
initial_state tf.Tensor([ 0.03676288 -0.04722179  0.01228821  0.03355048], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 369/10000 [01:15<52:29,  3.06it/s, episode_reward=500, running_reward=451]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 370/10000 [01:16<45:03,  3.56it/s, episode_reward=255, running_reward=449]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 370/10000 [01:16<45:03,  3.56it/s, episode_reward=500, running_reward=449]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 371/10000 [01:16<46:16,  3.47it/s, episode_reward=500, running_reward=449]

initial_state [ 0.03334244 -0.00876593 -0.03196651 -0.04272186]
info {}
initial_state tf.Tensor([ 0.03334244 -0.00876593 -0.03196651 -0.04272186], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 372/10000 [01:16<44:53,  3.57it/s, episode_reward=500, running_reward=449]

episode_reward 500
episodes_reward deque([500, 500, 102, 500, 500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500], maxlen=100)
initial_state [ 0.00152246 -0.02437927  0.02072765 -0.01661211]
info {}
initial_state tf.Tensor([ 0.00152246 -0.02437927  0.02072765 -0.01661211], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▎         | 373/10000 [01:16<45:58,  3.49it/s, episode_reward=500, running_reward=449]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 374/10000 [01:17<46:24,  3.46it/s, episode_reward=500, running_reward=449]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▎         | 374/10000 [01:17<46:24,  3.46it/s, episode_reward=500, running_reward=453]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 375/10000 [01:17<44:38,  3.59it/s, episode_reward=500, running_reward=453]

initial_state [0.04854175 0.02159265 0.01960695 0.01288726]
info {}
initial_state tf.Tensor([0.04854175 0.02159265 0.01960695 0.01288726], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  4%|▍         | 376/10000 [01:17<42:19,  3.79it/s, episode_reward=500, running_reward=453]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00465627 -0.01109952  0.03704552  0.03870262]
info {}
initial_state tf.Tensor([-0.00465627 -0.01109952  0.03704552  0.03870262], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 377/10000 [01:17<42:13,  3.80it/s, episode_reward=500, running_reward=453]

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01522464  0.01368341 -0.03515057 -0.01134821]
info {}
initial_state tf.Tensor([ 0.01522464  0.01368341 -0.03515057 -0.01134821], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 378/10000 [01:18<40:53,  3.92it/s, episode_reward=500, running_reward=453]

episode_reward 500
episodes_reward deque([500, 500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01243406  0.0350026  -0.04629065 -0.03771126]
info {}
initial_state tf.Tensor([-0.01243406  0.0350026  -0.04629065 -0.03771126], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 379/10000 [01:18<40:46,  3.93it/s, episode_reward=500, running_reward=453]

episode_reward 500
episodes_reward deque([500, 500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.04628154 -0.039173    0.01815241  0.04544087]
info {}
initial_state tf.Tensor([ 0.04628154 -0.039173    0.01815241  0.04544087], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 380/10000 [01:18<40:53,  3.92it/s, episode_reward=500, running_reward=453]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 121, 500, 500, 500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04382783  0.01688889 -0.00390467 -0.00687555]
info {}
initial_state tf.Tensor([-0.04382783  0.01688889 -0.00390467 -0.00687555], shape=(4,), d

  4%|▍         | 381/10000 [01:18<39:54,  4.02it/s, episode_reward=500, running_reward=453]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 382/10000 [01:19<41:13,  3.89it/s, episode_reward=500, running_reward=457]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 382/10000 [01:19<41:13,  3.89it/s, episode_reward=500, running_reward=457]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 383/10000 [01:19<40:39,  3.94it/s, episode_reward=500, running_reward=457]

initial_state [ 0.01527357 -0.02751173  0.00746342  0.01191344]
info {}
initial_state tf.Tensor([ 0.01527357 -0.02751173  0.00746342  0.01191344], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 384/10000 [01:19<41:18,  3.88it/s, episode_reward=500, running_reward=457]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105

  4%|▍         | 385/10000 [01:19<40:22,  3.97it/s, episode_reward=500, running_reward=457]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 385/10000 [01:20<40:22,  3.97it/s, episode_reward=500, running_reward=457]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 386/10000 [01:20<42:41,  3.75it/s, episode_reward=500, running_reward=457]

initial_state [-0.00799188 -0.0283912   0.01599894 -0.0467485 ]
info {}
initial_state tf.Tensor([-0.00799188 -0.0283912   0.01599894 -0.0467485 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 387/10000 [01:20<43:55,  3.65it/s, episode_reward=500, running_reward=457]

episode_reward 500
episodes_reward deque([348, 500, 500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00373904  0.00183693  0.0123401  -0.04088661]
info {}
initial_state tf.Tensor([ 0.00373904  0.00183693  0.0123401  -0.04088661], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 388/10000 [01:20<48:41,  3.29it/s, episode_reward=500, running_reward=458]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 389/10000 [01:21<49:34,  3.23it/s, episode_reward=450, running_reward=458]

episode_reward 450
episodes_reward deque([500, 113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450], maxlen=100)
initial_state [ 0.02431392 -0.03344918 -0.0130055  -0.02251129]
info {}
initial_state tf.Tensor([ 0.02431392 -0.03344918 -0.0130055  -0.02251129], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 391/10000 [01:21<47:20,  3.38it/s, episode_reward=168, running_reward=458]

episode_reward 484
episodes_reward deque([113, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484], maxlen=100)
initial_state [ 0.01052024  0.01987426 -0.03510967 -0.04574724]
info {}
initial_state tf.Tensor([ 0.01052024  0.01987426 -0.03510967 -0.04574724], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 392/10000 [01:22<43:41,  3.67it/s, episode_reward=252, running_reward=456]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 393/10000 [01:22<41:47,  3.83it/s, episode_reward=233, running_reward=453]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 394/10000 [01:22<43:37,  3.67it/s, episode_reward=387, running_reward=452]

episode_reward 387
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387], maxlen=100)
initial_state [ 0.00983065 -0.03435686  0.0278008  -0.04856992]
info {}
initial_state tf.Tensor([ 0.00983065 -0.03435686  0.0278008  -0.04856992], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 395/10000 [01:22<47:34,  3.36it/s, episode_reward=500, running_reward=452]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 396/10000 [01:23<46:12,  3.46it/s, episode_reward=361, running_reward=450]

episode_reward 425
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425], maxlen=100)
initial_state [-0.01027474 -0.0414828  -0.025582    0.02612003]
info {}
initial_state tf.Tensor([-0.01027474 -0.0414828  -0.025582    0.02612003], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 397/10000 [01:23<41:46,  3.83it/s, episode_reward=134, running_reward=446]

initial_state [0.0473184  0.03933169 0.03488106 0.02953025]
info {}
initial_state tf.Tensor([0.0473184  0.03933169 0.03488106 0.02953025], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  4%|▍         | 399/10000 [01:23<32:08,  4.98it/s, episode_reward=351, running_reward=445]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 351
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455

  4%|▍         | 401/10000 [01:23<28:12,  5.67it/s, episode_reward=251, running_reward=439]

episode_reward 187
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187], maxlen=100)
initial_state [-0.03893187 -0.0199115   0.00133957 -0.00565203]
info {}
initial_state tf.Tensor([-0.03893187 -0.0199115   0.00133957 -0.00565203], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 402/10000 [01:24<26:15,  6.09it/s, episode_reward=124, running_reward=432]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 404/10000 [01:24<22:51,  7.00it/s, episode_reward=150, running_reward=426]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_re

  4%|▍         | 406/10000 [01:24<21:48,  7.33it/s, episode_reward=132, running_reward=420]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 406/10000 [01:24<21:48,  7.33it/s, episode_reward=367, running_reward=419]

initial_state [-0.01312706 -0.02668695  0.01196083  0.01201415]
info {}
initial_state tf.Tensor([-0.01312706 -0.02668695  0.01196083  0.01201415], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 409/10000 [01:25<22:47,  7.02it/s, episode_reward=270, running_reward=417]

initial_state [ 0.01814495 -0.04722658  0.04035895  0.04766061]
info {}
initial_state tf.Tensor([ 0.01814495 -0.04722658  0.04035895  0.04766061], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 411/10000 [01:25<19:06,  8.36it/s, episode_reward=133, running_reward=409]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 413/10000 [01:25<17:18,  9.23it/s, episode_reward=150, running_reward=399]

episode_reward 158
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158], maxlen=100)
initial_state [ 0.02556882 -0.04646792  0.00454222  0.02255762]
info {}
initial_state tf.Tensor([ 0.02556882 -0.04646792  0.00454222  0.02255762], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 417/10000 [01:25<14:42, 10.86it/s, episode_reward=135, running_reward=388]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 129
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129], maxlen=100)
initial_state [-0.00971154  0.03596218 -0.0276663  -0.04501123]
info {}
initial_state tf.Tensor([-0.00971154  0.03596218 -0.0276663  -0.04501123], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 417/10000 [01:25<14:42, 10.86it/s, episode_reward=312, running_reward=383]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 419/10000 [01:26<15:53, 10.04it/s, episode_reward=303, running_reward=381]

initial_state [ 0.04713472  0.04509296  0.02124926 -0.00017916]
info {}
initial_state tf.Tensor([ 0.04713472  0.04509296  0.02124926 -0.00017916], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 421/10000 [01:26<17:34,  9.09it/s, episode_reward=134, running_reward=371]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 425/10000 [01:26<13:31, 11.79it/s, episode_reward=131, running_reward=364]

initial_state [ 0.00520216  0.01937407 -0.0473874  -0.00985286]
info {}
initial_state tf.Tensor([ 0.00520216  0.01937407 -0.0473874  -0.00985286], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 425/10000 [01:26<13:31, 11.79it/s, episode_reward=299, running_reward=362]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 299
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299], maxlen=100)
initial_state [ 0.0434462  -0.04708213  0.04672233  0.0

  4%|▍         | 429/10000 [01:26<15:45, 10.13it/s, episode_reward=116, running_reward=354]

episode_reward 413
episodes_reward deque([500, 500, 500, 500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413], maxlen=100)
initial_state [0.00753535 0.01467142 0.04912058 0.00525871]
info {}
initial_state tf.Tensor([0.00753535 0.01467142 0.04912058 0.00525871], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  4%|▍         | 431/10000 [01:27<14:19, 11.13it/s, episode_reward=138, running_reward=343]

episode_reward 123
episodes_reward deque([500, 500, 500, 455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123], maxlen=100)
initial_state [ 0.03876144 -0.02067803 -0.00100095  0.00592773]
info {}
initial_state tf.Tensor([ 0.03876144 -0.02067803 -0.00100095  0.00592773], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 433/10000 [01:27<14:24, 11.07it/s, episode_reward=118, running_reward=335]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 123
episodes_reward deque([455, 500, 500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135

  4%|▍         | 435/10000 [01:27<15:22, 10.37it/s, episode_reward=127, running_reward=328]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 137
episodes_reward deque([500, 500, 500, 126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137], maxlen=100)
initial_state [-0.04253619  0.02642863  0.02412503 -0.02826911]
info {}
initial_state tf.Tensor([-0.04253619  0.02642863  0.02412503 -0.02826911]

  4%|▍         | 437/10000 [01:27<17:11,  9.27it/s, episode_reward=120, running_reward=324]

initial_state [ 0.01100993  0.02423906  0.00666075 -0.04684232]
info {}
initial_state tf.Tensor([ 0.01100993  0.02423906  0.00666075 -0.04684232], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 439/10000 [01:27<17:06,  9.31it/s, episode_reward=113, running_reward=318]

episode_reward 116
episodes_reward deque([126, 307, 500, 500, 500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116], maxlen=100)
initial_state [-0.00596298 -0.00781953 -0.02732292 -0.02624564]
info {}
initial_state tf.Tensor([-0.00596298 -0.00781953 -0.02732292 -0.02624564], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 441/10000 [01:28<18:20,  8.69it/s, episode_reward=99, running_reward=314] 

initial_state [-0.02882951 -0.03618689 -0.04489461 -0.01583027]
info {}
initial_state tf.Tensor([-0.02882951 -0.03618689 -0.04489461 -0.01583027], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  4%|▍         | 443/10000 [01:28<18:56,  8.41it/s, episode_reward=104, running_reward=307]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 108
episodes_reward deque([500, 203, 500, 139, 131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233

  4%|▍         | 444/10000 [01:28<19:48,  8.04it/s, episode_reward=121, running_reward=306]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 445/10000 [01:28<22:00,  7.24it/s, episode_reward=130, running_reward=302]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  4%|▍         | 447/10000 [01:29<20:07,  7.91it/s, episode_reward=116, running_reward=298]

episode_reward 113
episodes_reward deque([131, 500, 242, 500, 371, 500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113], maxlen=100)
initial_state [ 0.0010237   0.02591804 -0.01391099  0.02202808]
info {}
initial_state tf.Tensor([ 0.0010237   0.02591804 -0.01391099  0.02202808], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  4%|▍         | 449/10000 [01:29<16:39,  9.56it/s, episode_reward=104, running_reward=292]

initial_state [-0.01131277  0.00968414 -0.02257893 -0.01735388]
info {}
initial_state tf.Tensor([-0.01131277  0.00968414 -0.02257893 -0.01735388], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 452/10000 [01:29<16:29,  9.65it/s, episode_reward=119, running_reward=286]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 110
episodes_reward deque([500, 500, 500, 452, 105, 205, 500, 279, 500, 281, 455, 204

  5%|▍         | 454/10000 [01:29<16:41,  9.53it/s, episode_reward=105, running_reward=278]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 111
episodes_reward deque([500, 452, 105, 205, 500, 279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153

  5%|▍         | 455/10000 [01:29<17:16,  9.20it/s, episode_reward=121, running_reward=275]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 457/10000 [01:29<17:13,  9.24it/s, episode_reward=123, running_reward=274]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 458/10000 [01:30<18:07,  8.77it/s, episode_reward=117, running_reward=269]

enter env_step
enter env_step
episode_reward 118
episodes_reward deque([279, 500, 281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118], maxlen=100)
initial_state [ 0.04496645  0.00744894 -0.00715963  0.04421232]
info {}
initial_state tf.Tensor([ 0.04496645  0.00744894 -0.00715963  0.04421232], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 462/10000 [01:30<15:41, 10.13it/s, episode_reward=132, running_reward=260]

episode_reward 146
episodes_reward deque([281, 455, 204, 500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146], maxlen=100)
initial_state [-0.02999835  0.03420676 -0.03303032 -0.0142254 ]
info {}
initial_state tf.Tensor([-0.02999835  0.03420676 -0.03303032 -0.0142254 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 464/10000 [01:30<21:22,  7.44it/s, episode_reward=132, running_reward=260]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 255, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134

  5%|▍         | 466/10000 [01:31<19:54,  7.98it/s, episode_reward=223, running_reward=253]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 467/10000 [01:31<19:23,  8.20it/s, episode_reward=157, running_reward=250]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 468/10000 [01:31<22:52,  6.94it/s, episode_reward=122, running_reward=245]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 471/10000 [01:31<18:50,  8.43it/s, episode_reward=174, running_reward=241]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 155
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155], maxlen=100)
initial_state [-0.03214222 -0.03462313 -0.01802946  0.03253108]
info {}
initial_state tf.Tensor([-0.03214222 -0.0346

  5%|▍         | 471/10000 [01:31<18:50,  8.43it/s, episode_reward=500, running_reward=241]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 472/10000 [01:32<24:28,  6.49it/s, episode_reward=148, running_reward=237]

initial_state [-0.00635695  0.00303063  0.00460968 -0.04461022]
info {}
initial_state tf.Tensor([-0.00635695  0.00303063  0.00460968 -0.04461022], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 474/10000 [01:32<24:48,  6.40it/s, episode_reward=500, running_reward=237]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 474/10000 [01:32<24:48,  6.40it/s, episode_reward=500, running_reward=237]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 477/10000 [01:32<23:08,  6.86it/s, episode_reward=140, running_reward=231]

initial_state [-0.00186703 -0.03869462 -0.04125494  0.01951211]
info {}
initial_state tf.Tensor([-0.00186703 -0.03869462 -0.04125494  0.01951211], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 477/10000 [01:32<23:08,  6.86it/s, episode_reward=500, running_reward=231]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 479/10000 [01:33<25:25,  6.24it/s, episode_reward=168, running_reward=227]

initial_state [ 0.01305809  0.02679615  0.03515298 -0.03147428]
info {}
initial_state tf.Tensor([ 0.01305809  0.02679615  0.03515298 -0.03147428], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 479/10000 [01:33<25:25,  6.24it/s, episode_reward=500, running_reward=227]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 480/10000 [01:33<29:30,  5.38it/s, episode_reward=115, running_reward=223]

initial_state [ 0.0004792   0.04555591 -0.03669428 -0.00314886]
info {}
initial_state tf.Tensor([ 0.0004792   0.04555591 -0.03669428 -0.00314886], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 482/10000 [01:33<26:47,  5.92it/s, episode_reward=500, running_reward=223]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 483/10000 [01:33<30:04,  5.28it/s, episode_reward=500, running_reward=223]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 484/10000 [01:34<32:07,  4.94it/s, episode_reward=500, running_reward=223]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 485/10000 [01:34<35:50,  4.43it/s, episode_reward=500, running_reward=223]

episode_reward 500
episodes_reward deque([500, 500, 500, 450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02795684  0.03991829 -0.01394737  0.00369901]
info {}
initial_state tf.Tensor([-0.02795684  0.03991829 -0.01394737  0.00369901], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 486/10000 [01:34<40:33,  3.91it/s, episode_reward=500, running_reward=223]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 486/10000 [01:35<40:33,  3.91it/s, episode_reward=500, running_reward=223]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 487/10000 [01:35<52:16,  3.03it/s, episode_reward=500, running_reward=223]

initial_state [-0.04653909  0.01896668 -0.00053964  0.04406255]
info {}
initial_state tf.Tensor([-0.04653909  0.01896668 -0.00053964  0.04406255], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 488/10000 [01:35<59:37,  2.66it/s, episode_reward=500, running_reward=223]

episode_reward 500
episodes_reward deque([450, 484, 168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02823426  0.0042461   0.0486779   0.00130607]
info {}
initial_state tf.Tensor([-0.02823426  0.0042461   0.0486779   0.00130607], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 488/10000 [01:36<59:37,  2.66it/s, episode_reward=500, running_reward=224]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 489/10000 [01:36<1:01:20,  2.58it/s, episode_reward=500, running_reward=224]

initial_state [ 0.01863338 -0.0489522   0.01889732  0.03552968]
info {}
initial_state tf.Tensor([ 0.01863338 -0.0489522   0.01889732  0.03552968], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 490/10000 [01:36<55:02,  2.88it/s, episode_reward=172, running_reward=222]  

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 328
episodes_reward deque([168, 252, 233, 387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328], maxlen=100)
initial_state [0.00698064 0.00582829 0.01385692 0.00888493]
info {}
initial_state tf.Tensor([0.00698064 0.00582829 0.01385692 0.00888493], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 492/10000 [01:36<38:11,  4.15it/s, episode_reward=193, running_reward=222]

initial_state [ 0.02611045  0.02969951 -0.02702937 -0.00417858]
info {}
initial_state tf.Tensor([ 0.02611045  0.02969951 -0.02702937 -0.00417858], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▍         | 494/10000 [01:36<30:22,  5.22it/s, episode_reward=200, running_reward=219]

episode_reward 139
episodes_reward deque([387, 500, 425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139], maxlen=100)
initial_state [-0.04760591 -0.0491954  -0.00619704  0.00160564]
info {}
initial_state tf.Tensor([-0.04760591 -0.0491954  -0.00619704  0.00160564], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 495/10000 [01:37<35:00,  4.52it/s, episode_reward=500, running_reward=219]

episode_reward 500
episodes_reward deque([425, 361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500], maxlen=100)
initial_state [0.02078378 0.0324921  0.0269707  0.01251564]
info {}
initial_state tf.Tensor([0.02078378 0.0324921  0.0269707  0.01251564], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▍         | 496/10000 [01:37<35:56,  4.41it/s, episode_reward=371, running_reward=218]

episode_reward 371
episodes_reward deque([361, 134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371], maxlen=100)
initial_state [ 0.01479985 -0.00963176 -0.03913525 -0.03411767]
info {}
initial_state tf.Tensor([ 0.01479985 -0.00963176 -0.03913525 -0.03411767], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 497/10000 [01:37<37:22,  4.24it/s, episode_reward=420, running_reward=219]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 420
episodes_reward deque([134, 351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420], maxlen=100)
initial_state [-0.02338291  0.04468509 -0.03970225  0.04579434]
info {}
initial_state tf.Tensor([-0.02338291  0.04468509 -0.03970225  0.04579434], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 498/10000 [01:38<37:44,  4.20it/s, episode_reward=427, running_reward=222]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 427
episodes_reward deque([351, 187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105

  5%|▍         | 499/10000 [01:38<34:57,  4.53it/s, episode_reward=286, running_reward=221]

enter env_step
episode_reward 286
episodes_reward deque([187, 251, 219, 124, 218, 150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286], maxlen=100)
initial_state [ 0.02698328  0.00433487 -0.03737435 -0.04744673]
info {}
initial_state tf.Tensor([ 0.02698328  0.00433487 -0.03737435 -0.04744673], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▍         | 499/10000 [01:38<34:57,  4.53it/s, episode_reward=500, running_reward=224]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 500/10000 [01:38<42:25,  3.73it/s, episode_reward=500, running_reward=224]

initial_state [-0.01780996  0.03323216 -0.03949654  0.02081499]
info {}
initial_state tf.Tensor([-0.01780996  0.03323216 -0.03949654  0.02081499], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 501/10000 [01:38<43:25,  3.65it/s, episode_reward=500, running_reward=227]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 502/10000 [01:39<43:05,  3.67it/s, episode_reward=500, running_reward=230]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 503/10000 [01:39<43:48,  3.61it/s, episode_reward=500, running_reward=233]

initial_state [ 0.01730972  0.00151455  0.00855892 -0.0047785 ]
info {}
initial_state tf.Tensor([ 0.01730972  0.00151455  0.00855892 -0.0047785 ], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 504/10000 [01:39<42:34,  3.72it/s, episode_reward=500, running_reward=236]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([150, 286, 132, 367, 270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00742946 -0.0097175   0.01698025  0.00151264]
info {}
initial_state 

  5%|▌         | 505/10000 [01:39<41:06,  3.85it/s, episode_reward=500, running_reward=240]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 506/10000 [01:40<42:04,  3.76it/s, episode_reward=500, running_reward=242]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 506/10000 [01:40<42:04,  3.76it/s, episode_reward=500, running_reward=246]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 507/10000 [01:40<41:01,  3.86it/s, episode_reward=500, running_reward=246]

initial_state [ 0.04304405 -0.03860223  0.03277298  0.04629942]
info {}
initial_state tf.Tensor([ 0.04304405 -0.03860223  0.03277298  0.04629942], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 508/10000 [01:40<39:53,  3.97it/s, episode_reward=500, running_reward=247]

episode_reward 500
episodes_reward deque([270, 155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01130771 -0.02914559  0.0023392   0.04644326]
info {}
initial_state tf.Tensor([-0.01130771 -0.02914559  0.0023392   0.04644326], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 509/10000 [01:40<37:41,  4.20it/s, episode_reward=500, running_reward=249]

episode_reward 500
episodes_reward deque([155, 133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01431635 -0.03146389 -0.04552206  0.04444909]
info {}
initial_state tf.Tensor([-0.01431635 -0.03146389 -0.04552206  0.04444909], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 510/10000 [01:41<37:48,  4.18it/s, episode_reward=500, running_reward=253]

episode_reward 500
episodes_reward deque([133, 158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.04399071  0.03505573  0.02337403 -0.02066192]
info {}
initial_state tf.Tensor([ 0.04399071  0.03505573  0.02337403 -0.02066192], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 511/10000 [01:41<36:46,  4.30it/s, episode_reward=355, running_reward=255]

episode_reward 355
episodes_reward deque([158, 162, 150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355], maxlen=100)
initial_state [ 0.02217548 -0.00161812 -0.02153531 -0.00946126]
info {}
initial_state tf.Tensor([ 0.02217548 -0.00161812 -0.02153531 -0.00946126], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 512/10000 [01:41<41:58,  3.77it/s, episode_reward=500, running_reward=258]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 513/10000 [01:42<45:39,  3.46it/s, episode_reward=500, running_reward=262]

episode_reward 500
episodes_reward deque([150, 129, 136, 135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500], maxlen=100)
initial_state [-0.0182456   0.01054906  0.0006684  -0.02250665]
info {}
initial_state tf.Tensor([-0.0182456   0.01054906  0.0006684  -0.02250665], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 514/10000 [01:42<50:55,  3.10it/s, episode_reward=500, running_reward=265]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 514/10000 [01:42<50:55,  3.10it/s, episode_reward=500, running_reward=269]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 515/10000 [01:42<54:57,  2.88it/s, episode_reward=500, running_reward=269]

initial_state [ 0.04423584  0.01356177 -0.01972394  0.01927063]
info {}
initial_state tf.Tensor([ 0.04423584  0.01356177 -0.01972394  0.01927063], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 516/10000 [01:43<51:19,  3.08it/s, episode_reward=284, running_reward=270]

episode_reward 284
episodes_reward deque([135, 141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284], maxlen=100)
initial_state [-0.04008051 -0.01985937 -0.00618522 -0.00928094]
info {}
initial_state tf.Tensor([-0.04008051 -0.01985937 -0.00618522 -0.00928094], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 517/10000 [01:43<49:20,  3.20it/s, episode_reward=385, running_reward=273]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 385
episodes_reward deque([141, 312, 303, 317, 105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284,

  5%|▌         | 517/10000 [01:43<49:20,  3.20it/s, episode_reward=500, running_reward=276]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 518/10000 [01:43<50:18,  3.14it/s, episode_reward=500, running_reward=276]

initial_state [0.01179931 0.00363612 0.02276052 0.02476133]
info {}
initial_state tf.Tensor([0.01179931 0.00363612 0.02276052 0.02476133], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  5%|▌         | 519/10000 [01:44<51:40,  3.06it/s, episode_reward=500, running_reward=278]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 520/10000 [01:44<57:27,  2.75it/s, episode_reward=500, running_reward=280]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 521/10000 [01:44<54:38,  2.89it/s, episode_reward=500, running_reward=282]

episode_reward 500
episodes_reward deque([105, 134, 123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01405194  0.006494    0.00665284 -0.01128011]
info {}
initial_state tf.Tensor([ 0.01405194  0.006494    0.00665284 -0.01128011], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 522/10000 [01:45<54:22,  2.91it/s, episode_reward=500, running_reward=286]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 523/10000 [01:45<53:36,  2.95it/s, episode_reward=500, running_reward=290]

episode_reward 500
episodes_reward deque([123, 131, 299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.02427598  0.03527068 -0.03005897  0.02851599]
info {}
initial_state tf.Tensor([ 0.02427598  0.03527068 -0.03005897  0.02851599], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 524/10000 [01:45<52:09,  3.03it/s, episode_reward=500, running_reward=294]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 525/10000 [01:46<53:30,  2.95it/s, episode_reward=500, running_reward=297]

episode_reward 500
episodes_reward deque([299, 413, 153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.03531014 -0.00689131  0.01933403 -0.03202702]
info {}
initial_state tf.Tensor([ 0.03531014 -0.00689131  0.01933403 -0.03202702], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 526/10000 [01:46<54:37,  2.89it/s, episode_reward=500, running_reward=299]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 527/10000 [01:46<53:26,  2.95it/s, episode_reward=500, running_reward=300]

episode_reward 500
episodes_reward deque([153, 116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.0422048  -0.02863982 -0.01894005 -0.01267235]
info {}
initial_state tf.Tensor([-0.0422048  -0.02863982 -0.01894005 -0.01267235], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 528/10000 [01:47<51:57,  3.04it/s, episode_reward=500, running_reward=304]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([116, 123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180,

  5%|▌         | 529/10000 [01:47<54:15,  2.91it/s, episode_reward=500, running_reward=307]

episode_reward 500
episodes_reward deque([123, 130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01729611 -0.01885734  0.02667454 -0.01492673]
info {}
initial_state tf.Tensor([-0.01729611 -0.01885734  0.02667454 -0.01492673], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 530/10000 [01:47<57:48,  2.73it/s, episode_reward=500, running_reward=311]

episode_reward 500
episodes_reward deque([130, 138, 123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02637347  0.04732118 -0.03468706 -0.01526129]
info {}
initial_state tf.Tensor([-0.02637347  0.04732118 -0.03468706 -0.01526129], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 531/10000 [01:48<58:14,  2.71it/s, episode_reward=500, running_reward=315]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 532/10000 [01:48<1:05:25,  2.41it/s, episode_reward=500, running_reward=318]

episode_reward 500
episodes_reward deque([123, 118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [0.01544321 0.01454417 0.04358817 0.03552124]
info {}
initial_state tf.Tensor([0.01544321 0.01454417 0.04358817 0.03552124], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 533/10000 [01:49<1:11:46,  2.20it/s, episode_reward=500, running_reward=322]

episode_reward 500
episodes_reward deque([118, 137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.02299045  0.00753132  0.00904557 -0.02677466]
info {}
initial_state tf.Tensor([ 0.02299045  0.00753132  0.00904557 -0.02677466], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 534/10000 [01:49<1:14:13,  2.13it/s, episode_reward=500, running_reward=326]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([137, 127, 120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01272916 -0.01659673  0.01988183 -0.03234858]
info {}
initial_state tf.Tensor([ 0.01272916 -0.01659673  0.01988183 -0.03234858], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 535/10000 [01:50<1:13:04,  2.16it/s, episode_reward=500, running_reward=330]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episod

  5%|▌         | 536/10000 [01:51<1:23:44,  1.88it/s, episode_reward=500, running_reward=333]

episode_reward 500
episodes_reward deque([120, 116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00157836  0.04260166 -0.01942452  0.03184291]
info {}
initial_state tf.Tensor([ 0.00157836  0.04260166 -0.01942452  0.03184291], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 537/10000 [01:51<1:21:17,  1.94it/s, episode_reward=500, running_reward=337]

episode_reward 500
episodes_reward deque([116, 122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.02756893 -0.02347567 -0.00318483 -0.03154145]
info {}
initial_state tf.Tensor([-0.02756893 -0.02347567 -0.00318483 -0.03154145], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 538/10000 [01:52<1:19:25,  1.99it/s, episode_reward=500, running_reward=341]

episode_reward 500
episodes_reward deque([122, 113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.00675465  0.03084019 -0.03580144  0.01265709]
info {}
initial_state tf.Tensor([ 0.00675465  0.03084019 -0.03580144  0.01265709], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
en

  5%|▌         | 539/10000 [01:52<1:21:32,  1.93it/s, episode_reward=500, running_reward=345]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([113, 99, 108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500,

  5%|▌         | 540/10000 [01:52<1:15:39,  2.08it/s, episode_reward=500, running_reward=349]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 541/10000 [01:53<1:09:03,  2.28it/s, episode_reward=500, running_reward=353]

episode_reward 500
episodes_reward deque([108, 104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.0402178   0.02953322 -0.00725337 -0.01396023]
info {}
initial_state tf.Tensor([-0.0402178   0.02953322 -0.00725337 -0.01396023], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 542/10000 [01:53<1:09:27,  2.27it/s, episode_reward=500, running_reward=357]

episode_reward 500
episodes_reward deque([104, 121, 130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.03348019  0.03419713 -0.01906389 -0.01792044]
info {}
initial_state tf.Tensor([-0.03348019  0.03419713 -0.01906389 -0.01792044], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 543/10000 [01:54<1:04:38,  2.44it/s, episode_reward=500, running_reward=361]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 544/10000 [01:54<1:00:43,  2.59it/s, episode_reward=500, running_reward=364]

episode_reward 500
episodes_reward deque([130, 113, 115, 116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01609123  0.0175725  -0.04745775  0.01730865]
info {}
initial_state tf.Tensor([ 0.01609123  0.0175725  -0.04745775  0.01730865], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 545/10000 [01:54<56:17,  2.80it/s, episode_reward=500, running_reward=368]  

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque

  5%|▌         | 546/10000 [01:54<51:25,  3.06it/s, episode_reward=500, running_reward=372]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  5%|▌         | 547/10000 [01:55<49:54,  3.16it/s, episode_reward=500, running_reward=376]

episode_reward 500
episodes_reward deque([116, 106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.04707238 -0.01996641  0.02665508  0.04981277]
info {}
initial_state tf.Tensor([ 0.04707238 -0.01996641  0.02665508  0.04981277], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  5%|▌         | 548/10000 [01:55<47:50,  3.29it/s, episode_reward=500, running_reward=380]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([106, 104, 110, 119, 111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115

  5%|▌         | 549/10000 [01:55<47:05,  3.35it/s, episode_reward=500, running_reward=384]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 550/10000 [01:56<44:06,  3.57it/s, episode_reward=500, running_reward=388]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 551/10000 [01:56<44:56,  3.50it/s, episode_reward=500, running_reward=391]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_re

  6%|▌         | 552/10000 [01:56<49:42,  3.17it/s, episode_reward=500, running_reward=395]

enter env_step
enter env_step
episode_reward 500
episodes_reward deque([111, 105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00020933  0.04031317 -0.04946039 -0.04129069]
info {}
initial_state tf.Tensor([-0.00020933  0.04031317 -0.04946039 -0.04129069], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 553/10000 [01:57<52:54,  2.98it/s, episode_reward=500, running_reward=399]

episode_reward 500
episodes_reward deque([105, 120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.03941081 -0.00410096 -0.03409927 -0.00484835]
info {}
initial_state tf.Tensor([-0.03941081 -0.00410096 -0.03409927 -0.00484835], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 554/10000 [01:57<1:01:40,  2.55it/s, episode_reward=500, running_reward=403]

episode_reward 500
episodes_reward deque([120, 121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01708176  0.0233711   0.01604763 -0.00428166]
info {}
initial_state tf.Tensor([-0.01708176  0.0233711   0.01604763 -0.00428166], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 555/10000 [01:58<1:08:19,  2.30it/s, episode_reward=500, running_reward=407]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([121, 123, 118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  6%|▌         | 556/10000 [01:58<1:07:25,  2.33it/s, episode_reward=500, running_reward=411]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 557/10000 [01:58<1:06:18,  2.37it/s, episode_reward=500, running_reward=414]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([118, 117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139

  6%|▌         | 558/10000 [01:59<1:03:26,  2.48it/s, episode_reward=500, running_reward=418]

episode_reward 500
episodes_reward deque([117, 146, 119, 132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [0.01199057 0.03907576 0.02577784 0.01381921]
info {}
initial_state tf.Tensor([0.01199057 0.03907576 0.02577784 0.01381921], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  6%|▌         | 559/10000 [01:59<1:02:55,  2.50it/s, episode_reward=500, running_reward=422]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 560/10000 [02:00<1:01:40,  2.55it/s, episode_reward=500, running_reward=426]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 561/10000 [02:00<59:21,  2.65it/s, episode_reward=500, running_reward=429]  

episode_reward 500
episodes_reward deque([132, 500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.03294292  0.03032349 -0.02645254 -0.00351418]
info {}
initial_state tf.Tensor([-0.03294292  0.03032349 -0.02645254 -0.00351418], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 562/10000 [02:00<55:51,  2.82it/s, episode_reward=500, running_reward=433]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0198399

  6%|▌         | 563/10000 [02:01<54:58,  2.86it/s, episode_reward=500, running_reward=433]

episode_reward 500
episodes_reward deque([132, 142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04757033 -0.02884054 -0.04474346  0.00767487]
info {}
initial_state tf.Tensor([-0.04757033 -0.02884054 -0.04474346  0.00767487], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 564/10000 [02:01<51:34,  3.05it/s, episode_reward=500, running_reward=437]

episode_reward 500
episodes_reward deque([142, 223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00944144 -0.02157675 -0.04013111  0.03571663]
info {}
initial_state tf.Tensor([-0.00944144 -0.02157675 -0.04013111  0.03571663], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 565/10000 [02:01<49:12,  3.20it/s, episode_reward=500, running_reward=440]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([223, 157, 400, 122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.04217938 -0.02582205  0.0187107  -0.03188831]
info {}
initial_state tf.Tensor([-0.04217938 -0.02582205  0.0187107  -0.03188831], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 566/10000 [02:01<47:00,  3.34it/s, episode_reward=500, running_reward=443]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 566/10000 [02:02<47:00,  3.34it/s, episode_reward=500, running_reward=447]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 567/10000 [02:02<45:53,  3.43it/s, episode_reward=500, running_reward=447]

initial_state [-0.04913851 -0.02084727  0.02573309 -0.02560139]
info {}
initial_state tf.Tensor([-0.04913851 -0.02084727  0.02573309 -0.02560139], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 568/10000 [02:02<45:06,  3.49it/s, episode_reward=500, running_reward=448]

episode_reward 500
episodes_reward deque([122, 155, 174, 500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00517156 -0.04016832  0.00334856 -0.00401415]
info {}
initial_state tf.Tensor([-0.00517156 -0.04016832  0.00334856 -0.00401415], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 569/10000 [02:02<46:01,  3.42it/s, episode_reward=500, running_reward=451]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 570/10000 [02:03<45:48,  3.43it/s, episode_reward=500, running_reward=455]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 571/10000 [02:03<44:25,  3.54it/s, episode_reward=500, running_reward=458]

episode_reward 500
episodes_reward deque([500, 148, 500, 500, 180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.00918414 -0.02500752 -0.01433513 -0.03612218]
info {}
initial_state tf.Tensor([-0.00918414 -0.02500752 -0.01433513 -0.03612218], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 572/10000 [02:03<44:22,  3.54it/s, episode_reward=500, running_reward=458]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episod

  6%|▌         | 573/10000 [02:03<43:04,  3.65it/s, episode_reward=500, running_reward=462]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 573/10000 [02:04<43:04,  3.65it/s, episode_reward=500, running_reward=462]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 574/10000 [02:04<42:57,  3.66it/s, episode_reward=500, running_reward=462]

initial_state [-0.00535496  0.01508955 -0.01197653  0.03681706]
info {}
initial_state tf.Tensor([-0.00535496  0.01508955 -0.01197653  0.03681706], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 575/10000 [02:04<47:56,  3.28it/s, episode_reward=500, running_reward=462]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([180, 140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=

  6%|▌         | 576/10000 [02:04<50:48,  3.09it/s, episode_reward=500, running_reward=465]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([140, 500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.01984569 -0.03443446  0.00541245  0.02383523]
info {}
initial_state tf.Tensor([-0.01984569 -0.03443446  0.00541245  0.02383523], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 577/10000 [02:05<54:49,  2.86it/s, episode_reward=500, running_reward=468]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500

  6%|▌         | 578/10000 [02:05<57:01,  2.75it/s, episode_reward=500, running_reward=468]

episode_reward 500
episodes_reward deque([168, 500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.01194678  0.00729256 -0.01574359 -0.02489681]
info {}
initial_state tf.Tensor([ 0.01194678  0.00729256 -0.01574359 -0.02489681], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 579/10000 [02:06<1:02:30,  2.51it/s, episode_reward=500, running_reward=472]

episode_reward 500
episodes_reward deque([500, 115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [-0.03809238 -0.02546486 -0.020545    0.00093429]
info {}
initial_state tf.Tensor([-0.03809238 -0.02546486 -0.020545    0.00093429], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 580/10000 [02:06<59:35,  2.63it/s, episode_reward=407, running_reward=471]  

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 407
episodes_reward deque([115, 500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500

  6%|▌         | 581/10000 [02:06<1:00:28,  2.60it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500

  6%|▌         | 582/10000 [02:07<1:03:51,  2.46it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 583/10000 [02:07<58:40,  2.68it/s, episode_reward=500, running_reward=475]  

episode_reward 500
episodes_reward deque([500, 500, 500, 500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 407, 500, 500, 500], maxlen=100)
initial_state [ 0.02584707  0.04303687  0.00302448 -0.00240438]
info {}
initial_state tf.Tensor([ 0.02584707  0.04303687  0.00302448 -0.00240438], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 584/10000 [02:07<55:11,  2.84it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 585/10000 [02:08<52:03,  3.01it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 586/10000 [02:08<49:03,  3.20it/s, episode_reward=500, running_reward=475]

episode_reward 500
episodes_reward deque([500, 500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 407, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [0.04481142 0.00422674 0.0400615  0.01216136]
info {}
initial_state tf.Tensor([0.04481142 0.00422674 0.0400615  0.01216136], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env

  6%|▌         | 587/10000 [02:08<48:29,  3.24it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
episode_reward 500
episodes_reward deque([500, 500, 328, 172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 407, 500, 500, 500, 500, 500, 500, 500], maxlen=100)
initial_state [ 0.0386459  -0.01850116  0.02845228  0.02645641]
info {}
initial_state tf.Tensor([ 0.0386459  -0.01850116  0.02845228  0.02645641], shape=(4,), dtype=float32)
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
e

  6%|▌         | 588/10000 [02:09<46:37,  3.36it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 589/10000 [02:09<48:24,  3.24it/s, episode_reward=500, running_reward=475]

enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_step
enter env_

  6%|▌         | 589/10000 [02:09<34:32,  4.54it/s, episode_reward=500, running_reward=476]

episode_reward 500
episodes_reward deque([172, 193, 139, 200, 500, 371, 420, 427, 286, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 355, 500, 500, 500, 500, 284, 385, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 407, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], maxlen=100)

Solved at episode 589: average reward: 476.39!
CPU times: total: 2min 54s
Wall time: 2min 9s





The diagram above shows the interactions and communications between an agent and an environment. In reinforcement learning, one or more agents interact within an environment which may be either a simulation like CartPole in this tutorial or a connection to real-world sensors and actuators. At each step, the agent receives an observation (i.e., the state of the environment), takes an action, and usually receives a reward (the frequency at which an agent receives a reward depends on a given task or problem). Agents learn from repeated trials, and a sequence of those is called an episode — the sequence of actions from an initial observation up to either a “success” or “failure” causing the environment to reach its “done” state. The learning portion of an RL framework trains a policy about which actions (i.e., sequential decisions) cause agents to maximize their long-term, cumulative rewards.

Visualization
After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for Gym to render the environment's images correctly in Colab.

In [None]:
# Render an episode and save as a GIF file

from IPython import display as ipythondisplay
from PIL import Image

render_env = gym.make("CartPole-v1", render_mode='rgb_array')

def render_episode(env: gym.Env, model: tf.keras.Model, max_steps: int):
  state, info = env.reset()
  print(state)
  state = tf.constant(state, dtype=tf.float32)
  print(state)
  screen = env.render()
  print(screen)
  images = [Image.fromarray(screen)]
  print(images)

  for i in range(1, max_steps + 1):
    state = tf.expand_dims(state, 0)
    action_probs, _ = model(state)
    action = np.argmax(np.squeeze(action_probs))

    state, reward, done, truncated, info = env.step(action)
    state = tf.constant(state, dtype=tf.float32)

    # Render screen every 10 steps
    if i % 10 == 0:
      screen = env.render()
      images.append(Image.fromarray(screen))

    if done:
      break

  return images


# Save GIF image
images = render_episode(render_env, model, max_steps_per_episode)
image_file = 'cartpole-v1.gif'
# loop=0: loop forever, duration=1: play each frame for 1ms
images[0].save(
    image_file, save_all=True, append_images=images[1:], loop=0, duration=1)

In [14]:
import tensorflow_docs.vis.embed as embed
embed.embed_file(image_file)