In [1]:
import gym

In [2]:
e = gym.make('CartPole-v0')

This environment is from the "classic control" group and its gist is to
control the platform with a stick attached by its bottom part. The trickiness is that this stick tends to fall right or left and you need to
balance it by moving the platform to the right or left on every step

In [3]:
obs = e.reset()
obs

array([-0.033498  , -0.01541619,  0.02484131,  0.03696866])

Here we reset the environment and obtain the first observation (we always need
to reset the newly created environment).

The observation of this environment is four float numbers containing
information about the x coordinate of the stick's center of mass, its speed, its
angle to the platform, and its angular speed. Of course, by applying some math
and physics knowledge, it won't be complicated to convert these numbers into
actions when we need to balance the stick, but our problem is much trickier: how
do we learn to balance this system without knowing the exact meaning of the
observed numbers and only by getting the reward?

In [6]:
#the observation is four numbers, so let's check how we can know this in advance:
print(e.action_space)
print(e.observation_space)

Discrete(2)
Box(4,)


The action_space field is of the Discrete type, so our actions will be just 0 or
1, where 0 means pushing the platform to the left and 1 means to the right. The
observation space is of Box(4,) which means a vector of size four with values
inside the [−inf, inf] interval.

In [7]:
e.step(0)

(array([-0.03380633, -0.2108854 ,  0.02558068,  0.33738461]), 1.0, False, {})

Here we pushed our platform to the left by executing the action 0 and got the
tuple of four elements:
<ul>
<li>A new observation that is a new vector of four numbers</li>
<li>A reward of 1.0</li>
<li>The done flag = False, which means that the episode is not over yet and
we're more or less okay</li>
<li>Extra information about the environment that is an empty dictionary</li>
</ul>

In [8]:
e.action_space.sample()

0

In [9]:
e.action_space.sample()

1

In [10]:
e.observation_space.sample()

array([ 1.3731933e+00, -7.2842941e+37,  3.0977464e-01, -5.3017316e+37],
      dtype=float32)

Here we used the sample() method of the Space class on action_space and
observation_space. This method returns a random sample from the underlying
space, which in the case of our Discrete action space means a random number
of 0 or 1 and for the observation space is a random vector of four numbers. The
random sample of the observation space may not look useful, and this is true, but
the sample from the action space could be used when we're not sure how to
perform an action.

### Enough of playing around with the environment. Now, let's create a random agent.

In [18]:
env = gym.make('CartPole-v0')
total_reward = 0.0
total_steps = 0
obs = env.reset()

Here, we create the environment and initialize the counter of steps and the
reward accumulator. On the last line, we reset the environment to obtain the first
observation (which we'll not use, as our agent is stochastic).

In [19]:
while True:
  action = env.action_space.sample()
  obs, reward, done, _ = env.step(action)
  total_reward += reward
  total_steps += 1
  if done:
    break

print('Episode done in {} steps. \n Total reward : {}'.format(total_steps, total_reward))

Episode done in 32 steps. 
 Total reward : 32.0


In this loop, we sample a random action, then ask the environment to execute it
and return to us the next observation(obs), the reward, and the done flag. If the
episode is over, we stop the loop and show how many steps we've done and how
much reward has been accumulated.