In [4]:
from tensorboardX import SummaryWriter
import gym 
from collections import namedtuple
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

Our model's core is a one-hidden-layer
neural network, with ReLU and 128 hidden neurons (which is absolutely
arbitrary). 
We define constants at the top of the file and they include the count of neurons in
the hidden layer, the count of episodes we play on every iteration (16), and the
percentile of episodes' total rewards that we use for elite episode filtering

In [5]:
HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70

Our network takes a single observation from the
environment as an input vector and outputs a number for every action we can perform. The output from the network is a probability distribution over actions.

In [6]:
class Net(nn.Module):

    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
                    nn.Linear(obs_size, hidden_size),
                    nn.ReLU(),
                    nn.Linear(hidden_size, n_actions)
                    )

    def forward(self, x):
        return self.net(x)

Now we will define two helper classes that are named tuples from the
collections package in the standard library:

*   EpisodeStep: This will be used to represent one single step that our agent made in the episode, and it stores the observation from the environment and what action the agent completed. We'll use episode steps from elite episodes as training data.
*   Episode: This is a single episode stored as total undiscounted reward and a collection of EpisodeStep.




In [7]:
EpisodeStep = namedtuple('EpisodeStep', field_names= ['observation', 'action'])
Episode = namedtuple('Episode', field_names= ['reward', 'steps'])

The succeding function accepts the environment (the Env class instance from the
Gym library), our neural network, and the count of episodes it should generate
on every iteration. The batch variable will be used to accumulate our batch
(which is a list of the Episode instances). We also declare a reward counter for
the current episode and its list of steps (the EpisodeStep objects). Then we reset
our environment to obtain the first observation and create a softmax layer, which
will be used to convert the network's output to a probability distribution of
actions.

In [8]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    """At every iteration, we convert our current observation to a PyTorch tensor and
     pass it to the network to obtain action probabilities. There are several things to
     note here:
     1. All nn.Module instances in PyTorch expect a batch of data items and the
        same is true for our network, so we convert our observation (which is a
        vector of four numbers in CartPole) into a tensor of size 1 × 4 (to achieve
        this we pass an observation in a single-element list).
     2. As we haven't used nonlinearity at the output of our network, it outputs raw
        action scores, which we need to feed through the softmax function.
     3. Both our network and the softmax layer return tensors which track
        gradients, so we need to unpack this by accessing the tensor.data field
        and then converting the tensor into a NumPy array.
    """
    while True:
        obs_v = torch.FloatTensor([obs])
        action_probs_v = sm(net(obs_v))
        action_probs = action_probs_v.data.numpy()[0]
        """Now that we have the probability distribution of actions, we can use this
           distribution to obtain the actual action for the current step by sampling this
           distribution using NumPy's function, random.choice().
        """
        action = np.random.choice(len(action_probs), p = action_probs)
        """We pass this action to env to get our next observation and reward."""
        next_obs, reward, is_done, _ = env.step(action)
        """Reward is added to the current episode's total reward, and our list of episode
           steps is also extended with an (observation, action) pair.
        """
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation = obs, action = action))
        """This is how we handle the situation when the current episode is over (in the case
          of CartPole, the episode ends when the stick has fallen down despite our efforts).
          We append the finalized episode to the batch, saving the total reward (as the
          episode has been completed and we've accumulated all reward) and steps we've
          taken. Then we reset our total reward accumulator and clean the list of steps.
          After that, we reset our environment to start over.

          In case our batch has reached the desired count of episodes, we return it to the
          caller for processing, using yield. Our function is a generator, so every time the
          yield operator is executed, the control is transferred to the outer iteration loop
          and then continues after the yield line
        """
        if is_done:
            batch.append(Episode(reward = episode_reward, steps = episode_steps))
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs

This function is at the core of the cross-entropy method: 

From the given batch of
episodes and percentile value, it calculates a boundary reward, which is used to
filter elite episodes to train on. To obtain the boundary reward, we're using
NumPy's percentile function, which from the list of values and the desired
percentile, calculates the percentile's value. Then we will calculate mean reward,
which is used only for monitoring.


In [9]:
def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    """Next, we will filter off our episodes. For every episode in the batch, we will
       check that the episode has a higher total reward than our boundary and if it has,
       we will populate lists of observations and actions that we will train on.
    """
    train_obs = []
    train_act = []
    for example in batch:
        if example.reward < reward_bound:
            continue
        train_obs.extend(map(lambda step : step.observation, example.steps))
        train_act.extend(map(lambda step : step.action, example.steps))

    """As the final step of the function, we will convert our observations and actions
       from elite episodes into tensors, and return a tuple of four: observations, actions,
       the boundary of reward, and the mean reward.
    """
    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    return train_obs_v, train_act_v, reward_bound, reward_mean

Now, we will write the final chunk of code that glues everything together and mostly consists of the training loop.


In the training loop, we will iterate our batches (which are a list of Episode
objects), then we perform filtering of the elite episodes using the filter_batch
function. The result is variables of observations and taken actions, the reward
boundary used for filtering and the mean reward. After that, we zero gradients of
our network and pass observations to the network, obtaining its action scores.
These scores are passed to the objective function, which calculates cross-entropy
between the network output and the actions that the agent took. The idea of this
is to reinforce our network to carry out those "elite" actions which have led to
good rewards. Then, we will calculate gradients on the loss and ask the
optimizer to adjust our network.

In [11]:
env = gym.make('CartPole-v0')
#env = gym.wrappers.Monitor(env, directory='mon', force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
writer = SummaryWriter(comment="-cartpole")

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    obs_v, act_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, act_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
                iter_no, loss_v.item(), reward_m, reward_b))
    writer.add_scalar('loss', loss_v.item(), iter_no)
    writer.add_scalar('reward_bound', reward_b, iter_no)
    writer.add_scalar('reward_mean', reward_m, iter_no)
    if reward_m > 199:
        print('Solved!')
        break

writer.close()

0: loss=0.684, reward_mean=28.9, reward_bound=30.0
1: loss=0.670, reward_mean=31.5, reward_bound=36.5
2: loss=0.674, reward_mean=34.0, reward_bound=38.5
3: loss=0.634, reward_mean=38.2, reward_bound=46.0
4: loss=0.643, reward_mean=38.9, reward_bound=46.0
5: loss=0.636, reward_mean=33.8, reward_bound=32.0
6: loss=0.630, reward_mean=43.1, reward_bound=46.5
7: loss=0.613, reward_mean=53.1, reward_bound=63.0
8: loss=0.616, reward_mean=60.6, reward_bound=77.5
9: loss=0.600, reward_mean=86.1, reward_bound=93.0
10: loss=0.604, reward_mean=87.1, reward_bound=118.0
11: loss=0.596, reward_mean=92.2, reward_bound=133.0
12: loss=0.577, reward_mean=90.6, reward_bound=112.5
13: loss=0.578, reward_mean=79.1, reward_bound=91.5
14: loss=0.570, reward_mean=106.1, reward_bound=130.5
15: loss=0.571, reward_mean=120.5, reward_bound=134.5
16: loss=0.569, reward_mean=126.9, reward_bound=142.5
17: loss=0.562, reward_mean=155.4, reward_bound=193.5
18: loss=0.548, reward_mean=151.9, reward_bound=196.0
19: loss=