# Reinforcement Learning - Cross-Entropy Method

## What is Cross-Entropy?

Cross-Entropy is a way to measure how different two probability distributions are. Imagine you have a model that makes predictions, and you want to see how close these predictions are to the actual outcomes. Cross-Entropy gives a number that tells you how good your model is. A lower cross-entropy means your model's predictions are better.

**Simple Example:**

* Think of predicting the weather. If you say there's a 70% chance of rain and it actually rains, your prediction is pretty good. Cross-Entropy helps measure how accurate such predictions are.

## How Does RL Use Cross-Entropy?

In Reinforcement Learning (RL), an agent learns to make decisions by performing actions and receiving rewards. When RL uses Cross-Entropy, it helps the agent improve its actions based on past experiences.

**How It Works:**

* The agent tries different actions in an environment.
* It records which actions lead to good rewards.
* Using Cross-Entropy, it focuses more on the actions that gave better rewards.
* Over time, the agent gets better at choosing actions that maximize rewards.

## General Step-by-Step Process

Here’s a simple step-by-step guide on how RL with Cross-Entropy works:

1. **Initialize:** Start with a random policy (a set of rules the agent follows to decide actions).
2. **Generate Episodes:** Let the agent perform actions in the environment to create episodes (sequences of states, actions, and rewards).
3. **Evaluate:** Calculate the total rewards for each episode.
Select Top Performers: Choose the best-performing episodes based on their rewards.
4. **Update Policy:** Use Cross-Entropy to update the policy, making it more likely to choose actions that led to high rewards.
5. **Repeat:** Go back to step 2 and repeat the process until the agent performs well.

## Additional Information

* Advantages:
    * Simplicity: The Cross-Entropy method is straightforward and easy to implement.
    * Efficiency: It can find good solutions with fewer trials compared to other methods.


* Applications:
    * Robotics (teaching robots to perform tasks)
    * Game Playing (like training agents to play video games)
    * Optimization Problems (finding the best solutions in complex scenarios)

# Initial Environment Setup

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Now I would like to check the graphics card

In [2]:
!nvidia-smi

Thu Feb 13 12:02:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1660 ...  WDDM  |   00000000:01:00.0  On |                  N/A |
| N/A   60C    P8              4W /   60W |     505MiB /   6144MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:55:00_Pacific_Daylight_Time_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0


In [4]:
import torch
print(torch.__version__)
print('Is CUDA available: ' + str(torch.cuda.is_available()))

2.4.1+cu124
Is CUDA available: True


In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np
import gymnasium as gym

from collections import namedtuple
from tensorboardX import SummaryWriter
from pathlib import Path

# CartPole

The CartPole is one of the most basic environments available on gymnasium. This environment is well suit suitable to be used on this first example.

In [6]:
import gymnasium as gym

from utils.wrappers import RecordGif

# RUN A RANDOM AGENT FOR DEMO PURPOSES
env = gym.make('CartPole-v1', render_mode='rgb_array')
env = RecordGif(env, f'./gifs/cross-entroph/cartpole', gif_length=200, name_prefix='random-agent')

env.reset(seed=42)
for _ in range(200):
    env.step(env.action_space.sample())
env.close()

<image src="gifs/cross-entroph/cartpole/random-agent-episode-0.gif" style="width: 350px">

This is a video of the cart pole environment being controlled by random actions.<br>
This environment consists in a pole being balanced by a cart who can move to left or right to keep the pole in a vertical position.<br>
The apisode ends when the pole cross a given angle, or the cart cross the environment boundaries. For every step the agent can keep the pole balanced, it will receive a positive reward.


The environment could look simple, but it is a good candidate to implement a reinforcement learning, cross-entropy, model. 

## Creating the function to yield the batches to train the model

This function is used to create the batches of data to train the model. The batches are set by running the Neural Network throught the environment steps, acumulate the rewards and group them in collections to use during the training steps.

In [7]:
# These named tuples are to store the experience of the agent
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

# this is the function to create the batches
def training_batch(env: gym.Env, net: nn.Module, batch_size: int, device: str):
    '''
    This function will generate a batch of episodes from the environment.
    The batch will be of size batch_size.
    The neural network nn will be used to generate the actions.
    The device will be used to store the tensors.

    The function will yield a list of episodes of size batch_size.
    '''

    softmax = nn.Softmax(dim=1)

    while True:
        batchs = []

        for _ in range(batch_size):
            obs,_ = env.reset()

            total_reward = 0.0
            steps = []

            while True:
                obs_v = torch.FloatTensor([obs]).to(device)

                act_probs_v = softmax(net(obs_v))
                act_probs = act_probs_v.data.cpu().numpy()[0]
                action = np.random.choice(len(act_probs), p=act_probs)

                next_obs, reward, done, truncated, _ = env.step(action)
                steps.append(EpisodeStep(observation=obs, action=action))
                
                obs = next_obs
                total_reward += reward

                if done or truncated:
                    batchs.append(Episode(reward=total_reward, steps=steps))
                    break
        
        yield batchs

**Episode**: This tuple was defined to store the information of every episode. So for every run where our neural network is interacting with the environment we are acumulating the data in a instance of this tuple. The 'reward' field receives the total reward obtained during the episode, and the 'steps' field is to store a list of *EpisodeStep*.

**EpisodeStep**: This tuple was defined to store the information of a single step from the agent in the environment. So for every step we needs to create a new instance of this tuple to include in the *Episode*. The 'action' field is to store the action from the agent in the step, and the 'observation' field is to store the observation the agend did use to choose the action.

> It is important to notice here, the opservation we are includding in the EpisodeStep is the 'obs' the agent did use to choose an action and not the 'next_obs' we receive after the step.

**training_batch**: This function has the singlçe objective of run some episodes using the Neural Network agent to choose an action for every step from the episodes. Than it should acumulate the data from the episodes and return it as an yield value. We can use this function to get the batches we need during the training of the agent.

> You can see something interesting in the training_batch function. When we are choosing an action to step into the evrironment, we are not using the direct result from the Neural Network, we are using the probabilities generated by the Softmax activation function, and we are using these probabilities to choose a random action based on them. So while the agente is getting more confident to perform an action for a given scenario, this action will have more chances to be the chosen while we keep space for exploration.

> Another interesting point is we are using the Softmax outside of the newral network! We are doing it, because we not expecting to use it on our final agent, but it is supporting us to get the probabilities for every action, for every step, during the batch creation.


Ok, now we have a way to get batches to use during the training. But we still have to filter the top episodes to discard the ones with a bad performance. This is exactly what we are going to do next!

## Filtering the episodes to use for model training (choose the Elite episodes)

To ensure our agent will learn on good experences and ignore the bad ones, we have to... we... discard the bad experiences! We are looking for the Elite ones!

A technique to filter off the bad experiences is to define a minimun reward for an episode to include it on our training data. But should not be just an abritary value, and it has to adapt for every batch, since the average performance should increace over the time.

We can, instead of an abitrary value, organize the rewards in percentiles, and choose only the episodes whos reward is equals or higger than the value related to a given percentile we choose. So with this strategy, the value we use to filter episodes will adapt for every batch.

In [8]:
def filter_batch(batch: list[Episode], device: str, percentile:int = 70):
    '''
    This function will filter the batch of episodes.
    It will only keep the episodes that have a reward greater than the percentile.
    It will return the observations and the actions of the episodes.
    '''
    rewards = [ e.reward for e in batch ]
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    train_obs = []
    train_act = []

    for episode in batch:
        if episode.reward <= reward_bound:
            continue

        train_obs.extend([step.observation for step in episode.steps])
        train_act.extend([step.action for step in episode.steps])

    train_obs_v = torch.FloatTensor(train_obs).to(device)
    train_act_v = torch.LongTensor(train_act).to(device)

    return train_obs_v, train_act_v, reward_bound, reward_mean


Now our function to filter batches is going to find the reward value related to the percentile (70 defined as default), use this percentile to filter episodes, acumulate the opservations and actions and return the observation, action, the reward bound (the reward related to the given percentile) and the reward mean.

The reard bound and mean are just for metrics and they don't have any direct impact in the agent itself (well... we are going to use reward mean for early stop)

## The Agent (A Neural Network responsible by learn and run the episodes)

Now it is time to define our agent! This is basically a neural retwork who is going to learn with experiences and play the episodes of Cart Pole game!

In [9]:
class Net(nn.Module):
    def __init__(self, obs_size: int, hidden_size: int, n_actions: int):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

This Newral Network is pretty simple (we don't need anything to complex for Cart Pole game), we have 2 linear models, connected by a ReLU activation function.

The model definition is arbritary and could be structured in different ways, but this one should work for now.

## Training the Agent

We have now all the pieces we need to start training our agent.

In [10]:
# DEFINE THE HYPERPARAMETERS
NN_HIDDEN_SIZE = 128
LEARNING_RATE = 0.01
BATCH_SIZE = 16

# DEFINE THE ENVIRONMENT
env = gym.make('CartPole-v1', render_mode='rgb_array')
env = RecordGif(env, './gifs/cross-entroph/cartpole', name_prefix='training', gif_length=200, episode_trigger=lambda x: x % 500 == 0)

observation_size = env.observation_space.shape[0]
action_size = env.action_space.n

# DEFINE THE NETWORK
net = Net(observation_size, NN_HIDDEN_SIZE, action_size)

# DEFINE THE OPTIMIZER
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

# DEFINE THE LOSS FUNCTION
loss_fn = nn.CrossEntropyLoss()

# DEFINE THE DEVICE
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net.to(device)

# DEFINE THE WRITER
writer = SummaryWriter(logdir='runs/cross-entroph/cart_pole', comment=f'-cartpole-pg')

# TRAIN THE AGENT
for iter_n, batch in enumerate(training_batch(env, net, BATCH_SIZE, device)):
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, device)

    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = loss_fn(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()

    if iter_n % 25 == 0:
        print(f'{iter_n}: loss={loss_v.item()}, reward_bound={reward_b}, reward_mean={reward_m}')

    writer.add_scalar('loss', loss_v.item(), iter_n)
    writer.add_scalar('reward_bound', reward_b, iter_n)
    writer.add_scalar('reward_mean', reward_m, iter_n)

    if reward_m > 199:
        print(f'{iter_n}: loss={loss_v.item()}, reward_bound={reward_b}, reward_mean={reward_m}')
        print('Solved!')
        break

env.close()
writer.close()

0: loss=0.6793823838233948, reward_bound=24.0, reward_mean=24.0
25: loss=0.5276786088943481, reward_bound=147.5, reward_mean=132.0625
38: loss=0.49733400344848633, reward_bound=236.5, reward_mean=215.6875
Solved!


Took a while, but we have an Agent who can play the Cart Pole!

How about compare some episodes?!

**Episode 0:**<br>
<image src="gifs/cross-entroph/cartpole/training-episode-0.gif" style="width: 250px">

**Episode 500:**<br>
<image src="gifs/cross-entroph/cartpole/training-episode-500.gif" style="width: 250px">

## Looking the metrics on tensorboad

The tensorboard is a tool used to record and read metrics from our models. Let's see our metrics!

<img src="./prints/cross-entroph/tensorboard-cartpole.png" style="width: 1000px">

Using the tensorboard we can see the loss going down and the reward going up (reward is increacing almost linear hahahaha)

In [11]:
# saving the model
Path(r'models/cross-entroph').mkdir(exist_ok=True, parents=True)
torch.save(net.state_dict(), 'models/cross-entroph/cartpole-pg.pth')

I'll create a final video of the agent playing the cart pole game!

In [12]:
env = gym.make('CartPole-v1', render_mode='rgb_array')
env = RecordGif(env, './gifs/cross-entroph/cartpole', name_prefix='model', gif_length=500)

net.load_state_dict(torch.load('models/cross-entroph/cartpole-pg.pth'))
net.eval()

obs,_ = env.reset(seed=42)
while True:
    obs_v = torch.FloatTensor([obs]).to(device)
    action = torch.argmax(net(obs_v)).item()
    obs, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        break

env.close()

<image src="gifs/cross-entroph/cartpole/model-episode-0.gif" style="width: 350px">

The agent is quite good!<br>
It may not be able to run indefinidelly for some scenarios, since the agent never had to care about what happens after the 200 steps we did set as the reward mean to stop the training. Even with the limitation, we could produce a good agent!


# Frozen Lake

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np
import gymnasium as gym

from collections import namedtuple
from tensorboardX import SummaryWriter

import warnings
warnings.filterwarnings("ignore")

## The environment

The Frozen Lake is another of the most basic environments on GYM, but it has some differences from Cart Pole as we will discuss bellow.

In [14]:
import gymnasium as gym

from utils.wrappers import RecordGif

# RUN A RANDOM AGENT FOR DEMO PURPOSES
env = gym.make('FrozenLake-v1', render_mode='rgb_array')
env = RecordGif(env, f'./gifs/cross-entroph/frozen-lake', gif_length=15, name_prefix='random-agent', fps=3)

np.random.seed(42)
env.reset(seed=42)
for _ in range(200):
    env.step(env.action_space.sample())

env.close()

<image src="gifs/cross-entroph/frozen-lake/random-agent-episode-0.gif">

The frozen lake environment, is a grid of 4X4 squares, where the agent should navigate to get the present. This is also very simple, but we have a big difference from the last environment. 

The Cart Pole the agent will receive a positive reward for every step the it can keep the pole balanced. But on the Frozen Lake environment, the agent will get the positive reward only when the gets the present. So it does not mater if the agent keeps walking arround and how much time it spend to get the present, the final reward will be the same.

If you remember from cart pole, we have to filter the *Elite* episodes, and use these episodes to train the agent, but since the agent will receive the same reward independently of how many steps it stend to get the present. We need a way to reward better scenarios the agent goes straight to the goal, than the scenarios it keeps waling arround before it gets there.

There is also an extra complexity on this evironment. The agent has 33% of chance that it will slip and move in a non expected direction.

So let's see how to implement the changes we need!

## Checking the observation space

The observation space from frozen lake is just a number, from 0 to 15, that represents the index of the square the agent is on at the moment.

In [15]:
env = gym.make('FrozenLake-v1', render_mode='rgb_array')
env.observation_space.sample()

np.int64(14)

Well, this kind of representation does not works well on machine learning, so we need to transform it to a format who works better for our models. We are going to use a technique called One Hot Encodding.

I'll demo row to implement the One Hot Encodding:

In [16]:
original_values = []
encodded_values = []

for _ in range(5):
    obs = env.observation_space.sample() # we get a sample observation
    one_hot = np.zeros(env.observation_space.n) # we create a one hot vector with all items as zero
    one_hot[obs] = 1 # we set the value of the observation to 1

    original_values.append(obs)
    encodded_values.append(one_hot)

for original, encodded in zip(original_values, encodded_values):
    print(f'Original: {original} Encodded: {encodded}')


Original: 9 Encodded: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
Original: 14 Encodded: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
Original: 13 Encodded: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
Original: 2 Encodded: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Original: 4 Encodded: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


So as you can see, only the index related to our observation value should be set as one, and every other index keeps as zero. This way the model will understand better our observation data.

So we need to force our environment to return the observation value in the format we need, and we can do that by creating a custom wrapper:

In [17]:
class OneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(OneHotWrapper, self).__init__(env)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

In [18]:
env = gym.make('FrozenLake-v1', render_mode='rgb_array')
env = OneHotWrapper(env)
env.reset()

(array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=float32),
 {'prob': 1})

Ok, so now our environment is returning the observation in the format we need!

## The training batch

For now, I'm not planning any change in the training_batch function, we can keep it as it is :)

In [19]:
# These named tuples are to store the experience of the agent
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

# this is the function to create the batches
def training_batch(env: gym.Env, net: nn.Module, batch_size: int, device: str):
    '''
    This function will generate a batch of episodes from the environment.
    The batch will be of size batch_size.
    The neural network nn will be used to generate the actions.
    The device will be used to store the tensors.

    The function will yield a list of episodes of size batch_size.
    '''

    softmax = nn.Softmax(dim=1)

    while True:
        batch = []

        for _ in range(batch_size):
            obs,_ = env.reset()

            terminated = False
            truncated = False
            total_reward = 0.0
            steps = []

            with torch.no_grad():
                net.eval()

                while not terminated and not truncated:
                    obs_v = torch.FloatTensor([obs]).to(device)

                    act_probs_v = softmax(net(obs_v))
                    act_probs = act_probs_v.data.cpu().numpy()[0]
                    action = np.random.choice(len(act_probs), p=act_probs)

                    next_obs, reward, terminated, truncated, _ = env.step(action)
                    steps.append(EpisodeStep(observation=obs, action=action))
                    
                    obs = next_obs
                    total_reward += reward

            batch.append(Episode(reward=total_reward, steps=steps))
        
        yield batch

## Filter elite episodes

We have 2 changed to do on this function:

* Include a new parameter called gamma, to penalize rewards for episodes the agente took too much time to find the goal
* Include a list of elite steps in the return

The penalization in the reward is to reward better episodes the agent could meet the goal faster. So a agent who can meet the goal in 10 steps, will be rewarded better than an agent who spent 20 steps before meet the goal.

Include the list of elite episodes in the return, give us the oportunity to reuse these steps and allow us to include them in the training process for long. This is needed because the episodes producing positive reward are going to be more rare on Frozen Lake than in it was on Cart Pole. 

In [20]:
# These named tuples are to store the experience of the agent
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

def filter_batch_advanced(batch: list[Episode], device: str, gamma=0.9, percentile:int = 70):
    '''
    This function will filter the batch of episodes.
    It will only keep the episodes that have a reward greater than the percentile.
    It will return the observations and the actions of the episodes.
    '''
    rewards = [ e.reward * (gamma ** len(e.steps)) for e in batch ]
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean([ e.reward for e in batch ]))

    train_obs = []
    train_act = []
    elite_eps = [] # We need to keep the elite episodes a bit longer in the training process

    for episode, reward in zip(batch, rewards):
        if reward > reward_bound:
            train_obs.extend([step.observation for step in episode.steps])
            train_act.extend([step.action for step in episode.steps])
            elite_eps.append(episode)

    train_obs_v = torch.FloatTensor(train_obs).to(device)
    train_act_v = torch.LongTensor(train_act).to(device)

    return train_obs_v, train_act_v, elite_eps, reward_bound, reward_mean

## The Neural Network

We can keep the same structure as the Cart Pole Neural Natwork. This NN should work fine as well.

In [21]:
class Net(nn.Module):
    def __init__(self, obs_size: int, hidden_size: int, n_actions: int):
        super(Net, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)

## Train the agent!

Yes, we are already there! training the agent!

Update the batch creation and filtering are the biggest changes we have to do! But since the frozen lake scenarion is a bit different than the cart pole, we will need to play a bit with the hiper parameters to ensure the agent will have the opportunity to meet the goal and find the present!

In [22]:
import datetime

# DEFINE THE HYPERPARAMETERS
NN_HIDDEN_SIZE = 192
LEARNING_RATE = 0.001 # we need to lower the learning rate
BATCH_SIZE = 250 # we need more samples now, as just a fee will have a reward > 0
GAMMA = 0.95 # we add the gamma hyperparameter to penalize the reward
PERCENTILE = 55 # we lower the percentile to increace the chance of have positive rewards

# DEFINE THE ENVIRONMENT
env = gym.make('FrozenLake-v1', render_mode='rgb_array')
env = OneHotWrapper(env) # we add the one hot wrapper

episode_trigger = lambda x: x % 200000 == 0
env = RecordGif(env, './gifs/cross-entroph/frozen-lake', name_prefix='training', gif_length=200, fps=5, episode_trigger=episode_trigger)

observation_size = env.observation_space.shape[0]
action_size = env.action_space.n

# DEFINE THE NETWORK
net = Net(observation_size, NN_HIDDEN_SIZE, action_size)

# DEFINE THE OPTIMIZER
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

# DEFINE THE LOSS FUNCTION
loss_fn = nn.CrossEntropyLoss()

# DEFINE THE DEVICE
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net.to(device)

# DEFINE THE WRITER
writer = SummaryWriter(logdir='runs/cross-entroph/frozen-lake', comment=f'-frozen-lake-pg')

# TRAIN THE AGENT
starting_time = datetime.datetime.now()

max_success_rate = 0.0
max_reward_bound = 0.0
full_batch = []
for iter_n, batch in enumerate(training_batch(env, net, BATCH_SIZE, device)):
    success_rate = sum([ e.reward for e in batch ]) / BATCH_SIZE
    max_success_rate = max(max_success_rate, success_rate)

    if  success_rate > 0.55:
        print(f'{iter_n}: loss={round(loss_v.item(), 3)}, reward_bound={round(reward_b, 3)},'
              f' success_rate={round(success_rate, 3)}, batch={len(full_batch)},'
              f' max_success_rate={round(max_success_rate, 3)}, max_reward_bound={round(max_reward_bound, 3)}')
        print('Solved!')
        break


    obs_v, acts_v, full_batch, reward_b, _ = filter_batch_advanced(full_batch+batch, device, gamma=GAMMA, percentile=PERCENTILE)

    if not len(full_batch):
        continue

    max_reward_bound = max(max_reward_bound, reward_b)
    full_batch = full_batch[-500:] # we keep the most recent elite episodes

    net.train()
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = loss_fn(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()

    writer.add_scalar('loss', loss_v.item(), iter_n)
    writer.add_scalar('reward_bound', reward_b, iter_n)
    writer.add_scalar('success_rate', success_rate, iter_n)

    if iter_n % 50 == 0:
        print(f'{iter_n}: loss={round(loss_v.item(), 3)}, reward_bound={round(reward_b, 3)},'
              f' success_rate={round(success_rate, 3)}, batch={len(full_batch)},'
              f' max_success_rate={round(max_success_rate, 3)}, max_reward_bound={round(max_reward_bound, 3)}')

    # we add a time limit to the training
    # it should not take more than 4 hours
    delta_time = datetime.datetime.now() - starting_time
    if delta_time.total_seconds() > 3600 * 4:
        print(f'{iter_n}: loss={round(loss_v.item(), 3)}, reward_bound={round(reward_b, 3)},'
              f' success_rate={round(success_rate, 3)}, batch={len(full_batch)},'
              f' max_success_rate={round(max_success_rate, 3)}, max_reward_bound={round(max_reward_bound, 3)}')
        print('Time limit reached')
        break

writer.close()

0: loss=1.377, reward_bound=0.0, success_rate=0.02, batch=5, max_success_rate=0.02, max_reward_bound=0.0
50: loss=1.286, reward_bound=0.394, success_rate=0.028, batch=201, max_success_rate=0.048, max_reward_bound=0.463
100: loss=1.146, reward_bound=0.341, success_rate=0.032, batch=199, max_success_rate=0.068, max_reward_bound=0.63
150: loss=1.097, reward_bound=0.663, success_rate=0.036, batch=166, max_success_rate=0.08, max_reward_bound=0.663
200: loss=1.067, reward_bound=0.54, success_rate=0.048, batch=202, max_success_rate=0.08, max_reward_bound=0.698
250: loss=1.057, reward_bound=0.599, success_rate=0.064, batch=181, max_success_rate=0.108, max_reward_bound=0.698
300: loss=1.042, reward_bound=0.57, success_rate=0.024, batch=203, max_success_rate=0.108, max_reward_bound=0.698
350: loss=1.058, reward_bound=0.418, success_rate=0.06, batch=200, max_success_rate=0.108, max_reward_bound=0.698
400: loss=0.99, reward_bound=0.54, success_rate=0.076, batch=200, max_success_rate=0.108, max_rew

## Looking the metrics on tensorboad

The tensorboard is a tool used to record and read metrics from our models. Let's see our metrics!

<img src="prints/cross-entroph/tensorboard-frozen-lake.png" style="width: 1000px">

As expected, we can see the loss decreasing over the time, but the reward is not consistent.

In [23]:
# saving the model
Path('models/cross-entroph').mkdir(exist_ok=True, parents=True)
torch.save(net.state_dict(), 'models/cross-entroph/frozen-lake-pg.pth')

Ok, this is a good example of scenarion the Cross-Entrophy learning does not performs pretty well. Even after hours os training, it will not have a success rate much better than 50% os the test executions.

For this kind of scenarios we will have better techiniques.

I'll create a final video of the agent playing the frozen lake game!

In [33]:
env = gym.make('FrozenLake-v1', render_mode='rgb_array')
env = OneHotWrapper(env) # we add the one hot wrapper
env = RecordGif(env, './gifs/cross-entroph/frozen-lake', name_prefix='model', gif_length=500)

net.load_state_dict(torch.load('models/cross-entroph/frozen-lake-pg.pth'))
net.eval()

obs,_ = env.reset(seed=42)
while True:
    obs_v = torch.FloatTensor([obs]).to(device)
    action = torch.argmax(net(obs_v)).item()
    obs, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        break

env.close()

<image src="gifs/cross-entroph/frozen-lake/model-episode-0.gif">

Well... at least it could get the prize :|