# Trainers

In this notebook I test trainer implementations on various environments.

## Monitoring setup
This sets up the logging system to be able to monitor the training progress of the agent

In [None]:
import logging
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "2"

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s:%(name)s: %(message)s')
logging.root.setLevel(logging.INFO)

logger = logging.getLogger(__name__)

### Environment factory
This function produces the environment the agent is trained and evaluated on and applies all the necessary wrappers. It is also used by the evaluation function to validate the agent on a fresh environment instance.

In [None]:
def make_env():
    e = gym.make('BreakoutNoFrameskip-v4')
    e = NoOpResetEnv(e)
    e = MaxAndSkipEnv(e)
    e = EpisodicLifeEnv(e)
    e = OriginalReturnWrapper(e)
    e = SignReward(e)
    e = TorchObservation(e)
    e = StackFrames(e, size=4)
    return e

### Evaluation function
To evaluate the agent, a video is produces of one trajectory within the environment and sent to tensorboard. On Tensorboard one can then analyse the progress of the agent

In [None]:
def evaluate_agent(trainable):
    e = make_env()
    video, rewards = render_trajectory(e, trainable.policy, reward_infos=['episodic_return'])
    writer.add_video("trajectory", video, trainable.steps_trained, fps=40)
    writer.add_scalar("rewards/total", rewards['total_reward'], trainable.steps_trained)
    writer.add_scalar("returns/total", rewards['total_episodic_return'], trainable.steps_trained)

## Train the Agent
In the part I actually train the agent on one of the testing environments. It is a good idea to look at the gym unit
testing environment first and try to solve that. If this doesn't work, there is something wrong either with the
algorithm or the setup. Also make sure to sanity check input signal. Here the problem was that the input image was
actually completely mangled by the torch vision pipeline because I used it wrong initally.

In [None]:
import amarl
import gym
from amarl.messenger import monitor, LogMonitor, CombinedMonitor, TensorboardMonitor
from amarl.trainers import A2CTrainer
from amarl.visualisation import render_trajectory
from amarl.wrappers import MultipleEnvs, active_gym, OriginalReturnWrapper, SignReward, TorchObservation, StackFrames, \
    NoOpResetEnv, MaxAndSkipEnv, EpisodicLifeEnv
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
log_monitor = LogMonitor(logger, progress_averaging=100, performance_sample_size=10000)
tb_monitor = TensorboardMonitor(writer, 10, scalars=dict(episodic_return='returns/episodic'))

env = MultipleEnvs(make_env, num_envs=16)
with active_gym(env) as env, monitor(CombinedMonitor([log_monitor, tb_monitor])) as monitor:
    com = A2CTrainer(env, config={'rollout_horizon': 5, 'device': 'cuda'})
    try:
        amarl.run(com, num_steps=int(1e5), step_frequency_fns={int(2e4): evaluate_agent})
    finally:
        pass

## Plot Results
It is easier to understand how the agent is performing by plotting the metrics of the training. Sum bugs can be subtle,
for instance it was not that obvious that the frame stack ordering was wrong, because the agent still learned a good
policy and performed reasonably well. Because of such subtle bugs the algorithm can become more sensitive to certain
hyper-parameters (i.e. gradient clipping), resulting in a misleading perception of importance of these parameters. Be
careful!

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def moving_average(a, n=3):
    ret = np.cumsum(np.insert(a, 0, 0))
    return (ret[n:] - ret[:-n]) / n

scores = monitor.captured_returns
avg_window = 100
scores_avg = moving_average(scores, avg_window)
fig, ax = plt.subplots()
ax.plot(range(len(scores)), scores)
start = avg_window // 2
ax.plot(range(start, start + len(scores_avg)), scores_avg)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()
