# Single Agent evaluation using Malmo
This guide uses a trained checkpoint from RLlib and evaluates it for a few episodes on the same level it was used for training. We use a PPO checkpoint here, in case of using a different algorithm the other algorithm's trainer should be loaded.

We do not use the screen capturer in this guide, but can add it as done in the other evaluation guide.

In [None]:
# imports
import gym, os, sys, argparse
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from pathlib import Path
import pickle
import numpy as np

# malmoenv imports
import malmoenv
from malmoenv.utils.launcher import launch_minecraft
from malmoenv.utils.wrappers import DownsampleObs

from examples.utils.utils import update_checkpoint_for_rollout, get_config

# ray dependencies
import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer

Define some constants.
When training with ray's tune it might create multiple checkpoints, so we specifically have to select the one we would like to use.

In [None]:
EPISODES = 10
ENV_NAME = "malmo"
MISSION_XML = os.path.realpath('../../MalmoEnv/missions/mobchase_single_agent.xml')
xml = Path(MISSION_XML).read_text()

env_config = {
    "xml": xml,
    "port": 8999, # first port's number
}

CHECKPOINT_FREQ = 100     # in terms of number of algorithm iterations
LOG_DIR = "results/"       # creates a new directory and puts results there

NUM_WORKERS = 1
NUM_GPUS = 0
TOTAL_STEPS = int(1e6)
launch_script = "./launchClient_quiet.sh"

checkpoint_file = "examples/checkpoints/PPO_malmo_single_agent/checkpoint_209/checkpoint-209"
update_checkpoint_for_rollout(checkpoint_file)

Env creator function. This is the part where the ScreenCapturer can be utilised.
Note that for this sort of checkpoint restoration we have to register the environment.

In [None]:
def create_env(config):
    env = malmoenv.make()
    env.init(config.xml, config.port, reshape=True)
    env.reward_range = (-float('inf'), float('inf'))

    env = DownsampleObs(env, shape=tuple((84, 84)))
    return env

tune.register_env(ENV_NAME, create_env)

The next step is to load the original config and overwrite some parameters. We want to get the same setting as we did for the training, but we don't necessarily want to use the same hardware for evaluation. Let's say we trained an agent on a Server with multiple CPUs and a GPU, but we would like to evaluate the checkpoint locally using a single env and without a GPU. To do this we can just overwrite these entries in the config. We can also disable the exploration as shown below. Depending on the chosen algorithm there are more configurations that might be useful for evaluation see the RLlib documentation for more details.

In [None]:
config = get_config(checkpoint_file)
config["num_workers"] = NUM_WORKERS
config["num_gpus"] = NUM_GPUS
config["explore"] = False

In [None]:
# Load agent
ray.init()
trainer = PPOTrainer(config)
trainer.restore(checkpoint_file)
policy = trainer.get_policy()

As in the previous examples, the next step is to start the Malmo instances. In this version we manually create the environment, which gives us more flexibility over the evaluation.

In [None]:
GAME_INSTANCE_PORTS = [env_config.port + 1 + i for i in range(NUM_WORKERS)]
instances = launch_minecraft(GAME_INSTANCE_PORTS, launch_script=launch_script)

env = create_env(config)

In this setup we have more flexibility over the evaluation.
RLlib expects 4 dimensions for input [Batch, Width, Height, Channels], to satisfy this requirement we expand the state's dimension.
The ```action``` variable returned by the ```policy.compute_actions``` does not only return the best action but various algorithm specific output, such as value function, Q-values or action distributions.
The evaluation loop below is a simple example, but it can be used to extract more information about malmo. The ```info``` output returns various symbolic information about the current state.

In [None]:
# Custom evaluation loop
print(f"running evaluations for {EPISODES} episodes")
for ep in range(EPISODES):
    state = env.reset()
    done = False
    ep_length = 0
    ep_reward = 0
    while not done:
        # actions returns multiple algorithm specific entries such as value, action distribution...
        actions = policy.compute_actions(np.expand_dims(state, 0))
        state, reward, done, info = env.step(actions[0][0])
        ep_length += 1
        ep_reward += reward
        if done:
            print(f"Episode #{ep} finished in {ep_length} steps with reward {ep_reward}")
            ep_length = 0
            ep_reward = 0