# Single Agent training using Malmo
This guide goes through the usage of Malmo and gives an example on how to train a Reinforcement Learning agent from [Rllib](https://docs.ray.io/en/master/) in Malmo.

This notebook requires the ```ray``` python package to be installed. It can easily be installed using pip:
```pip install ray ray[rllib] ray[tune]```

The first steps are the same as for the Random Agent example.

In [None]:
# imports
from pathlib import Path
import os

# malmoenv imports
import malmoenv
from malmoenv.utils.launcher import launch_minecraft
from malmoenv.utils.wrappers import DownsampleObs

import ray
from ray.tune import register_env

The next step is to define some constants.

The ```MISSION_XML``` is the file defining the current mission. Using RLlib can change the current working directory, so we use its absolute path. This example has been setup to work correctly with both 1 and multiple workers.

In [None]:
ENV_NAME = "malmo"
MISSION_XML = os.path.realpath('../../MalmoEnv/missions/mobchase_single_agent.xml')
COMMAND_PORT = 8999 # first port's number
xml = Path(MISSION_XML).read_text()

CHECKPOINT_FREQ = 100     # in terms of number of algorithm iterations
LOG_DIR = "results/"       # creates a new directory and puts results there

NUM_WORKERS = 1
NUM_GPUS = 0
TOTAL_STEPS = int(1e6)
launch_script = "./launchClient_quiet.sh"

Next we want to create a function that defines how the environment is generated in RLlib. This is going to be the python client connecting to the malmo instances, so make sure that these PORT numbers match the ports used later to create the Minecraft instances.
When using RLlib each worker has an index accessible by calling ```config.worker_index```, using this variable we can easily set the correct ports for each env.
If we would like to use wrappers the ```create_env``` function is a good place to add them, see the ```DownsampleObs``` wrapper added in this example.
We downsample the observations from the default ```(800, 600, 3)``` to ```(84, 84, 3)``` as the default vision models in RLlib only support a few dimensions, this being one of them. RLlib can work with any vector based observation and uses convolutional networks for input sizes of (84, 84) and (42, 42) by default. If you want to work with different input sizes check out the [RLlib documentation](https://docs.ray.io/en/master/rllib-models.html).

Finally we have to register the env generator function to make it visible to RLlib.

In [None]:
def create_env(config):
    env = malmoenv.make()
    env.init(xml, COMMAND_PORT + config.worker_index, reshape=True)
    env.reward_range = (-float('inf'), float('inf'))

    env = DownsampleObs(env, shape=tuple((84, 84)))
    return env

register_env(ENV_NAME, create_env)

The next step is to start up the Minecraft instances. Note that this step might take a few minutes.
In the background each Malmo instance get copied to the ```/tmp/malmo_<hash>/malmo``` directory, where it gets executed (Each Minecraft instance requires its own directory).
After copying the instances are started using a the provided ```launch_script```, this is where we can define if we want to run it without rendering a window for example.

In [None]:
GAME_INSTANCE_PORTS = [COMMAND_PORT + 1 + i for i in range(NUM_WORKERS)]
instances = launch_minecraft(GAME_INSTANCE_PORTS, launch_script=launch_script)

After the Malmo instances are setup and running the next step is to get an agent training.
In this example we use ray's tune API to run the training. The algorithm in this example is ```PPO```, but RLlib provides a large collection of RL algorithms and to use a different one you can just replace the first line with the desired algorithm, i.e: ```DQN```.

Then we define the ```config```, it includes the environment and the resources we would like ray to use for training. Note that to use a custom environment with ray it has to be registered first and then it can be referred to by its name.
The remaining arguments to ```tune.run``` are optional, but are useful in this example. We set the stop condition to be based on the number of agent-env interactions and to make checkpoints every ```CHECKPOINT_FREQ``` algorithm iterations and to save the log files to a custom location (default would be ```~/ray_results/```).


In [None]:
ray.tune.run(
    "PPO",
    config={
        "env": ENV_NAME,
        "num_workers": NUM_WORKERS,
        "num_gpus": NUM_GPUS,
    },
    stop={"timesteps_total": TOTAL_STEPS},
    checkpoint_at_end=True,
    checkpoint_freq=CHECKPOINT_FREQ,
    local_dir=LOG_DIR
)

To change the algorithm or the arguments check out these links:
- [Available algorithms](https://docs.ray.io/en/latest/rllib-algorithms.html)
- [Common arguments](https://docs.ray.io/en/master/rllib-training.html#common-parameters)