# Custom Multi Agent Env with Variable-Length Action Spaces in RLlib

RLlib on card games:
- How to train multiple agents. In particular, every agent (player) should have its own trajectory so that its final reward propagates on his trajectory. Still, all the agents might follow the same policy.
    https://docs.ray.io/en/master/rllib-env.html#multi-agent-and-hierarchical 
- Action space changes depending on the current state. Depending on the cards on the table I might  not be able to play some cards in my hand. In order to mask out some actions:
    https://docs.ray.io/en/master/rllib-models.html#variable-length-parametric-action-spaces

RLlib can create distinct policies and route agent decisions to its bound policy. When an agent first appears in the env, policy_mapping_fn will be called to determine which policy it is bound to. These assignments are done when the agent first enters the episode, and persist for the duration of the episode.

RLlib reports separate training statistics for each policy in the return from train(), along with the combined reward.

If all “agents” in the env are homogeneous, then it is possible to use existing single-agent algorithms for training. Since there is still only a single policy being trained, RLlib only needs to internally aggregate the experiences of the different agents prior to policy optimization.

## 1. Create the custom environment
https://docs.ray.io/en/latest/rllib-env.html?multi-agent-and-hierarchical#multi-agent-and-hierarchical

```python
# Example: using a multi-agent env
> env = MultiAgentTrafficEnv(num_cars=20, num_traffic_lights=5)

# Observations are a dict mapping agent names to their obs. Only those
# agents' names that require actions in the next call to `step()` will
# be present in the returned observation dict.
> print(env.reset())
{
    "car_1": [[...]],
    "car_2": [[...]],
    "traffic_light_1": [[...]],
}

# In the following call to `step`, actions should be provided for each
# agent that returned an observation before:
> new_obs, rewards, dones, infos = env.step(actions={"car_1": ..., "car_2": ..., "traffic_light_1": ...})

# Similarly, new_obs, rewards, dones, etc. also become dicts
> print(rewards)
{"car_1": 3, "car_2": -1, "traffic_light_1": 0}

# Individual agents can early exit; The entire episode is done when "__all__" = True
> print(dones)
{"car_2": True, "__all__": False}
```

Test the observation space:

In [4]:
from gym.spaces import Discrete, Tuple
s = Tuple((
    # First round
    Tuple((Discrete(4), Discrete(4))),

    # Second round
    Tuple((Discrete(4), Discrete(4))),

    # Third round
    Tuple((Discrete(4), Discrete(4))),
))

state = [
    [0, 0],
    [0, 0],
    [0, 0]
]

s.contains(state)

True

In [1]:
# Implementation of the Multi Agent Env. game

from gym.spaces import Discrete, Tuple
from ray.rllib.env.multi_agent_env import MultiAgentEnv
import random


class Actions:
    # number of actions
    SIZE = 3

    # types o actions
    ROCK = 0
    PAPER = 1
    SCISSORS = 2
    NA = 3  # Not Available, hand not yet played

class RockPaperScissors(MultiAgentEnv):
    """
    Two-player environment for the famous rock paper scissors game, modified:
    - There are two agents which alternate, the action of one agent provides the
        state for the next agent. Since one of the two players begins, the agent
        which starts second should learn to always win! The startign player
        is drawn randomly.
    - The action space changes. The game is divided in three rounds across
        which you can't re-use the same action.
    """

    # Action/State spaces
    ACTION_SPACE = Discrete(Actions.SIZE)
    
    OBSERVATION_SPACE = Tuple((
        # First round
        Tuple((Discrete(4), Discrete(4))),

        # Second round
        Tuple((Discrete(4), Discrete(4))),

        # Third round
        Tuple((Discrete(4), Discrete(4))),
    ))
#     OBSERVATION_SPACE = Dict({
#         "real_obs": Tuple((
#             # First round
#             Tuple((Discrete(4), Discrete(4))),

#             # Second round
#             Tuple((Discrete(4), Discrete(4))),

#             # Third round
#             Tuple((Discrete(4), Discrete(4))),
#         )),
#         # we have to handle changing action spaces
#         "action_mask": Box(0, 1, shape=(Actions.SIZE, )),
# #         "avail_actions": Box(-1, 1, shape=(Actions.SIZE, action_embedding_sz)),
#     })
    
    
    # Reward mapping
    rewards = {
        (Actions.ROCK, Actions.ROCK): (0, 0),
        (Actions.ROCK, Actions.PAPER): (-1, 1),
        (Actions.ROCK, Actions.SCISSORS): (1, -1),
        (Actions.PAPER, Actions.ROCK): (1, -1),
        (Actions.PAPER, Actions.PAPER): (0, 0),
        (Actions.PAPER, Actions.SCISSORS): (-1, 1),
        (Actions.SCISSORS, Actions.ROCK): (-1, 1),
        (Actions.SCISSORS, Actions.PAPER): (1, -1),
        (Actions.SCISSORS, Actions.SCISSORS): (0, 0),
    }

    def __init__(self, config=None):
        
        # state and action spaces
        self.action_space = self.ACTION_SPACE
        self.observation_space = self.OBSERVATION_SPACE

        self.players = ["player_1", "player_2"]        

    def reset(self):
        self.player_scores = [0, 0]  # not used
        self.curr_round = 0
        self.player_pointer = random.randint(0, 1)
        self.state = [
            [3, 3],
            [3, 3],
            [3, 3],
        ]
        # reward is given to the last player with 1 delay
        self.reward_buffer = {p: 0 for p in self.players}
        return {self.players[self.player_pointer]: self.state}

    def step(self, action_dict):
        # Get current player
        curr_player_pointer = self.player_pointer
        curr_player = self.players[self.player_pointer]

        # Get next player
        next_player_pointer = (self.player_pointer + 1) % 2
        next_player = self.players[next_player_pointer]
    
        # Make sure you have the ation only for the current player
        assert curr_player in action_dict and len(action_dict) == 1,\
            "{} should be playing but action {} was received.".format(curr_player, action_dict)
        
        # Play the action
        curr_action = action_dict[curr_player]
        assert self.action_space.contains(curr_action), 'Action {} is not valid'.format(curr_action)
        assert self.state[self.curr_round][curr_player_pointer] == self.NA,\
            "Player {} has already played in round {}. Here the current state: {}".format(
            curr_player_pointer,
            self.curr_round,
            self.state
        )
        self.state[self.curr_round][curr_player_pointer] = curr_action

        # We might be not done yet
        done = {"__all__": False}
        
        # If the next player has already played, the round is done
        game_done = False
        round_done = self.state[self.curr_round][next_player_pointer] != self.NA
        if round_done:
            # If the round is done we compute the rewards
            curr_rewards = self.rewards[tuple(self.state[self.curr_round])]
            self.player_scores[0] += curr_rewards[0]
            self.player_scores[1] += curr_rewards[1]            
            self.reward_buffer[curr_player] = curr_rewards[curr_player_pointer]
            
            self.curr_round += 1
            if self.curr_round == 3:
                done = {"__all__": True}
                # Return reward and state for all players
                reward = self.reward_buffer
                obs = {p: self.state for p in self.players}
                game_done = True
        
        # Get the state and reward for the next player
        if not game_done:
            obs = {next_player: self.state}
            reward = {next_player: self.reward_buffer[next_player]}
        
        # Move pointer to next player
        self.player_pointer = next_player_pointer
        return obs, reward, done, {}

In [59]:
# Test the environment
import random

env = RockPaperScissors()
obs = env.reset()
print(obs)

is_done = False
while not is_done:
    print('\nRound {}: {}'.format(env.curr_round, env.players[env.player_pointer]))
    action = {list(obs.keys())[0]: int(input('Insert action (0, 1, 2): '))}
    obs, reward, done, _ = env.step(action)
    print(obs, reward, done)
    is_done = done['__all__']

{'player_2': [[3, 3], [3, 3], [3, 3]]}

Round 0: player_2
Insert action (0, 1, 2): 0
{'player_1': [[3, 0], [3, 3], [3, 3]]} {'player_1': 0} {'__all__': False}

Round 0: player_1
Insert action (0, 1, 2): 1
{'player_2': [[1, 0], [3, 3], [3, 3]]} {'player_2': 0} {'__all__': False}

Round 1: player_2
Insert action (0, 1, 2): 1
{'player_1': [[1, 0], [3, 1], [3, 3]]} {'player_1': 1} {'__all__': False}

Round 1: player_1
Insert action (0, 1, 2): 0
{'player_2': [[1, 0], [0, 1], [3, 3]]} {'player_2': 0} {'__all__': False}

Round 2: player_2
Insert action (0, 1, 2): 2
{'player_1': [[1, 0], [0, 1], [3, 2]]} {'player_1': -1} {'__all__': False}

Round 2: player_1
Insert action (0, 1, 2): 2
{'player_1': [[1, 0], [0, 1], [2, 2]], 'player_2': [[1, 0], [0, 1], [2, 2]]} {'player_1': 0, 'player_2': 0} {'__all__': True}


## 2. Create the custom model for Variable-Length Action Spaces
https://docs.ray.io/en/master/rllib-models.html#variable-length-parametric-action-spaces
Our policy has to take into consideration the fact that some actions might not be executable.

In [7]:
#Todo

## 3. Train the Agents

In [2]:
# before training we have to initialize ray
import ray
from ray.rllib.agents.ppo import PPOTrainer

ray.shutdown()
ray.init(num_cpus=4)

2020-06-20 20:34:40,114	INFO resource_spec.py:212 -- Starting Ray with 4.44 GiB memory available for workers and up to 2.24 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-20 20:34:40,494	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.1.125',
 'raylet_ip_address': '192.168.1.125',
 'redis_address': '192.168.1.125:38515',
 'object_store_address': '/tmp/ray/session_2020-06-20_20-34-40_110513_90627/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-20_20-34-40_110513_90627/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-06-20_20-34-40_110513_90627'}

### 3.1. Example with tune

In [61]:
from ray import tune

config = {
    "env": RockPaperScissors,
#     "framework": "torch",
}

stop = {
    "episode_reward_mean": 2.90,
#     "timesteps_total": stop_timesteps,
#     "training_iteration": stop_iters,
}

results = tune.run(
    PPOTrainer,
    name='RLlibExample',
    config=config,
    verbose=1,
    stop=stop
)

Trial name,status,loc,iter,total time (s),ts,reward
PPO_RockPaperScissors_00000,TERMINATED,,9,42.7967,36999,2.92286


### 3.2. Plain example without tune

In [65]:
trainer_config = {
    "env": RockPaperScissors
}
trainer = PPOTrainer(config=trainer_config)
for i in range(5):
    res = trainer.train()
    print("Iteration {}. episode_reward_mean: {}".format(i, res['episode_reward_mean']))

2020-06-20 20:30:01,649	INFO trainable.py:217 -- Getting current IP.


Iteration 0. episode_reward_mean: 0.04148783977110158
Iteration 1. episode_reward_mean: 0.848575712143928
Iteration 2. episode_reward_mean: 1.2814285714285714
Iteration 3. episode_reward_mean: 1.9835082458770614
Iteration 4. episode_reward_mean: 2.3605150214592276


### 3.3. Example with multiple policies
Inspired from: https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_two_trainers.py

In [4]:
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
from ray.rllib.agents.dqn.dqn_tf_policy import DQNTFPolicy
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.dqn import DQNTrainer
from ray.tune.logger import pretty_print


policies = {
    "ppo_policy_1": (PPOTFPolicy, RockPaperScissors.OBSERVATION_SPACE, RockPaperScissors.ACTION_SPACE, {}),
    "dqn_policy_1": (DQNTFPolicy, RockPaperScissors.OBSERVATION_SPACE, RockPaperScissors.ACTION_SPACE, {}),
}

# Define the PPO trainer
ppo_trainer = PPOTrainer(config={
    "env": RockPaperScissors,
    "multiagent": {
        "policies_to_train": ['ppo_policy_1'],
        "policies": policies,
        "policy_mapping_fn": lambda agent_id: "ppo_policy_1" if agent_id=="player_1" else "dqn_policy_1",
    },
    # disable filters, otherwise we would need to synchronize those
    # as well to the DQN agent
    "observation_filter": "NoFilter",
})

# Define the DQN trainer
dqn_trainer = DQNTrainer(config={
    "env": RockPaperScissors,
    "multiagent": {
        "policies_to_train": ['dqn_policy_1'],
        "policies": policies,
        "policy_mapping_fn": lambda agent_id: "ppo_policy_1" if agent_id=="player_1" else "dqn_policy_1",
    },
})

# Alternate training of the two policies
stop_reward = 2.9
for i in range(20):
    print("== Iteration", i, "==")

    # improve the DQN policy
    print("-- DQN --")
    result_dqn = dqn_trainer.train()
    # print(pretty_print(result_dqn))
    print("\tDQN. episode_reward_mean: {}".format(result_dqn['episode_reward_mean']))

    # improve the PPO policy
    print("-- PPO --")
    result_ppo = ppo_trainer.train()
    # print(pretty_print(result_ppo))
    print("\tPPO. episode_reward_mean: {}".format(result_ppo['episode_reward_mean']))

    # Test passed gracefully.
    if (
        result_dqn["episode_reward_mean"] > stop_reward and
        result_ppo["episode_reward_mean"] > stop_reward
    ):
        print("test passed (both agents above requested reward)")
        break

    # swap weights to synchronize
#     dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))
#     ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))

2020-06-20 20:36:55,479	INFO trainable.py:217 -- Getting current IP.
2020-06-20 20:36:58,277	INFO trainable.py:217 -- Getting current IP.


== Iteration 0 ==
-- DQN --
	DQN. episode_reward_mean: -0.05389221556886228
-- PPO --
	PPO. episode_reward_mean: -0.025525525525525526
== Iteration 1 ==
-- DQN --
	DQN. episode_reward_mean: 0.005988023952095809
-- PPO --
	PPO. episode_reward_mean: 0.2912912912912913
== Iteration 2 ==
-- DQN --
	DQN. episode_reward_mean: 0.5654761904761905
-- PPO --
	PPO. episode_reward_mean: 0.7485029940119761
== Iteration 3 ==
-- DQN --
	DQN. episode_reward_mean: 0.38922155688622756
-- PPO --
	PPO. episode_reward_mean: 0.972972972972973
== Iteration 4 ==
-- DQN --
	DQN. episode_reward_mean: 0.6646706586826348
-- PPO --
	PPO. episode_reward_mean: 1.021021021021021
== Iteration 5 ==
-- DQN --
	DQN. episode_reward_mean: 0.9940476190476191
-- PPO --
	PPO. episode_reward_mean: 1.2664670658682635
== Iteration 6 ==
-- DQN --
	DQN. episode_reward_mean: 0.9880239520958084
-- PPO --
	PPO. episode_reward_mean: 1.3063063063063063
== Iteration 7 ==
-- DQN --
	DQN. episode_reward_mean: 1.1197604790419162
-- PPO --


## 4. Evaluate the agents
Execute in the console:
```console
tensorboard --logdir=~/ray_results --host=0.0.0.0
```