# Tutorial 14: Multiagent

This tutorial covers the implementation of multiagent experiments in Flow. It assumes some level of knowledge or experience in writing custom environments and running experiments with RLlib. The rest of the tutorial is organized as follows. Section 1 describes the procedure through which custom environments can be augmented to generate multiagent environments. Then, section 2 walks you through an example of running a multiagent environment
in RLlib.

本教程介绍了流中多代理实验的实现。它假定您具有编写自定义环境和使用RLlib运行实验方面的一定知识或经验。本教程的其余部分组织如下。第1节描述了通过扩展自定义环境来生成多代理环境的过程。然后，第2节将介绍一个运行多代理环境的示例

在RLlib。

## 1. Creating a Multiagent Environment Class 创建一个多代理环境类

In this part we will be setting up steps to create a multiagent environment. We begin by importing the abstract multi-agent evironment class.

在本部分中，我们将设置创建多代理环境的步骤。我们首先导入抽象的多代理环境类。

In [1]:
# import the base Multi-agent environment 
from flow.envs.multiagent.base import MultiEnv

In multiagent experiments, the agent can either share a policy ("shared policy") or have different policies ("non-shared policy"). In the following subsections, we describe the two.

在多代理实验中，代理可以共享一个策略(“共享策略”)，也可以拥有不同的策略(“非共享策略”)。在下面的小节中，我们将介绍这两种方法。

### 1.1 Shared policies 共享策略
In the multi-agent environment with a shared policy, different agents will use the same policy. 

在具有共享策略的多代理环境中，不同的代理将使用相同的策略。

We define the environment class, and inherit properties from the Multi-agent version of base env.

我们定义了environment类，并从基本环境的多代理版本中继承属性。

In [2]:
class SharedMultiAgentEnv(MultiEnv):
    pass

This environment will provide the interface for running and modifying the multiagent experiment. Using this class, we are able to start the simulation (e.g. in SUMO), provide a network to specify a configuration and controllers, perform simulation steps, and reset the simulation to an initial configuration.

该环境将提供运行和修改多代理实验的接口。使用这个类，我们可以启动模拟(例如在SUMO中)，提供一个网络来指定配置和控制器，执行模拟步骤，并将模拟重置为初始配置。

For the multi-agent experiments, certain functions of the `MultiEnv` will be changed according to the agents. Some functions will be defined according to a *single* agent, while the other functions will be defined according to *all* agents.

在多主体实验中，“MultiEnv”的某些功能会根据主体的不同而改变。一些函数将根据*单个*代理定义，而其他函数将根据*所有*代理定义。

In the follwing functions, observation space and action space will be defined for a *single* agent (not all agents):

在下面的功能中，将定义一个*单个* agent(不是所有agent)的观察空间和行动空间:

* **observation_space**
* **action_space**

For instance, in a multiagent traffic light grid, if each agents is considered as a single intersection controlling the traffic lights of the intersection, the observation space can be define as *normalized* velocities and distance to a *single* intersection for nearby vehicles, that is defined for every intersection.  

例如，在一个多智能体交通信号灯网格中，如果每个智能体都被视为一个控制该交叉口交通灯的单个交叉口，那么对于附近车辆，观测空间可以定义为*归一化*速度和到*单个*交叉口的距离，即为每个交叉口定义。

In [3]:
def observation_space(self):
        """State space that is partially observed.

        Velocities and distance to intersections for nearby
        vehicles ('num_observed') from each direction.
        """
        tl_box = Box(
            low=0.,
            high=1,
            shape=(2 * 4 * self.num_observed),
            dtype=np.float32)
        return tl_box

The action space can be defined for a *single* intersection as follows
可以为一个*单个*交集定义操作空间，如下所示

In [4]:
def action_space(self):
        """See class definition."""
        if self.discrete: 
            # each intersection is an agent, and the action is simply 0 or 1. 
            # - 0 means no-change in the traffic light 
            # - 1 means switch the direction
            return Discrete(2)
        else:
            return Box(low=0, high=1, shape=(1,), dtype=np.float32)

Conversely, the following functions (including their return values) will be defined to take into account *all* agents:

相反，下面的函数(包括它们的返回值)将被定义为考虑*所有*代理:

* **apply_rl_actions**
* **get_state**
* **compute_reward**

Instead of calculating actions, state, and reward for a single agent, in these functions, the ctions, state, and reward will be calculated for all the agents in the system. To do so, we create a dictionary with agent ids as keys and different parameters (actions, state, and reward ) as vaules. For example, in the following `_apply_rl_actions` function, based on the action of intersections (0 or 1), the state of the intersections' traffic lights will be changed.

在这些函数中，将计算系统中所有代理的动作、状态和奖励，而不是计算单个代理的动作、状态和奖励。为此，我们创建了一个字典，其中代理id作为键，不同的参数(动作、状态和奖励)作为变量。例如，在下面的‘_apply_rl_actions’函数中，根据交叉口(0或1)的动作，将改变交叉口交通灯的状态。

In [5]:
class SharedMultiAgentEnv(MultiEnv): 
    def _apply_rl_actions(self, rl_actions):
        for agent_name in rl_actions:
            action = rl_actions[agent_name]
            # check if the action space is discrete
            
            # check if our timer has exceeded the yellow phase, meaning it
            # should switch to red
            if self.currently_yellow[tl_num] == 1:  # currently yellow
                self.last_change[tl_num] += self.sim_step
                if self.last_change[tl_num] >= self.min_switch_time: # check if our timer has exceeded the yellow phase, meaning it
                # should switch to red
                    if self.direction[tl_num] == 0:
                        self.k.traffic_light.set_state(
                            node_id='center{}'.format(tl_num),
                            state="GrGr")
                    else:
                        self.k.traffic_light.set_state(
                            node_id='center{}'.format(tl_num),
                            state='rGrG')
                    self.currently_yellow[tl_num] = 0
            else:
                if action:
                    if self.direction[tl_num] == 0:
                        self.k.traffic_light.set_state(
                            node_id='center{}'.format(tl_num),
                            state='yryr')
                    else:
                        self.k.traffic_light.set_state(
                            node_id='center{}'.format(tl_num),
                            state='ryry')
                    self.last_change[tl_num] = 0.0
                    self.direction[tl_num] = not self.direction[tl_num]
                    self.currently_yellow[tl_num] = 1

Similarly, the `get_state` and `compute_reward` methods support the dictionary structure and add the observation and reward, respectively, as a value for each correpsonding key, that is agent id. 

类似地，' get_state '和' compute_reward '方法支持字典结构，并分别为每个correpsonding键(即代理id)添加观察值和奖赏值。

In [6]:
class SharedMultiAgentEnv(MultiEnv): 

    def get_state(self):
        """Observations for each intersection

        :return: dictionary which contains agent-wise observations as follows:
        - For the self.num_observed number of vehicles closest and incomingsp
        towards traffic light agent, gives the vehicle velocity and distance to
        intersection.
        """
        # Normalization factors
        max_speed = max(
            self.k.network.speed_limit(edge)
            for edge in self.k.network.get_edge_list())
        max_dist = max(grid_array["short_length"], grid_array["long_length"],
                       grid_array["inner_length"])

        # Observed vehicle information
        speeds = []
        dist_to_intersec = []

        for _, edges in self.network.node_mapping:
            local_speeds = []
            local_dists_to_intersec = []
            # .... More code here (removed for simplicity of example)
            # ....

            speeds.append(local_speeds)
            dist_to_intersec.append(local_dists_to_intersec)
            
        obs = {}
        for agent_id in self.k.traffic_light.get_ids():
            # .... More code here (removed for simplicity of example)
            # ....
            observation = np.array(np.concatenate(speeds, dist_to_intersec))
            obs.update({agent_id: observation})
        return obs


    def compute_reward(self, rl_actions, **kwargs):
        if rl_actions is None:
            return {}

        if self.env_params.evaluate:
            rew = -rewards.min_delay_unscaled(self)
        else:
            rew = -rewards.min_delay_unscaled(self) \
                  + rewards.penalize_standstill(self, gain=0.2)

        # each agent receives reward normalized by number of lights
        rew /= self.num_traffic_lights

        rews = {}
        for rl_id in rl_actions.keys():
            rews[rl_id] = rew
        return rews

### 1.2 Non-shared policies 非共享策略

In the multi-agent environment with a non-shared policy, different agents will use different policies. In what follows we will see the two agents in a ring road using two different policies, 'adversary' and 'av' (non-adversary).

在具有非共享策略的多代理环境中，不同的代理将使用不同的策略。在接下来的内容中，我们将看到这两个代理在一个环形路上使用了两种不同的策略，“对手”和“av”(非对手)。

Similarly to the shared policies, observation space and action space will be defined for a *single* agent (not all agents):

与共享策略类似，将为一个*单个*代理(不是所有代理)定义观察空间和操作空间:

* **observation_space**
* **action_space**

And, the following functions (including their return values) will be defined to take into account *all* agents::

并且，以下函数(包括它们的返回值)将被定义为考虑*所有*代理::

* **apply_rl_actions**
* **get_state**
* **compute_reward**

\* Note that, when observation space and action space will be defined for a single agent, it means that all agents should have the same dimension (i.e. space) of observation and action, even when their policise are not the same. 

请注意，当为单个agent定义了观察空间和行动空间时，这意味着所有agent都应该具有相同的观察和行动维度(即空间)，即使它们的策略并不相同。

Let us start with defining  `apply_rl_actions` function. In order to make it work for a non-shared policy multi-agent ring road, we define `rl_actions` as a combinations of each policy actions plus the `perturb_weight`.

让我们从定义' apply_rl_actions '函数开始。为了使它适用于非共享策略多代理环路，我们将“rl_actions”定义为每个策略操作加上“扰动权值”的组合。

In [7]:
class NonSharedMultiAgentEnv(MultiEnv):
    def _apply_rl_actions(self, rl_actions):
        # the names of all autonomous (RL) vehicles in the network
        agent_ids = [
            veh_id for veh_id in self.sorted_ids
            if veh_id in self.k.vehicle.get_rl_ids()
        ]
        # define different actions for different multi-agents 
        av_action = rl_actions['av']
        adv_action = rl_actions['adversary']
        perturb_weight = self.env_params.additional_params['perturb_weight']
        rl_action = av_action + perturb_weight * adv_action
        
        # use the base environment method to convert actions into accelerations for the rl vehicles
        self.k.vehicle.apply_acceleration(agent_ids, rl_action)

In the `get_state` method, we define the state for each of the agents. Remember, the sate of the agents can be different. For the purpose of this example and simplicity, we define the state of the adversary and non-adversary agent to be the same. 

在“get_state”方法中，我们为每个代理定义状态。记住，代理的状态可以是不同的。为了这个示例和简单起见，我们将对手和非对手代理的状态定义为相同的。

In the `compute_reward` method, the agents receive opposing speed rewards. The reward of the adversary agent is more when the speed of the vehicles is small, while the non-adversary agent tries to increase the speeds of the vehicles.

在“compute_reward”方法中，代理收到相反的速度奖励。当车辆的速度较小时，敌手的奖励较多，而非敌手则试图提高车辆的速度。

In [8]:
class NonSharedMultiAgentEnv(MultiEnv):
    def get_state(self, **kwargs):
        state = np.array([[
            self.k.vehicle.get_speed(veh_id) / self.k.network.max_speed(),
            self.k.vehicle.get_x_by_id(veh_id) / self.k.network.length()
        ] for veh_id in self.sorted_ids])
        state = np.ndarray.flatten(state)
        return {'av': state, 'adversary': state}

    def compute_reward(self, rl_actions, **kwargs):
        if self.env_params.evaluate:
            reward = np.mean(self.k.vehicle.get_speed(
                self.k.vehicle.get_ids()))
            return {'av': reward, 'adversary': -reward}
        else:
            reward = rewards.desired_velocity(self, fail=kwargs['fail'])
            return {'av': reward, 'adversary': -reward}

## 2. Running Multiagent Environment in RLlib 在RLlib中运行多代理环境

When running the experiment that uses a multiagent environment, we specify certain parameters in the `flow_params` dictionary. 

在运行使用多代理环境的实验时，我们在“flow_params”字典中指定某些参数。

Similar to any other experiments, the following snippets of codes will be inserted into a blank python file (e.g. `new_multiagent_experiment.py`, and should be saved under `flow/examples/exp_configs/rl/multiagent/` directory. (all the basic imports and initialization of variables are omitted in this example for brevity)

与其他实验类似，下面的代码片段将被插入到一个空白的python文件中(例如“new_multiagent_experiment。，应该保存在“flow/examples/exp_configs/rl/multiagent/”目录下。(为了简单起见，本例中省略了所有基本的变量导入和初始化)

In [9]:
from flow.envs.multiagent import MultiWaveAttenuationPOEnv
from flow.networks import MultiRingNetwork
from flow.core.params import SumoParams, EnvParams, NetParams, VehicleParams, InitialConfig
from flow.controllers import ContinuousRouter, IDMController, RLController

# time horizon of a single rollout
HORIZON = 3000
# Number of rings
NUM_RINGS = 1

vehicles = VehicleParams()
for i in range(NUM_RINGS):
    vehicles.add(
        veh_id='human_{}'.format(i),
        acceleration_controller=(IDMController, {
            'noise': 0.2
        }),
        routing_controller=(ContinuousRouter, {}),
        num_vehicles=21)
    vehicles.add(
        veh_id='rl_{}'.format(i),
        acceleration_controller=(RLController, {}),
        routing_controller=(ContinuousRouter, {}),
        num_vehicles=1)

flow_params = dict(
    # name of the experiment
    exp_tag='multiagent_ring_road',

    # name of the flow environment the experiment is running on
    env_name=MultiWaveAttenuationPOEnv,

    # name of the network class the experiment is running on
    network=MultiRingNetwork,

    # simulator that is used by the experiment
    simulator='traci',

    # sumo-related parameters (see flow.core.params.SumoParams)
    sim=SumoParams(
        sim_step=0.1,
        render=False,
    ),

    # environment related parameters (see flow.core.params.EnvParams)
    env=EnvParams(
        horizon=HORIZON,
        warmup_steps=750,
        additional_params={
            'max_accel': 1,
            'max_decel': 1,
            'ring_length': [230, 230],
            'target_velocity': 4
        },
    ),

    # network-related parameters 
    net=NetParams(
        additional_params={
            'length': 230,
            'lanes': 1,
            'speed_limit': 30,
            'resolution': 40,
            'num_rings': NUM_RINGS
        },
    ),

    # vehicles to be placed in the network at the start of a rollout
    veh=vehicles,

    # parameters specifying the positioning of vehicles upon initialization/
    # reset
    initial=InitialConfig(bunching=20.0, spacing='custom'),
)

Then we run the following code to create the environment 

In [10]:
from flow.utils.registry import make_create_env
from ray.tune.registry import register_env

create_env, env_name = make_create_env(params=flow_params, version=0)

# Register as rllib env
register_env(env_name, create_env)

test_env = create_env()
obs_space = test_env.observation_space
act_space = test_env.action_space

### 2.1 Shared policies 共享策略

When we run a shared-policy multiagent experiment, we refer to the same policy for each agent. In the example below the agents will use 'av' policy.

当我们运行共享策略多代理实验时，我们为每个代理引用相同的策略。在下面的示例中，代理将使用“av”策略。

In [11]:
from ray.rllib.agents.ppo.ppo_policy import PPOTFPolicy

def gen_policy():
    """Generate a policy in RLlib."""
    return PPOTFPolicy, obs_space, act_space, {}


# Setup PG with an ensemble of `num_policies` different policy graphs
POLICY_GRAPHS = {'av': gen_policy()}


def policy_mapping_fn(_):
    """Map a policy in RLlib."""
    return 'av'


POLICIES_TO_TRAIN = ['av']

### 2.2 Non-shared policies 非共享策略

When we run the non-shared multiagent experiment, we refer to different policies for each agent. In the example below, the policy graph will have two policies, 'adversary' and 'av' (non-adversary).

当我们运行非共享多代理实验时，我们引用每个代理的不同策略。在下面的示例中，策略图将有两个策略，“对手”和“av”(非对手)。

In [12]:
def gen_policy():
    """Generate a policy in RLlib."""
    return PPOTFPolicy, obs_space, act_space, {}


# Setup PG with an ensemble of `num_policies` different policy graphs
POLICY_GRAPHS = {'av': gen_policy(), 'adversary': gen_policy()}


def policy_mapping_fn(agent_id):
    """Map a policy in RLlib."""
    return agent_id

Lastly, just like any other experiments, we run our code using `train_rllib.py` as follows:
最后，与其他实验一样，我们使用' train_rllib运行代码。py”如下:

    python flow/examples/train_rllib.py new_multiagent_experiment.py