# Pommerman

The Pommerman  challenge was originally held on June 3rd, 2018, where the goal was to train an agent to play the famous Bomberman game from Nintendo. In this competition, the agent had to face three other agents in a free-for-all game. The holders of the competition designed an environment based on the original game and invited everyone to submit their agents. On November 21st a similar competition was held. This time the agent had to learn how to cooperate with a teammate facing another team consisting of two agents.
We decided to approach this challenge using the advantage actor critic ([A2C](https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)) method, with the goal to train two separate agents to perform better than two hardcoded agents. Each match starts on a randomly drawn 11x11 grid. There are four agents, one in each corner. The agent's teammate is on the kitty corner.

In addition to the agents, the board contains wood walls and rigid walls. Rigid walls are indestructible. Wooden walls can be destroyed by bombs. After they are destroyed, they become either a passage or a power-up 50 \% of the times. The agent starts with one bomb. Every time it lays a bomb, its count decreases by one. After that bomb explodes, the count will increase by one. If the agent picks up the **Extra Bomb** power up, its count will increase by one. The agent also has a blast strength that starts at three, which is how far in the vertical and horizontal directions that bomb will effect. If the **Increase Range** power up is picked up, its range will increase by one. If an agent picks up the **Can Kick** power up, the agent is allowed to kick a bomb by running into it. A bomb has a fuse of 10 time steps. After that, it explodes and any wooden walls, agents, power-ups or other bombs in its range  are destroyed. For each timestep any agent can choose from these six actions

* **Stop** : This action is a pass.
* **Up**: Move one position up on the board.
* **Left**: Move one position to the left on the board.
* **Down**: Move one position down on the board.
* **Right**: Move one position to the right on the board.
* **Bomb**: Lay a bomb at the current position.

The agents receive the following observations on each timestep

* **Board**: 121 ints representing the flattened board. All squares ouside of the agents view will be covered by fog: int value 5
* **Position**: 2 ints (x, y) each in the range [0, 10].
* **Ammo**: 1 int showing the agent's current ammo. How many bombs the agent has left to place.
* **Blast Strength**: 1 int the blast radius of the agent's bomb.
* **Can Kick**: 1 in , 0 or 1. Indiates whether or not the agent can kick the bomb.
* **Teammate**: 1 int int the range [-1, 3]. In the free for all environment this will be -1. In the cooperation environment it will be the id of the agent's teammate.
* **Enemies**: 3 ints in the range [-1, 3]. The ids for each enemy. In the cooperation environment the last id will be -1, as there are only 2 enemies.
* **Bombs**: List of ints. The bombs visible to the agent. Contains X int, Y int and Blast strength for each bomb.


In [None]:
# Run this after a change in the Pommerman folder
!py -m pip install -U . --user

In [None]:
import argparse
import pommerman
from pommerman import agents
from pommerman.agents import BaseAgent
from pommerman import characters
import pommerman.envs
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import torch.nn.functional as F
import matplotlib.pyplot as plt
import random
import numpy as np
from collections import deque

In [None]:
# Use GPU if available
use_cuda = torch.cuda.is_available()


def get_variable(x):
    """ Converts tensors to cuda, if available. """
    if use_cuda:
        return x.cuda()
    return x


def get_numpy(x):
    """ Get numpy array for both cuda and not. """
    if use_cuda:
        return x.cpu().data.numpy()
    return x.data.numpy()

In [None]:
# Initialize parameters
gamma = 0.99
lr = 2.5e-4
eps = 1e-5
alpha = 0.99
tau = 1
entropy_coef = 0.01
value_loss_coef = 0.5
log_interval = 10
num_processes = 20
num_steps = 20

The next part is based on [Ross Wightman's work](https://github.com/rwightman/pytorch-pommerman-rl/blob/master/envs/pommerman.py) on the Pommerman challenge where the observations are one hot encoded and compressed. The features function in the buttom takes the observations as input and returns a $9\times11\times11$ obervation matrix and $1\times3$ feature vector.

In [None]:
DEFAULT_FEATURE_CONFIG = {
    'recode_agents': True,
    'compact_powerups': True,
    'compact_structure': True,
    'rescale': True,
}


def make_np_float(feature):
    return np.array(feature).astype(np.float32)


def _rescale(x):
    return (x - 0.5) * 2.0


def featurize(obs, agent_id, config):
    max_item = pommerman.constants.Item.Agent3.value

    ob = obs["board"]
    ob_bomb_blast_strength = obs["bomb_blast_strength"].astype(np.float32) / pommerman.constants.AGENT_VIEW_SIZE
    ob_bomb_life = obs["bomb_life"].astype(np.float32) / pommerman.constants.DEFAULT_BOMB_LIFE

    # one hot encode the board items
    ob_values = max_item + 1
    ob_hot = np.eye(ob_values)[ob]

    # replace agent item channels with friend, enemy, self channels
    if config['recode_agents']:
        self_value = pommerman.constants.Item.Agent0.value + agent_id
        enemies = np.logical_and(ob >= pommerman.constants.Item.Agent0.value, ob != self_value)
        self = (ob == self_value)
        friends = (ob == pommerman.constants.Item.AgentDummy.value)
        ob_hot[:, :, 9] = friends.astype(np.float32)
        ob_hot[:, :, 10] = self.astype(np.float32)
        ob_hot[:, :, 11] = enemies.astype(np.float32)
        ob_hot = np.delete(ob_hot, np.s_[12::], axis=2)

    if config['compact_powerups']:
        # replace powerups with single channel
        powerup = ob_hot[:, :, 6] * 0.5 + ob_hot[:, :, 7] * 0.66667 + ob_hot[:, :, 8]
        ob_hot[:, :, 6] = powerup
        ob_hot = np.delete(ob_hot, [7, 8], axis=2)

    # replace bomb item channel with bomb life
    ob_hot[:, :, 3] = ob_bomb_life

    if config['compact_structure']:
        ob_hot[:, :, 0] = 0.5 * ob_hot[:, :, 0] + ob_hot[:, :, 5]  # passage + fog
        ob_hot[:, :, 1] = 0.5 * ob_hot[:, :, 2] + ob_hot[:, :, 1]  # rigid + wood walls
        ob_hot = np.delete(ob_hot, [2], axis=2)
        # replace former fog channel with bomb blast strength
        ob_hot[:, :, 5] = ob_bomb_blast_strength
    else:
        # insert bomb blast strength next to bomb life
        ob_hot = np.insert(ob_hot, 4, ob_bomb_blast_strength, axis=2)

    self_ammo = make_np_float([obs["ammo"]])
    self_blast_strength = make_np_float([obs["blast_strength"]])
    self_can_kick = make_np_float([obs["can_kick"]])

    ob_hot = ob_hot.transpose((2, 0, 1))  # PyTorch tensor layout compat

    if config['rescale']:
        ob_hot = _rescale(ob_hot)
        self_ammo = _rescale(self_ammo / 10)
        self_blast_strength = _rescale(self_blast_strength / pommerman.constants.AGENT_VIEW_SIZE)
        self_can_kick = _rescale(self_can_kick)

    return [ob_hot], [np.concatenate([self_ammo, self_blast_strength, self_can_kick])]


def features(obs, feature_config=DEFAULT_FEATURE_CONFIG):
    obs_im, obs_other = featurize(
                obs,
                0,
                feature_config)
    return obs_im, obs_other

A2C is based on two deep neural networks. One working as a critic and the other as the actor. Both networks make an assessment of the current state. The actor uses this to decide which action would be the most favorable and the critic estimates the state value. The state value is an estimate of how good the current state is. When training the actor, the change in state value is used as an indication of how good the action was in the given state. As illustrated in the figure below, the networks share the same architecture before the last layers where they separate. The observations which are described above are one hot encoded and compressed into a 11 $\times$ 11 $\times$ 9 matrix $B$ using the function *featurize* from [Ross Wightman's work](https://github.com/rwightman/pytorch-pommerman-rl/blob/master/envs/pommerman.py). The same function also outputs a 1X3 vector $F$ with information about the agent's ammo, blast strength and whether the agent can kick bombs. $B$ is fed to a three-layer convolution neural networks (CNN) with 64 filters each. Batch normalization is performed after every CNN layer. The output of the last CNN is fed to a dense neural network (DNN) with 1024 nodes which then again is fed to a DNN with 512 nodes. *F* is fed to DNN with 3 nodes and the output is concatenated with the output from the DNN with 512 nodes into a vector of length 515. This vector is then fed to a layer of Gated Recurrent Units (GRU) in order to consider previous states when assessing the current state. The output of the GRU is used as an input in two different DNN which corresponds to the actor and the critic. Rectified Linear Units (ReLU) is used as activation function on the output of every layer in the architecture, except for the critic and actor layer where \textit{tanh} and \textit{softmax} is used respectively. \textit{Tanh} is used on the critic because the state value should be negative when the current state is estimated to yield a negative reward in the future. The softmax is used on the output of the actor since the outputs of the actor are treated as the probability of each action being chosen during training.
![Network architecture](Images/Model.png)



In [None]:
class Actor_Critic(nn.Module):
    """Actor and critic - networks"""

    def __init__(self, n_inputs, n_outputs, inputs_other, n_conv_output):
        super(Actor_Critic, self).__init__()
        # network
        self.CNN = nn.Sequential(
                    nn.Conv2d(9, 64, 3, stride=1, padding=1),
                    nn.BatchNorm2d(64),
                    nn.ReLU(),
                    nn.Conv2d(64, 64, 3, stride=1, padding=1),
                    nn.BatchNorm2d(64),
                    nn.ReLU(),
                    nn.Conv2d(64, 64, 3, stride=1, padding=1),
                    nn.BatchNorm2d(64),
                    nn.ReLU(),
                    )
        self.CNN_mlp = nn.Sequential(        
                    nn.Linear(n_conv_output, 1024, bias=True),
                    nn.ReLU(),
                    nn.Linear(1024, 512, bias=True),
                    nn.ReLU(),
                    )
                    
        self.fnn_other = nn.Sequential(
                                    nn.Linear(inputs_other, inputs_other, bias=True),
                                    nn.ReLU(),
                                    )
        self.actor = nn.Sequential(
                                    nn.Linear(515, 6, bias=False),
                                    )
        self.state_value = nn.Sequential(
                                    nn.Linear(515, 1, bias=False),
                                    nn.Tanh(),
                                    )
        self.GRU = nn.GRUCell(n_inputs, 515)
        self.rewards = []
        self.values = []
        self.entropies = []
        self.log_prob = []

    def init_hidden(self):
        # This is what we'll initialise our hidden state as
        return get_variable(torch.zeros(1, 515))

    def forward(self, x_im, x_other, hxs, batch_size=1):
        out = []
        x = self.CNN(x_im)
        x = x.view(batch_size, -1)
        x = self.CNN_mlp(x)
        out.append(x)
        if batch_size > 1:
            x_other = self.input_norm(x_other)
        x = self.fnn_other(x_other)
        out.append(x)
        out = torch.cat(out, dim=1)
        out = hxs = self.GRU(out, hxs)
        action_scores = self.actor(out)
        state_values = self.state_value(out)
        return action_scores, state_values[0], hxs

In [None]:
class PommermanAgent(BaseAgent):
    # This is the agent class compitable with the pommerman enviroment.
    def __init__(self, character=characters.Bomber, mode='old', model="saved_models/policy_network_power_up3"):
        super(PommermanAgent, self).__init__(character)

        n_inputs = 515
        n_conv_output = 7744
        inputs_other = 3
        n_outputs = 6
        
        self.net = Actor_Critic(n_inputs, n_outputs, inputs_other, n_conv_output)
        
        if use_cuda:
            self.net = self.net.cuda()
            if mode is not "new":
                self.net.load_state_dict(torch.load(model))
        elif mode is not "new":
                self.net.load_state_dict(torch.load(model, map_location='cpu'))

        self.hxs = self.net.init_hidden()

    def act(self, obs, action_space=None):
        obs_im, obs_other = features(obs)
        self.net.eval()
        with torch.no_grad():
            action_scores, _, hxs = self.net(get_variable(torch.Tensor(obs_im)), get_variable(torch.Tensor(obs_other)), self.hxs)
            self.hxs = hxs
        return action_scores.argmax().item()

In [None]:
# Create a set of agents (exactly four)
agent_list = [
    PommermanAgent(model="saved_models/policy_network_team_agent0"),
    agents.SimpleAgent(),
    PommermanAgent(model="saved_models/policy_network_team_agent2"),
    agents.SimpleAgent(),
]
# Make the "Team" environment using the agent list
env = pommerman.make('PommeTeamCompetitionFast-v0', agent_list)



When training, the environment is first initialized, and the observation is fed to the neural network.The critic then outputs the state value, while the actor outputs probability of every action. Using the probabilities from the actor, an action is sampled. In other words, if the actor outputs that there is a 26% probability of the agent laying a bomb, this action will be sampled 26% of the times. The action is then performed in the environment and a new state and reward are returned from the environment. This corresponds to one step. After every step, the reward, state value, log likelihoods of the actions and entropy are saved in the model. After 20 steps, the saved values are used to calculate the gradients used to update the critic and actor. First, the discounted return $R_t$  is calculated and then used to calculate the value loss $L_v$ using 
$\frac{1}{2}\sum_t||\hat{V}_\phi^\pi(s)_t-R_t||$. Following, the Advantage $\hat{A}^\pi(s_t, a_t)$ is calculated using $r(s_t, a_t) + \gamma\hat{V}_\theta^\pi(s_t')- \hat{V}_\theta^\pi(s_t)$
The advantage is used to give the agent a better understanding of what defines a good action. If the advantage was not included, all actions would be punished if the agent was in a state with only negative outcomes. However, when advantage is included, an action is regarded as good if the state value after the action is better than before. The policy loss $L_p$ is calculated using $\nabla_\theta J(\theta) \approx \nabla_\theta \log{\pi_\theta(a_t|s_t)\hat{A}^\pi(s_t,a_t)}-\beta*H(\pi)$ The entropy $H(\pi)$ is subtracted from the policy loss to encourage exploration by restricting an action to have a too high probability compared to all other actions. Entropy describes the spread of the action distribution. The losses are used to obtain the gradient with respect to the network parameters. Each of these gradients is clipped to prevent overly-large parameter updates which can destabilize the policy. Root Mean Square Propagation Algorithm (RMSprop) was used as optimizer. The pseudocode for updating the parameter updates of the actor-critic algorithm is: 

* take N timesteps t $a_t\sim\pi_\theta(a_t|s_t)$, get $(s_t, a_t, s'_t, r_t)$



* evaluate return $R_t = \sum\limits_{t}\gamma\cdot r(s_t, a_t)$

* evaluate value loss $L_v =\frac{1}{2}\sum_t||\hat{V}_\phi^\pi(s)_t-R_t||$


* evaluate advantage $\hat{A}^\pi(s_t, a_t) = r(s_t, a_t) + \gamma\hat{V}_\theta^\pi(s_t')- \hat{V}_\theta^\pi(s_t)$



* evaluate policy loss $L_p= -\nabla_\theta \log{\pi_\theta(a_t|s_t)\hat{A}^\pi(s_t,a_t)}-\beta*H(\pi)$



* calculate gradient $\nabla_\theta J(\theta) \approx  L_p + 0.5 \cdot L_v$



* update parameters of the network $\theta \leftarrow \theta+\gamma\nabla_\theta J(\theta)$



$s_t$ is the state of the timestep t, $a_t$ is the action of the timestep t, $\pi$ is the policy of the actor, $\theta$ are the parameters of the actor, $\hat{V}^\pi\phi$ is sample state value, $\phi$ are the parameters of the critic, $H$ is the entropy of the actions, $\beta$ is the entropy coefficient and $\gamma$ is the discount factor.


In [None]:
def select_action(state, hxs, agent, model):
    obs_im, obs_other = features(state[agent])
    state_im = get_variable(torch.from_numpy(np.array(obs_im)).float())
    state_other = get_variable(torch.from_numpy(np.array(obs_other)).float())
    action_scores, state_value, hxs = model(state_im, state_other, hxs)
    probs = F.softmax(action_scores, dim=-1)
    m = Categorical(probs)
    action = m.sample().detach()
    log_prob = m.log_prob(action)
    entropy = m.entropy()

    model.log_prob.append(log_prob)
    model.values.append(state_value)
    model.entropies.append(entropy)

    return action.item(), hxs

In [None]:
def finish_episode(done, state, hxs, agent, model, optimizer):
    entropies = model.entropies
    values = model.values
    rewards = model.rewards
    retain_graph = False
    log_probs = model.log_prob
    R = get_variable(torch.zeros(1, 1))
    if not done:
        retain_graph = True
        obs_im, obs_other = features(state[agent])
        state_im = get_variable(torch.from_numpy(np.array(obs_im)).float())
        state_other = get_variable(torch.from_numpy(np.array(obs_other)).float())
        _, state_value, hxs = model(state_im, state_other, hxs)
        R = state_value.detach()

    values.append(R)
    policy_loss = 0
    value_loss = 0
    gae = get_variable(torch.zeros(1, 1))

    for i in reversed(range(len(rewards))):
        R = gamma * R + rewards[i]
        if R > 1:
            R = 1
        error = R - values[i]
        value_loss = value_loss + 0.5*error.pow(2)

        # Generalized Advantage Estimataion
        delta_t = rewards[i] + gamma * \
            values[i + 1] - values[i]
        gae = gae * gamma * tau + delta_t

        policy_loss = policy_loss - \
            log_probs[i] * gae.detach() - entropy_coef * entropies[i]
    optimizer.zero_grad()
    (policy_loss + value_loss_coef * value_loss).backward(retain_graph=retain_graph)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)

    optimizer.step()
    del model.rewards[:]
    del model.entropies[:]
    del model.values[:]
    del model.log_prob[:]

The model was trained with agents in the top left and bottom right corners. Separate sets of networks were trained for the agent in either corner. To encourage cooperation and a more aggressive style of play, different rewards described in the table below are used.
 


 Scenario       | Reward
----------------|--------------
Lays a bomb     |       0.007
Gets a powerup  |       0.8
Wins the game   |      1
Draws the game  |      -1
       Dies     |      -1
Teammate Dies   |      -0.5
Teammate visible|      0.008

Due to GPU memory restraints, only one agent is trained for each game.

In [None]:
def main():
    win = []
    eps = np.finfo(np.float32).eps.item()
    number_of_bombs = 0
    running_reward = 10
    i_episode = 0
    seen_positions = []
    power_ups = 0
    agent = 2
    team_agent = 0
    for _ in range(200):
        if agent == 2:
            agent = 0
            team_agent = 2
        else:
            agent = 2
            team_agent = 0
        model = PommermanAgent(mode="test", model="saved_models/policy_network_team_agent"+str(agent)).net
        optimizer = optim.RMSprop(model.parameters(), lr=lr, eps=eps, alpha=alpha)
        model.train()
        for _ in range(1000):
            total_steps = 0
            state = env.reset()
            old_blast_strength = state[agent]['blast_strength']
            old_ammo = state[agent]['ammo']
            old_can_kick = state[agent]['can_kick']
            ammo = 1
            hxs = model.init_hidden()
            model.zero_grad()
            alive0 = 1
            alive2 = 1
            for t in range(10000):  # Don't infinite loop while learning
                for steps in range(num_steps):
                    total_steps += 1
                    actions = env.act(state)
                    actions[agent], hxs = select_action(state, hxs, agent, model)
                    state, reward, done, _ = env.step(actions)
                    # region rewardfunction
                    if alive0 and pommerman.constants.Item.Agent0.value not in state[team_agent]['alive']:
                        if agent == 0:
                            reward[agent] = -1
                            done = 1
                        elif agent == 2:
                            reward[agent] = -0.5
                    if alive2 and pommerman.constants.Item.Agent2.value not in state[agent]['alive']:
                        if agent == 0:
                            reward[agent] = -0.5
                        elif agent == 2:
                            reward[agent] = -1
                            done = 1
                    if state[agent]['teammate'].value in state[agent]['board']:
                        reward[agent] += 0.008
                    if state[agent]['position'] not in seen_positions:
                        seen_positions.append(state[agent]['position'])
                    if actions[agent] == 5 and reward[agent] != -1 and ammo != 0:
                        reward[agent] += 0.007
                        number_of_bombs += 1
                    if state[agent]['blast_strength'] > old_blast_strength:
                        reward[agent] += 0.8
                        power_ups += 1
                    if state[agent]['ammo'] > old_ammo:
                        reward[agent] += 0.8
                        power_ups += 1
                    if state[agent]['can_kick'] > old_can_kick:
                        reward[agent] += 0.8
                        power_ups += 1
                    ammo = state[agent]['ammo'] != 0
                    old_blast_strength = max(state[agent]['blast_strength'], old_blast_strength)
                    old_ammo = max(state[agent]['ammo'], old_ammo)
                    old_can_kick = max(state[agent]['can_kick'], old_can_kick)
                    alive0 = pommerman.constants.Item.Agent0.value in state[team_agent]['alive']
                    alive2 = pommerman.constants.Item.Agent2.value in state[agent]['alive']
                    # endregion
                    model.rewards.append(get_variable(torch.from_numpy(np.array(reward[agent])).float()))
                    if done or reward[agent] == -1:
                        win.append(reward[agent] >= 1)
                        seen_positions = []
                        done = 1
                        break
                finish_episode(done, state, hxs, agent, model, optimizer)
                if done: 
                    break
            running_reward = running_reward * 0.99 + total_steps * 0.01
            if i_episode % log_interval == 0:
                print('Episode {}\tPower ups per match: {:.2f}\tAverage length: {:.2f}\tWin percentage: {:.2f}\tBombs per match: {:.2f}'.format(
                    i_episode, power_ups/log_interval, running_reward, np.mean(win), number_of_bombs/log_interval))
                win = []
                power_ups = 0
                number_of_bombs = 0
            i_episode += 1
        torch.save(model.state_dict(), "saved_models/policy_network_team_agent"+str(agent))

Below, it is possible to run the agents in the enviroment. The agents have been trained for 100 000 matches each. Have in mind that there is a high variance of performance since the agents still are in need of more training. A weakness of the AI agents is their unwillingness to move away from their starting corner. This is most likely due to overfitting and could be solved in numerous ways. One solution could be to assign a random starting position for each game, forcing the agent to accommodate a new environment every game. A consequence of the agent's reluctance to explore during the game is, that they rarely see each other (the agents' vision is limited). This is probably the cause of the lack of cooperation between the two AI agents.

In [None]:

# Print all possible environments in the Pommerman registry
print(pommerman.REGISTRY)


# Create a set of agents (exactly four)
agent_list = [
PommermanAgent(mode="test", model="saved_models/policy_network_team_agent0"),
agents.SimpleAgent(),
PommermanAgent(mode="test", model="saved_models/policy_network_team_agent2"),
agents.SimpleAgent(),
]

# Make the Team environment using the agent list

env = pommerman.make('PommeTeamCompetition-v0', agent_list)

# Run the episodes just like OpenAI Gym

state = env.reset()
num_no_mov_agent1 = 0
num_no_mov_agent3 = 0
done = False
while not done:
    env.render()
    actions = env.act(state)
    # An unelegant method to stop the agents to get stuck:
    old_position_agent1 = state[1]['position']
    old_position_agent3 = state[3]['position']
    if num_no_mov_agent1 > 20:
        actions[1] = random.choice(range(0, 5))
    if num_no_mov_agent3 > 20:
        actions[3] = random.choice(range(0, 5))
    state, reward, done, info = env.step(actions)
    if abs(state[1]['position'][0]-old_position_agent1[0])+abs(state[1]['position'][1]-old_position_agent1[1]) == 0:
        num_no_mov_agent1 += 1
    else:
        num_no_mov_agent1 = 0
    if abs(state[3]['position'][0]-old_position_agent3[0])+abs(state[3]['position'][1]-old_position_agent3[1]) == 0:
        num_no_mov_agent3 += 1
    else:
        num_no_mov_agent3 = 0

print(info)
env.close()

Run the cell below to train the agents continuing from the model after 100 000 matches of training. The models are automatically saved in the **saved models** folder every 1000 match. 

In [None]:
try:
    main()
    print('Done')
except KeyboardInterrupt:
    print('Keyboard interrupt')

## Future implementations

One of the biggest challenges we faced in this project was the immense training time required using A2C. Thus, optimization of the training code should be considered in future work. Parallelization of the environment during training and possibly even implementation on a cluster system is an obvious next step. This would greatly reduce the training time.
One approach to reducing the overfitting of an agent to its starting corner is to randomize the starting position of each agent. This would potentially force the AI agent to learn how to cope in a different environment every game. This approach could provide rewards for behavior outside of the AI agent's starting area and would increase the likelihood of the AI agent leaving the said area.
To further improve the performance of the AI agents, they should be trained against other agents than the simple agents. Training the AI agents against other AI agents could force the against to learn how to cope with different tactic styles. Hopefully, it would help hinder overfitting the AI agents playing style to playing against the simple agents. However, as mentioned before, this was attempted and resulted in very slow training due to the agents not leaving their corner, so this problem should be addressed first.
If we had a better computer and more time, we would have implemented the Actor-Critic using Kronecker-Factored Trust-Region (ACKTR) as it has been shown to learn quicker than A2C, especially when using large batch sizes