# DroneLeader Architecture

We envision a multi-agent organization whereby:

* A Strategist directs multiple Teams through Tasks/Objectives.

* Each Team is made up of agents of different types (drones and crawlers) and roles (leaders and followers).

* A Team uses its Culture to shape its agents' behaviors by doling out behavorial rewards during training. It uses its Mission to help its agents learn abilities to accomplish individual or group objectives by doling out mission rewards during training.

* In this way, a Strategist that is optimized for strategic decision making can analyze the games space and direct multiple Teams to accomplish more complex and strategic tasks that require more than behavioral skills.

However, we are not able to reliably train droneleaders with simple 2-layer FC-Softmax policies to learn how to minimize the delta between the drone's coordinate and the target coordinate of max favoribility given by the Strategist.

We will implement several different drone leaders to find out which is the most reliable:

- DroneLeader_FC32: a 2 layer FC-Softmax (32 hidden units) policy with the deltas (between drone and target coordinates) as input
- DroneLeader_FC64: a 2 layer FC-Softmax (64 hidden units) policy with the deltas (between drone and target coordinates) as input
- DroneLeader_CNN1: a CNN policy with a 1-frame game space of drone and goal locations as input

In [1]:
import os
import random
import time
import platform
import torch
import torch.optim as optim
import gym
import numpy as np
import pickle

# This is the Crossing game environment
from xteams_env import CrossingEnv
from xteams_model import *
from interface import *

import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

print("Python version: ", platform.python_version())
print("Pytorch version: {}".format(torch.__version__))
print("OpenAI Gym version: {}".format(gym.__version__))

Python version:  3.6.8
Pytorch version: 1.0.1.post2
OpenAI Gym version: 0.9.2


## Strategist Class

(Wikipedia) A strategist is responsible for the formulation and implementation of a strategy. Strategy generally involves setting goals, determining actions to achieve the goals, and mobilizing resources to execute the actions. It describes how the ends (goals) will be achieved by the means (resources).

An agent belonging to the Strategist class performs the following:

(1) It accepts and abdicates responsibilities for directing teams of agents

(2) It receives game space and metrics from the Environment

(3) It analyzes the game space and metrics to arrive at a "strategic position" for its teams. e.g. a topological map and/or a set of game stats

(4) Based on the strategic position, it decides on a set of goals that need to be accomplished

(5) It surveys its teams of agents and their location in the games space

(6) For each goal, it picks the best team and assign it the goal

(7) If necessary, it reorganize the teams and the agents

(8) It measures the effectiveness of the teams in accomplishing the assigned goals, and whether the "strategic position" has improved for its teams


In [2]:
import os
import random
import time
import platform
import torch
import gym
import numpy as np
from collections import deque
from torch.autograd import Variable

class Strategist():
    
    teams = []
    eyes = []  # Each team has a drone agent that serves as an eye for the strategist
    game_spaces = []
    game_metrics = []

    
    def __init__(self):
        super(Strategist, self).__init__()
        
        # Teams parameters
        self.teams = []
        self.eyes = []
        
        # Teams' game spaces  
        self.game_spaces = []
        
        # Teams' game metrics
        self.game_metrics = []
        
        
        # zone parameters

        # episode history

        return
    
    # This method accepts directorship of a team of agents, but only if the team has a drone agent
    # that can act as eye for the strategist.
    def accept(self, team):
        
        eye_found = False
        
        # Look for drone agent in team
        for agent in team.members:
            if agent.type is "drone":
                self.eyes.append(agent)  # assign agent as team eye
                eye_found = True
                break
        
        # Only accept directorship of a team if there is a team eye
        if eye_found:
            self.teams.append(team)  
        else:
            raise Exception('Cannot accept team directorship! Team {} has no drone.'.format(team.name))
            
        return
    
    # This method abdicates directorship of a team of agents
    def abdicate(self, team):
        try:
            self.teams.remove(team)
        except ValueError:
            print("Cannot abdicate team directorship! Team {} is not under strategist's direction.".format(team.name))
        return

    
    # This method generates a favorability topological map from the game space 
    def _topology(self, game_space):
        
        space = game_space.numpy()
        _,_,x,y = space.shape
        
        topology = np.zeros((x,y))
        
        # Generate favorability topology based on food units in 5x5 target zone
        for ix,iy in np.ndindex(x,y):
            topology[ix,iy] = np.sum(space[0,0,ix:ix+5, iy:iy+5])

        return topology


    # This "black box" method generates a set of goals after analyzing the game space and metrics
    def generate_goals(self, game_space):
        
        # Create a topology of favorability
        topology = self._topology(game_space)
        
        # Find the coordinate of highest favorability
        i,j = np.unravel_index(topology.argmax(), topology.shape)
        
        goals = [(i,j)]  # The goal is to move a team to the coordinate of highest favorability
        
        return goals, topology
    
        
    def _assign_goal(self, goal, team):
        
        # TBD
        
        return  
    
    # This method flush the strategist's history at the end of a game episode    
    def clear_history(self):
        
        return

    # This method resets strategist by abdicating all team directorships
    def reset(self):
        # Abdicate directorship for all teams
        self.teams = []
        self.eyes = []
        self.game_spaces = []
        self.game_metrics = []
        
        return


# Train Team directed by Strategist

For now, a strategist can only direct 1 team with a drone agent, which acts as the "eye" for the strategist. The strategist access the game space through the complete obs space of the drone agent.

The code below run training on 2 teams of 5 Agents each. Both team Viking and Franks have Pacifist cultures so they are unagressive (do not fire their lasers). The Vikings have a drone leader and a strategist. The Franks do not.

Our strategist is able to take in the game space provided by its eye and output a goal in the form of a coordinate. 

The Team class must now take this goal ("move the team to this coordinate") and generate the mission reward such that its leader agent learns to move to that coordinate, thus taking many of its followers along in its target zone.


## DroneLeader_FC32

The DroneLeader_FC32 is a 2-layer fully-connected NN with 32 hidden units that accepts the normalized deltas between the droneleader's coordinate vs the target coordinate of max favorability as input to output an action.

In [3]:
num_drone_actions = 12
num_goal_params = 2

drone_leader = DroneLeader_FC32(num_goal_params, num_drone_actions, 0)
print (drone_leader)

batch_size = 1
x = torch.randn(batch_size, 2)
output = drone_leader(x)
print(output)

drone_leader = DroneLeader_FC64(num_goal_params, num_drone_actions, 0)
print (drone_leader)

batch_size = 1
x = torch.randn(batch_size, 2)
output = drone_leader(x)
print(output)

DroneLeader_FC32(
  (features): Sequential(
    (0): Linear(in_features=2, out_features=32, bias=True)
    (1): ReLU(inplace)
  )
  (action_head): Linear(in_features=32, out_features=12, bias=True)
)
tensor([[0.0842, 0.0947, 0.0897, 0.0691, 0.1018, 0.1100, 0.0657, 0.0907, 0.0607,
         0.0625, 0.1026, 0.0684]], grad_fn=<SoftmaxBackward>)
DroneLeader_FC64(
  (features): Sequential(
    (0): Linear(in_features=2, out_features=64, bias=True)
    (1): ReLU(inplace)
  )
  (action_head): Linear(in_features=64, out_features=12, bias=True)
)
tensor([[0.0751, 0.0894, 0.0735, 0.0506, 0.0774, 0.0763, 0.0662, 0.0970, 0.1251,
         0.0728, 0.1066, 0.0899]], grad_fn=<SoftmaxBackward>)


## Basic Training (1 Team with 1 Drone Leader)


In [6]:
import sys
from collections import deque
from torch.autograd import Variable

# Initialize environment
game = "Crossing"
num_crawler_actions = 8                     # Crawlers are capable of 8 actions
num_drone_actions = 12                      # Drones are capable of 12 actions
num_goal_params = 2    # Goal has 2 parameter

experiment = '1T-1L/strategist/'    # 1 team of 1 drone leader directed by a strategist

# Map and Parameter sets
map_name = "food_d37_river_w1_d25"  
parameters =[ 
            {'temp_start':2.0, 'river_penalty':-1.0, 'target_reward':0.5, \
             'game_steps':300, 'seed': 0},
            {'temp_start':2.0, 'river_penalty':-1.0, 'target_reward':0.5, \
             'game_steps':300, 'seed': 7},
            ]

temp_end = 1.0   # temp parameter is annealed from the value stored in parameters['temp_start'] to 1.0 

# Initialize training parameters
warm_start = False
num_frames = 7      # environ observation consists of a list of stacked frames per agent
max_episodes = 2000

render = True    # This turns on rendering every save so that agents' behavior can be observed
SPEED = 1/30
second_pile_x = 50  # x-coordinate of the 2nd food pile

log_interval = 10
save_interval = 20

# These trainer parameters works for Atari Breakout
gamma = 0.99  
lr = 1e-2

# Initialize agents parameters
#   1 agents - 1 learning agents, 0 trained agent, 0 random agent
num_learners = 1
num_trained = 0
num_rdn = 0

num_statics = num_trained + num_rdn
num_agents = num_learners + num_statics  

# The main code starts here!!!

for parameter in parameters:   # Go down the list of parameter sets
    
    start = time.clock()  # time the training
    
    torch.manual_seed(parameter['seed'])
    situation = 'droneleaderfc32_seed_'+parameter['seed']
    temp_start = parameter['temp_start']
    river_penalty = parameter['river_penalty']
    max_frames = parameter['game_steps']
    
    # Set up parameters of agents and teams as inputs into CrossingEnv
    teams_params = [
        {'name': 'Vikings', 'color': 'deepskyblue', 
         'culture': {'name':'pacifist_leadfollow','laser_penalty':-1.0,'target_reward':parameter['target_reward']},
         'roles': ['leader','follower'],
         'target_zone': None, 'banned_zone': None},
    ]
    agents_params = [
        {'id': 0, 'team': 'Vikings', 'color': 'royalblue', 'type': 'drone',    \
         'role': 'leader', 'start': (3,9)},
    ]

    # Data structure for agents
    agents = []
    actions = []
    log_probs = []
    tags = []
    rewards = []
    deltas = []   # 6-2-2019 delta coordinates
    optimizers = []

    # Cold start
    if warm_start is False:
   
        # Initialize learner agents, then load static agents (trained followed by random)
        for i in range(num_learners):
            
            print("Learner agent {}".format(i))
            
            # Initialize agent policy based on type
            if agents_params[i]['type'] is 'crawler':
                agents.append(Crawler_Policy(num_frames, num_crawler_actions, i))
            elif agents_params[i]['type'] is 'drone' and agents_params[i]['role'] is 'follower':
                agents.append(Drone_Policy(num_frames, num_drone_actions, i)) 
            elif agents_params[i]['type'] is 'drone' and agents_params[i]['role'] is 'leader':
                print("Load Drone Leader.")
                agents.append(DroneLeader_FC32(num_goal_params, num_drone_actions, i)) 
            else:
                raise Exception('Unexpected agent type: {}'.format(agents_params[i]['type']))
            
            optimizers.append(optim.Adam(agents[i].parameters(), lr=lr))
        
            # set up optimizer - this works for Atari Breakout
            # optimizers.append(optim.RMSprop(agents[i].parameters(), lr=lr, weight_decay=0.1)) 
        
        for i in range(num_learners, num_learners+num_trained):
            print ("Learning with trained agents - not implemented yet!")
            raise
            """
            Disable for now! No need to train with trained agents.
            agents.append(Crawler_Policy(num_frames, num_crawler_actions, i))
            agents[i].load_weights()         # load weight for static agent        
            """
        for i in range(num_learners+num_trained, num_agents):
            print("Load random agent {}".format(i))
            agents.append(Rdn_Policy())

    
        # Initialize all agent data
        actions = [0 for i in range(num_agents)]
        log_probs = [0 for i in range(num_agents)]
        tags = [0 for i in range(num_agents)]
        rewards = [0 for i in range(num_agents)]
        deltas = [0 for i in range(num_agents)]

        # Keep track of rewards learned by learners
        episode_reward = [0 for i in range(num_learners)]   # reward for an episode
        running_reward = [None for i in range(num_learners)]   # running average
        running_rewards = [[] for i in range(num_learners)]   # history of running averages
        best_reward = [0 for i in range(num_learners)]    # best running average (for storing best_model)
        
        # 6-2-2019 Keep track of distance from goal achieved by droneleader
        episode_delta = 0   # distance from goal for an episode
        running_delta = None   # running distance from goal
        running_deltas = []    # history of running distance from goal
        best_delta = 0    # best running distance from goal (for storing best_model)        
        
        # Keep track of num learners who has crossed over to the 2nd food pile
        crossed = [0 for i in range(num_learners)]      # whether an agent has crossed to the 2nd food pile  
        episode_crossed = 0                             # num learners who has crossed for an episode
        running_crossed = None         # running average
        running_crossed_hist = []   # history of running averages

        # This is to support warm start for training
        prior_eps = 0

    # Warm start
    if warm_start:
        print ("Cannot warm start")
        raise
    
        """
        # Disable for now!  Need to ensure model can support training on GPU and game playing
        # on both CPU and GPU.
    
        data_file = 'results/{}.p'.format(game)

        try:
            with open(data_file, 'rb') as f:
                running_rewards = pickle.load(f)
                running_reward = running_rewards[-1]

            prior_eps = len(running_rewards)

            model_file = 'saved_models/actor_critic_{}_ep_{}.p'.format(game, prior_eps)
            with open(model_file, 'rb') as f:
                # Model Save and Load Update: Include both model and optim parameters
                saved_model = pickle.load(f)
                model, optimizer = saved_model

        except OSError:
            print('Saved file not found. Creating new cold start model.')
            model = Crawler_Policy(input_channels=num_frames, num_actions=num_crawler_actions)
            optimizer = optim.RMSprop(model.parameters(), lr=lr,
                                      weight_decay=0.1)
            running_rewards = []
            prior_eps = 0
        """
    # Attach agents to their teams
    # 4-28-2019 Add roles and types to enable multi-role teams

    teams = []
    # Team Vikings
    teams.append(Team(name=teams_params[0]['name'],color=teams_params[0]['color'], \
                  culture=teams_params[0]['culture'], roles=teams_params[0]['roles'], \
                  agent_policies=[agents[0]], \
                  agent_roles = [agent['role'] for agent in agents_params[0:1]]))
    
    # 5-30-2019  Strategist accepts directorship of a team
    suntzu = Strategist()
    suntzu.accept(teams[0])   # Strategist accepts directorship of Team Viking
    
    env = CrossingEnv(agents=agents_params, teams=teams_params, \
                  map_name=map_name, river_penalty=river_penalty,  \
                  debug_window = False)   
    
    cuda = torch.cuda.is_available()

    if cuda:
        for i in range(num_learners):    # Learning agents need to utilize GPU
            agents[i].cuda()

        
    for ep in range(max_episodes):
    
        print('.', end='')  # To show progress
    
        # Anneal temperature from temp_start to temp_end
        for i in range(num_learners):    # For learning agents
            agents[i].temperature = max(temp_end, temp_start - (temp_start - temp_end) * (ep / max_episodes))

        env_obs = env.reset()  # Env return observations

        # For Debug only
        # print (len(env_obs))
        # print (env_obs[0].shape)
    
        # Unpack observations into data structure compatible with Crawler_Policy
        agents_obs = unpack_env_obs(env_obs)
        
        # 5-30-2019 Strategist uses the obs space of its team eye as the big picture
        game_space = agents_obs[suntzu.eyes[0].idx]
        goals, topology = suntzu.generate_goals(game_space)
        deltas = calc_norm_deltas(goals[0], env.agent_locations[0])
        agents[0].deltas.append(deltas)   # Store a history of deltas for generating mission rewards

        for i in range(num_learners):    # Reset agent info - laser tag statistics
            agents[i].reset_info()   

        # For Debug only
        # print (len(agents_obs))
        # print (agents_obs[0].shape)
    
        """
        For now, we do not stack observations, and we do not implement LSTM
    
        state = np.stack([state]*num_frames)

        # LSTM change - reset LSTM hidden units when episode begins
        cx = Variable(torch.zeros(1, 256))
        hx = Variable(torch.zeros(1, 256))
        if cuda:
            cx = cx.cuda()
            hx = hx.cuda()
        """

        # Initialize reward and agents crossed counters
        episode_reward = [0 for i in range(num_learners)]   # reward for an episode
        episode_delta = 0                               # distance from goal for an episode
        crossed = [0 for i in range(num_learners)]      # whether an agent has crossed to the 2nd food pile  
        episode_crossed = 0                             # num learners who has crossed for an episode
    
        for frame in range(max_frames):

            """
            For now, we do not implement LSTM
            # Select action
            # LSTM Change: Need to cycle hx and cx thru select_action
            action, log_prob, value, (hx,cx)  = select_action(model, state, (hx,cx), cuda)        
            """

            for i in range(num_learners):    # For learning agents
                if agents_params[i]['type'] is 'drone' and agents_params[i]['role'] is 'leader':
                    # 6-02-2019 Simple droneleaders do not require obs space as input
                    actions[i], log_probs[i] = select_action_strat_simple(agents[i], deltas, cuda)
                else:    
                    actions[i], log_probs[i] = select_action(agents[i], agents_obs[i], cuda)
                
                # Only crawlers can fire lasers
                if agents_params[i]['type'] is 'crawler':
                    if actions[i] is 6:  # action[i] is a tensor, .item() returns the integer
                        tags[i] += 1   # record a tag for accessing aggressiveness
                        
                agents[i].saved_actions.append((log_probs[i]))
            
                # Do not implement LSTM for now
                # actions[i].saved_actions.append((log_prob, value))
            
            for i in range(num_learners, num_learners+num_trained):
                print ("No trained agent exist yet!")
                raise
            for i in range(num_learners+num_trained, num_agents):   # For random agents
                actions[i] = agents[i].select_action(agents_obs[i])
                if actions[i] is 6:
                    tags[i] += 1   # record a tag for accessing aggressiveness

            # For Debug only
            # if frame % 20 == 0:
            #    print (actions) 
            #    print (log_probs)
            
            # Perform step        
            env_obs, reward, done, info = env.step(actions)
        
            """
            For Debug only
            print (env_obs)
            print (reward)
            print (done) 
            """
       
            # Unpack observations into data structure compatible with Crawler_Policy
            agents_obs = unpack_env_obs(env_obs)
            
            load_info(agents, agents_params, info, narrate=False)   # Load agent info for AI agents
            
            # 5-30-2019 Strategist uses the obs space of its team eye as the big picture
            game_space = agents_obs[suntzu.eyes[0].idx]
            goals, topology = suntzu.generate_goals(game_space)
            deltas = calc_norm_deltas(goals[0], env.agent_locations[0])
            agents[0].deltas.append(deltas)   # Store a history of deltas for generating mission rewards

            # For learner agents only, generate reward statistics and reward stack for policy gradient
            for i in range(num_learners):
                agents[i].rewards.append(reward[i])  # Stack rewards (for policy gradient)
                episode_reward[i] += reward[i]   # accumulate episode reward 
            
            """
            For now, we do not stack observation, may come in handy later on
        
            # Evict oldest diff add new diff to state
            next_state = np.stack([next_state]*num_frames)
            next_state[1:, :, :] = state[:-1, :, :]
            state = next_state
            """
            
            if render and (ep % save_interval == 0):   # render 1 episode every save
                env.render()
                time.sleep(SPEED)  # Change speed of video rendering

            if any(done):
                print("Done after {} frames".format(frame))
                break

        # Keep track num of agents who gather from 2nd food pile. Note that env.consumption tracks the 
        # agent index and location of apple gathered
        for (i, loc) in env.consumption:
            if loc[0] > second_pile_x:   # If x-cood of gathered apple is beyond a preset value, it is
                                         # in the 2nd pile
                crossed[i] = 1
        episode_crossed = sum(crossed)   # sum up the num agents who crossed to 2nd pile for the episode
                
        # Update reward and crossed statistics for learners
        for i in range(num_learners):
            if running_reward[i] is None:
                running_reward[i] = episode_reward[i]
            running_reward[i] = running_reward[i] * 0.99 + episode_reward[i] * 0.01
            running_rewards[i].append(running_reward[i])
            
        if running_crossed is None:
            running_crossed = episode_crossed
        running_crossed = running_crossed * 0.99 + episode_crossed * 0.01
        running_crossed_hist.append(running_crossed)
        
        # 6-02-2019 Update distance from goal for droneleader
        target_x, target_y = goals[0]
        current_x, current_y = env.agent_locations[0])
        episode_delta = abs(target_x - current_x) + abs(target_y - current_y)
        
        if running_delta is None:
            running_delta = episode_delta
        running_delta = running_delta * 0.99 + episode_delta * 0.01
        running_deltas.append(running_delta)
        
                
        # Track Episode #, temp and highest frames/episode
        if (ep+prior_eps+1) % log_interval == 0: 
            verbose_str = '\nEpisode {} complete'.format(ep+prior_eps+1)
            # verbose_str += '\tTemp = {:.4}'.format(model.temperature)
            print(verbose_str)
    
            # Display rewards and running rewards for learning agents
            for i in range(num_learners):
                verbose_str = 'Learner:{}'.format(i)
                verbose_str += '\tReward total:{}'.format(episode_reward[i])
                verbose_str += '\tRunning mean: {:.4}'.format(running_reward[i])
                verbose_str += '\tNum agents crossed: {}'.format(episode_crossed)
                verbose_str += '\tRunning mean: {:.4}'.format(running_crossed)
                verbose_str += '\tDelta total:{}'.format(episode_delta)
                verbose_str += '\tRunning mean: {:.4}'.format(running_delta)
                print(verbose_str)
    
        # Update model
        total_norms = finish_episode(teams, agents[0:num_learners], optimizers[0:num_learners], gamma, cuda)

        if (ep+prior_eps+1) % log_interval == 0:
            print('Max Norms = ',["%0.2f" % i for i in total_norms])

        if (ep+prior_eps+1) % save_interval == 0: 
            for i in range(num_learners):
                model_dir = 'models/' + experiment + map_name
                results_dir = 'results/' + experiment + map_name

                model_file = model_dir+'/{}/t{}_rp{}_{}gs/MA{}_{}_ep{}.p'.format(situation, \
                        temp_start, river_penalty, max_frames, \
                        i, game, ep+prior_eps+1)
                data_file = results_dir+'/{}/t{}_rp{}_{}gs/MA{}_{}.p'.format(situation, \
                        temp_start, river_penalty, max_frames, \
                        i, game)

                os.makedirs(os.path.dirname(model_file), exist_ok=True)
                os.makedirs(os.path.dirname(data_file), exist_ok=True)
                
                with open(model_file, 'wb') as f:
                    # Model Save and Load Update: Include both model and optim parameters 
                    save_model(f, ep, agents[i], optimizers[i])

                with open(data_file, 'wb') as f:
                    pickle.dump(running_rewards[i], f)    
             
            crossed_file = results_dir+'/{}/t{}_rp{}_{}gs/Crossed.p'.format(situation, \
                        temp_start, river_penalty, max_frames)
            os.makedirs(os.path.dirname(crossed_file), exist_ok=True)
            with open(crossed_file, 'wb') as f:
                    pickle.dump(running_crossed_hist, f)

            delta_file = results_dir+'/{}/t{}_rp{}_{}gs/Delta.p'.format(situation, \
                        temp_start, river_penalty, max_frames)
            os.makedirs(os.path.dirname(delta_file), exist_ok=True)
            with open(delta_file, 'wb') as f:
                    pickle.dump(running_deltas, f)

    end = time.clock()
    print('\nTraining time: {:.2f} sec'.format((end-start)/60.0))
            
    env.close()  # Close the environment

Learner agent 0
Load Drone Leader.
....

KeyboardInterrupt: 

## Advanced Training (1 Team with 1 Drone Leader + 10 Followers)

In [12]:
env.close()

In [7]:
torch.manual_seed(3)
print(torch.rand(2))

tensor([0.0043, 0.1056])
