# Cultural Parameter Optimization - Warlike

### **3T-9L: 3 Teams composed of 9 agents (3 agents per team) **

Teams with Warlike culture require much longer training episodes, so we have a separate notebook specifically for  them.

In [1]:
import os
import random
import time
import pickle
import platform
import torch
import gym
import numpy as np

# This is the Gathering Game Environment based on Tribal Organization of agents
from tribes_env import GatheringEnv
from tribes_model import *

import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

print("Python version: ", platform.python_version())
print("Pytorch version: {}".format(torch.__version__))
print("OpenAI Gym version: {}".format(gym.__version__))

Python version:  3.6.4
Pytorch version: 0.4.1.post2
OpenAI Gym version: 0.9.2


## Culture = Warlike

Run finish_episode() specific to this culture.

In [2]:
def finish_episode(learners, optimizers, gamma, cuda):
    """ 
    Note that in RL, policy gradient is calculated at the end of an episode and only then used to 
    update the weights of an agent's policy. This is very different compared to image recog.
    
    The code will perform policy update on each learning agent independently. Reward for each time 
    step is stored in the list policy.rewards[] --> r(t)
    """  
    
    num_learners = len(learners)
    total_norms = [0 for i in range(num_learners)]
    policy_losses = [[] for i in range(num_learners)]
    losses = [[] for i in range(num_learners)]
    T_reward = []

   
    for i in range(num_learners):

        R = 0
        saved_actions = learners[i].saved_actions
        
        for t in tribes:
            if t.name is learners[i].tribe:
                T_reward = t.tribal_awards(US_hits = learners[i].US_hits,THEM_hits = learners[i].THEM_hits)
 
                # For debug only
                # print('Agent{} receives tribal award from Tribe{}'.format(i,t.name))
                # print (T_reward)
                # print (learners[i].rewards)
                
        # Do not implement actor-critic for now
        # value_losses = []
        
        rewards = deque()

        for r,T in zip(learners[i].rewards[::-1],T_reward[::-1]):
            # The agent is incentivized to cooperate by an award of 30% of what the tribe takes
            # in by all its members
            R = r + T + gamma * R
            rewards.appendleft(R)
            
        rewards = list(rewards)
        rewards = torch.Tensor(rewards)
        if cuda:
            rewards = rewards.cuda()

        # z-score rewards
        rewards = (rewards - rewards.mean()) / (1.1e-7+rewards.std())
        
        #Debug     
        #print (rewards)       
        
        """
        Do not implement actor-critic for now!!!
        for (log_prob, state_value), r in zip(saved_actions, rewards):
            reward = r - state_value.data[0]
            policy_losses.append(-log_prob * Variable(reward))
            r = torch.Tensor([r])
            if cuda:
                r = r.cuda()
            value_losses.append(torch.nn.functional.smooth_l1_loss(state_value,
                                                               Variable(r)))

        optimizer.zero_grad()
        loss = torch.stack(policy_losses).sum() + torch.stack(value_losses).sum()
        loss.backward()        
        
        
        """
        for log_prob, r in zip(saved_actions, rewards):
            r = torch.Tensor([r])
            if cuda:
                r = r.cuda()
            policy_losses[i].append(-log_prob * Variable(r))

        optimizers[i].zero_grad()
        losses[i] = torch.stack(policy_losses[i]).sum()
        losses[i].backward()
        
        # Gradient Clipping Update: prevent exploding gradient
        total_norms[i] = torch.nn.utils.clip_grad_norm_(learners[i].parameters(), 8000)
        
        optimizers[i].step()
        learners[i].clear_history()   # clear an agent's history at the end of episode


    return total_norms

## Special Case - Warlike Culture

We discovered that if we run training of only 300 game steps, the agents with Warlike culture will spend the 1st hundreds of steps in mutually assured destruction - a field of deadly crossfire which they cannot escape out of.

So we lengthen the game step to 1000 instead.

In [3]:
import gym
import sys
import pickle
from collections import deque

import torch
import torch.optim as optim
from torch.autograd import Variable
import numpy as np

# Initialize agents parameters
#   9 agents - 9 learning agents, 0 trained agent, 0 random agent
num_learners = 9
num_trained = 0
num_rdn = 0
num_statics = num_trained + num_rdn
num_agents = num_learners + num_statics  

# Initialize environment
game = "Gather"
num_actions = 8                       # 8 actions in Gathering

# Initialize training parameters
warm_start = False
num_frames = 4      # environ observation consists of a list of 4 stacked frames per agent
max_episodes = 5000
max_frames = 1000
max_frames_ep = 0   # track highest number of frames an episode can last

# These trainer parameters works for Atari Breakout
gamma = 0.99  
lr = 1e-3
temp_start = 1.8  # Temperature for explore/exploit
temp_end = 1.0
log_interval = 200
save_interval = 500


def unpack_env_obs(env_obs):
    """
    Gathering is a partially-observable Markov Game. env_obs returned by GatheringEnv is a numpy 
    array of dimension (num_agent, 800), which represents the agents' observations of the game.

    The 800 elements (view_box) encodes 4 layers of 10x20 pixels frames in the format:
    (viewbox_width, viewbox_depth, 4).
    
    This code reshapes the above into stacked frames that can be accepted by the Policy class:
    (batch_idx, in_channel, width, height)
    
    """
    
    num_agents = len(env_obs)  # environ observations is a list of agents' observations
    
    obs = []
    for i in range(num_agents):
        x = env_obs[i]   # take the indexed agent's observation
        x = torch.Tensor(x)   # Convert to tensor
        
        # Policy is a 3-layer CNN
        x = x.view(1, 10, 20, -1)  # reshape into environment defined stacked frames
        x = x.permute(0, 3, 1, 2)  # permute to Policy accepted stacked frames
        obs.append(x)
        
    return obs  # return a list of Tensors


"""
For now, we do not implement LSTM            
# LSTM Change: Need to cycle hx and cx thru function
def select_action(model, state, lstm_hc, cuda):
    hx , cx = lstm_hc 
    num_frames, height, width = state.shape
    state = torch.FloatTensor(state.reshape(-1, num_frames, height, width))

    if cuda:
        state = state.cuda()

    probs, value, (hx, cx) = model((Variable(state), (hx, cx)))

    m = torch.distributions.Categorical(probs)
    action = m.sample()
    log_prob = m.log_prob(action)
    # LSTM Change: Need to cycle hx and cx thru function
    return action.data[0], log_prob, value, (hx, cx)
"""


def select_learner_action(model, obs, cuda):
    """
    This code expects obs to be an array of stacked frames of the following dim:
    (batch_idx, in_channel, width, height)
    
    This is inputted into model - the agent's Policy, which outputs a probability 
    distribution over available actions.
    
    Policy gradient is implemented using torch.distributions.Categorical. 
    """
    
    # Policy is a 3-layer CNN
    # _, num_frames, width, height = obs.shape
    # obs = torch.FloatTensor(obs.reshape(-1, num_frames, width, height))
    
    # Policy is a 2-layer NN for now
    # obs = obs.view(1, -1)
   
    if cuda:
        obs = obs.cuda()
      
    probs = model(obs)
    m = torch.distributions.Categorical(probs)
    action = m.sample()
    log_prob = m.log_prob(action)

    return action.item(), log_prob 


def load_info(agents, narrate=False):
    for i in range(num_agents):    
        agents[i].load_info(info[i])
        if narrate:
            if agents[i].tagged:
                print('frame {}, agent{} is tagged'.format(frame,i))
            if agents[i].laser_fired:
                print('frame {}, agent{} fires its laser'.format(frame,i))
                print('and hit {} US and {} THEM'.format(agents[i].US_hit, agents[i].THEM_hit))
    return


# The main code starts here!!!

cultures =[{'name':'warlike', 'penalty':-1.0, 'reward':0.075},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.025}
          ]

"""
cultures =[
           {'name':'warlike', 'penalty':-1.0, 'reward':0.001},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.005},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.01},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.05},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.1},
           {'name':'warlike', 'penalty':-1.0, 'reward':0.5},
           {'name':'warlike', 'penalty':-1.0, 'reward':1.0}
          ]
"""

# Cultural parameter search
for culture in cultures:   # Go down the list of cultures

    # Data structure for agents
    agents = []
    actions = []
    log_probs = []
    tags = []
    rewards = []
    optimizers = []

    # Cold start
    if warm_start is False:
   
        # Initialize learner agents, then load static agents (trained followed by random)
        for i in range(num_learners):
            print("Learner agent {}".format(i))
            agents.append(Policy(num_frames, num_actions, i)) # No weights loaded for learning agent
            optimizers.append(optim.Adam(agents[i].parameters(), lr=lr))
        
            # set up optimizer - this works for Atari Breakout
            # optimizers.append(optim.RMSprop(agents[i].parameters(), lr=lr, weight_decay=0.1)) 
        
        for i in range(num_learners, num_learners+num_trained):
            print ("No trained agent exist yet!")
            raise
            """
            Disable for now! No trained model exist!!!
            agents.append(Policy(num_frames, num_actions, i))
            agents[i].load_weights()         # load weight for static agent        
            """
        for i in range(num_learners+num_trained, num_agents):
            print("Load random agent {}".format(i))
            agents.append(Rdn_Policy())

    
        # Initialize all agent data
        actions = [0 for i in range(num_agents)]
        log_probs = [0 for i in range(num_agents)]
        tags = [0 for i in range(num_agents)]
        rewards = [0 for i in range(num_agents)]

        # Keep track of rewards learned by learners
        episode_reward = [0 for i in range(num_learners)]   # reward for an episode
        running_reward = [None for i in range(num_learners)]   # running average
        running_rewards = [[] for i in range(num_learners)]   # history of running averages
        best_reward = [0 for i in range(num_learners)]    # best running average (for storing best_model)

        # This is to support warm start for training
        prior_eps = 0

    # Warm start
    if warm_start:
        print ("Cannot warm start")
        raise
    
        """
        # Disable for now!  Need to ensure model can support training on GPU and game playing
        # on both CPU and GPU.
    
        data_file = 'results/{}.p'.format(game)

        try:
            with open(data_file, 'rb') as f:
                running_rewards = pickle.load(f)
                running_reward = running_rewards[-1]

            prior_eps = len(running_rewards)

            model_file = 'saved_models/actor_critic_{}_ep_{}.p'.format(game, prior_eps)
            with open(model_file, 'rb') as f:
                # Model Save and Load Update: Include both model and optim parameters
                saved_model = pickle.load(f)
                model, optimizer = saved_model

        except OSError:
            print('Saved file not found. Creating new cold start model.')
            model = Policy(input_channels=num_frames, num_actions=num_actions)
            optimizer = optim.RMSprop(model.parameters(), lr=lr,
                                      weight_decay=0.1)
            running_rewards = []
            prior_eps = 0
        """

    # Establish tribal association

    tribes = []
    tribes.append(Tribe(name='Vikings',color='blue', culture=culture, \
                    agents=[agents[0], agents[1], agents[2]]))
    tribes.append(Tribe(name='Saxons', color='red', culture=culture, \
                    agents=[agents[3], agents[4], agents[5]]))
    tribes.append(Tribe(name='Franks', color='purple', culture=culture, \
                    agents=[agents[6], agents[7], agents[8]]))
    # tribes.append(Tribe(name='Crazies', color='yellow', agents=[agents[9]]))   # random agents are crazy!!!

    # 9 agents in 4 tribes, used map defined in default.txt
    agent_colors = [agent.color for agent in agents]
    agent_tribes = [agent.tribe for agent in agents]
    
    env = GatheringEnv(n_agents=num_agents,agent_colors=agent_colors, agent_tribes=agent_tribes, \
                       map_name='default')    
    
    cuda = torch.cuda.is_available()

    if cuda:
        for i in range(num_learners):    # Learning agents need to utilize GPU
            agents[i].cuda()

        
    for ep in range(max_episodes):
    
        print('.', end='')  # To show progress
    
        # Anneal temperature from temp_start to temp_end
        for i in range(num_learners):    # For learning agents
            agents[i].temperature = max(temp_end, temp_start - (temp_start - temp_end) * (ep / max_episodes))

        env_obs = env.reset()  # Env return observations

        # For Debug only
        # print (len(env_obs))
        # print (env_obs[0].shape)
    
        # Unpack observations into data structure compatible with agent Policy
        agents_obs = unpack_env_obs(env_obs)

        for i in range(num_learners):    # Reset agent info - laser tag statistics
            agents[i].reset_info()   

        # For Debug only
        # print (len(agents_obs))
        # print (agents_obs[0].shape)
    
        """
        For now, we do not stack observations, and we do not implement LSTM
    
        state = np.stack([state]*num_frames)

        # LSTM change - reset LSTM hidden units when episode begins
        cx = Variable(torch.zeros(1, 256))
        hx = Variable(torch.zeros(1, 256))
        if cuda:
            cx = cx.cuda()
            hx = hx.cuda()
        """

        episode_reward = [0 for i in range(num_learners)]   # reward for an episode
    
        for frame in range(max_frames):

            """
            For now, we do not implement LSTM
            # Select action
            # LSTM Change: Need to cycle hx and cx thru select_action
            action, log_prob, value, (hx,cx)  = select_action(model, state, (hx,cx), cuda)        
            """

            for i in range(num_learners):    # For learning agents
                actions[i], log_probs[i] = select_learner_action(agents[i], agents_obs[i], cuda)
                if actions[i] is 6:
                    tags[i] += 1   # record a tag for accessing aggressiveness
                agents[i].saved_actions.append((log_probs[i]))
            
                # Do not implement LSTM for now
                # actions[i].saved_actions.append((log_prob, value))
            
            for i in range(num_learners, num_learners+num_trained):
                print ("No trained agent exist yet!")
                raise
            for i in range(num_learners+num_trained, num_agents):   # For random agents
                actions[i] = agents[i].select_action(agents_obs[i])
                if actions[i] is 6:
                    tags[i] += 1   # record a tag for accessing aggressiveness

            # For Debug only
            # if frame % 20 == 0:
            #    print (actions) 
            #    print (log_probs)
            
            # Perform step        
            env_obs, reward, done, info = env.step(actions)
        
            """
            For Debug only
            print (env_obs)
            print (reward)
            print (done) 
            """
       
            # Unpack observations into data structure compatible with agent Policy
            agents_obs = unpack_env_obs(env_obs)
            load_info(agents, narrate=False)   # Load agent info for AI agents

            # For learner agents only, generate reward statistics and reward stack for policy gradient
            for i in range(num_learners):
                agents[i].rewards.append(reward[i])  # Stack rewards (for policy gradient)
                episode_reward[i] += reward[i]   # accumulate episode reward 
            
            """
            For now, we do not stack observation, may come in handy later on
        
            # Evict oldest diff add new diff to state
            next_state = np.stack([next_state]*num_frames)
            next_state[1:, :, :] = state[:-1, :, :]
            state = next_state
            """

            if any(done):
                print("Done after {} frames".format(frame))
                break
            
        if frame > max_frames_ep:
            max_frames_ep = frame    # Keep track of highest frames/episode

        # Update reward statistics for learners
        for i in range(num_learners):
            if running_reward[i] is None:
                running_reward[i] = episode_reward[i]
            running_reward[i] = running_reward[i] * 0.99 + episode_reward[i] * 0.01
            running_rewards[i].append(running_reward[i])

        # Track Episode #, temp and highest frames/episode
        if (ep+prior_eps+1) % log_interval == 0: 
            verbose_str = '\nEpisode {} complete'.format(ep+prior_eps+1)
            # verbose_str += '\tTemp = {:.4}'.format(model.temperature)
            # verbose_str += '\tMax frames = {}'.format(max_frames_ep+1)
            print(verbose_str)
    
            # Display rewards and running rewards for learning agents
            for i in range(num_learners):
                verbose_str = 'Learner:{}'.format(i)
                verbose_str += '\tReward total:{}'.format(episode_reward[i])
                verbose_str += '\tRunning mean: {:.4}'.format(running_reward[i])
                print(verbose_str)
    
        # Update model
        total_norms = finish_episode(agents[0:num_learners], optimizers[0:num_learners], gamma, cuda)

        if (ep+prior_eps+1) % log_interval == 0:
            print('Max Norms = ',["%0.2f" % i for i in total_norms])
        
        if (ep+prior_eps+1) % save_interval == 0: 
            for i in range(num_learners):
                model_dir = 'MA_models/3T-9L/'
                results_dir = 'results/3T-9L/'

                model_file = model_dir+'{}/p{}_r{}/MA{}_{}_ep{}.p'.format(culture['name'], culture['penalty'],\
                                 culture['reward'], i, game, ep+prior_eps+1)
                data_file = results_dir+'{}/p{}_r{}/MA{}_{}.p'.format(culture['name'], culture['penalty'],\
                                                      culture['reward'], i, game)
                
                os.makedirs(os.path.dirname(model_file), exist_ok=True)
                os.makedirs(os.path.dirname(data_file), exist_ok=True)
                
                with open(model_file, 'wb') as f:
                    # Model Save and Load Update: Include both model and optim parameters 
                    pickle.dump((agents[i].cpu(), optimizers[i]), f)

                if cuda:
                    agents[i] = agents[i].cuda()

                with open(data_file, 'wb') as f:
                    pickle.dump(running_rewards[i], f)    
            
    env.close()  # Close the environment

Learner agent 0
Learner agent 1
Learner agent 2
Learner agent 3
Learner agent 4
Learner agent 5
Learner agent 6
Learner agent 7
Learner agent 8
........................................................................................................................................................................................................
Episode 200 complete
Learner:0	Reward total:5	Running mean: 3.697
Learner:1	Reward total:28	Running mean: 11.48
Learner:2	Reward total:16	Running mean: 10.01
Learner:3	Reward total:25	Running mean: 6.267
Learner:4	Reward total:26	Running mean: 8.026
Learner:5	Reward total:138	Running mean: 64.6
Learner:6	Reward total:1	Running mean: 8.516
Learner:7	Reward total:48	Running mean: 39.62
Learner:8	Reward total:139	Running mean: 53.64
Max Norms =  ['212.70', '184.58', '260.47', '264.89', '175.34', '222.54', '187.71', '245.93', '226.10']
.....................................................................................................................

........................................................................................................................................................................................................
Episode 2400 complete
Learner:0	Reward total:0	Running mean: 0.1276
Learner:1	Reward total:0	Running mean: 0.1943
Learner:2	Reward total:0	Running mean: 0.1832
Learner:3	Reward total:0	Running mean: 0.003087
Learner:4	Reward total:0	Running mean: 0.002331
Learner:5	Reward total:0	Running mean: 0.004146
Learner:6	Reward total:0	Running mean: 0.005076
Learner:7	Reward total:0	Running mean: 0.01169
Learner:8	Reward total:0	Running mean: 0.00808
Max Norms =  ['0.00', '0.00', '0.17', '0.00', '0.00', '0.27', '0.03', '0.00', '0.00']
........................................................................................................................................................................................................
Episode 2600 complete
Learner:0	Reward total:0	Running mean: 0.440

........................................................................................................................................................................................................
Episode 4600 complete
Learner:0	Reward total:0	Running mean: 0.001833
Learner:1	Reward total:0	Running mean: 0.002821
Learner:2	Reward total:0	Running mean: 0.01664
Learner:3	Reward total:0	Running mean: 7.707e-13
Learner:4	Reward total:0	Running mean: 5.821e-13
Learner:5	Reward total:0	Running mean: 1.035e-12
Learner:6	Reward total:0	Running mean: 1.268e-12
Learner:7	Reward total:0	Running mean: 2.92e-12
Learner:8	Reward total:0	Running mean: 2.018e-12
Max Norms =  ['0.00', '0.00', '0.03', '0.00', '0.00', '1.82', '0.00', '0.00', '0.00']
........................................................................................................................................................................................................
Episode 4800 complete
Learner:0	Reward total:0	Running

Max Norms =  ['189.90', '176.47', '182.85', '322.73', '66.99', '192.45', '217.39', '141.59', '266.06']
........................................................................................................................................................................................................
Episode 1800 complete
Learner:0	Reward total:30	Running mean: 30.14
Learner:1	Reward total:25	Running mean: 27.68
Learner:2	Reward total:54	Running mean: 52.68
Learner:3	Reward total:55	Running mean: 68.17
Learner:4	Reward total:20	Running mean: 27.59
Learner:5	Reward total:59	Running mean: 49.67
Learner:6	Reward total:65	Running mean: 59.86
Learner:7	Reward total:22	Running mean: 29.74
Learner:8	Reward total:71	Running mean: 85.36
Max Norms =  ['128.24', '142.98', '311.65', '252.46', '137.79', '155.83', '193.04', '58.31', '207.15']
.............................................................................................................................................................

........................................................................................................................................................................................................
Episode 4000 complete
Learner:0	Reward total:28	Running mean: 34.68
Learner:1	Reward total:21	Running mean: 14.81
Learner:2	Reward total:33	Running mean: 52.86
Learner:3	Reward total:104	Running mean: 70.27
Learner:4	Reward total:9	Running mean: 18.44
Learner:5	Reward total:44	Running mean: 38.2
Learner:6	Reward total:44	Running mean: 53.2
Learner:7	Reward total:17	Running mean: 31.72
Learner:8	Reward total:60	Running mean: 94.57
Max Norms =  ['89.48', '86.74', '288.18', '303.50', '35.04', '136.84', '284.20', '68.63', '264.93']
........................................................................................................................................................................................................
Episode 4200 complete
Learner:0	Reward total:23	Running mean: 28