<h1>Getting Started with PySC2</h1>
<p>For additional information, please refer to the official GitHub repository linked <a href="https://github.com/deepmind/pysc2">here</a>.</p>
<p>To get started with PySC2, first verify that you have the proper Python packages installed as well as the mini game maps linked in the PySC2 repository!</p>
<p>Please run the following cell to make sure that your computer has all of the proper software setup.</p>
<p>If you have everything installed properly, the running the following cell should open an instance of StarCraft with a randomly acting agent playing on the MoveToBeacon map:</p>
<div style="background-color:#300a24"><b><p style="color:white">python3 -m pysc2.bin.agent --map MoveToBeacon --agent pysc2.agents.random_agent.RandomAgent</p></b></div>

<h3>Actions</h3>
<p>Assuming all of that worked, we will continue by setting up some of the basic configuration for our StarCraft AI. In order to make our AI, we will create a Python class to represent our StarCraft agent. We will make our class inherit from the base agent which PySC2 provides.</p>

<p>Some things worth noting are that we define a list of default actions for our agent to use. This will be used to restrict which actions our agent is allowed to perform. To view the list of all valid actions our agent can perform, try entering the following command in a terminal:</p>
<div style="background-color:#300a24"><b><p style="color:white">python3 -m pysc2.bin.valid_actions --hide_specific</p></b></div>


<p>Running this command produces many lines of output giving you the numerical id's of various StarCraft actions our agent can perform. A smaller subset of actions has been selected with comments with their name so that our agent is not slowed down with too many actions to learn.<p>

In [1]:
from pysc2.agents.base_agent import BaseAgent
from pysc2.lib import actions, features
import random
import numpy as np

default_actions = [
   0, #no_op                                              ()
   1, #move_camera                                        (1/minimap [64, 64])
#   2, #select_point                                       (6/select_point_act [4]; 0/screen [84, 84])
#   3, #select_rect                                        (7/select_add [2]; 0/screen [84, 84]; 2/screen2 [84, 84])
#   4, #select_control_group                               (4/control_group_act [5]; 5/control_group_id [10])
   5, #select_unit                                        (8/select_unit_act [4]; 9/select_unit_id [500])
#   6, #select_idle_worker                                 (10/select_worker [4])
   7, #select_army                                        (7/select_add [2])
 #  8, #select_warp_gates                                  (7/select_add [2])
 #  9, #select_larva                                       ()
 # 10, #unload                                             (12/unload_id [500])
 # 11, #build_queue                                        (11/build_queue_id [10])
 # 12, #Attack_screen                                      (3/queued [2]; 0/screen [84, 84])
 # 13, #Attack_minimap                                     (3/queued [2]; 1/minimap [64, 64])
 331, #Move_screen                                        (3/queued [2]; 0/screen [84, 84])
 332 #Move_minimap                                       (3/queued [2]; 1/minimap [64, 64])
]


_PLAYER_SELF = features.PlayerRelative.SELF
_PLAYER_NEUTRAL = features.PlayerRelative.NEUTRAL  # beacon/minerals
_PLAYER_ENEMY = features.PlayerRelative.ENEMY


#This represents the base interface for how our agent will work
#We separate this from the StarCraft II agent class so we can focus on the underlying RL
#implementation later...
class Brain:
    def __init__(self, race="T", actions = default_actions):
        self.race = race
        self.actions = actions
    #By default, our brain will just do nothing.
    #We will change this later...
    def step(self, obs):
        return 0, []
        
#        if 331 in obs.observation.available_actions:
#            player_relative = obs.observation.feature_screen.player_relative
#            beacon = _xy_locs(player_relative == _PLAYER_NEUTRAL)
#            if not beacon:
#                return 0, []
#            beacon_center = np.mean(beacon, axis=0).round()
#            return 331, [[0], beacon_center]
#        else:
#            return 7, [[0]]


#This represents the actual agent which will play StarCraft II
class MyAgent(BaseAgent):
    def __init__(self, brain = Brain()):
        super().__init__() #call parent constructor
        assert isinstance(brain, Brain)
        self.brain = brain
        
    def step(self, obs): #This function is called once per frame to give the AI observation data and return its action
        super().step(obs) #call parent base method
        action, params = self.brain.step(obs)
        return actions.FunctionCall(action, params)
        
agent = MyAgent()

<p>From here, we can test our our agent by calling the following cell. The first line exports our notbook code as a Python file. The second line actually runs our agent.</p>

<div style="background-color:#300a24"><b><p style="color:white">
jupyter nbconvert --to script PySC2_Basics<br>python3 -m pysc2.bin.agent --map MoveToBeacon --agent PySC2_Basics.MyAgent</p></b></div>

<p>To scale up our training performance, we will be using the Synchronous Actor Advantage Critic (A2C) reinforcement learning algorithm, which allows us to train our agent multiple times in parallel. Starter code is provided <a href="https://github.com/MG2033/A2C">here</a>. First, we will need to import some modules.<p>

In [2]:
import tensorflow as tf
import numpy as np

  return f(*args, **kwds)


In [19]:
UNIT_ELEMENTS = 7
MAXIMUM_CARGO = 10
MAXIMUM_BUILD_QUEUE = 10
MAXIMUM_MULTI_SELECT = 10
class StateNet:
    def __init__(self, scope, nonspatial_actions = len(default_actions),
                 resolution=84, channels=20, max_multi_select=MAXIMUM_MULTI_SELECT,
                 max_cargo=MAXIMUM_CARGO, max_build_queue=MAXIMUM_BUILD_QUEUE,
                 l2_scale=0.01, hidden_size=256):
        self.resolution = resolution
        self.variable_feature_sizes = {
            'multi_select' : max_multi_select,
            'cargo' :  max_cargo, 
            'build_queue' : max_build_queue
        }
        #The following assumes that we will stack our minimap and screen features (and they will have the same size)
        with tf.variable_scope('State-{}'.format(scope)):
            self.structured_observation = tf.placeholder(tf.float32, [None, 11], 'StructuredObservation')
            self.single_select = tf.placeholder(tf.float32, [None, 1, UNIT_ELEMENTS], 'SingleSelect')
            self.cargo = tf.placeholder(tf.float32, [None,  max_cargo, UNIT_ELEMENTS], 'Cargo')
            self.multi_select = tf.placeholder(tf.float32, [None, max_multi_select, UNIT_ELEMENTS], 'Multiselect')
            self.build_queue = tf.placeholder(tf.float32, [None,  max_build_queue, UNIT_ELEMENTS], 'BuildQueue')
            self.units = tf.concat([self.single_select,
                                    self.multi_select,
                                    self.cargo,
                                    self.build_queue], axis=1,
                                    name='Units')
            self.control_groups = tf.placeholder(tf.float32, [None, 10, 2], 'ControlGroups')
            self.available_actions = tf.placeholder(tf.float32, [None, nonspatial_actions], 'AvailableActions')
            self.used_actions = tf.placeholder(tf.float32, [None, nonspatial_actions], 'UsedActions')
            self.actions = tf.concat([self.available_actions,
                                      self.used_actions], axis=1,
                                      name='Actions')
            self.nonspatial_features = tf.concat([
                    self.structured_observation,
                    tf.reshape(self.units, [-1, UNIT_ELEMENTS * (1+sum(self.variable_feature_sizes.values()))]),
                    tf.reshape(self.control_groups, [-1, 20]),
                    tf.reshape(self.actions, [-1, 2 * nonspatial_actions])
                ], axis=1, name='NonspatialFeatures')
            
            self.spatial_features = tf.placeholder(tf.float32,
                                                   [None, resolution, resolution, channels],
                                                   'SpatialFeatures')
            self.conv1 = tf.layers.conv2d(inputs=self.spatial_features, filters=32,
                                          kernel_size=[5, 5],
                                          kernel_regularizer=tf.contrib.layers.l2_regularizer(l2_scale),
                                          activation=tf.nn.relu, name='Convolutional1')
            self.max_pool1 = tf.layers.max_pooling2d(inputs=self.conv1, pool_size=[2, 2],
                                                     strides=2, name='Pool1')
            self.conv2 = tf.layers.conv2d(inputs=self.max_pool1, filters=64,
                                          kernel_size=[5, 5],
                                          kernel_regularizer=tf.contrib.layers.l2_regularizer(l2_scale),
                                          activation=tf.nn.relu, name='Convolutional2')
            self.max_pool2 = tf.layers.max_pooling2d(inputs=self.conv2, pool_size=[2, 2],
                                                     strides=2, name='Pool2')
            self.max_pool2_flat = tf.reshape(self.max_pool2, [-1, 18 * 18 * 64], name='Pool2_Flattened')
            self.state_flattened = tf.concat([self.max_pool2_flat, self.nonspatial_features],
                                             1, name='StateFlattened')
            self.hidden_1 = tf.layers.dense(self.state_flattened, hidden_size, tf.nn.relu, name='Hidden1')
            self.output = self.hidden_1
            for variable_name, tensor in vars(self).items():
                if isinstance(tensor, tf.Tensor):
                    print('{}:\t({} Shape={})'.format(variable_name, tensor.name, tensor.shape))
            
tf.reset_default_graph()
test_state = StateNet('test')
global_state = StateNet('global')

structured_observation:	(State-test/StructuredObservation:0 Shape=(?, 11))
single_select:	(State-test/SingleSelect:0 Shape=(?, 1, 7))
cargo:	(State-test/Cargo:0 Shape=(?, 10, 7))
multi_select:	(State-test/Multiselect:0 Shape=(?, 10, 7))
build_queue:	(State-test/BuildQueue:0 Shape=(?, 10, 7))
units:	(State-test/Units:0 Shape=(?, 31, 7))
control_groups:	(State-test/ControlGroups:0 Shape=(?, 10, 2))
available_actions:	(State-test/AvailableActions:0 Shape=(?, 6))
used_actions:	(State-test/UsedActions:0 Shape=(?, 6))
actions:	(State-test/Actions:0 Shape=(?, 12))
nonspatial_features:	(State-test/NonspatialFeatures:0 Shape=(?, 260))
spatial_features:	(State-test/SpatialFeatures:0 Shape=(?, 84, 84, 20))
conv1:	(State-test/Convolutional1/Relu:0 Shape=(?, 80, 80, 32))
max_pool1:	(State-test/Pool1/MaxPool:0 Shape=(?, 40, 40, 32))
conv2:	(State-test/Convolutional2/Relu:0 Shape=(?, 36, 36, 64))
max_pool2:	(State-test/Pool2/MaxPool:0 Shape=(?, 18, 18, 64))
max_pool2_flat:	(State-test/Pool2_Flattened

In [20]:
class QNet:
    def __init__(self, statenet, scope, usable_actions=default_actions):
        self.actions = usable_actions
        with tf.variable_scope(scope):
            self.action_probability_raw = tf.layers.dense(statenet.output,
                                                          len(self.actions),
                                                          tf.nn.relu,
                                                          name='ActionProbRaw')
            self.action_probability = tf.nn.softmax(self.action_probability_raw)
            print('Action probability shape:', self.action_probability.shape)
            self.arguments = {}
            for argument in actions.TYPES:
                self.arguments[argument.name] = [] 
                for dimension, size in enumerate(argument.sizes):
                    if size == 0: #set size for screen/minimap coordinates
                        size = 1
                        if argument.name in ['screen', 'screen2', 'minimap']:
                            size = statenet.resolution
                    argument_layer = tf.layers.dense(statenet.output, 
                                                     size, tf.nn.relu,
                                                     name='{}{}'.format(argument.name,
                                                                        dimension))
                    print('Argument {}[{}] Shape:{}'.format(argument.name, dimension, argument_layer.shape))
                    self.arguments[argument.name].append(argument_layer)
            self.value =  tf.layers.dense(statenet.output, 1, name='Value')
            print('{} Shape: {}'.format(self.value.name, self.value.shape))
        
test_q = QNet(test_state, 'test')
global_q = QNet(global_state, 'global')

Action probability shape: (?, 6)
Argument screen[0] Shape:(?, 84)
Argument screen[1] Shape:(?, 84)
Argument minimap[0] Shape:(?, 84)
Argument minimap[1] Shape:(?, 84)
Argument screen2[0] Shape:(?, 84)
Argument screen2[1] Shape:(?, 84)
Argument queued[0] Shape:(?, 2)
Argument control_group_act[0] Shape:(?, 5)
Argument control_group_id[0] Shape:(?, 10)
Argument select_point_act[0] Shape:(?, 4)
Argument select_add[0] Shape:(?, 2)
Argument select_unit_act[0] Shape:(?, 4)
Argument select_unit_id[0] Shape:(?, 500)
Argument select_worker[0] Shape:(?, 4)
Argument build_queue_id[0] Shape:(?, 10)
Argument unload_id[0] Shape:(?, 500)
test/Value/BiasAdd:0 Shape: (?, 1)
Action probability shape: (?, 6)
Argument screen[0] Shape:(?, 84)
Argument screen[1] Shape:(?, 84)
Argument minimap[0] Shape:(?, 84)
Argument minimap[1] Shape:(?, 84)
Argument screen2[0] Shape:(?, 84)
Argument screen2[1] Shape:(?, 84)
Argument queued[0] Shape:(?, 2)
Argument control_group_act[0] Shape:(?, 5)
Argument control_group_i

In [21]:
entropy_factor = 0.1
value_factor = 0.5
gradient_norm_factor = 40
default_optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
class QTrainer:
    def __init__(self, statenet, qnet, scope, optimizer=default_optimizer):
        with tf.variable_scope('QTrainer-{}'.format(scope)):
            self.action = tf.placeholder(tf.int32, [None], name='Action_Placeholder')
            self.action_one_hot = tf.one_hot(self.action, len(qnet.actions), name='Actions_One_Hot')
            self.entropy_base = -tf.reduce_sum(qnet.action_probability * tf.log(
                    tf.clip_by_value(qnet.action_probability, 1e-20, 1)
                ), [1]
            )
            self.arguments = {} #placeholders for argument training
            self.arguments_one_hot = {}
            self.arguments_entropy = {}
            for argument in actions.TYPES:
                self.arguments[argument.name] = [] 
                self.arguments_one_hot[argument.name] =[]
                self.arguments_entropy[argument.name] = []
                for dimension, size in enumerate(argument.sizes):
                    argument_placeholder = tf.placeholder(tf.int32,[None],
                                                          name='{}{}_Placeholder'.format(argument.name,
                                                                                         dimension))
                    self.arguments[argument.name].append(argument_placeholder)
                    argument_one_hot = tf.one_hot(argument_placeholder,
                                                  qnet.arguments[argument.name][dimension].shape[1],
                                                  dtype=tf.float32,
                                                  name='{}{}_One_Hot'.format(argument.name,
                                                                                 dimension))
                    self.arguments_one_hot[argument.name].append(argument_one_hot)
                    print('{} Shape:{}'.format(argument_one_hot.name,
                                                        argument_one_hot.shape))
                    
                    argument_entropy = -tf.reduce_sum(
                        qnet.arguments[argument.name][dimension] * tf.log(
                            tf.clip_by_value(qnet.arguments[argument.name][dimension], 1e-20, 1)
                        ), [1]
                    )
                    self.arguments_entropy[argument.name].append(argument_entropy)
                    self.entropy_base += argument_entropy
            self.target_value = tf.placeholder(tf.float32, shape=[None], name='Target_Value')
            #Multiplying our QNet's action probabilities by the one hot action tensor
            self.value_loss = qnet.value - self.target_value
            self.loss = self.value_loss * value_factor + self.entropy_base * entropy_factor
            local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
            self.gradients = tf.gradients(self.loss, local_vars)
            self.var_norms = tf.global_norm(local_vars)
            grads, self.grad_norms = tf.clip_by_global_norm(self.gradients, gradient_norm_factor)

            # Apply local gradients to global network
            global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)#'global')
            self.apply_grads = optimizer.apply_gradients(zip(grads,global_vars))
            
test_qtrainer = QTrainer(test_state, test_q, 'test')

QTrainer-test/screen0_One_Hot:0 Shape:(?, 84)
QTrainer-test/screen1_One_Hot:0 Shape:(?, 84)
QTrainer-test/minimap0_One_Hot:0 Shape:(?, 84)
QTrainer-test/minimap1_One_Hot:0 Shape:(?, 84)
QTrainer-test/screen20_One_Hot:0 Shape:(?, 84)
QTrainer-test/screen21_One_Hot:0 Shape:(?, 84)
QTrainer-test/queued0_One_Hot:0 Shape:(?, 2)
QTrainer-test/control_group_act0_One_Hot:0 Shape:(?, 5)
QTrainer-test/control_group_id0_One_Hot:0 Shape:(?, 10)
QTrainer-test/select_point_act0_One_Hot:0 Shape:(?, 4)
QTrainer-test/select_add0_One_Hot:0 Shape:(?, 2)
QTrainer-test/select_unit_act0_One_Hot:0 Shape:(?, 4)
QTrainer-test/select_unit_id0_One_Hot:0 Shape:(?, 500)
QTrainer-test/select_worker0_One_Hot:0 Shape:(?, 4)
QTrainer-test/build_queue_id0_One_Hot:0 Shape:(?, 10)
QTrainer-test/unload_id0_One_Hot:0 Shape:(?, 500)


In [14]:
class RLBrain(Brain):
    def __init__(self, name, sess=None, race="T", actions = default_actions):
        super().__init__()
        self.state = StateNet(name, nonspatial_actions=len(actions))
        self.q = QNet(self.state, name, usable_actions=actions)
        self.q_trainer = QTrainer(self.state, self.q, name)
        self.sess = sess
    #By default, our brain will just do nothing.
    #We will change this later...
    def step(self, obs):
        #formatting/processing our observation to feed into state/q nets
        #determine our action probabilities
        #select action randomly with probabilities
        #for each required  argument:
            #determine argument value probabilities
            #select argument value randomly with probabilities
        #return action id, arguments
        return 0, []
    
    def process_observations(self, observation):
        # is episode over?
        episode_end = (observation.step_type == environment.StepType.LAST)
        # reward
        reward = observation.reward #scalar?
        # features
        features = observation.observation
        # the shapes of some features depend on the state (eg. shape of multi_select depends on number of units)
        # since tf requires fixed input shapes, we set a maximum size then pad the input if it falls short
        processed_features = {}
        for feature_label, feature in observation.observation.items():
            if feature_label in ['available_actions', 'last_actions']:
                actions = np.zeros(len(self.actions))
                for i, action in enumerate(self.actions):
                    if action in feature:
                        actions[i] = 1
                feature = actions
            if feature_label in ['minimap', 'screen']:
                feature = np.stack(feature, axis=2)
            else if feature_label in ['single_select', 'multi_select', 'cargo', 'build_queue']:
                feature = feature.reshape(-1)
                if feature_label in self.state.variable_feature_sizes:
                    padding = np.zeros(self.state.variable_feature_sizes[feature_label] * UNIT_ELEMENTS - len(feature))
                    feature = np.concatenate(feature, padding)
            processed_features[feature_label] = np.expand_dims(feature, axis=0)
        return reward, processed_features, episode_end


In [17]:
tf.reset_default_graph()
brain = RLBrain('test')

structured_observation:	(State-test/StructuredObservation:0 Shape=(?, 11))
single_select:	(State-test/SingleSelect:0 Shape=(?, 1, 7))
cargo:	(State-test/Cargo:0 Shape=(?, 10, 7))
multi_select:	(State-test/Multiselect:0 Shape=(?, 10, 7))
build_queue:	(State-test/BuildQueue:0 Shape=(?, 10, 7))
units:	(State-test/Units:0 Shape=(?, 31, 7))
control_groups:	(State-test/ControlGroups:0 Shape=(?, 10, 2))
available_actions:	(State-test/AvailableActions:0 Shape=(?, 6))
used_actions:	(State-test/UsedActions:0 Shape=(?, 6))
actions:	(State-test/Actions:0 Shape=(?, 12))
nonspatial_features:	(State-test/NonspatialFeatures:0 Shape=(?, 260))
spatial_features:	(State-test/SpatialFeatures:0 Shape=(?, 84, 84, 20))
conv1:	(State-test/Convolutional1/Relu:0 Shape=(?, 80, 80, 32))
max_pool1:	(State-test/Pool1/MaxPool:0 Shape=(?, 40, 40, 32))
conv2:	(State-test/Convolutional2/Relu:0 Shape=(?, 36, 36, 64))
max_pool2:	(State-test/Pool2/MaxPool:0 Shape=(?, 18, 18, 64))
max_pool2_flat:	(State-test/Pool2_Flattened