### 1. Start the Environment

In [1]:
from mlagents.envs import UnityEnvironment
import numpy as np

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

**_Before running the code cell below_**, change the `file_name` parameter to match the location of the Reacher Unity environment.

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
env = UnityEnvironment(file_name='unity_envs/Crawler_StaticTarget')

INFO:mlagents.envs:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of Training Brains : 1
        Reset Parameters :
		
Unity brain name: CrawlerStaticLearning
        Number of Visual Observations (per agent): 0
        Vector Observation space size (per agent): 129
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): [20]
        Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

* Set-up: A creature with 4 arms and 4 forearms.
* Goal: The agents must move its body toward the goal direction without falling.
* CrawlerStaticTarget - Goal direction is always forward.
* CrawlerDynamicTarget- Goal direction is randomized.
* Agents: The environment contains 3 agent linked to a single Brain.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times body direction alignment with goal direction.
* Brains: One Brain with the following observation/action space.
* Vector Observation space: 117 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
* Vector Action space: (Continuous) Size of 20, corresponding to target rotations for joints.
* Visual Observations: None.
* Reset Parameters: None
* Benchmark Mean Reward for CrawlerStaticTarget: 2000
* Benchmark Mean Reward for CrawlerDynamicTarget: 400

Lets print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size[0]
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 12
Size of each action: 20
There are 12 agents. Each observes a state with length: 129
The state for the first agent looks like: [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  2.25000000e+00
  1.00000000e+00  0.00000000e+00  1.78813934e-07  0.00000000e+00
  1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  6.06093168e-01 -1.42857209e-01 -6.06078804e-01  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  1.33339906e+00 -1.42857209e-01
 -1.33341408e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -6.0609

### 3. Take Random Actions in the Environment

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents) # initialize the score (for each agent)
step=0
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    step+=1
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.20592984704611203


### 4. Training the agent!

Now it's turn to train an agent to solve the environment!  When training the environment, we have to set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [6]:
import random
import datetime
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

#pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal

# imports for rendering outputs in Jupyter.
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
# defining the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print ("using",device)

using cpu


### 3. Define policy network (Actor Critic style)

In [8]:
state_size = brain.vector_observation_space_size
action_size = brain.vector_action_space_size
action_low = -1
action_high = 1

# define actor critic network
class ActorCritic(nn.Module):
    
    def __init__(self,state_size,action_size,action_high,action_low,hidden_size=32):
        super(ActorCritic, self).__init__()
        
        # action range
        self.action_high = torch.tensor(action_high).to(device)
        self.action_low = torch.tensor(action_low).to(device)
        
        self.std = nn.Parameter(torch.zeros(action_size))
        
        # common network
        self.fc1 = nn.Linear(state_size,512)
        
        # actor network
        self.fc2_actor = nn.Linear(512,256)
        self.fc3_action = nn.Linear(256,action_size)
        #self.fc3_std = nn.Linear(64,action_size)
        
        # critic network
        self.fc2_critic = nn.Linear(512,64)
        self.fc3_critic = nn.Linear(64,1)
    
    def forward(self,state):
        # common network
        x = F.relu(self.fc1(state))
        
        # actor network
        x_actor = F.relu(self.fc2_actor(x))
        action_mean = F.sigmoid(self.fc3_action(x_actor))
        ## rescale action mean
        action_mean_ = (self.action_high-self.action_low)*action_mean + self.action_low
        #action_std = F.sigmoid(self.fc3_std(x_actor))
        
        # critic network
        x_critic = F.relu(self.fc2_critic(x))
        v = self.fc3_critic(x_critic)
        return action_mean_,v
    
    def act(self,state,action=None):
        # converting state from numpy array to pytorch tensor on the "device"
        state = torch.from_numpy(state).float().to(device)
        action_mean,v = self.forward(state)
        prob_dist = Normal(action_mean,F.softplus(self.std))
        if action is None:
            action = prob_dist.sample()
        log_prob = prob_dist.log_prob(action).sum(-1).unsqueeze(-1)
        entropy = prob_dist.entropy().sum(-1).unsqueeze(-1)
        return {'a': action,
                'log_pi_a': log_prob,
                'ent': entropy,
                'mean': action_mean,
                'v': v}

### 4. Storage class

In [9]:
class Storage:
    def __init__(self, size, keys=None):
        if keys is None:
            keys = []
        keys = keys + ['s', 'a', 'r', 'm',
                       'v', 'q', 'pi', 'log_pi', 'ent',
                       'adv', 'ret', 'q_a', 'log_pi_a',
                       'mean']
        self.keys = keys
        self.size = size
        self.reset()

    def add(self, data):
        for k, v in data.items():
            assert k in self.keys
            getattr(self, k).append(v)

    def placeholder(self):
        for k in self.keys:
            v = getattr(self, k)
            if len(v) == 0:
                setattr(self, k, [None] * self.size)

    def reset(self):
        for key in self.keys:
            setattr(self, key, [])

    def cat(self, keys):
        data = [getattr(self, k)[:self.size] for k in keys]
        return map(lambda x: torch.cat(x, dim=0), data)

### 4. PPO agent

In [10]:
from collections import deque
from itertools import accumulate
import torch.tensor as tensor

def random_sample(indices, batch_size):
    indices = np.asarray(np.random.permutation(indices))
    batches = indices[:len(indices) // batch_size * batch_size].reshape(-1, batch_size)
    for batch in batches:
        yield batch
    r = len(indices) % batch_size
    if r:
        yield indices[-r:]
        
class Agent:
    
    def __init__(self,env,learning_rate=1e-3):
        self.env = env
        nS = brain.vector_observation_space_size
        nA = brain.vector_action_space_size[0]
        self.policy = ActorCritic(state_size=nS,hidden_size=128,action_size=nA,
                             action_low=action_low,action_high=action_high).to(device)
        self.optimizer = optim.RMSprop(self.policy.parameters(), lr=learning_rate)
        
        # reset the environment
        env_info = self.env.reset(train_mode=True)[brain_name]
        self.states = env_info.vector_observations
        
        self.episode_rewards_window = deque(maxlen=100)
        self.episode_rewards = []
        num_trajectories = 12
        self.online_rewards = np.zeros(num_trajectories)
    
        
    def train(self,max_opt_steps=1000,num_trajectories=12,rollout_length=2048,mini_batch_size=64,gamma=.99,
              target_score=-250,use_gae=False,gae_tau=0.95,PRINT_EVERY=100):
        
        for opt_step in range(max_opt_steps):
        
            storage = Storage(rollout_length)
            states = self.states
            for _ in range(rollout_length):
                prediction = self.policy.act(states)
                
                # send all actions to tne environment
                env_info = self.env.step((prediction['a']).cpu().numpy())[brain_name]
                
                next_states = np.array(env_info.vector_observations)         # get next state (for each agent)
                rewards = np.array(env_info.rewards)                      # get reward (for each agent)
                terminals = np.array(env_info.local_done)                  # see if episode finished
                
                self.online_rewards += rewards
                for i, terminal in enumerate(terminals):
                    if terminals[i]:
                        self.episode_rewards.append(self.online_rewards[i])
                        self.episode_rewards_window.append(self.online_rewards[i])
                        self.online_rewards[i] = 0
                
                storage.add(prediction)
                storage.add({'r': tensor(rewards).unsqueeze(-1).float().to(device),
                             'm': tensor(1 - terminals).unsqueeze(-1).float().to(device),
                             's': tensor(states).to(device)})
                states = next_states

            self.states = states
            prediction = self.policy.act(states)
            storage.add(prediction)
            storage.placeholder()

            advantages = tensor(np.zeros((num_trajectories, 1))).float().to(device)
            returns = prediction['v'].detach()
            for i in reversed(range(rollout_length)):
                returns = storage.r[i] + gamma * storage.m[i] * returns
                if not use_gae:
                    advantages = returns - storage.v[i].detach()
                else:
                    td_error = storage.r[i] + gamma * storage.m[i] * storage.v[i + 1] - storage.v[i]
                    advantages = advantages * gae_tau * gamma * storage.m[i] + td_error
                storage.adv[i] = advantages.detach()
                storage.ret[i] = returns.detach()

            states, actions, log_probs_old, returns, advantages = storage.cat(['s', 'a', 'log_pi_a', 'ret', 'adv'])
            actions = actions.detach()
            log_probs_old = log_probs_old.detach()
            advantages = (advantages - advantages.mean()) / advantages.std()
            
            ppo_ratio_clip = 0.2
            gradient_clip = 0.5
            entropy_weight = 0.0
            
            for _ in range(10):
                sampler = random_sample(np.arange(states.size(0)), mini_batch_size)
                for batch_indices in sampler:
                    batch_indices = tensor(batch_indices).long()
                    sampled_states = states[batch_indices]
                    sampled_actions = actions[batch_indices]
                    sampled_log_probs_old = log_probs_old[batch_indices]
                    sampled_returns = returns[batch_indices]
                    sampled_advantages = advantages[batch_indices]

                    prediction = self.policy.act(sampled_states.cpu().numpy(), sampled_actions)
                    ratio = (prediction['log_pi_a'] - sampled_log_probs_old).exp()
                    obj = ratio * sampled_advantages
                    obj_clipped = ratio.clamp(1.0 - ppo_ratio_clip,
                                              1.0 + ppo_ratio_clip) * sampled_advantages
                    policy_loss = -torch.min(obj, obj_clipped).mean() - entropy_weight * prediction['ent'].mean()

                    value_loss = 0.5 * (sampled_returns - prediction['v']).pow(2).mean()

                    self.optimizer.zero_grad()
                    (policy_loss + value_loss).backward()
                    nn.utils.clip_grad_norm_(self.policy.parameters(), gradient_clip)
                    self.optimizer.step()
            
            #printing progress
            if opt_step % PRINT_EVERY == 0:
                print ("Opt step: {}\t Avg reward: {:.2f}\t std: {}".format(opt_step,np.mean(self.episode_rewards_window),
                                                                             self.policy.std))
                # save the policy
                torch.save(self.policy, 'ppo-crawler.policy')
            
            if np.mean(self.episode_rewards_window)>= target_score:
                print ("Environment solved in {} optimization steps! ... Avg reward : {:.2f}".format(opt_step-100,
                                                                                          np.mean(self.episode_rewards_window)))
                # save the policy
                torch.save(self.policy, 'ppo-crawler.policy')
                break
                
        return self.episode_rewards

### 5. Train the agent

In [None]:
# lets define and train our agent
agent = Agent(env=env,learning_rate=1e-4)

In [None]:
scores = agent.train(max_opt_steps=2000,gamma=0.98,target_score=600,use_gae=True,PRINT_EVERY=1)

In [11]:
# uncomment this cell to load the trained policy for Pendulum-v0
# load policy
policy =  torch.load('ppo-crawler.policy',map_location='cpu')
agent = Agent(env)
agent.policy = policy

In [15]:
frames = []
total_rewards = np.zeros(12)

# reset the environment
env_info = env.reset(train_mode=False)[brain_name]
states = np.array(env_info.vector_observations)
value = []
r = []
for t in range(2000):
    prediction = agent.policy.act(states)
    action  = prediction['a'].cpu().numpy()
    v = prediction['v'].detach().cpu().numpy()
    #frames.append(env.render(mode='rgb_array')) 
    
    # send all actions to tne environment
    env_info = env.step(action)[brain_name]

    next_states = np.array(env_info.vector_observations)         # get next state (for each agent)
    rewards = np.array(env_info.rewards)                         # get reward (for each agent)
    terminals = np.array(env_info.local_done)                    # see if episode finished
    
    #value.append(v.squeeze())
    #r.append(reward)
    states=next_states
    total_rewards+= rewards
    
    if np.any(terminals):
        for i,terminal in enumerate(terminals):
            if terminal:
                eps_reward = total_rewards[i]
                break
        break

print ("Total reward:",eps_reward)
#animate_frames(frames)



Total reward: 17.941041075624526
