# Continuous Control

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install ./python

ERROR: Invalid requirement: './python'


The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.  

Please select one of the two options below for loading the environment.

In [2]:
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')
env = UnityEnvironment(file_name='./Crawler_Windows_x86_64/Crawler')
# env = UnityEnvironment(file_name='./Reacher_Windows_x86_64/Reacher')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: CrawlerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 129
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 20
        Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 12
Size of each action: 20
There are 12 agents. Each observes a state with length: 129
The state for the first agent looks like: [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  2.25000000e+00
  1.00000000e+00  0.00000000e+00  1.78813934e-07  0.00000000e+00
  1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  6.06093168e-01 -1.42857209e-01 -6.06078804e-01  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  1.33339906e+00 -1.42857209e-01
 -1.33341408e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -6.0609

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [5]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
count = 0
# reward = []
while count<2048:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    count = count+1
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
print(count)

Total score (averaged over agents) this episode: 0.2944198167727639
9


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

### this section create model class

In [7]:
# create neural network policy and optimizer
import torch
import torch.nn as nn
import torch.nn.functional as F

# set up a neural net that outputs actions, sigma for each action and state value
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class Policy(nn.Module):
    
    # initialize layers
    def __init__(self, s_size=129, h_size=64, h_size2 = 32, a_size=20):
        super(Policy, self).__init__()

        
        self.fc1 = nn.Linear(s_size, h_size)
#         self.bn1 = nn.BatchNorm1d(h_size)
        self.fc2 = nn.Linear(h_size, h_size)
#         self.bn2 = nn.BatchNorm1d(h_size)
#         self.fc3 = nn.Linear(h_size, h_size2)
#         self.fc4 = nn.Linear(h_size2, h_size2)
        
        self.fc_critic1 = nn.Linear(s_size, h_size)
#         self.critic_bn1 = nn.BatchNorm1d(h_size)
        self.fc_critic2 = nn.Linear(h_size, h_size)
#         self.critic_bn2 = nn.BatchNorm1d(h_size)
#         self.fc_critic3 = nn.Linear(h_size, h_size2)
#         self.fc_critic4 = nn.Linear(h_size2, h_size2)
    
    
        self.fc_sig1 = nn.Linear(s_size, h_size)
#         self.sig_bn1 = nn.BatchNorm1d(h_size)
        self.fc_sig2 = nn.Linear(h_size, h_size)
#         self.sig_bn2 = nn.BatchNorm1d(h_size)
#         self.fc_sig3 = nn.Linear(h_size, h_size2)
#         self.fc_sig4 = nn.Linear(h_size2, h_size2)
        
        self.fc_action = nn.Linear(h_size, a_size)
        self.fc_value = nn.Linear(h_size, 1)
        self.fc_sigma = nn.Linear(h_size, a_size)
        
#         self.std = nn.Parameter(torch.ones(1, a_size))
        # Standard deviations approximated seperately
#         self.register_parameter('log_sigma', None)
#         self.sigma = nn.Parameter(torch.ones(1, action_size)*0.5, requires_grad=True)
                
    # forward function for calculating mu, sigma and value
    def forward(self, state):
        
        # mu output
        x = (F.relu(self.fc1(state)))
        x = (F.relu(self.fc2(x)))
#         x = (F.relu(self.fc3(x)))
#         x = (F.relu(self.fc4(x)))
        
        # action outputs
#         mu = (self.fc_action(x))
        mu = F.tanh(self.fc_action(x))
    
        # sigma output
        x2 = (F.relu(self.fc_sig1(state)))
        x2 = (F.relu(self.fc_sig2(x2)))
#         x2 = (F.relu(self.fc_sig3(x2)))
#         x2 = (F.relu(self.fc_sig4(x2)))
        sigma = F.sigmoid(self.fc_sigma(x))
#         sigma = F.hardtanh(self.fc_sigma(x2), 0.1, 1)
#         sigma = F.softplus(self.sigma).expand(mu.size())

        # critic value output
        x3 = (F.relu(self.fc_critic1(state)))
        x3 = (F.relu(self.fc_critic2(x3)))
#         x3 = (F.relu(self.fc_critic3(x3)))
#         x3 = (F.relu(self.fc_critic4(x3)))
        value = self.fc_value(x3)
        
        mu = torch.clamp(mu, -0.99, 0.99)
        sigma = torch.clamp(sigma, 0.001, 0.999)
        
        return mu, sigma, value
    
    
# run your own policy!
policy=Policy(state_size, 64, action_size).to(device)

# we use the adam optimizer with learning rate 2e-4
# optim.SGD is also possible
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=5e-5)

In [8]:
# test
# statetest = torch.tensor(states, dtype=torch.float, device=device)
# mu, sigma, value = policy(statetest)

### this section define the function which generate trajectories

In [9]:
## build function that collect history data
from torch.distributions import Normal

def collect_trajectories(env, policy, tmax=20, num_agents=12, state_size=129, action_size=20, epi_search = 1.0):
    action_list = []
    state_list = []
    reward_list = []
    prob_list = []
    value_list = []
    mask = []
    scores = np.zeros(num_agents)
    scores_list = []
    
    # reset environment
    brain_name = env.brain_names[0]
    env_info = env.reset(train_mode=True)[brain_name] 
    
    # get starting state
    state_current = env_info.vector_observations
    
    for t in range(tmax):
        # append current state to state list
        state_list.append(state_current)
        
        # map current state to mu_current, sigma_current
        state_current = torch.from_numpy(state_current).to(torch.float).to(device)
        mu_current, sigma_current, value_current = policy(state_current)
        
        # create distribution object
#         sigma_rollout = torch.ones_like(sigma_current)*epi_search
        dist = Normal(mu_current, sigma_current)
        
        # get action
        act_current = dist.sample()
        act_current = torch.clamp(act_current, -1, 1)
        
        # calculate probabiliy of act_current
        prob = torch.exp(dist.log_prob(act_current))
        prob = prob.cpu().detach().numpy()
        
        # get reward, done flag and state_next
        act_current = act_current.cpu().detach().numpy()
        env_info = env.step(act_current)[brain_name]
        
        # some reward is nan, give those nan a negative rewards
        reward = env_info.rewards
        reward = np.asarray(reward)
        reward[np.isnan(reward)] = -5.0
        
        # retreive done flag and next state
        done = env_info.local_done 
        state_next = env_info.vector_observations
        
        # accumulate scores for each rollout before any agent is done
        scores = scores + reward
        
        # append state_current, act_current, reward, action prob and state value into list
        action_list.append(act_current)
        reward_list.append(reward)
        prob_list.append(prob)        
        value_list.append(value_current.cpu().detach().numpy())
        
        
        # if any agent is done, set curent mask to done for all agent
        # reset agent and get new starting state again
        
        if any(done) or t+1==tmax:

            # since one of the agent is done, all agent terminate, so set current done flag to True for all agent
            for i in range(len(done)):
                done[i] = True
            
            # restart env    
            brain_name = env.brain_names[0]
            env_info = env.reset(train_mode=True)[brain_name]
            
            # retrieve next starting state
            state_next = env_info.vector_observations
            
            # append scores to scores list
            scores_list.append(scores)
            
            # reset scores array to zero in order to accumulate scores again for next rollout
            scores = scores*0
            
        # append done mask
        mask.append(done)
        
        # assign state_next to state_current
        state_current = state_next
        
    # transfer to numpy array from list object
    state_list = np.array(state_list)
    action_list = np.array(action_list)
    prob_list = np.array(prob_list)
    value_list = np.array(value_list)
    reward_list = np.expand_dims(np.array(reward_list), -1)
    scores_list = np.array(scores_list)
    
    # transfer mask from (True, False) mask to (0, 1) mask
    mask = np.array(mask)
    mask = np.where(mask==False, 1, 0)
    mask = np.expand_dims(mask, -1)
    mask[-1] = 0
    
    
    return state_list, action_list, prob_list, reward_list, value_list, mask, scores_list

In [10]:
# test
# states, actions, probs, rewards, values, mask, scores = collect_trajectories(env, policy, 20, 
#                                                                              num_agents, state_size, action_size, 1.0)

# print((scores))
# print(mask.squeeze(-1))
# print(np.mean(scores))

### this section define the function that calculate loss

In [11]:
# helper funtion for normalize rewards and values
def normalization(rewards):
    rewards_mean = np.mean(rewards, 1, keepdims=True)
    rewards_std = np.std(rewards, 1, keepdims=True)+1e-10
    rewards_normalize = (rewards-rewards_mean)/rewards_std
    
    return rewards_normalize

# helper function for calculate discounted future rewards, td_error, advantage and monte-carlo advantage
def Cal_GAE(rewards, values, mask, reward_discount = 0.99, adv_discount = 0.95): 
    # create variables for storing cumulative rewards
    reward_future = np.zeros(np.shape(rewards))
    reward_future_pre = np.zeros(np.shape(rewards[0]))
    
    # create variables for storing cumulative advantage estimation
    advantage = np.zeros(np.shape(rewards))
    advantage_pre = np.zeros(np.shape(rewards[0]))
    
    # create variables for storing td_error
    td_error = np.zeros(np.shape(rewards))
    
    # loop in reversed order
    for i in reversed(range(len(rewards))):
        
        # calculate cumulative reward
        reward_future[i] = rewards[i] + reward_discount*reward_future_pre*mask[i]
        reward_future_pre = reward_future[i]
        
        # calculate td_error
        if i == len(rewards)-1:
            td_error[i] = rewards[i] - values[i]
        else:
            td_error[i] = rewards[i] + reward_discount*values[i+1]*mask[i] - values[i]
            
        # calculate advantage
        advantage[i] = td_error[i] + reward_discount*adv_discount*advantage_pre*mask[i]
        advantage_pre = advantage[i]
        
        # calculate monte-carlo advantage
        mc_error = reward_future-values

    return reward_future, td_error, advantage, mc_error

In [12]:
# test cell
# reward_future, td_error, advantage, mc_error = Cal_GAE(rewards, values)

In [13]:
# function for calculating loss
def clipped_surrogate(policy, states, actions, prob_old, reward_future, td_error, advantage, mc_error, 
                      epsilon=0.2, beta=0.01, mse_w = 0.5, mode = "advantage"):
       
    # transfer reward_future, td_error and advantage to tensor
    reward_future = torch.tensor(reward_future, dtype=torch.float, device=device)
    mc_error = torch.tensor(mc_error, dtype=torch.float, device=device)
    td_error = torch.tensor(td_error, dtype=torch.float, device=device)
    advantage = torch.tensor(advantage, dtype=torch.float, device=device)
    
    # calculate mu, sigma and state value
    states = torch.tensor(states, dtype=torch.float, device=device)
    mu, sigma, value = policy(states)
#     d = np.shape(states)
#     d_action = np.shape(actions)
#     d_value = np.shape(reward_future)
#     states_shaped = np.reshape(states, (d[0]*d[1], d[2]))
#     states_shaped = torch.tensor(states_shaped, dtype=torch.float, device=device)
#     mu, sigma, value = policy(states_shaped)
#     mu = torch.reshape(mu, (d_action))
#     sigma = torch.reshape(sigma, (d_action))
#     value = torch.reshape(value, (d_value))
    
    # calculate current action probability
    actions = torch.tensor(actions, dtype=torch.float, device=device)
    dist_training = Normal(mu, sigma)
    prob_new = torch.exp(dist_training.log_prob(actions))
    
    # calculate probability ratio = prob_new/prob_old
    prob_old = torch.tensor(prob_old, dtype=torch.float, device=device)
    ratio = prob_new/prob_old

    # calculate clipped loss function
    ratio_clip = torch.clamp(ratio, 1-epsilon, 1+epsilon)
    if mode == "MC":
        clipped_surrogate_loss = -torch.mean(torch.min(ratio*mc_error, ratio_clip*mc_error))
    elif mode == "TD":
        clipped_surrogate_loss = -torch.mean(torch.min(ratio*td_error, ratio_clip*td_error))
    else:
        clipped_surrogate_loss = -torch.mean(torch.min(ratio*advantage, ratio_clip*advantage))

    # calculate MSE loss value 
    mse_loss = torch.mean((reward_future - value).pow(2))
    
    # calculate entropy regularization term
    entropy_loss = -torch.mean(dist_training.entropy())
    
    # total loss
    total_loss = clipped_surrogate_loss + mse_w*mse_loss + beta*entropy_loss
    
    return total_loss, clipped_surrogate_loss, mse_loss, entropy_loss

In [14]:
# test cell
# L1, L2, L3, L4 = clipped_surrogate(policy, states, actions, probs, reward_future, td_error, advantage, mc_error)

### this section is the main training loop

In [None]:
# training config
episode = 100
tmax = 32768
step = [1]
minibatch_size = [1024]
load_flag = False
shuffle_flag = True
normalization_flag = True
GAE_mode = "GAE"
saving_flag = False

In [None]:
# import package that include function to keep workspace alive
import os
# import requests
import time
import matplotlib.pyplot as plt
from collections import deque
!pip install progressbar
import progressbar as pb
from parallelEnv import parallelEnv

# main training function
# keep track of how long training takes

def training(env, policy, tmax, step, minibatch_size, load_flag, shuffle_flag, normalization_flag, GAE_mode, saving_flag):
    
    # hyperparameters
    discount_reward = .99
    discount_advantage = 0.9
    epsilon = 0.2
    epsilon_decay = 0.998
    beta = 0.01
    beta_decay = 0.995
    mse_w = 0.5
    epi_search = 1.0
    
    # load policy if needed
    if load_flag == True:
        policy = torch.load('PPO_Crawler.policy')

    # widget bar to display progress
    widget = ['training loop: ', pb.Percentage(), ' ', 
              pb.Bar(), ' ', pb.ETA() ]
    timer = pb.ProgressBar(widgets=widget, maxval=episode).start()

    # keep track of progress
    mean_rewards = []
    loss_clip_his = []
    loss_mse_his = []
    loss_entropy_his = []
    # mean_rewards_window = deque(maxlen=100)  # last 100 scores

    for e in range(episode):

        # collect trajectories, returns states_list, actions_list, probs_list, rewards_list, values_list, done mask
        # and scores for each rollout in the tmax horizon
        states, actions, probs, rewards, values, mask, scores = collect_trajectories(env, policy, tmax, num_agents, 
                                                                   state_size, action_size, epi_search)

        # helper function for calculate discounted future rewards, td_error, advantage and monte-carlo advantage
        reward_future, td_error, advantage, mc_error = Cal_GAE(rewards, values, mask, 
                                                               discount_reward, discount_advantage)

        # normalize advantage estimations
        if normalization_flag == True:
            td_error = normalization(td_error)
            advantage = normalization(advantage)
            mc_error = normalization(mc_error)

        # create temporary variable to accumulate loss for each epoch
        L_clip_temp = 0
        L_mse_temp = 0
        L_entropy_temm = 0

        # loop through each step
        for _ in range(step):

            indices = np.arange(len(states))
            # random shuffle indices for training
            if shuffle_flag == True:
                np.random.shuffle(indices)

            # calculate SGD_epoch based on (length of data)/minibatch
            SGD_epoch = max(len(states)//minibatch_size, 1)

            for i in range(SGD_epoch):

                # retreive minibatch indices
                ind_start = i*minibatch_size
                if (ind_start+minibatch_size) > (len(states)-minibatch_size):
                    ind_end = len(states)
                else:
                    ind_end = ind_start+minibatch_size
                indices_train = indices[ind_start:ind_end]

                # retreive minibatch data
                states_mini = states[indices_train]
                actions_mini = actions[indices_train]
                probs_old_mini = probs[indices_train]
                reward_future_mini = reward_future[indices_train]
                td_error_mini = td_error[indices_train]
                advantage_mini = advantage[indices_train]
                mc_error_mini = mc_error[indices_train]

                # calculate loss function and keep track of clip loss, mse loss and entropy loss
                L, L_clip, L_mse, L_entropy = clipped_surrogate(policy, states_mini, actions_mini, probs_old_mini, 
                                                                reward_future_mini, td_error_mini, 
                                                                advantage_mini, mc_error_mini,
                                                                epsilon, beta, mse_w, GAE_mode)

                # accumulate loss history
                L_clip_temp = L_clip_temp + L_clip
                L_mse_temp = L_mse_temp + L_mse
                L_entropy_temm = L_entropy_temm + L_entropy

                # optimize neural network parameters
                optimizer.zero_grad()
                L.backward()
                optimizer.step()
                del L, L_clip, L_mse, L_entropy

        # the clipping parameter reduces as time goes on
        epsilon*=epsilon_decay

        # the regulation term also reduces
        # this reduces exploration in later runs
        beta*=beta_decay
    #     if beta<1e-4:
    #         beta = 1e-4

        # epi_search update
    #     epi_search = epi_search*0.995
    #     if epi_search<0.001:
    #         epi_search = 0.001

        # get the average reward and loss of the parallel environments
        mean_rewards.append(np.mean(scores))
        loss_clip_his.append(L_clip_temp/(step+1)/(SGD_epoch+1))
        loss_mse_his.append(L_mse_temp/(step+1)/(SGD_epoch+1))
        loss_entropy_his.append(L_entropy_temm/(step+1)/(SGD_epoch+1))

        # display some progress every 20 iterations
        if (e+1)%10 ==0 :
            print("Episode: {0:d}, score: {1:f}".format(e+1,np.mean(scores)))

        # update progress widget bar
        timer.update(e+1)

    timer.finish()
    
    return mean_rewards, loss_mse_his, loss_clip_his, loss_entropy_his

In [None]:
scores_train = []
# L_mse_per_train = []
# L_ent_per_train = []
# L_clip_per_train =[]
for s in step:
    for mini_size in minibatch_size:
        # declare policy
        policy=Policy(state_size, 64, action_size).to(device)
        mean_rewards, loss_mse_his, loss_clip_his, loss_entropy_his = training(env, policy, tmax, s, mini_size, load_flag, 
                                                                             shuffle_flag, normalization_flag, 
                                                                             GAE_mode, saving_flag)
        scores_train.append(mean_rewards)

In [None]:
for p in range(len(scores_train)):
    plt.figure(p),plt.plot(scores_train[p])

In [None]:
# save policy
if saving_flag == True:
    torch.save(policy, 'PPO_Crawler.policy')

In [None]:
# test performance
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
count = 0
# reward = []
while count < 1000:
    states_test = torch.tensor(states, dtype=torch.float, device=device)
    mu, sigma, _ = policy(states_test)
    actions = mu.cpu().detach().numpy()
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    count = count+1
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
print(count)

In [24]:
env.close()