### 1. Start the Environment

In [3]:
from gym_unity.envs import UnityEnv
import numpy as np

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

**_Before running the code cell below_**, change the `file_name` parameter to match the location of the Reacher Unity environment.

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [5]:
env_name = 'unity_envs/Crawler_StaticTarget_Linux/Crawler_StaticTarget_Linux.x86_64'
env = UnityEnv(env_name,worker_id=1,use_visual=False, multiagent=True)

INFO:mlagents.envs:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of Training Brains : 1
        Reset Parameters :
		
Unity brain name: CrawlerStaticLearning
        Number of Visual Observations (per agent): 0
        Vector Observation space size (per agent): 129
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): [20]
        Vector Action descriptions: , , , , , , , , , , , , , , , , , , , 
INFO:gym_unity:12 agents within environment.


### 2. Examine the State and Action Spaces

* Set-up: A creature with 4 arms and 4 forearms.
* Goal: The agents must move its body toward the goal direction without falling.
* CrawlerStaticTarget - Goal direction is always forward.
* CrawlerDynamicTarget- Goal direction is randomized.
* Agents: The environment contains 3 agent linked to a single Brain.
* Agent Reward Function (independent):
* +0.03 times body velocity in the goal direction.
* +0.01 times body direction alignment with goal direction.
* Brains: One Brain with the following observation/action space.
* Vector Observation space: 117 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body.
* Vector Action space: (Continuous) Size of 20, corresponding to target rotations for joints.
* Visual Observations: None.
* Reset Parameters: None
* Benchmark Mean Reward for CrawlerStaticTarget: 2000
* Benchmark Mean Reward for CrawlerDynamicTarget: 400

Lets print some information about the environment.

In [18]:
# number of agents
num_agents = env.number_agents
print('Number of agents:', num_agents)

# size of each action
action_size = env.action_space.shape[0]
print('Size of each action:', action_size)

# examine the state space 
states = env.reset()
state_size = env.observation_space.shape[0]
print('There are {} agents. Each observes a state with length: {}'.format(num_agents, state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 12
Size of each action: 20
There are 12 agents. Each observes a state with length: 129
The state for the first agent looks like: [ 9.99999225e-01  1.27272622e-03  0.00000000e+00  2.25000000e+00
  1.00000036e+00 -3.22882910e-12 -1.19209290e-07  3.22882975e-12
  1.00000000e+00 -3.22882910e-12  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  6.06093168e-01 -1.42857209e-01 -6.06078804e-01  5.00000000e-01
  5.00000000e-01  0.00000000e+00  5.00000000e-01  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  1.33339918e+00 -1.42857209e-01
 -1.33341408e+00  5.00000000e-01  0.00000000e+00  0.00000000e+00
  5.00000000e-01  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -6.0609

### 3. Take Random Actions in the Environment

In [23]:
states = env.reset()                 # reset env and get the current state (for each agent)
scores = np.zeros(num_agents)        # initialize the score (for each agent)
step=0
while True:
    # select an action (for each agent)
    actions = list(2*np.random.rand(num_agents, action_size)-1)
    next_states,rewards,dones,_ = env.step(actions)    
    
    # update the score (for each agent)
    scores +=  rewards
    
    # roll over states to next time step
    states = next_states                               
    step+=1
    
    # exit loop if episode finished
    if np.any(dones):                                  
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 1.1957619555372123


### 4. Training the agent!

Now it's turn to train an agent to solve the environment!  When training the environment, we have to set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [25]:
import random
import datetime
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

#pytorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal

# imports for rendering outputs in Jupyter.
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython.display import display

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
# defining the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print ("using",device)

using cuda:0


### 3. Define policy network (Actor Critic style)

In [27]:
action_low = env.action_space.low
action_high = env.action_space.high

# define actor critic network
class ActorCritic(nn.Module):
    
    def __init__(self,state_size,action_size,action_high,action_low,hidden_size=32):
        super(ActorCritic, self).__init__()
        
        # action range
        self.action_high = torch.tensor(action_high).to(device)
        self.action_low = torch.tensor(action_low).to(device)
        
        self.std = nn.Parameter(torch.zeros(action_size))
        
        # common network
        self.fc1 = nn.Linear(state_size,512)
        
        # actor network
        self.fc2_actor = nn.Linear(512,256)
        self.fc3_action = nn.Linear(256,action_size)
        #self.fc3_std = nn.Linear(64,action_size)
        
        # critic network
        self.fc2_critic = nn.Linear(512,64)
        self.fc3_critic = nn.Linear(64,1)
    
    def forward(self,state):
        # common network
        x = F.relu(self.fc1(state))
        
        # actor network
        x_actor = F.relu(self.fc2_actor(x))
        action_mean = F.sigmoid(self.fc3_action(x_actor))
        ## rescale action mean
        action_mean_ = (self.action_high-self.action_low)*action_mean + self.action_low
        #action_std = F.sigmoid(self.fc3_std(x_actor))
        
        # critic network
        x_critic = F.relu(self.fc2_critic(x))
        v = self.fc3_critic(x_critic)
        return action_mean_,v
    
    def act(self,state,action=None):
        # converting state from numpy array to pytorch tensor on the "device"
        state = torch.from_numpy(state).float().to(device)
        action_mean,v = self.forward(state)
        prob_dist = Normal(action_mean,F.softplus(self.std))
        if action is None:
            action = prob_dist.sample()
        log_prob = prob_dist.log_prob(action).sum(-1).unsqueeze(-1)
        entropy = prob_dist.entropy().sum(-1).unsqueeze(-1)
        return {'a': action,
                'log_pi_a': log_prob,
                'ent': entropy,
                'mean': action_mean,
                'v': v}

### 4. Storage class

In [28]:
class Storage:
    def __init__(self, size, keys=None):
        if keys is None:
            keys = []
        keys = keys + ['s', 'a', 'r', 'm',
                       'v', 'q', 'pi', 'log_pi', 'ent',
                       'adv', 'ret', 'q_a', 'log_pi_a',
                       'mean']
        self.keys = keys
        self.size = size
        self.reset()

    def add(self, data):
        for k, v in data.items():
            assert k in self.keys
            getattr(self, k).append(v)

    def placeholder(self):
        for k in self.keys:
            v = getattr(self, k)
            if len(v) == 0:
                setattr(self, k, [None] * self.size)

    def reset(self):
        for key in self.keys:
            setattr(self, key, [])

    def cat(self, keys):
        data = [getattr(self, k)[:self.size] for k in keys]
        return map(lambda x: torch.cat(x, dim=0), data)

### 4. PPO agent

In [35]:
from collections import deque
from itertools import accumulate
import torch.tensor as tensor

def random_sample(indices, batch_size):
    indices = np.asarray(np.random.permutation(indices))
    batches = indices[:len(indices) // batch_size * batch_size].reshape(-1, batch_size)
    for batch in batches:
        yield batch
    r = len(indices) % batch_size
    if r:
        yield indices[-r:]
        
class Agent:
    
    def __init__(self,env,learning_rate=1e-3):
        self.env = env
        nS = state_size
        nA = action_size
        self.policy = ActorCritic(state_size=nS,hidden_size=128,action_size=nA,
                             action_low=action_low,action_high=action_high).to(device)
        self.optimizer = optim.RMSprop(self.policy.parameters(), lr=learning_rate)
        
        # reset the environment
        self.states = np.array(env.reset())
        
        self.episode_rewards_window = deque(maxlen=100)
        self.episode_rewards = []
        num_trajectories = 12
        self.online_rewards = np.zeros(num_trajectories)
    
        
    def train(self,max_opt_steps=1000,num_trajectories=12,rollout_length=2048,mini_batch_size=64,gamma=.99,
              target_score=-250,use_gae=False,gae_tau=0.95,PRINT_EVERY=100):
        
        for opt_step in range(max_opt_steps):
        
            storage = Storage(rollout_length)
            states = self.states
            for _ in range(rollout_length):
                prediction = self.policy.act(states)
                
                # send all actions to tne environment
                next_states,rewards,terminals,_ = self.env.step(list(prediction['a'].cpu().numpy()))
                
                next_states = np.array(next_states)
                rewards = np.array(rewards)                   
                terminals = np.array(terminals)
                
                self.online_rewards += rewards
                for i, terminal in enumerate(terminals):
                    if terminals[i]:
                        self.episode_rewards.append(self.online_rewards[i])
                        self.episode_rewards_window.append(self.online_rewards[i])
                        self.online_rewards[i] = 0
                
                storage.add(prediction)
                storage.add({'r': tensor(rewards).unsqueeze(-1).float().to(device),
                             'm': tensor(1 - terminals).unsqueeze(-1).float().to(device),
                             's': tensor(states).to(device)})
                states = next_states

            self.states = states
            prediction = self.policy.act(states)
            storage.add(prediction)
            storage.placeholder()

            advantages = tensor(np.zeros((num_trajectories, 1))).float().to(device)
            returns = prediction['v'].detach()
            for i in reversed(range(rollout_length)):
                returns = storage.r[i] + gamma * storage.m[i] * returns
                if not use_gae:
                    advantages = returns - storage.v[i].detach()
                else:
                    td_error = storage.r[i] + gamma * storage.m[i] * storage.v[i + 1] - storage.v[i]
                    advantages = advantages * gae_tau * gamma * storage.m[i] + td_error
                storage.adv[i] = advantages.detach()
                storage.ret[i] = returns.detach()

            states, actions, log_probs_old, returns, advantages = storage.cat(['s', 'a', 'log_pi_a', 'ret', 'adv'])
            actions = actions.detach()
            log_probs_old = log_probs_old.detach()
            advantages = (advantages - advantages.mean()) / advantages.std()
            
            ppo_ratio_clip = 0.2
            gradient_clip = 0.5
            entropy_weight = 0.0
            
            for _ in range(10):
                sampler = random_sample(np.arange(states.size(0)), mini_batch_size)
                for batch_indices in sampler:
                    batch_indices = tensor(batch_indices).long()
                    sampled_states = states[batch_indices]
                    sampled_actions = actions[batch_indices]
                    sampled_log_probs_old = log_probs_old[batch_indices]
                    sampled_returns = returns[batch_indices]
                    sampled_advantages = advantages[batch_indices]

                    prediction = self.policy.act(sampled_states.cpu().numpy(), sampled_actions)
                    ratio = (prediction['log_pi_a'] - sampled_log_probs_old).exp()
                    obj = ratio * sampled_advantages
                    obj_clipped = ratio.clamp(1.0 - ppo_ratio_clip,
                                              1.0 + ppo_ratio_clip) * sampled_advantages
                    policy_loss = -torch.min(obj, obj_clipped).mean() - entropy_weight * prediction['ent'].mean()

                    value_loss = 0.5 * (sampled_returns - prediction['v']).pow(2).mean()

                    self.optimizer.zero_grad()
                    (policy_loss + value_loss).backward()
                    nn.utils.clip_grad_norm_(self.policy.parameters(), gradient_clip)
                    self.optimizer.step()
            
            #printing progress
            if opt_step % PRINT_EVERY == 0:
                print ("Opt step: {}\t Avg reward: {:.2f}\t std: {}".format(opt_step,np.mean(self.episode_rewards_window),
                                                                             self.policy.std))
                # save the policy
                torch.save(self.policy, 'ppo-crawler.policy')
            
            if np.mean(self.episode_rewards_window)>= target_score:
                print ("Environment solved in {} optimization steps! ... Avg reward : {:.2f}".format(opt_step-100,
                                                                                          np.mean(self.episode_rewards_window)))
                # save the policy
                torch.save(self.policy, 'ppo-crawler.policy')
                break
                
        return self.episode_rewards

### 5. Train the agent

In [36]:
# lets define and train our agent
agent = Agent(env=env,learning_rate=1e-4)

In [37]:
scores = agent.train(max_opt_steps=2000,gamma=0.98,target_score=2000,use_gae=True,PRINT_EVERY=1)



Opt step: 0	 Avg reward: -2.61	 std: Parameter containing:
tensor([ 0.0008, -0.0070, -0.0049, -0.0037,  0.0023, -0.0050, -0.0047,  0.0094,
        -0.0068,  0.0006, -0.0048, -0.0010,  0.0019, -0.0048,  0.0026, -0.0032,
        -0.0091,  0.0024,  0.0021, -0.0078], device='cuda:0',
       requires_grad=True)


  "type " + obj.__name__ + ". It won't be checked "


Opt step: 1	 Avg reward: -0.26	 std: Parameter containing:
tensor([-0.0002, -0.0093, -0.0105, -0.0044, -0.0013, -0.0031, -0.0051,  0.0097,
        -0.0046, -0.0035, -0.0014, -0.0022, -0.0058, -0.0033, -0.0010, -0.0046,
        -0.0082, -0.0012,  0.0046, -0.0042], device='cuda:0',
       requires_grad=True)
Opt step: 2	 Avg reward: 4.48	 std: Parameter containing:
tensor([-0.0026, -0.0084, -0.0094, -0.0072, -0.0047, -0.0020, -0.0079,  0.0087,
        -0.0102, -0.0122,  0.0002,  0.0009, -0.0049, -0.0087,  0.0034, -0.0069,
        -0.0054, -0.0098, -0.0078, -0.0079], device='cuda:0',
       requires_grad=True)
Opt step: 3	 Avg reward: 5.29	 std: Parameter containing:
tensor([-1.0853e-02, -1.4654e-02, -1.1256e-02, -5.3427e-03, -6.3788e-03,
        -4.6587e-05, -3.6881e-03, -1.3093e-03, -1.3239e-02, -1.6244e-02,
        -3.5796e-03, -4.7262e-03, -1.4249e-02, -1.4963e-02,  8.9431e-04,
        -1.2529e-02, -1.4089e-02, -7.1098e-03, -3.7697e-03, -1.1113e-02],
       device='cuda:0', requires_g

Opt step: 28	 Avg reward: 38.80	 std: Parameter containing:
tensor([-0.0537, -0.0736, -0.0551, -0.0571, -0.0721, -0.0675, -0.0580, -0.0689,
        -0.0674, -0.0462, -0.0853, -0.1098, -0.0496, -0.0435, -0.1236, -0.0643,
        -0.0688, -0.0806, -0.0542, -0.0846], device='cuda:0',
       requires_grad=True)
Opt step: 29	 Avg reward: 41.28	 std: Parameter containing:
tensor([-0.0531, -0.0687, -0.0542, -0.0611, -0.0715, -0.0769, -0.0627, -0.0673,
        -0.0680, -0.0466, -0.0923, -0.1144, -0.0587, -0.0464, -0.1251, -0.0560,
        -0.0631, -0.0841, -0.0607, -0.0941], device='cuda:0',
       requires_grad=True)
Opt step: 30	 Avg reward: 38.56	 std: Parameter containing:
tensor([-0.0519, -0.0767, -0.0567, -0.0602, -0.0704, -0.0821, -0.0684, -0.0745,
        -0.0767, -0.0514, -0.0944, -0.1137, -0.0575, -0.0454, -0.1246, -0.0619,
        -0.0692, -0.0854, -0.0607, -0.0995], device='cuda:0',
       requires_grad=True)
Opt step: 31	 Avg reward: 32.39	 std: Parameter containing:
tensor([-0.04

Opt step: 55	 Avg reward: 66.38	 std: Parameter containing:
tensor([-0.1205, -0.1224, -0.1506, -0.1095, -0.1232, -0.2085, -0.1986, -0.1505,
        -0.1907, -0.1204, -0.1949, -0.1732, -0.1570, -0.1319, -0.2407, -0.1773,
        -0.1572, -0.1679, -0.0967, -0.1539], device='cuda:0',
       requires_grad=True)
Opt step: 56	 Avg reward: 64.63	 std: Parameter containing:
tensor([-0.1250, -0.1252, -0.1626, -0.1127, -0.1258, -0.2047, -0.2001, -0.1644,
        -0.1851, -0.1327, -0.1960, -0.1793, -0.1635, -0.1288, -0.2498, -0.1870,
        -0.1681, -0.1586, -0.0912, -0.1637], device='cuda:0',
       requires_grad=True)
Opt step: 57	 Avg reward: 66.53	 std: Parameter containing:
tensor([-0.1192, -0.1283, -0.1691, -0.1225, -0.1337, -0.2016, -0.2114, -0.1663,
        -0.1936, -0.1375, -0.2038, -0.1787, -0.1660, -0.1380, -0.2507, -0.1887,
        -0.1681, -0.1539, -0.0989, -0.1740], device='cuda:0',
       requires_grad=True)
Opt step: 58	 Avg reward: 70.97	 std: Parameter containing:
tensor([-0.12

Opt step: 82	 Avg reward: 63.82	 std: Parameter containing:
tensor([-0.1829, -0.1756, -0.2302, -0.1683, -0.2147, -0.2459, -0.2856, -0.2211,
        -0.2584, -0.2598, -0.2074, -0.2468, -0.2068, -0.2265, -0.2992, -0.2529,
        -0.2161, -0.1798, -0.1387, -0.2162], device='cuda:0',
       requires_grad=True)
Opt step: 83	 Avg reward: 79.52	 std: Parameter containing:
tensor([-0.1877, -0.1674, -0.2306, -0.1641, -0.2179, -0.2487, -0.2952, -0.2244,
        -0.2618, -0.2596, -0.2106, -0.2523, -0.2176, -0.2230, -0.3071, -0.2464,
        -0.2073, -0.1876, -0.1375, -0.2201], device='cuda:0',
       requires_grad=True)
Opt step: 84	 Avg reward: 87.73	 std: Parameter containing:
tensor([-0.1891, -0.1712, -0.2312, -0.1667, -0.2149, -0.2536, -0.3019, -0.2268,
        -0.2702, -0.2621, -0.2079, -0.2555, -0.2247, -0.2330, -0.3229, -0.2508,
        -0.2134, -0.1711, -0.1399, -0.2232], device='cuda:0',
       requires_grad=True)
Opt step: 85	 Avg reward: 71.24	 std: Parameter containing:
tensor([-0.19

Opt step: 109	 Avg reward: 93.61	 std: Parameter containing:
tensor([-0.2431, -0.1861, -0.2758, -0.1673, -0.2657, -0.2453, -0.3095, -0.2147,
        -0.3470, -0.3014, -0.2245, -0.2468, -0.2683, -0.2778, -0.3239, -0.2821,
        -0.2053, -0.2355, -0.1909, -0.2625], device='cuda:0',
       requires_grad=True)
Opt step: 110	 Avg reward: 104.28	 std: Parameter containing:
tensor([-0.2619, -0.1914, -0.2714, -0.1693, -0.2788, -0.2396, -0.3004, -0.2283,
        -0.3338, -0.2966, -0.2242, -0.2341, -0.2702, -0.2801, -0.3199, -0.2893,
        -0.2036, -0.2308, -0.1903, -0.2656], device='cuda:0',
       requires_grad=True)
Opt step: 111	 Avg reward: 95.00	 std: Parameter containing:
tensor([-0.2598, -0.1822, -0.2702, -0.1731, -0.2816, -0.2499, -0.3011, -0.2194,
        -0.3375, -0.3060, -0.2306, -0.2301, -0.2743, -0.2744, -0.3105, -0.2855,
        -0.2059, -0.2380, -0.1857, -0.2723], device='cuda:0',
       requires_grad=True)
Opt step: 112	 Avg reward: 92.45	 std: Parameter containing:
tensor([

Opt step: 136	 Avg reward: 114.02	 std: Parameter containing:
tensor([-0.2966, -0.2231, -0.3194, -0.2028, -0.3354, -0.2073, -0.3865, -0.1945,
        -0.3699, -0.3679, -0.2777, -0.2862, -0.2693, -0.3179, -0.3737, -0.3619,
        -0.2107, -0.2465, -0.1730, -0.3342], device='cuda:0',
       requires_grad=True)
Opt step: 141	 Avg reward: 130.50	 std: Parameter containing:
tensor([-0.3117, -0.2219, -0.3417, -0.2141, -0.3223, -0.2412, -0.3966, -0.1708,
        -0.3797, -0.3965, -0.2842, -0.3108, -0.2892, -0.3067, -0.3787, -0.3612,
        -0.2351, -0.2416, -0.1768, -0.3208], device='cuda:0',
       requires_grad=True)
Opt step: 142	 Avg reward: 129.09	 std: Parameter containing:
tensor([-0.3102, -0.2305, -0.3306, -0.2073, -0.3219, -0.2332, -0.4094, -0.1635,
        -0.3840, -0.3919, -0.2918, -0.3195, -0.2889, -0.3113, -0.3759, -0.3702,
        -0.2426, -0.2539, -0.1719, -0.3255], device='cuda:0',
       requires_grad=True)
Opt step: 143	 Avg reward: 124.34	 std: Parameter containing:
tenso

Opt step: 167	 Avg reward: 147.77	 std: Parameter containing:
tensor([-0.3387, -0.2615, -0.3114, -0.2246, -0.3501, -0.2637, -0.3926, -0.2039,
        -0.4384, -0.4151, -0.3314, -0.3578, -0.2846, -0.3102, -0.4500, -0.3709,
        -0.2187, -0.2468, -0.2353, -0.3477], device='cuda:0',
       requires_grad=True)
Opt step: 168	 Avg reward: 174.33	 std: Parameter containing:
tensor([-0.3379, -0.2614, -0.3202, -0.2312, -0.3520, -0.2574, -0.3934, -0.2053,
        -0.4387, -0.4154, -0.3429, -0.3502, -0.2849, -0.3228, -0.4592, -0.3800,
        -0.2163, -0.2471, -0.2367, -0.3507], device='cuda:0',
       requires_grad=True)
Opt step: 169	 Avg reward: 151.22	 std: Parameter containing:
tensor([-0.3434, -0.2616, -0.3251, -0.2350, -0.3548, -0.2674, -0.3838, -0.2089,
        -0.4345, -0.4074, -0.3371, -0.3511, -0.2888, -0.3215, -0.4576, -0.3824,
        -0.2139, -0.2538, -0.2315, -0.3421], device='cuda:0',
       requires_grad=True)
Opt step: 170	 Avg reward: 175.03	 std: Parameter containing:
tenso

Opt step: 194	 Avg reward: 211.90	 std: Parameter containing:
tensor([-0.3927, -0.2416, -0.3980, -0.2545, -0.3691, -0.2688, -0.4218, -0.1983,
        -0.4119, -0.4102, -0.3903, -0.4034, -0.2829, -0.3051, -0.4603, -0.4364,
        -0.2567, -0.2800, -0.2130, -0.3786], device='cuda:0',
       requires_grad=True)
Opt step: 195	 Avg reward: 185.75	 std: Parameter containing:
tensor([-0.4016, -0.2408, -0.3968, -0.2483, -0.3661, -0.2714, -0.4309, -0.1896,
        -0.4070, -0.4164, -0.3988, -0.4096, -0.2684, -0.2958, -0.4527, -0.4316,
        -0.2537, -0.2813, -0.2167, -0.3749], device='cuda:0',
       requires_grad=True)
Opt step: 196	 Avg reward: 220.66	 std: Parameter containing:
tensor([-0.3935, -0.2486, -0.3907, -0.2384, -0.3746, -0.2810, -0.4283, -0.1867,
        -0.4072, -0.4263, -0.3980, -0.4163, -0.2540, -0.2973, -0.4469, -0.4241,
        -0.2670, -0.2841, -0.2083, -0.3627], device='cuda:0',
       requires_grad=True)
Opt step: 197	 Avg reward: 218.78	 std: Parameter containing:
tenso

Opt step: 221	 Avg reward: 218.47	 std: Parameter containing:
tensor([-0.3593, -0.2111, -0.3950, -0.1731, -0.3334, -0.2155, -0.4187, -0.1402,
        -0.4075, -0.3641, -0.4429, -0.4321, -0.2610, -0.2989, -0.4505, -0.4018,
        -0.2454, -0.2603, -0.1947, -0.3544], device='cuda:0',
       requires_grad=True)
Opt step: 222	 Avg reward: 252.79	 std: Parameter containing:
tensor([-0.3615, -0.2191, -0.3950, -0.1715, -0.3383, -0.2031, -0.4198, -0.1379,
        -0.4226, -0.3698, -0.4529, -0.4500, -0.2634, -0.2998, -0.4613, -0.4092,
        -0.2386, -0.2617, -0.1845, -0.3590], device='cuda:0',
       requires_grad=True)
Opt step: 223	 Avg reward: 218.04	 std: Parameter containing:
tensor([-0.3575, -0.2253, -0.4030, -0.1773, -0.3467, -0.1922, -0.4259, -0.1401,
        -0.4245, -0.3751, -0.4567, -0.4563, -0.2729, -0.2964, -0.4684, -0.4018,
        -0.2404, -0.2558, -0.1911, -0.3612], device='cuda:0',
       requires_grad=True)
Opt step: 224	 Avg reward: 269.79	 std: Parameter containing:
tenso

Opt step: 248	 Avg reward: 279.88	 std: Parameter containing:
tensor([-0.2823, -0.2096, -0.3973, -0.1859, -0.3531, -0.1329, -0.3985, -0.1666,
        -0.4156, -0.4100, -0.4897, -0.4714, -0.1757, -0.2463, -0.3961, -0.3987,
        -0.2499, -0.2988, -0.1754, -0.3052], device='cuda:0',
       requires_grad=True)
Opt step: 249	 Avg reward: 310.49	 std: Parameter containing:
tensor([-0.2875, -0.2047, -0.3882, -0.1744, -0.3484, -0.1410, -0.4072, -0.1587,
        -0.4140, -0.4076, -0.4862, -0.4746, -0.1819, -0.2560, -0.3878, -0.3988,
        -0.2402, -0.2982, -0.1705, -0.2929], device='cuda:0',
       requires_grad=True)
Opt step: 250	 Avg reward: 291.82	 std: Parameter containing:
tensor([-0.2967, -0.2023, -0.3855, -0.1761, -0.3488, -0.1464, -0.4131, -0.1590,
        -0.4085, -0.4066, -0.4920, -0.4730, -0.1786, -0.2411, -0.3818, -0.4034,
        -0.2474, -0.2957, -0.1693, -0.2957], device='cuda:0',
       requires_grad=True)
Opt step: 251	 Avg reward: 304.12	 std: Parameter containing:
tenso

Opt step: 278	 Avg reward: 294.80	 std: Parameter containing:
tensor([-0.2553, -0.1955, -0.2823, -0.1419, -0.3940, -0.1312, -0.3303, -0.0930,
        -0.3763, -0.4272, -0.4663, -0.5601, -0.1573, -0.2220, -0.3473, -0.3812,
        -0.2438, -0.2549, -0.1145, -0.3031], device='cuda:0',
       requires_grad=True)
Opt step: 279	 Avg reward: 290.31	 std: Parameter containing:
tensor([-0.2567, -0.2016, -0.2868, -0.1341, -0.3987, -0.1284, -0.3363, -0.0980,
        -0.3766, -0.4281, -0.4615, -0.5555, -0.1586, -0.2221, -0.3468, -0.3795,
        -0.2334, -0.2522, -0.1051, -0.3071], device='cuda:0',
       requires_grad=True)
Opt step: 280	 Avg reward: 279.25	 std: Parameter containing:
tensor([-0.2587, -0.1945, -0.2783, -0.1417, -0.3929, -0.1303, -0.3389, -0.0875,
        -0.3734, -0.4430, -0.4588, -0.5501, -0.1516, -0.2254, -0.3373, -0.3719,
        -0.2364, -0.2373, -0.1134, -0.2879], device='cuda:0',
       requires_grad=True)
Opt step: 284	 Avg reward: 289.13	 std: Parameter containing:
tenso

Opt step: 308	 Avg reward: 368.54	 std: Parameter containing:
tensor([-0.3103, -0.1490, -0.2598, -0.0922, -0.4004, -0.1424, -0.3192, -0.0113,
        -0.2842, -0.4504, -0.4997, -0.6071, -0.1409, -0.1784, -0.2973, -0.3137,
        -0.2475, -0.2036, -0.0999, -0.2623], device='cuda:0',
       requires_grad=True)
Opt step: 309	 Avg reward: 383.62	 std: Parameter containing:
tensor([-0.3161, -0.1311, -0.2639, -0.1036, -0.4076, -0.1347, -0.3240, -0.0031,
        -0.2916, -0.4714, -0.4939, -0.6045, -0.1464, -0.1852, -0.2900, -0.3234,
        -0.2567, -0.2027, -0.0928, -0.2623], device='cuda:0',
       requires_grad=True)
Opt step: 310	 Avg reward: 376.39	 std: Parameter containing:
tensor([-0.3304, -0.1394, -0.2573, -0.1025, -0.4034, -0.1319, -0.3186, -0.0042,
        -0.2843, -0.4693, -0.4956, -0.6157, -0.1391, -0.1855, -0.3063, -0.3227,
        -0.2593, -0.2050, -0.0854, -0.2558], device='cuda:0',
       requires_grad=True)
Opt step: 311	 Avg reward: 327.34	 std: Parameter containing:
tenso

Opt step: 335	 Avg reward: 118.22	 std: Parameter containing:
tensor([-0.2990, -0.1539, -0.2004, -0.0721, -0.3583, -0.0401, -0.2744, -0.0721,
        -0.2965, -0.3623, -0.5214, -0.6027, -0.1607, -0.1314, -0.1894, -0.3174,
        -0.2001, -0.1953, -0.0864, -0.2743], device='cuda:0',
       requires_grad=True)
Opt step: 336	 Avg reward: 89.90	 std: Parameter containing:
tensor([-0.2999, -0.1648, -0.1957, -0.0735, -0.3598, -0.0397, -0.2721, -0.0679,
        -0.2954, -0.3643, -0.5160, -0.5970, -0.1676, -0.1342, -0.1885, -0.3165,
        -0.2025, -0.2021, -0.0923, -0.2766], device='cuda:0',
       requires_grad=True)
Opt step: 337	 Avg reward: 129.60	 std: Parameter containing:
tensor([-0.2908, -0.1530, -0.1904, -0.0656, -0.3691, -0.0310, -0.2710, -0.0762,
        -0.3040, -0.3713, -0.5275, -0.5955, -0.1697, -0.1321, -0.1822, -0.3182,
        -0.2087, -0.1996, -0.0994, -0.2768], device='cuda:0',
       requires_grad=True)
Opt step: 338	 Avg reward: 122.15	 std: Parameter containing:
tensor

Opt step: 362	 Avg reward: 224.71	 std: Parameter containing:
tensor([-0.2428, -0.1580, -0.1745, -0.0905, -0.3677, -0.0288, -0.2266, -0.0148,
        -0.2666, -0.4005, -0.5294, -0.6150, -0.1384, -0.0246, -0.1472, -0.2675,
        -0.1268, -0.1517, -0.1220, -0.2160], device='cuda:0',
       requires_grad=True)
Opt step: 363	 Avg reward: 229.43	 std: Parameter containing:
tensor([-0.2484, -0.1628, -0.1616, -0.0938, -0.3560, -0.0068, -0.2149, -0.0239,
        -0.2580, -0.4032, -0.5160, -0.6015, -0.1339, -0.0185, -0.1461, -0.2575,
        -0.1194, -0.1592, -0.1209, -0.2141], device='cuda:0',
       requires_grad=True)
Opt step: 364	 Avg reward: 237.07	 std: Parameter containing:
tensor([-0.2498, -0.1611, -0.1709, -0.0923, -0.3555, -0.0156, -0.2129, -0.0269,
        -0.2635, -0.4089, -0.5148, -0.6063, -0.1365, -0.0175, -0.1364, -0.2535,
        -0.1318, -0.1547, -0.1224, -0.1985], device='cuda:0',
       requires_grad=True)
Opt step: 365	 Avg reward: 219.67	 std: Parameter containing:
tenso

Opt step: 389	 Avg reward: 137.42	 std: Parameter containing:
tensor([-0.2341, -0.1480, -0.1732, -0.0016, -0.3177, -0.0395, -0.1754,  0.0083,
        -0.2220, -0.3675, -0.4847, -0.6159, -0.0639,  0.0027, -0.0710, -0.2208,
        -0.1134, -0.1031, -0.0994, -0.0951], device='cuda:0',
       requires_grad=True)
Opt step: 390	 Avg reward: 127.20	 std: Parameter containing:
tensor([-0.2266, -0.1405, -0.1703,  0.0098, -0.3110, -0.0435, -0.1707,  0.0148,
        -0.2166, -0.3795, -0.4692, -0.6135, -0.0642, -0.0023, -0.0711, -0.2193,
        -0.0985, -0.1026, -0.0968, -0.1006], device='cuda:0',
       requires_grad=True)
Opt step: 391	 Avg reward: 142.05	 std: Parameter containing:
tensor([-0.2246, -0.1304, -0.1592,  0.0179, -0.3079, -0.0239, -0.1759,  0.0198,
        -0.2166, -0.3920, -0.4663, -0.5979, -0.0510, -0.0098, -0.0713, -0.2244,
        -0.0988, -0.0988, -0.1032, -0.0972], device='cuda:0',
       requires_grad=True)
Opt step: 392	 Avg reward: 129.17	 std: Parameter containing:
tenso

Opt step: 419	 Avg reward: 209.90	 std: Parameter containing:
tensor([-0.1916, -0.1289, -0.1268,  0.0361, -0.2976,  0.0311, -0.1211,  0.0248,
        -0.2321, -0.3675, -0.4770, -0.6017,  0.0683, -0.0658, -0.0330, -0.2838,
        -0.0204, -0.0727, -0.0854, -0.1253], device='cuda:0',
       requires_grad=True)
Opt step: 420	 Avg reward: 266.23	 std: Parameter containing:
tensor([-0.1852, -0.1105, -0.1189,  0.0546, -0.3035,  0.0345, -0.1133,  0.0224,
        -0.2363, -0.3693, -0.4922, -0.6001,  0.0693, -0.0598, -0.0144, -0.2884,
        -0.0212, -0.0769, -0.0853, -0.1357], device='cuda:0',
       requires_grad=True)
Opt step: 421	 Avg reward: 219.16	 std: Parameter containing:
tensor([-0.1721, -0.1148, -0.1302,  0.0619, -0.3122,  0.0348, -0.1248,  0.0162,
        -0.2410, -0.3798, -0.4878, -0.5951,  0.0671, -0.0445, -0.0131, -0.2822,
        -0.0293, -0.0811, -0.0908, -0.1392], device='cuda:0',
       requires_grad=True)
Opt step: 422	 Avg reward: 229.34	 std: Parameter containing:
tenso

Opt step: 446	 Avg reward: 252.95	 std: Parameter containing:
tensor([-0.1637, -0.1203, -0.1393,  0.0660, -0.3475,  0.1125, -0.1478,  0.0563,
        -0.2573, -0.3343, -0.5949, -0.5837,  0.1129, -0.0206,  0.0419, -0.2176,
         0.0370, -0.0920, -0.1381, -0.1128], device='cuda:0',
       requires_grad=True)
Opt step: 447	 Avg reward: 275.15	 std: Parameter containing:
tensor([-0.1678, -0.1313, -0.1472,  0.0846, -0.3498,  0.1107, -0.1494,  0.0545,
        -0.2616, -0.3538, -0.5930, -0.5819,  0.1227, -0.0186,  0.0362, -0.2285,
         0.0510, -0.0912, -0.1486, -0.1153], device='cuda:0',
       requires_grad=True)
Opt step: 448	 Avg reward: 263.38	 std: Parameter containing:
tensor([-0.1575, -0.1344, -0.1433,  0.1041, -0.3720,  0.1143, -0.1407,  0.0592,
        -0.2541, -0.3514, -0.5820, -0.5765,  0.1229, -0.0243,  0.0225, -0.2332,
         0.0494, -0.0940, -0.1480, -0.1228], device='cuda:0',
       requires_grad=True)
Opt step: 449	 Avg reward: 241.04	 std: Parameter containing:
tenso

Opt step: 477	 Avg reward: 244.85	 std: Parameter containing:
tensor([-0.2057, -0.1041, -0.1348,  0.2155, -0.3393,  0.1156, -0.0939,  0.0123,
        -0.2607, -0.3333, -0.5433, -0.5580,  0.1331,  0.0476,  0.0992, -0.1331,
         0.0630, -0.0779, -0.1655, -0.0666], device='cuda:0',
       requires_grad=True)
Opt step: 478	 Avg reward: 277.36	 std: Parameter containing:
tensor([-0.2095, -0.1013, -0.1279,  0.2190, -0.3457,  0.1293, -0.0916,  0.0175,
        -0.2625, -0.3361, -0.5439, -0.5596,  0.1325,  0.0479,  0.1084, -0.1321,
         0.0648, -0.0671, -0.1570, -0.0743], device='cuda:0',
       requires_grad=True)
Opt step: 479	 Avg reward: 274.62	 std: Parameter containing:
tensor([-0.1960, -0.0999, -0.1283,  0.2179, -0.3385,  0.1191, -0.0996,  0.0322,
        -0.2635, -0.3308, -0.5537, -0.5598,  0.1364,  0.0460,  0.1011, -0.1292,
         0.0560, -0.0725, -0.1565, -0.0635], device='cuda:0',
       requires_grad=True)
Opt step: 480	 Avg reward: 250.93	 std: Parameter containing:
tenso

Opt step: 504	 Avg reward: 305.57	 std: Parameter containing:
tensor([-0.1682, -0.0328, -0.1494,  0.2704, -0.4122,  0.2063, -0.0491,  0.0234,
        -0.2242, -0.3722, -0.5239, -0.5644,  0.1212,  0.1216,  0.0835, -0.0941,
         0.0970, -0.0140, -0.1390, -0.0662], device='cuda:0',
       requires_grad=True)
Opt step: 505	 Avg reward: 306.37	 std: Parameter containing:
tensor([-0.1678, -0.0305, -0.1557,  0.2799, -0.4177,  0.1981, -0.0554,  0.0238,
        -0.2253, -0.3617, -0.5374, -0.5655,  0.1256,  0.1120,  0.0886, -0.0896,
         0.0866, -0.0144, -0.1428, -0.0601], device='cuda:0',
       requires_grad=True)
Opt step: 506	 Avg reward: 316.43	 std: Parameter containing:
tensor([-0.1596, -0.0273, -0.1467,  0.2763, -0.4220,  0.2146, -0.0644,  0.0339,
        -0.2091, -0.3535, -0.5451, -0.5588,  0.1304,  0.1119,  0.0957, -0.0778,
         0.0907, -0.0150, -0.1402, -0.0513], device='cuda:0',
       requires_grad=True)
Opt step: 507	 Avg reward: 312.39	 std: Parameter containing:
tenso

Opt step: 531	 Avg reward: 339.52	 std: Parameter containing:
tensor([-0.1803, -0.0008, -0.0484,  0.2488, -0.4595,  0.2881, -0.0707,  0.0634,
        -0.2379, -0.2960, -0.5928, -0.5744,  0.1722,  0.1698,  0.1408, -0.1005,
         0.0829,  0.0322, -0.0722, -0.0457], device='cuda:0',
       requires_grad=True)
Opt step: 532	 Avg reward: 306.72	 std: Parameter containing:
tensor([-1.7712e-01,  5.7997e-04, -4.9542e-02,  2.4545e-01, -4.5693e-01,
         2.8636e-01, -7.2966e-02,  6.5863e-02, -2.4201e-01, -3.0116e-01,
        -5.8705e-01, -5.6764e-01,  1.7549e-01,  1.6776e-01,  1.4684e-01,
        -9.9634e-02,  7.5998e-02,  3.0645e-02, -6.4333e-02, -4.2958e-02],
       device='cuda:0', requires_grad=True)
Opt step: 533	 Avg reward: 329.59	 std: Parameter containing:
tensor([-0.1850,  0.0021, -0.0452,  0.2489, -0.4574,  0.2871, -0.0767,  0.0638,
        -0.2464, -0.3073, -0.5855, -0.5690,  0.1683,  0.1618,  0.1487, -0.1044,
         0.0632,  0.0335, -0.0558, -0.0339], device='cuda:0',
      

Opt step: 558	 Avg reward: 281.13	 std: Parameter containing:
tensor([-0.1642,  0.0724, -0.0670,  0.2597, -0.4444,  0.2584, -0.1095,  0.0197,
        -0.2237, -0.3326, -0.6107, -0.5658,  0.1052,  0.1382,  0.1672, -0.0741,
         0.0913,  0.0684, -0.0874, -0.0234], device='cuda:0',
       requires_grad=True)
Opt step: 559	 Avg reward: 348.25	 std: Parameter containing:
tensor([-0.1646,  0.0799, -0.0616,  0.2620, -0.4490,  0.2524, -0.1164,  0.0143,
        -0.2186, -0.3285, -0.6139, -0.5704,  0.1147,  0.1189,  0.1630, -0.0775,
         0.1014,  0.0570, -0.0875, -0.0182], device='cuda:0',
       requires_grad=True)
Opt step: 560	 Avg reward: 312.71	 std: Parameter containing:
tensor([-0.1683,  0.0878, -0.0514,  0.2547, -0.4542,  0.2486, -0.1140,  0.0185,
        -0.2160, -0.3353, -0.6135, -0.5599,  0.1113,  0.1248,  0.1653, -0.0878,
         0.0986,  0.0546, -0.0902, -0.0114], device='cuda:0',
       requires_grad=True)
Opt step: 561	 Avg reward: 275.91	 std: Parameter containing:
tenso

Opt step: 584	 Avg reward: 304.12	 std: Parameter containing:
tensor([-0.1544,  0.1092, -0.0624,  0.2426, -0.4261,  0.2453, -0.1358, -0.0387,
        -0.2670, -0.3148, -0.6033, -0.6224,  0.1461,  0.1453,  0.2028, -0.0206,
         0.1786,  0.0287, -0.0481,  0.0179], device='cuda:0',
       requires_grad=True)
Opt step: 585	 Avg reward: 319.11	 std: Parameter containing:
tensor([-0.1531,  0.1149, -0.0608,  0.2417, -0.4266,  0.2474, -0.1411, -0.0337,
        -0.2448, -0.3203, -0.6057, -0.6212,  0.1542,  0.1395,  0.2018, -0.0226,
         0.1823,  0.0202, -0.0429,  0.0162], device='cuda:0',
       requires_grad=True)
Opt step: 586	 Avg reward: 322.22	 std: Parameter containing:
tensor([-0.1580,  0.0984, -0.0538,  0.2488, -0.4312,  0.2494, -0.1383, -0.0473,
        -0.2550, -0.3122, -0.6213, -0.6217,  0.1591,  0.1390,  0.2000, -0.0236,
         0.1888,  0.0076, -0.0459,  0.0099], device='cuda:0',
       requires_grad=True)
Opt step: 587	 Avg reward: 369.77	 std: Parameter containing:
tenso

Opt step: 611	 Avg reward: 358.17	 std: Parameter containing:
tensor([-0.1366,  0.0982, -0.0583,  0.2049, -0.4355,  0.2050, -0.1854, -0.0385,
        -0.3182, -0.3527, -0.6652, -0.6743,  0.1093,  0.1940,  0.2093, -0.0046,
         0.2324, -0.0717, -0.0168,  0.0336], device='cuda:0',
       requires_grad=True)
Opt step: 612	 Avg reward: 385.19	 std: Parameter containing:
tensor([-0.1412,  0.0997, -0.0709,  0.2015, -0.4151,  0.2066, -0.1955, -0.0315,
        -0.3254, -0.3469, -0.6699, -0.6841,  0.1080,  0.1734,  0.2103, -0.0038,
         0.2326, -0.0659, -0.0116,  0.0292], device='cuda:0',
       requires_grad=True)
Opt step: 613	 Avg reward: 348.94	 std: Parameter containing:
tensor([-0.1399,  0.1088, -0.0682,  0.1991, -0.4033,  0.2262, -0.1859, -0.0299,
        -0.3184, -0.3502, -0.6563, -0.6898,  0.1083,  0.1555,  0.1948,  0.0216,
         0.2337, -0.0791, -0.0158,  0.0290], device='cuda:0',
       requires_grad=True)
Opt step: 614	 Avg reward: 314.82	 std: Parameter containing:
tenso

Opt step: 638	 Avg reward: 337.31	 std: Parameter containing:
tensor([-0.2419,  0.1256, -0.0733,  0.2065, -0.4597,  0.1949, -0.1450, -0.0768,
        -0.3720, -0.3863, -0.6678, -0.6835,  0.0472,  0.1843,  0.2646,  0.0945,
         0.1989, -0.0813, -0.0077,  0.0347], device='cuda:0',
       requires_grad=True)
Opt step: 639	 Avg reward: 327.31	 std: Parameter containing:
tensor([-0.2322,  0.1243, -0.0820,  0.1928, -0.4750,  0.1976, -0.1439, -0.0852,
        -0.3687, -0.3977, -0.6840, -0.6841,  0.0391,  0.1700,  0.2639,  0.0909,
         0.1979, -0.0796, -0.0057,  0.0450], device='cuda:0',
       requires_grad=True)
Opt step: 640	 Avg reward: 326.13	 std: Parameter containing:
tensor([-0.2408,  0.1229, -0.0791,  0.2076, -0.4699,  0.1967, -0.1522, -0.0760,
        -0.3719, -0.3969, -0.6894, -0.6865,  0.0255,  0.1748,  0.2638,  0.0907,
         0.1955, -0.0857,  0.0079,  0.0538], device='cuda:0',
       requires_grad=True)
Opt step: 641	 Avg reward: 269.76	 std: Parameter containing:
tenso

Opt step: 664	 Avg reward: 286.16	 std: Parameter containing:
tensor([-0.2336,  0.1889, -0.1243,  0.2306, -0.5289,  0.2373, -0.1708, -0.0357,
        -0.3587, -0.4225, -0.7019, -0.7115, -0.0245,  0.1571,  0.3144,  0.0401,
         0.1193, -0.1021,  0.0097,  0.0523], device='cuda:0',
       requires_grad=True)
Opt step: 665	 Avg reward: 275.47	 std: Parameter containing:
tensor([-0.2398,  0.1903, -0.1377,  0.2415, -0.5353,  0.2405, -0.1656, -0.0392,
        -0.3717, -0.4205, -0.6926, -0.7140, -0.0237,  0.1437,  0.3226,  0.0395,
         0.1316, -0.0985,  0.0114,  0.0454], device='cuda:0',
       requires_grad=True)
Opt step: 666	 Avg reward: 294.53	 std: Parameter containing:
tensor([-0.2401,  0.1870, -0.1439,  0.2506, -0.5238,  0.2431, -0.1638, -0.0579,
        -0.3780, -0.4285, -0.6923, -0.7045, -0.0211,  0.1547,  0.3090,  0.0404,
         0.1333, -0.0873,  0.0113,  0.0484], device='cuda:0',
       requires_grad=True)
Opt step: 667	 Avg reward: 291.49	 std: Parameter containing:
tenso

Opt step: 691	 Avg reward: 249.59	 std: Parameter containing:
tensor([-0.2864,  0.2045, -0.1375,  0.2503, -0.5413,  0.2326, -0.1810, -0.0856,
        -0.3736, -0.5196, -0.6803, -0.7412, -0.0479,  0.1252,  0.3553,  0.0944,
         0.1079, -0.0829,  0.0150,  0.0595], device='cuda:0',
       requires_grad=True)
Opt step: 692	 Avg reward: 269.01	 std: Parameter containing:
tensor([-0.2837,  0.2047, -0.1354,  0.2451, -0.5572,  0.2304, -0.1901, -0.0905,
        -0.3684, -0.5196, -0.6938, -0.7245, -0.0551,  0.1249,  0.3482,  0.0943,
         0.1072, -0.0729,  0.0183,  0.0436], device='cuda:0',
       requires_grad=True)
Opt step: 693	 Avg reward: 303.28	 std: Parameter containing:
tensor([-0.2740,  0.2085, -0.1518,  0.2469, -0.5479,  0.2306, -0.2005, -0.0821,
        -0.3679, -0.5246, -0.6926, -0.7248, -0.0520,  0.1343,  0.3559,  0.0792,
         0.1116, -0.0665,  0.0323,  0.0385], device='cuda:0',
       requires_grad=True)
Opt step: 694	 Avg reward: 309.34	 std: Parameter containing:
tenso

Opt step: 718	 Avg reward: 156.47	 std: Parameter containing:
tensor([-0.2783,  0.1730, -0.1594,  0.2763, -0.5278,  0.2349, -0.2254, -0.0530,
        -0.3164, -0.4794, -0.7098, -0.7806, -0.0527,  0.2073,  0.3406,  0.1634,
         0.1488, -0.0062,  0.0489,  0.0380], device='cuda:0',
       requires_grad=True)
Opt step: 719	 Avg reward: 168.65	 std: Parameter containing:
tensor([-0.2841,  0.1574, -0.1603,  0.2788, -0.5289,  0.2356, -0.2223, -0.0452,
        -0.3081, -0.4867, -0.7081, -0.7791, -0.0467,  0.2044,  0.3460,  0.1495,
         0.1365, -0.0141,  0.0350,  0.0504], device='cuda:0',
       requires_grad=True)
Opt step: 720	 Avg reward: 143.67	 std: Parameter containing:
tensor([-0.3015,  0.1659, -0.1635,  0.2892, -0.5372,  0.2290, -0.2336, -0.0365,
        -0.3192, -0.4920, -0.7083, -0.7732, -0.0560,  0.2144,  0.3476,  0.1608,
         0.1521, -0.0111,  0.0401,  0.0412], device='cuda:0',
       requires_grad=True)
Opt step: 721	 Avg reward: 180.53	 std: Parameter containing:
tenso

Opt step: 745	 Avg reward: 151.05	 std: Parameter containing:
tensor([-0.3150,  0.1369, -0.0965,  0.3030, -0.5855,  0.2414, -0.2861, -0.0338,
        -0.2780, -0.4977, -0.6838, -0.8044, -0.0381,  0.1720,  0.3498,  0.1798,
         0.1776, -0.0172,  0.0363,  0.0319], device='cuda:0',
       requires_grad=True)
Opt step: 746	 Avg reward: 141.14	 std: Parameter containing:
tensor([-0.3111,  0.1370, -0.0940,  0.3054, -0.5848,  0.2370, -0.2926, -0.0306,
        -0.2838, -0.4983, -0.6814, -0.8048, -0.0357,  0.1581,  0.3531,  0.1707,
         0.1754, -0.0278,  0.0455,  0.0517], device='cuda:0',
       requires_grad=True)
Opt step: 747	 Avg reward: 134.43	 std: Parameter containing:
tensor([-0.3235,  0.1420, -0.0964,  0.2909, -0.5799,  0.2213, -0.2832, -0.0322,
        -0.2778, -0.5067, -0.6704, -0.8071, -0.0379,  0.1684,  0.3485,  0.1825,
         0.1569, -0.0207,  0.0469,  0.0451], device='cuda:0',
       requires_grad=True)
Opt step: 748	 Avg reward: 128.96	 std: Parameter containing:
tenso

Opt step: 772	 Avg reward: 161.14	 std: Parameter containing:
tensor([-0.3313,  0.1230, -0.0756,  0.2902, -0.5731,  0.1962, -0.2675,  0.0057,
        -0.2923, -0.5197, -0.7109, -0.8418, -0.0336,  0.2347,  0.2853,  0.2057,
         0.1018, -0.0121,  0.0909,  0.0408], device='cuda:0',
       requires_grad=True)
Opt step: 773	 Avg reward: 177.04	 std: Parameter containing:
tensor([-0.3382,  0.1148, -0.0722,  0.2782, -0.5723,  0.2024, -0.2721,  0.0071,
        -0.3090, -0.5261, -0.7140, -0.8237, -0.0452,  0.2264,  0.2811,  0.1959,
         0.0953, -0.0168,  0.0877,  0.0348], device='cuda:0',
       requires_grad=True)
Opt step: 774	 Avg reward: 165.31	 std: Parameter containing:
tensor([-0.3402,  0.1187, -0.0874,  0.2734, -0.5768,  0.1938, -0.2676,  0.0104,
        -0.3096, -0.5323, -0.7119, -0.8121, -0.0540,  0.2259,  0.2890,  0.1943,
         0.0902, -0.0276,  0.0937,  0.0292], device='cuda:0',
       requires_grad=True)
Opt step: 775	 Avg reward: 145.71	 std: Parameter containing:
tenso

Opt step: 803	 Avg reward: 144.75	 std: Parameter containing:
tensor([-0.4017,  0.1360, -0.1717,  0.2896, -0.5854,  0.1693, -0.3564,  0.0282,
        -0.2051, -0.4438, -0.6944, -0.8575, -0.0542,  0.2020,  0.3018,  0.1446,
         0.0599, -0.1105,  0.1342, -0.0046], device='cuda:0',
       requires_grad=True)
Opt step: 804	 Avg reward: 170.29	 std: Parameter containing:
tensor([-0.4040,  0.1231, -0.1665,  0.2799, -0.5889,  0.1657, -0.3552,  0.0165,
        -0.1964, -0.4402, -0.6951, -0.8556, -0.0578,  0.2108,  0.2824,  0.1456,
         0.0541, -0.1080,  0.1491, -0.0124], device='cuda:0',
       requires_grad=True)
Opt step: 805	 Avg reward: 136.69	 std: Parameter containing:
tensor([-0.4128,  0.1205, -0.1652,  0.2654, -0.5964,  0.1603, -0.3573,  0.0230,
        -0.2012, -0.4419, -0.6977, -0.8596, -0.0579,  0.2068,  0.2692,  0.1431,
         0.0532, -0.1132,  0.1505, -0.0077], device='cuda:0',
       requires_grad=True)
Opt step: 806	 Avg reward: 155.60	 std: Parameter containing:
tenso

KeyboardInterrupt: 

In [None]:
frames = []
total_rewards = np.zeros(12)

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]
states = np.array(env_info.vector_observations)
value = []
r = []
for t in range(2000):
    prediction = agent.policy.act(states)
    action  = prediction['a'].cpu().numpy()
    v = prediction['v'].detach().cpu().numpy()
    frames.append(env.render(mode='rgb_array')) 
    
    
    # send all actions to tne environment
    env_info = self.env.step((prediction['a']).cpu().numpy())[brain_name]

    next_states = np.array(env_info.vector_observations)         # get next state (for each agent)
    rewards = np.array(env_info.rewards)                         # get reward (for each agent)
    terminals = np.array(env_info.local_done)                    # see if episode finished
    
    #value.append(v.squeeze())
    #r.append(reward)
    states=next_states
    total_rewards+= rewards
    for i,terminal in terminals:
        if terminal:
            eps_reward = total_rewards[i]
            break

print ("Total reward:",eps_rewards)
env.close()
#animate_frames(frames)