# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
env = UnityEnvironment(file_name='Reacher_Many/Reacher.exe')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [13]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    # print(next_states.shape) <-- (20,33)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.16099999640136958


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

Implementation based on code: https://github.com/reinforcement-learning-kr/pg_travel
with only some adjustments

In [7]:
input_size = state_size
hidden_size = 256
output_size = action_size
agent_number = states.shape[0]

import torch
import torch.optim as optim
torch.autograd.set_detect_anomaly(True)
device = torch.device('cpu')

In [8]:
#from korea_code.model import Actor, Critic
from collections import deque
import importlib
import Code.my_model as my_model
import Code.ppo as ppo
importlib.reload(my_model)
importlib.reload(ppo)

from Code.running_state import ZFilter
from Code.memory import Memory
from Code.utils import to_tensor, get_action, save_checkpoint, save_networks

In [17]:
class artificial_args:
    pass

args = artificial_args()

args.actor_lr = 0.0005
args.critic_lr = 0.0005
args.l2_rate = 0.0001
args.gamma = 0.995
args.lamda = 0.95
args.clip_param = 0.1

args.batch_size = 2048
args.hidden_size = 128
args.episodes = 1000
args.activation = "elu"

args.scores_window_len = 100

actor = my_model.Actor(input_size, output_size, args.hidden_size).to(device)
critic = my_model.Critic(input_size, args.hidden_size).to(device)

In [10]:
running_state = ZFilter((agent_number,input_size), clip=5)
states = running_state(env_info.vector_observations)

actor_optim = optim.Adam(actor.parameters(), lr=args.actor_lr, weight_decay=args.l2_rate)
critic_optim = optim.Adam(critic.parameters(), lr=args.critic_lr, weight_decay=args.l2_rate)

In [42]:
scores = []
scores_window = deque(maxlen=args.scores_window_len)
score_avg = 0

env_info = env.reset(train_mode=True)[brain_name]
for ep in range(args.episodes):
    actor.eval(), critic.eval()
    memory = [Memory() for _ in range(agent_number)]
    
    steps = 0
    score = np.zeros(agent_number)
    
    while True:
        steps += 1
        
        mu, std, _ = actor(to_tensor(states))
        actions = get_action(mu, std)
        env_info = env.step(actions)[brain_name]
        
        
        next_states = running_state(env_info.vector_observations)
        rewards = env_info.rewards
        dones = env_info.local_done
        masks = list(~(np.array(dones)))
        
        for i in range(agent_number):
            memory[i].push(states[i], actions[i], rewards[i], masks[i])
        
        score += np.array(rewards)
        states = next_states
                
        if np.any(dones):# exit loop if episode finished
            scores.append(score)
            scores_window.append(score)
            score = 0
            episodes = len(scores)
            if len(scores) % 1 == 0:
                print('{}th episode - mean in window: {} mean in episode: {} unformated: {}'.format(
                    episodes, np.mean(scores_window), np.mean(scores[-1]), scores[-1]))
            if len(scores) % 100 == 0:
                save_networks(actor, critic, args, "Checkpoints/traning_"+str(episodes)+".pth")
            break
            
    actor.train(), critic.train()
    sts, ats, returns, advants, old_policy, old_value = [], [], [], [], [], []
    
    for i in range(agent_number):
        batch = memory[i].sample()
        st, at, rt, adv, old_p, old_v = ppo.process_memory(actor, critic, batch, args)
        sts.append(st)
        ats.append(at)
        returns.append(rt)
        advants.append(adv)
        old_policy.append(old_p)
        old_value.append(old_v)
        
    sts = torch.cat(sts)
    ats = torch.cat(ats)
    returns = torch.cat(returns)
    advants = torch.cat(advants)
    old_policy = torch.cat(old_policy)
    old_value = torch.cat(old_value)
    
    ppo. train_model(actor, critic, actor_optim, critic_optim, sts, ats, returns, advants,
                    old_policy, old_value, args)
    

1th episode - mean in window: 0.35099999215453864 mean in episode: 0.35099999215453864 unformated: [0.56999999 0.28999999 1.32999997 0.43999999 0.18       0.37999999
 0.12       0.25999999 0.40999999 0.89999998 0.1        0.88999998
 0.         0.15       0.25999999 0.06       0.26999999 0.
 0.         0.40999999]
2th episode - mean in window: 0.35274999211542307 mean in episode: 0.35449999207630756 unformated: [1.48999997 0.         0.         0.36999999 0.         0.48999999
 0.         0.         0.19       1.00999998 0.         0.22
 0.19       0.43999999 0.14       0.15       0.14       1.41999997
 0.40999999 0.42999999]
3th episode - mean in window: 0.3648333251786729 mean in episode: 0.38899999130517243 unformated: [0.21       0.12       0.         0.         0.48999999 0.47999999
 0.         1.31999997 0.16       0.25999999 0.29999999 0.42999999
 1.05999998 0.82999998 0.88999998 0.04       0.         0.95999998
 0.22999999 0.        ]
4th episode - mean in window: 0.36987499173

27th episode - mean in window: 0.7907407230663078 mean in episode: 1.045499976631254 unformated: [1.91999996 1.22999997 1.76999996 1.65999996 0.94999998 0.24999999
 1.22999997 1.56999996 0.59999999 1.11999997 1.06999998 0.
 0.54999999 0.80999998 0.36999999 1.81999996 0.96999998 1.82999996
 0.42999999 0.75999998]
28th episode - mean in window: 0.809392839051517 mean in episode: 1.3129999706521631 unformated: [1.43999997 2.78999994 0.60999999 0.23999999 1.69999996 0.13
 0.         0.98999998 1.60999996 1.01999998 0.86999998 2.59999994
 2.28999995 1.58999996 3.02999993 1.85999996 0.92999998 1.44999997
 0.2        0.90999998]
29th episode - mean in window: 0.8309482572889276 mean in episode: 1.4344999679364263 unformated: [0.13       2.22999995 0.54999999 1.22999997 0.19       2.61999994
 1.66999996 0.89999998 1.79999996 2.22999995 1.94999996 0.87999998
 1.56999996 0.67999998 2.12999995 1.52999997 2.18999995 0.82999998
 2.31999995 1.05999998]
30th episode - mean in window: 0.85458331423190

53th episode - mean in window: 1.3110094046589198 mean in episode: 2.2634999494068326 unformated: [1.95999996 1.60999996 1.60999996 2.99999993 3.13999993 1.62999996
 1.11999997 2.60999994 3.27999993 2.08999995 1.60999996 1.34999997
 2.52999994 3.80999991 2.45999995 0.89999998 1.60999996 2.85999994
 2.36999995 3.71999992]
54th episode - mean in window: 1.3222129334092003 mean in episode: 1.9159999571740627 unformated: [2.42999995 1.84999996 1.97999996 3.23999993 1.97999996 1.75999996
 1.89999996 2.84999994 1.13999997 1.90999996 2.24999995 1.10999998
 1.99999996 1.20999997 2.11999995 2.37999995 0.86999998 1.01999998
 2.88999994 1.42999997]
55th episode - mean in window: 1.3373545155623419 mean in episode: 2.1549999518319964 unformated: [2.03999995 2.47999994 1.03999998 2.72999994 2.34999995 1.21999997
 1.32999997 1.50999997 1.47999997 2.11999995 1.32999997 1.88999996
 1.20999997 1.82999996 3.68999992 4.3799999  1.56999996 2.67999994
 4.00999991 2.20999995]
56th episode - mean in window: 

79th episode - mean in window: 1.9045948941378466 mean in episode: 4.282499904278666 unformated: [3.94999991 5.36999988 1.96999996 3.12999993 5.48999988 3.80999991
 4.3499999  7.49999983 6.20999986 4.08999991 4.4599999  3.78999992
 6.00999987 4.94999989 2.58999994 5.20999988 1.79999996 2.16999995
 4.99999989 3.79999992]
80th episode - mean in window: 1.928106206903467 mean in episode: 3.7854999153874815 unformated: [6.34999986 4.5599999  2.22999995 4.4099999  5.84999987 3.14999993
 2.95999993 4.08999991 1.52999997 3.07999993 1.63999996 2.64999994
 4.5199999  2.49999994 4.83999989 6.87999985 4.94999989 3.11999993
 3.62999992 2.76999994]
81th episode - mean in window: 1.9504135366517728 mean in episode: 3.7349999165162444 unformated: [4.89999989 5.45999988 2.90999993 5.12999989 1.63999996 0.93999998
 3.71999992 5.94999987 5.40999988 3.99999991 2.59999994 3.89999991
 2.92999993 3.72999992 3.27999993 3.14999993 2.46999994 5.52999988
 4.5799999  2.46999994]
82th episode - mean in window: 1.

101th episode - mean in window: 2.7073999394848944 mean in episode: 6.934499845001847 unformated: [ 6.25999986  5.68999987  5.41999988  9.1699998   4.4299999   7.53999983
  4.87999989  5.95999987 12.20999973  7.14999984  5.05999989  5.54999988
  9.67999978  5.14999988  7.86999982  8.9799998  10.40999977  6.77999985
  3.88999991  6.60999985]
102th episode - mean in window: 2.7752649379679935 mean in episode: 7.140999840386212 unformated: [ 5.54999988  8.62999981  6.83999985  7.17999984  7.21999984  7.79999983
  9.88999978  6.24999986  7.18999984  6.46999986  2.56999994  8.40999981
  5.07999989  9.1399998   2.43999995  5.60999987  7.98999982  6.83999985
  6.22999986 15.48999965]
103th episode - mean in window: 2.8609349360531198 mean in episode: 8.9559997998178 unformated: [10.67999976 10.28999977 21.02999953  6.03999986 11.93999973 11.09999975
  7.45999983  5.32999988  8.7599998   6.62999985  9.1199998   8.17999982
  9.48999979  5.59999987  4.6699999   8.9999998   7.58999983 10.00999978

125th episode - mean in window: 4.90532989035733 mean in episode: 11.26999974809587 unformated: [10.51999976  9.71999978 14.74999967 11.66999974 12.42999972 14.06999969
 11.60999974  7.40999983  5.62999987  6.91999985 18.52999959 14.16999968
 13.87999969  8.49999981 13.5699997  13.3299997  12.61999972 15.13999966
  6.63999985  4.2899999 ]
126th episode - mean in window: 5.025374887674116 mean in episode: 13.225999704375862 unformated: [14.68999967 14.34999968 12.43999972 13.16999971 11.74999974 15.91999964
 13.05999971  7.32999984  9.28999979 12.27999973 16.33999963 14.36999968
 14.15999968 10.32999977 15.69999965 12.78999971 16.90999962 16.96999962
 10.75999976 11.90999973]
127th episode - mean in window: 5.1362748851953075 mean in episode: 12.135499728750437 unformated: [ 8.9299998   9.27999979 13.2399997  14.30999968 15.70999965  9.52999979
 11.65999974  9.42999979 14.83999967 15.09999966 10.57999976  8.45999981
 11.47999974 14.95999967 11.13999975 14.92999967 15.25999966 12.0999997

149th episode - mean in window: 8.280524814915843 mean in episode: 17.72049960391596 unformated: [18.11999959 20.74999954 23.04999948 19.93999955 17.28999961 18.55999959
 19.23999957 10.33999977 16.06999964  9.64999978 17.9499996   9.66999978
 18.89999958 23.82999947 21.34999952 17.9399996  11.30999975 24.04999946
 23.40999948 12.98999971]
150th episode - mean in window: 8.432329811522736 mean in episode: 17.9409995989874 unformated: [19.89999956 15.40999966 23.02999949 14.45999968 20.20999955 23.38999948
 19.58999956 16.60999963 20.23999955  4.82999989 23.35999948 15.89999964
 22.1899995  20.09999955  6.09999986 23.38999948 24.00999946 20.37999954
 14.59999967 11.11999975]
151th episode - mean in window: 8.599724807781167 mean in episode: 18.91449957722798 unformated: [30.59999932 16.95999962 10.95999976 23.03999949 12.40999972 22.3899995
 17.6999996  22.3099995  14.45999968 14.41999968 19.91999955 20.05999955
 21.76999951 16.71999963 17.7799996  23.15999948 18.76999958 13.5199997
 21

173th episode - mean in window: 12.520164720152504 mean in episode: 22.258499502483755 unformated: [22.64999949 27.08999939 17.59999961  9.92999978 24.87999944 19.63999956
 33.16999926 16.87999962 30.96999931 16.31999964 28.81999936 28.16999937
 17.07999962 14.43999968 18.97999958 29.79999933 11.51999974 24.62999945
 26.9099994  25.68999943]
174th episode - mean in window: 12.711744715870358 mean in episode: 22.722499492112547 unformated: [19.08999957 16.47999963 27.64999938 20.95999953 17.16999962 24.54999945
 20.23999955 21.07999953 25.75999942 25.56999943 25.85999942 27.62999938
 32.09999928 16.77999962 25.26999944 23.22999948 16.58999963 20.39999954
 24.48999945 23.54999947]
175th episode - mean in window: 12.893954711797647 mean in episode: 22.765999491140246 unformated: [21.67999952 19.36999957 20.26999955 21.85999951 29.35999934 19.50999956
 24.27999946 21.44999952 28.02999937 24.10999946 24.44999945 12.71999972
 29.09999935 17.43999961 18.70999958 32.35999928 21.43999952 20.069

197th episode - mean in window: 17.687284604658373 mean in episode: 30.820999311096966 unformated: [30.29999932 33.40999925 33.18999926 35.24999921 37.40999916 29.62999934
 25.87999942 30.54999932 24.55999945 30.58999932 25.36999943 37.32999917
 30.95999931 33.66999925 29.90999933 24.59999945 33.16999926 32.16999928
 31.69999929 26.7699994 ]
198th episode - mean in window: 17.89689959997311 mean in episode: 28.534999362193048 unformated: [25.38999943 30.42999932 30.65999931 32.28999928 32.08999928 26.15999942
 32.30999928 15.69999965 31.63999929 20.28999955 21.58999952 34.41999923
 33.46999925 31.90999929 22.88999949 34.62999923 27.53999938 26.29999941
 32.40999928 28.57999936]
199th episode - mean in window: 18.137614594592712 mean in episode: 30.28149932315573 unformated: [34.10999924 33.10999926 27.41999939 30.29999932 27.31999939 31.5099993
 33.85999924 31.2099993  31.60999929 25.18999944 30.55999932 32.48999927
 30.49999932 27.71999938 33.05999926 29.66999934 31.78999929 34.379999

201th episode - mean in window: 18.606099584121257 mean in episode: 30.876999309845267 unformated: [25.94999942 28.36999937 36.87999918 19.38999957 29.09999935 38.22999915
 28.62999936 32.08999928 36.05999919 29.18999935 25.41999943 33.83999924
 32.96999926 30.32999932 31.5399993  35.02999922 29.50999934 33.97999924
 27.73999938 33.28999926]
202th episode - mean in window: 18.82005457933899 mean in episode: 28.53649936215952 unformated: [31.73999929 17.50999961 32.86999927 26.9199994  31.60999929 27.45999939
 36.82999918 24.44999945 31.3299993  31.96999929 22.3899995  28.67999936
 29.51999934 28.69999936 24.49999945 30.49999932 37.67999916 30.86999931
 26.00999942 19.18999957]
203th episode - mean in window: 19.043809574337676 mean in episode: 31.331499299686403 unformated: [32.68999927 29.47999934 35.19999921 34.08999924 26.6699994  27.93999938
 33.99999924 24.27999946 33.81999924 30.00999933 31.1799993  32.90999926
 33.82999924 32.48999927 33.43999925 17.8399996  31.76999929 34.66999

225th episode - mean in window: 23.230134480766022 mean in episode: 29.072999350167812 unformated: [38.10999915 33.28999926 33.94999924 20.34999955 34.21999924 34.77999922
 16.65999963 15.55999965 34.54999923 23.86999947 25.65999943 22.00999951
 33.80999924 33.29999926 33.11999926 20.48999954 34.95999922 27.98999937
 28.90999935 35.8699992 ]
226th episode - mean in window: 23.394384477094746 mean in episode: 29.650999337248503 unformated: [20.55999954 28.30999937 36.18999919 25.52999943 23.91999947 29.38999934
 31.64999929 27.0099994  25.99999942 33.31999926 31.89999929 33.64999925
 32.33999928 15.99999964 33.61999925 33.45999925 35.7899992  36.79999918
 21.34999952 36.22999919]
227th episode - mean in window: 23.579924472947607 mean in episode: 30.68949931403622 unformated: [33.72999925 35.52999921 36.33999919 25.99999942 34.71999922 37.48999916
 34.10999924 17.11999962 32.93999926 35.8399992  16.78999962 32.49999927
 27.74999938 16.58999963 35.6599992  27.0499994  30.63999932 32.8099

249th episode - mean in window: 26.94854439765308 mean in episode: 30.790999311767518 unformated: [35.53999921 35.8799992  37.90999915 36.76999918 27.97999937 27.96999937
 25.07999944 36.05999919 24.83999944 29.17999935 28.16999937 21.93999951
 27.10999939 33.68999925 34.90999922 28.78999936 35.48999921 23.47999948
 31.2499993  33.77999924]
250th episode - mean in window: 27.097104394332504 mean in episode: 32.79699926692992 unformated: [30.20999932 33.96999924 34.38999923 37.08999917 34.77999922 31.99999928
 31.92999929 27.73999938 37.61999916 35.46999921 25.95999942 36.94999917
 33.35999925 32.98999926 33.49999925 31.76999929 36.80999918 22.10999951
 34.06999924 33.21999926]
251th episode - mean in window: 27.234174391268752 mean in episode: 32.62149927085265 unformated: [30.85999931 27.49999939 33.87999924 38.18999915 34.80999922 32.98999926
 36.51999918 23.12999948 21.79999951 30.53999932 37.58999916 34.96999922
 32.53999927 30.31999932 35.12999921 34.50999923 33.65999925 35.509999

273th episode - mean in window: 30.069459327895196 mean in episode: 34.576499227155 unformated: [30.01999933 36.39999919 34.90999922 36.79999918 36.36999919 30.40999932
 33.76999925 36.67999918 37.01999917 33.26999926 36.10999919 36.86999918
 24.69999945 31.66999929 33.85999924 37.54999916 36.13999919 36.46999918
 36.27999919 36.22999919]
274th episode - mean in window: 30.18461432532128 mean in episode: 34.237999234721066 unformated: [37.90999915 36.57999918 36.58999918 31.85999929 29.13999935 34.78999922
 36.52999918 36.93999917 38.05999915 38.07999915 29.77999933 37.89999915
 28.90999935 31.62999929 37.53999916 33.25999926 28.45999936 32.87999927
 38.18999915 29.72999934]
275th episode - mean in window: 30.312814322455786 mean in episode: 35.58599920459092 unformated: [36.22999919 37.71999916 38.12999915 30.39999932 31.54999929 38.99999913
 35.8699992  37.57999916 35.43999921 33.39999925 37.19999917 36.98999917
 36.28999919 35.13999921 36.41999919 36.59999918 36.52999918 26.41999941

297th episode - mean in window: 32.0252842841791 mean in episode: 34.73649922357872 unformated: [36.44999919 27.26999939 36.59999918 37.92999915 37.15999917 35.21999921
 31.57999929 37.76999916 31.96999929 35.9099992  36.16999919 33.49999925
 37.15999917 32.68999927 28.62999936 36.0099992  31.79999929 37.36999916
 36.33999919 37.19999917]
298th episode - mean in window: 32.07969928296283 mean in episode: 33.97649924056604 unformated: [37.79999916 39.59999911 36.46999918 31.64999929 36.57999918 36.27999919
 32.66999927 38.12999915 33.91999924 37.37999916 32.05999928 37.10999917
 26.7499994  28.58999936 38.55999914 29.23999935 33.84999924 34.50999923
 28.55999936 29.81999933]
299th episode - mean in window: 32.12745928189531 mean in episode: 35.05749921640381 unformated: [37.35999916 35.9099992  36.12999919 34.71999922 32.89999926 32.43999927
 37.35999916 30.90999931 35.00999922 36.39999919 33.12999926 26.9199994
 36.39999919 36.66999918 34.37999923 34.84999922 37.29999917 36.84999918
 3

301th episode - mean in window: 32.21860927985795 mean in episode: 34.97399921827018 unformated: [37.16999917 33.43999925 32.64999927 33.99999924 36.13999919 27.97999937
 37.24999917 35.51999921 37.66999916 35.7599992  38.72999913 37.14999917
 31.3999993  36.26999919 32.47999927 37.25999917 37.09999917 26.04999942
 37.75999916 37.69999916]
302th episode - mean in window: 32.28681927833333 mean in episode: 35.35749920969829 unformated: [36.83999918 38.03999915 33.72999925 35.9199992  32.09999928 35.34999921
 34.00999924 29.60999934 38.01999915 34.33999923 32.60999927 33.31999926
 36.11999919 37.54999916 36.97999917 36.26999919 37.34999917 35.33999921
 38.42999914 35.21999921]
303th episode - mean in window: 32.3083942778511 mean in episode: 33.488999251462516 unformated: [37.19999917 24.84999944 38.75999913 34.42999923 34.53999923 30.16999933
 28.11999937 25.10999944 37.78999916 35.8099992  34.64999923 36.14999919
 31.1699993  33.64999925 34.75999922 37.13999917 35.12999921 24.19999946


KeyboardInterrupt: 

In [49]:
save_networks(actor, critic, args, "Checkpoints/quite_noice", talking=True)

{'args_params': 'actor_lr: 0.001, critic_lr: 0.001, l2_rate: 0.001, gamma: 0.995, lamda: 0.95, clip_param: 0.1, batch_size: 2048, hidden_size: 128, episodes: 1000, activation: elu, scores_window_len: 100', 'actor': OrderedDict([('fc1.weight', tensor([[ 0.2253,  0.0607, -0.0154,  ...,  0.1255,  0.0832,  0.1398],
        [-0.0152, -0.1104,  0.1681,  ..., -0.0961, -0.0599, -0.1280],
        [-0.1248, -0.0233,  0.2199,  ...,  0.0091, -0.1544,  0.1089],
        ...,
        [-0.1514, -0.2345,  0.1036,  ...,  0.0898, -0.1072, -0.1515],
        [ 0.0277, -0.1248,  0.0792,  ...,  0.0898, -0.1130, -0.0496],
        [ 0.1727, -0.0637,  0.0441,  ..., -0.0916,  0.1432,  0.0514]])), ('fc1.bias', tensor([ 1.7092e-01, -7.4782e-02, -2.9665e-02, -1.4141e-01, -1.5080e-02,
         1.1387e-01, -7.2574e-03, -1.3343e-01, -1.6532e-01,  8.3783e-02,
        -7.3638e-02,  1.7075e-01,  1.2834e-01,  6.6958e-02, -1.5263e-01,
        -9.4893e-02,  1.9919e-02, -2.3753e-01,  1.0750e-01,  5.9081e-02,
        -8.0627e

Let's see how model works after training 

In [22]:
# Loading if necessary
loeaded_pth = torch.load("Checkpoints/traning_300.pth")
actor.load_state_dict(loeaded_pth["actor"])
critic.load_state_dict(loeaded_pth["critic"])
actor.eval()
critic.eval()

Critic(
  (fc1): Linear(in_features=33, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
  (elu): ELU(alpha=1.0)
)

In [23]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    mu, std, _ = actor(to_tensor(states))
    actions = get_action(mu, std)
    env_info = env.step(actions)[brain_name]
    next_states = running_state(env_info.vector_observations)
    rewards = env_info.rewards
    dones = env_info.local_done
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states       
    if np.any(dones):                                  # exit loop if episode finished
        break
print("scores:", np.mean(scores), "rewards", rewards)

scores: 7.782999826036393 rewards [0.0, 0.0, 0.03999999910593033, 0.0, 0.0, 0.0, 0.03999999910593033, 0.0, 0.0, 0.03999999910593033, 0.03999999910593033, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
