# Deep Deterministic Policy Gradients (DDPG)
---
Notebook, addapted from 
https://raw.githubusercontent.com/udacity/deep-reinforcement-learning/master/ddpg-bipedal/DDPG.ipynb
training DDPG with OpenAI Gym's BipedalWalker-v2 environment.

### 1. Import the Necessary Packages

In [1]:
import gym
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from DDPG_Multi_agent_kthStep import Agent

### 2. Instantiate the Environment and Agent

In [2]:
from unityagents import UnityEnvironment
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
no_graphics=True
#no_graphics=False

## env = UnityEnvironment(file_name='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p2_continuous-control\Reacher_Windows_x86_64_multiple_agents\Reacher.exe')

# select this option to load version 2 (with 20 agents) of the environment
#env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')
env = UnityEnvironment(file_name='C:\EigeneLokaleDaten\DeepRL\Value-based-methods\p3_collab-compet\Tennis_Windows_x86_64\Tennis.exe',no_graphics=no_graphics)

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


In [4]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])
print('The state for the 2nd agent looks like:', states[1])

print(state_size)
print(len(states))

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]
The state for the 2nd agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.4669857  -1.5
  0.          0.         -6.83172083  6.          0.          0.        ]
24
2


In [7]:
NN_architecture = None # default 256,256,256,128
#NN_architecture = [64,64,32,16]  # actor_fc, critic_fc1,critic_fc2,critic_fc3
#NN_architecture = [16,32,16,8]  # actor_fc, critic_fc1,critic_fc2,critic_fc3

agent = Agent(state_size=env_info.vector_observations.shape[1], action_size=brain.vector_action_space_size, random_seed=10, NN_architecture=NN_architecture)

Hyperparams 1000000 128 0.99 0.001 0.0001 0.001 0 new update 30 20
Hyperparams Noise 0.0 0.15 0.2


In [8]:
print(agent.act(states[0],add_noise=False))
print(agent.act(states[1],add_noise=False))
#print(agent.act(states[2],add_noise=False))

[0.06329875 0.0195394 ]
[0.05057339 0.01870034]


### 3. Train the Agent with DDPG

Run the code cell below to train the agent from scratch.  Alternatively, you can skip to the next code cell to load the pre-trained weights from file.

In [9]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("using device: ",device)

using device:  cpu


In [10]:
data = np.load('state_scale.npz')
scale = data['scale_int']
print(scale)


print(states[0])
print(states[0]/scale)

env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)

action = agent.act(states[0]/scale,add_noise=False) 
action2 = agent.act(states[1]/scale,add_noise=False) 
actions = agent.act(states/scale,add_noise=False) 

print(action)
print(action2)
print(actions[0:2])

[21 30 30 30 30 30 30 30 30 30 30 23 23 23 30  8 12 14 30 30 30 30 30 30]
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -0.55439885 -0.10714286
 -0.          0.          0.22772403  0.2        -0.          0.        ]
[0.05703903 0.01734038]
[0.05646411 0.01745478]
[[0.05703903 0.01734038]
 [0.05646411 0.01745478]]


In [11]:
load_weights = False
if load_weights:
    # load weights after system crashed again :(
    # load weights from last checkpoint
    NN_architecture = None
    Noise_params = None # default 0.0 0.15 0.2
    rnd_seed = 42
    #torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
    agent = Agent(state_size=env_info.vector_observations.shape[1], action_size=brain.vector_action_space_size, random_seed=rnd_seed, NN_architecture=NN_architecture)
    agent.critic_local.load_state_dict(torch.load('Multi_checkpoint_critic.pth'))
    agent.critic_target.load_state_dict(torch.load('Multi_checkpoint_critic.pth'))
    agent.actor_local.load_state_dict(torch.load('Multi_checkpoint_actor.pth'))
    agent.actor_target.load_state_dict(torch.load('Multi_checkpoint_actor.pth'))
    #https://knowledge.udacity.com/questions/805381

In [12]:
states = env_info.vector_observations  
rewards = env_info.rewards                         # get reward (for each agent)# get the current state (for each agent)
next_states = env_info.vector_observations         # get next state (for each agent)   
dones = env_info.local_done                        # see if episode finished
agent.step(states[0], action, rewards[0], next_states[0], dones[0])

In [13]:
print('hey ho - lets go')
#NN_architecture = [64,64,32,16]  # actor_fc, critic_fc1, critic_fc2, critic_fc3
#NN_architecture = [64,64,32,16]  # actor_fc, critic_fc1, critic_fc2, critic_fc3
NN_architecture = None  # default 256,256,256,128

#Noise_params = [0.0,0.03,0.03]
Noise_params = None # default 0.0 0.15 0.2

rnd_seed = 42

agent = Agent(state_size=env_info.vector_observations.shape[1], action_size=brain.vector_action_space_size, random_seed=rnd_seed, NN_architecture=NN_architecture, Noise_params=Noise_params, ReLU=True)



load_weights = False

if load_weights:
    print('weights loaded from last checkpoint')
    agent.critic_local.load_state_dict(torch.load('Multi_checkpoint_critic_2nd.pth'))
    agent.critic_target.load_state_dict(torch.load('Multi_checkpoint_critic_2nd.pth'))
    agent.actor_local.load_state_dict(torch.load('Multi_checkpoint_actor_2nd.pth'))
    agent.actor_target.load_state_dict(torch.load('Multi_checkpoint_actor_2nd.pth'))      
    '''
    # load 20 agent - every step update 250 steps checkpoint as start - CPU / own PC 
    # differnt PyTorch version - problems to load checkpoitn :( 
    agent.actor_local.load_state_dict(torch.load('checkpoint_actor_250.pth'))
    agent.actor_target.load_state_dict(torch.load('checkpoint_actor_250.pth'))
    agent.critic_local.load_state_dict(torch.load('checkpoint_critic_250.pth'))
    agent.critic_target.load_state_dict(torch.load('checkpoint_critic_250.pth'))        
    '''
else:
    print('Start training from scratch')
    

print("using device: ",device)
print('Network:',NN_architecture)
do_scaling = True
if do_scaling:
    print('load state scaling')
    data = np.load('state_scale.npz')
    #scale = data['scale_int']
    scale = data['scale']
else:
    scale = np.ones(33,)

add_noise2state = True
print('add_noise to state',add_noise2state)
if add_noise2state:
    print('Noise_params:',Noise_params)

def ddpg(n_episodes=10000, max_t=999):  # 2000 / 700
    not_yet_shown = True
    scores_deque = deque(maxlen=100)
    scores = []
    max_score = -np.Inf
    for i_episode in range(1, n_episodes+1):                        
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        #state = states[0]/scale
        states = states/scale
        agent.reset()  # noise reset....
        score = 0
        #############
        for t in range(max_t):
            actions = agent.act(states,add_noise=add_noise2state)      
            #print('t:',action)
            env_info = env.step(actions)[brain_name]           # send all action to Env
            next_states = env_info.vector_observations         # get next state (for each agent)   
            next_states = next_states/scale
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            
            #next_state, reward, done, _ = env.step(action)                                    
            ### -> agent.step(states[0]/scale, action, rewards[0], next_states[0]/scale, dones[0])
            for state, action, reward, next_state, done in zip(states, actions, rewards, next_states, dones):
                agent.step(state, action, reward, next_state, done)
            states = next_states
            score += np.max(rewards)
            #if np.any(dones==True):
            if np.any(dones):
                #print('SHOULD NEVER BE REACHED...')
                #assert 1==0
                break 
        scores_deque.append(score)
        scores.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_deque), score), end="")
        if np.mean(scores_deque) >= 0.5 and not_yet_shown:
                    print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_deque), score))
                    print('Assignment -DONE-')
                    not_yet_shown = False
        if i_episode % 500 == 0:
            torch.save(agent.actor_local.state_dict(), 'Multi_checkpoint_actor_30_20_local.pth')
            torch.save(agent.critic_local.state_dict(), 'Multi_checkpoint_critic_30_20_local.pth')
            torch.save(agent.actor_target.state_dict(), 'Multi_checkpoint_actor_30_20_target.pth')
            torch.save(agent.critic_target.state_dict(), 'Multi_checkpoint_critic_30_20_target.pth')
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))   
            # average max score of +.5 over 100 consecutive episodes

    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

hey ho - lets go
Hyperparams 1000000 128 0.99 0.001 0.0001 0.001 0 new update 30 20
Hyperparams Noise 0.0 0.15 0.2
Start training from scratch
using device:  cpu
Network: None
load state scaling
add_noise to state True
Noise_params: None
Episode 4	Average Score: 0.00	Score: 0.00

  torch.nn.utils.clip_grad_norm(self.critic_local.parameters(), 1)   # take advise from Attempt 3 / Udacity Course


Episode 500	Average Score: 0.00	Score: 0.00
Episode 1000	Average Score: 0.17	Score: 0.00
Episode 1500	Average Score: 0.08	Score: 0.10
Episode 2000	Average Score: 0.11	Score: 0.10
Episode 2500	Average Score: 0.32	Score: 0.30
Episode 3000	Average Score: 0.11	Score: 0.10
Episode 3500	Average Score: 0.12	Score: 0.10
Episode 3800	Average Score: 0.50	Score: 0.30
Assignment -DONE-
Episode 4000	Average Score: 0.22	Score: 0.30
Episode 4500	Average Score: 0.17	Score: 0.10
Episode 5000	Average Score: 0.17	Score: 0.10
Episode 5500	Average Score: 0.32	Score: 0.10
Episode 6000	Average Score: 0.29	Score: 0.20
Episode 6438	Average Score: 0.14	Score: 0.20

KeyboardInterrupt: 

In [None]:
env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
action = agent.act(states[0]/scale) 
print(states[0]/scale)
print(states[0])
print(action)

In [23]:
not_yet_shown

True

In [None]:
# some random agent - determine length of episodes...
done_mean = []
score_mean = []
states_mean = []
for runs in range(100):
    env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    jj = 0 
    while True:
        jj += 1 
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        states_mean.append(states[0])
        if np.any(dones):                                  # exit loop if episode finished
            #print(jj,'steps before done')
            done_mean.append(jj)
            break        
    #print(runs,'Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
    score_mean.append(np.mean(scores))

In [None]:
print('<score>',np.mean(score_mean),'+/-',np.std(score_mean))
print('<done steps>',np.mean(done_mean),'+/-',np.std(done_mean))
print(states_mean[0].shape)
print(len(states_mean))
print(states_mean[10010-1])
states_np = np.array(states_mean)

s=0
print(np.argmax(states_np[:,s]))
print(states_mean[np.argmax(states_np[:,s])])
print(states_np.shape)

In [46]:
state = states_np[0]
dx = np.zeros(state.shape[0])
for s in range(state.shape[0]):
    if np.min(states_np[:][s]) > 0: 
        dx[s] = np.max(states_np[:,s])
    else:
        if np.abs(np.max(states_np[:,s])) > np.abs(np.min(states_np[:,s])):
            dx[s] = np.abs(np.max(states_np[:,s]))
        else:
            dx[s] = np.abs(np.min(states_np[:,s]))
    print('state ',s,'min/max scaling:',np.min(states_np[:,s]),np.max(states_np[:,s]),'\t\t\t->',np.min(states_np[:,s])/dx[s],np.max(states_np[:,s])/dx[s])
    #print('state ',s,':',dx[s])


state  0 min/max scaling: -4.020280838012695 4.010631561279297 			-> -1.0 0.997599850079586
state  1 min/max scaling: -4.006692409515381 1.8407402038574219 			-> -1.0 0.4594164002921462
state  2 min/max scaling: -3.9615864753723145 4.008120059967041 			-> -0.9883901719762583 1.0
state  3 min/max scaling: -0.1462584137916565 1.0 			-> -0.1462584137916565 1.0
state  4 min/max scaling: -0.6754260659217834 0.6729655265808105 			-> -1.0 0.9963570559901106
state  5 min/max scaling: -0.70906001329422 0.9998599886894226 			-> -0.7091593036177276 1.0
state  6 min/max scaling: -0.743294358253479 0.8522251844406128 			-> -0.8721807003877334 1.0
state  7 min/max scaling: -10.913050651550293 9.352399826049805 			-> -1.0 0.8569922494331331
state  8 min/max scaling: -2.058267831802368 1.7443766593933105 			-> -1.0 0.8474974113868399
state  9 min/max scaling: -3.3310513496398926 3.4418137073516846 			-> -0.9678186075338114 1.0
state  10 min/max scaling: -13.723333358764648 13.842732429504395 			-> -0.

  # This is added back by InteractiveShellApp.init_path()


In [82]:
scale = np.copy(dx)
scale[np.where(dx<1)]= 1
print(scale)

import math
scale_int = np.ones(scale.shape,dtype=int)
for jj in range(33):
  scale_int[jj] = math.ceil(scale[jj])
print(scale_int)

np.savez('state_scale.npz',scale=scale, scale_int=scale_int)

[ 4.02028084  4.00669241  4.00812006  1.          1.          1.
  1.         10.91305065  2.05826783  3.44181371 13.84273243  9.51571655
 13.86110687  9.92453194 10.01359653  9.82462692  1.          1.
  1.          1.         11.61246014  8.71936607  7.06147718 17.99956512
 19.19440842 16.99439049  8.          1.          8.          1.
  1.          1.          1.        ]
[ 5  5  5  1  1  1  1 11  3  4 14 10 14 10 11 10  1  1  1  1 12  9  8 18
 20 17  8  1  8  1  1  1  1]


In [85]:
data = np.load('state_scale.npz')
print(data['scale'])
print(data['scale_int'])

[ 4.02028084  4.00669241  4.00812006  1.          1.          1.
  1.         10.91305065  2.05826783  3.44181371 13.84273243  9.51571655
 13.86110687  9.92453194 10.01359653  9.82462692  1.          1.
  1.          1.         11.61246014  8.71936607  7.06147718 17.99956512
 19.19440842 16.99439049  8.          1.          8.          1.
  1.          1.          1.        ]
[ 5  5  5  1  1  1  1 11  3  4 14 10 14 10 11 10  1  1  1  1 12  9  8 18
 20 17  8  1  8  1  1  1  1]


### 5. Explore

In this exercise, we have provided a sample DDPG agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster than this benchmark implementation.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task!
- Write your own DDPG implementation.  Use this code as reference only when needed -- try as much as you can to write your own algorithm from scratch.
- You may also like to implement prioritized experience replay, to see if it speeds learning.  
- The current implementation adds Ornsetein-Uhlenbeck noise to the action space.  However, it has [been shown](https://blog.openai.com/better-exploration-with-parameter-noise/) that adding noise to the parameters of the neural network policy can improve performance.  Make this change to the code, to verify it for yourself!
- Write a blog post explaining the intuition behind the DDPG algorithm and demonstrating how to use it to solve an RL environment of your choosing.  