# Reinforcement learning

In this notebook, we give some example of reinforcement algorithm for a pendulum model. Beware that the implementation of the algorithms are ment to be simple, hence are not super efficient: don't use them as reference if you want to implement some dedicated RL for another problem

In [None]:
import numpy as np
from numpy.linalg import norm,inv,pinv,svd,eig
import matplotlib.pyplot as plt
import time
import random

## Environments

We are going to work with an inverted pendulum with limited torque, that must swing to collect energy before raising up to the unstable equilibrium state. As the algorithms that we are going to explore are either working on discrate action-state spaces, or continuous ones, several versions of this environment are proposed. In general, they all work the same: get in random initial configuration with reset, display in meshcat with render, and run a simulation step with step(control). Examples:

In [None]:
from tp6.env_pendulum import EnvPendulum,EnvPendulumDiscrete,EnvPendulumHybrid,EnvPendulumSinCos

In [None]:
env = EnvPendulum(1,viewer='meshcat')
env.name = str(env.__class__)
env.u0 = np.zeros(env.nu)

In [None]:
env.jupyter_cell()

In [None]:
env.render()
for i in range(10):
    env.step(env.u0)
    env.render()


We define here 4 main environments that you can similarly test:

- EnvPendulum:    state NX=2 continuous, control NU=1 continuous, Euler integration step with DT=1e-2 and high friction
- EnvPendulumDiscrete:  state NX=441 discrete, control NU=11 discrete, Euler step DT=0.5 low friction
- EnvPendulumSinCos: state NX=3 with x=[cos,sin,vel], control NU=1 control, Euler step DT=1e-2, high friction
- EnvPendulumHybrid:  state NX=3 continuous with x=[cos,sin,vel], control NU=11 discrete, Euler step DT=0.5 low friction


## Value iteration

For the first algorithm, we implement a Value iteration, which is an algorithm working on discrete states and discrete actions. As it is not very efficient, we must coarsly discretize the pendulum. Here is the implementation.

In [None]:
# %load tp6/qtable.py
'''
Example of Q-table learning with a simple discretized 1-pendulum environment.
-- concerge in 1k  episods with pendulum(1)
-- Converge in 10k episods with cozmo model
'''

import matplotlib.pyplot as plt
import signal
import time
import numpy as np

### --- Random seed
RANDOM_SEED = 1188 #int((time.time()%10)*1000)
print("Seed = %d" % RANDOM_SEED)
np.random.seed(RANDOM_SEED)

### --- Environment
from tp6.env_pendulum import EnvPendulumDiscrete; Env = lambda : EnvPendulumDiscrete(1,viewer='meshcat')
env = Env()

### --- Hyper paramaters
NEPISODES               = 400           # Number of training episodes
NSTEPS                  = 50            # Max episode length
LEARNING_RATE           = 0.85          # 
DECAY_RATE              = 0.99          # Discount factor 

Q     = np.zeros([env.nx,env.nu])       # Q-table initialized to 0

def policy(s):
    return np.argmax(Q[s,:])

def rendertrial(s0=None,maxiter=100):
    '''Roll-out from random state using greedy policy.'''
    s = env.reset(s0)
    for i in range(maxiter):
        a = np.argmax(Q[s,:])
        s,r = env.step(a)
        env.render()
    
signal.signal(signal.SIGTSTP, lambda x,y:rendertrial()) # Roll-out when CTRL-Z is pressed

h_rwd = []                              # Learning history (for plot).
for episode in range(1,NEPISODES):
    x    = env.reset()
    rsum = 0.0
    for steps in range(NSTEPS):
        u         = np.argmax(Q[x,:] + np.random.randn(1,env.nu)/episode) # Greedy action with noise
        x2,reward = env.step(u)
        
        # Compute reference Q-value at state x respecting HJB
        Qref = reward + DECAY_RATE*np.max(Q[x2,:])

        # Update Q-Table to better fit HJB
        Q[x,u] += LEARNING_RATE*(Qref-Q[x,u])
        x       = x2
        rsum   += reward

    h_rwd.append(rsum)
    if not episode%20:
        print('Episode #%d done with average cost %.2f' % (episode,sum(h_rwd[-20:])/20))

print("Total rate of success: %.3f" % (sum(h_rwd)/NEPISODES))
rendertrial()
plt.plot( np.cumsum(h_rwd)/range(1,NEPISODES) )
plt.show()



After convergence, you can try the obtained policy using the method <rendertrial>.

In [None]:
env.jupyter_cell()

In [None]:
rendertrial(maxiter=NSTEPS)

Let's display the optimal flow. As states are denoted by their indexes, we need to recover the 2d state from the index, with the following method:

In [None]:
def x2d(s):
    return env.decode_x(s)

In [None]:
from tp6.flow import plotFlow

In [None]:
plotFlow(env,policy,x2d)

## Value iteration with a neural network

Next, we marginally modifies the value iteration to store the Q function not as a table, but as a neural network. The main modification will be that the Belman contraction must now be achieve with a gradient descent ... and that is much less efficient. Let's see, on the same environment first.


In [None]:
# %load tp6/deeptable.py
'''
Example of Q-table learning with a simple discretized 1-pendulum environment using a linear Q network.
'''

import numpy as np
import random
import tensorflow as tf
import tensorflow.compat.v1 as tf1
import matplotlib.pyplot as plt
from tp6.env_pendulum import EnvPendulumDiscrete; Env = lambda : EnvPendulumDiscrete(1,viewer='meshcat')
import signal
import time
tf1.disable_eager_execution()


### --- Random seed
RANDOM_SEED = int((time.time()%10)*1000)
print("Seed = %d" % RANDOM_SEED)
np.random.seed(RANDOM_SEED)

### --- Hyper paramaters
NEPISODES               = 2000          # Number of training episodes
NSTEPS                  = 50            # Max episode length
LEARNING_RATE           = 0.1           # Step length in optimizer
DECAY_RATE              = 0.99          # Discount factor 

### --- Environment
env = Env()
NX  = env.nx
NU  = env.nu

### --- Q-value networks
class QValueNetwork:
    def __init__(self):
        x               = tf1.placeholder(shape=[1,NX],dtype=tf.float32)
        W               = tf1.Variable(tf1.random_uniform([NX,NU],0,0.01,seed=100))
        qvalue          = tf1.matmul(x,W)
        u               = tf1.argmax(qvalue,1)

        qref            = tf1.placeholder(shape=[1,NU],dtype=tf.float32)
        loss            = tf1.reduce_sum(tf.square(qref - qvalue))
        optim           = tf1.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

        self.x          = x             # Network input
        self.qvalue     = qvalue        # Q-value as a function of x
        self.u          = u             # Policy  as a function of x
        self.qref       = qref          # Reference Q-value at next step (to be set to l+Q o f)
        self.optim      = optim         # Optimizer      

### --- Tensor flow initialization
#tf.reset_default_graph()
qvalue  = QValueNetwork()
sess = tf1.InteractiveSession()
tf1.global_variables_initializer().run()

def onehot(ix,n=NX):
    '''Return a vector which is 0 everywhere except index <i> set to 1.'''
    return np.array([[ (i==ix) for i in range(n) ],],np.float)
   
def disturb(u,i):
    u += int(np.random.randn()*10/(i/50+10))
    return np.clip(u,0,NU-1)

def rendertrial(maxiter=100):
    x = env.reset()
    for i in range(maxiter):
        u = sess.run(qvalue.u,feed_dict={ qvalue.x:onehot(x) })
        x,r = env.step(u)
        env.render()
        if r==1: print('Reward!'); break
signal.signal(signal.SIGTSTP, lambda x,y:rendertrial()) # Roll-out when CTRL-Z is pressed

### --- History of search
h_rwd = []                              # Learning history (for plot).

### --- Training
for episode in range(1,NEPISODES):
    x    = env.reset()
    rsum = 0.0

    for step in range(NSTEPS-1):
        u = sess.run(qvalue.u,feed_dict={ qvalue.x: onehot(x) })[0] # Greedy policy ...
        u = disturb(u,episode)                                      # ... with noise
        x2,reward = env.step(u)

        # Compute reference Q-value at state x respecting HJB
        Q2        = sess.run(qvalue.qvalue,feed_dict={ qvalue.x: onehot(x2) })
        Qref      = sess.run(qvalue.qvalue,feed_dict={ qvalue.x: onehot(x ) })
        Qref[0,u] = reward + DECAY_RATE*np.max(Q2)

        # Update Q-table to better fit HJB
        sess.run(qvalue.optim,feed_dict={ qvalue.x    : onehot(x),
                                          qvalue.qref : Qref       })

        rsum += reward
        x = x2
        if reward == 1: break

    h_rwd.append(rsum)
    if not episode%20: print('Episode #%d done with %d sucess' % (episode,sum(h_rwd[-20:])))

print("Total rate of success: %.3f" % (sum(h_rwd)/NEPISODES))
rendertrial()
plt.plot( np.cumsum(h_rwd)/range(1,NEPISODES) )
plt.show()



See? Each step is much more costly (partly due to the poor implementation, but a gradient step is certainly more costly than a table update), and much less informative: the algorithm is slower to converge.

## Q-learning

The good point is that the algorithm scales well for more complex problem, and efficient Q-learning can be implemented on very large one (if you have enough CPUs). Let's take a look at the main one, the Deep-Q algorithm. The input of the neural network can be basically anything (discrete, continuous, an image, etc), the only limitation is that the control must remain discrete, and not too large. So let's take an hybrid version of the same pendulum, with continuous state space (cos,sin,velocity) and discrete control.

In [1]:
# %load tp6/qlearn.py
'''
Train a Q-value following a classical Q-learning algorithm (enforcing the
satisfaction of HJB method), using a noisy greedy exploration strategy.

The result of a training for a continuous pendulum (after 200 iterations) 
are stored in qvalue.h5.

Reference:
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." 
Nature 518.7540 (2015): 529.
'''

from tp6.env_pendulum import EnvPendulumHybrid; Env = lambda : EnvPendulumHybrid(1,viewer='meshcat')
from tp6.qnetwork import QNetwork
from collections import deque
import time
import signal
import matplotlib.pyplot as plt
import random
import numpy as np
import tensorflow as tf

### --- Random seed
RANDOM_SEED = int((time.time()%10)*1000)
print("Seed = %d" %  RANDOM_SEED)
np .random.seed     (RANDOM_SEED)
random.seed         (RANDOM_SEED)

### --- Environment
env                 = Env()

### --- Hyper paramaters
NEPISODES               = 1000          # Max training steps
NSTEPS                  = 60            # Max episode length
QVALUE_LEARNING_RATE    = 0.001         # Base learning rate for the Q-value Network
DECAY_RATE              = 0.99          # Discount factor 
UPDATE_RATE             = 0.01          # Homotopy rate to update the networks
REPLAY_SIZE             = 10000         # Size of replay buffer
BATCH_SIZE              = 64            # Number of points to be fed in stochastic gradient
NH1 = NH2               = 32            # Hidden layer size

### --- Replay memory
class ReplayItem:
    def __init__(self,x,u,r,d,x2):
        self.x          = x
        self.u          = u
        self.reward     = r
        self.done       = d
        self.x2         = x2
replayDeque = deque()

### --- Tensor flow initialization
qvalue          = QNetwork(nx=env.nx,nu=env.nu,learning_rate=QVALUE_LEARNING_RATE)
qvalueTarget    = QNetwork(name='target',nx=env.nx,nu=env.nu)
# Uncomment to load networks
#qvalue.load()
#qvalueTarget.load()

def rendertrial(maxiter=NSTEPS,verbose=True):
    x = env.reset()
    traj = [x.copy()]
    rsum = 0.
    for i in range(maxiter):
        u = qvalue.policy(x)[0]
        x, reward = env.step(u)
        env.render()
        time.sleep(1e-2)
        rsum += reward
        traj.append(x.copy())
    if verbose: print('Lasted ',i,' timestep -- total reward:',rsum)
    return np.array(traj)
signal.signal(signal.SIGTSTP, lambda x,y:rendertrial()) # Roll-out when CTRL-Z is pressed

### History of search
h_rwd = []

### --- Training
for episode in range(1,NEPISODES):
    x    = env.reset()
    rsum = 0.0

    for step in range(NSTEPS):
        u       = qvalue.policy(x,                                     # Greedy policy ...
                                noise=1. / (1. + episode + step))      # ... with noise
        x2,r    = env.step(u)
        done    = False # Some environment may return information when task completed

        replayDeque.append(ReplayItem(x,u,r,done,x2))                # Feed replay memory ...
        if len(replayDeque)>REPLAY_SIZE: replayDeque.popleft()       # ... with FIFO forgetting.

        rsum   += r
        x       = x2
        if done: break
        
        # Start optimizing networks when memory size > batch size.
        if len(replayDeque) > BATCH_SIZE:     
            batch = random.sample(replayDeque,BATCH_SIZE)            # Random batch from replay memory.
            x_batch    = np.vstack([ b.x      for b in batch ])
            u_batch    = np.vstack([ b.u      for b in batch ])
            r_batch    = np.array([ [b.reward] for b in batch ])
            d_batch    = np.array([ [b.done]   for b in batch ])
            x2_batch   = np.vstack([ b.x2     for b in batch ])
            
            # Compute Q(x,u) from target network
            v_batch    = qvalueTarget.value(x2_batch)
            qref_batch = r_batch + (d_batch==False)*(DECAY_RATE*v_batch)

            # Update qvalue to solve HJB constraint: q = r + q'
            qvalue.trainer.train_on_batch([x_batch,u_batch],qref_batch)
            
            # Update target networks by homotopy.
            qvalueTarget.targetAssign(qvalue,UPDATE_RATE)
      
    # \\\END_FOR step in range(NSTEPS)

    # Display and logging (not mandatory).
    print('Ep#{:3d}: lasted {:d} steps, reward={:3.0f}' .format(episode, step,rsum))
    h_rwd.append(rsum)
    if not (episode+1) % 200:     rendertrial(30)

# \\\END_FOR episode in range(NEPISODES)

print("Average reward during trials: %.3f" % (sum(h_rwd)/NEPISODES))
rendertrial()
plt.plot( np.cumsum(h_rwd)/range(1,NEPISODES) )
plt.show()

# Uncomment to save networks
#qvalue.save()


2022-03-16 23:50:32.052347: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/openrobots/lib/:/opt/openrobots/lib64/:/usr/lib
2022-03-16 23:50:32.052387: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Seed = 3473
You can open the visualizer by visiting the following URL:
http://127.0.0.1:7002/static/


2022-03-16 23:50:34.047183: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-03-16 23:50:34.047213: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (koyasan): /proc/driver/nvidia/version does not exist
2022-03-16 23:50:34.047468: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Ep#  1: lasted 59 steps, reward=-230
Ep#  2: lasted 59 steps, reward=-217
Ep#  3: lasted 59 steps, reward=-222
Ep#  4: lasted 59 steps, reward=-184


2022-03-16 23:50:36.691237: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled
2022-03-16 23:50:36.707204: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled


Ep#  5: lasted 59 steps, reward=-203
Ep#  6: lasted 59 steps, reward=-231
Ep#  7: lasted 59 steps, reward=-224
Ep#  8: lasted 59 steps, reward=-48
Ep#  9: lasted 59 steps, reward=-219
Ep# 10: lasted 59 steps, reward=-216
Ep# 11: lasted 59 steps, reward=-232
Ep# 12: lasted 59 steps, reward=-180


KeyboardInterrupt: 

## Actor critic
When the control space is discrete, the optimal policy is directly obtained by maximizing the Q value by an exhaustive search. This does not work for continuous space. In that case, a policy network must be trained in parallel, to greedily optimize the Q function. A famous algorithm for that, which is a near direct extension of Deep-Q, is the Deep Deterministic Policy Gradient (DDPG). 

Like Deep-Q, it optimizes the Q function to contract the Belman residual. To efficiently do that, it also uses minibatches to avoid the local collapses due to sample dependancy. In addition, it uses a smoothing of the gradient direction, using a so-called "target network", that can be understood as an ad-hoc trust region to avoid violent gradient steps. Finally, it greedily optimizes a policy network: one of the main trick of the paper is to compute the gradient direction of the policy network, which uses the jacobian of the value network. But in the following implementation, this is automatically computed by

In [None]:
# %load tp6/ddpg.py
'''
Deep actor-critic network, 
From "Continuous control with deep reinforcement learning", by Lillicrap et al, arXiv:1509.02971
'''

from env_pendulum import EnvPendulumSinCos; Env = lambda : EnvPendulumSinCos(1,viewer='meshcat')
import gym
import tensorflow as tf
import tensorflow.keras as tfk
import numpy as np
import matplotlib.pyplot as plt
import time
import random
from collections import deque
import signal

#######################################################################################################33
#######################################################################################################33
#######################################################################################################33
### --- Random seed
RANDOM_SEED = 0 # int((time.time()%10)*1000)
print("Seed = %d" %  RANDOM_SEED)
np .random.seed     (RANDOM_SEED)
random.seed         (RANDOM_SEED)
tf.random.set_seed  (RANDOM_SEED)

### --- Hyper paramaters
NEPISODES               = 1000           # Max training steps
NSTEPS                  = 200           # Max episode length
QVALUE_LEARNING_RATE    = 0.001         # Base learning rate for the Q-value Network
POLICY_LEARNING_RATE    = 0.0001        # Base learning rate for the policy network
DECAY_RATE              = 0.99          # Discount factor 
UPDATE_RATE             = 0.01          # Homotopy rate to update the networks
REPLAY_SIZE             = 10000         # Size of replay buffer
BATCH_SIZE              = 64            # Number of points to be fed in stochastic gradient
NH1 = NH2               = 250           # Hidden layer size
EXPLORATION_NOISE       = 0.2

### --- Environment
# problem = "Pendulum-v1"
# env = gym.make(problem)
# NX = env.observation_space.shape[0]
# NU = env.action_space.shape[0]
# UMAX = env.action_space.high[0]
# env.reset(seed=RANDOM_SEED)
# assert( env.action_space.low[0]==-UMAX)

env                 = Env()             # Continuous pendulum
NX                  = env.nx            # ... training converges with q,qdot with 2x more neurones.
NU                  = env.nu            # Control is dim-1: joint torque
UMAX                = env.umax[0]       # Torque range


#######################################################################################################33
### NETWORKS ##########################################################################################33
#######################################################################################################33

class QValueNetwork:
    '''
    Neural representaion of the Quality function:
    Q:  x,y -> Q(x,u) \in R
    '''
    def __init__(self,nx,nu,nhiden1=32,nhiden2=256,learning_rate=None):

        state_input = tfk.layers.Input(shape=(nx))
        state_out = tfk.layers.Dense(nhiden1, activation="relu")(state_input)
        state_out = tfk.layers.Dense(nhiden1, activation="relu")(state_out)

        action_input = tfk.layers.Input(shape=(nu))
        action_out = tfk.layers.Dense(nhiden1, activation="relu")(action_input)

        concat = tfk.layers.Concatenate()([state_out, action_out])

        out = tfk.layers.Dense(nhiden2, activation="relu")(concat)
        out = tfk.layers.Dense(nhiden2, activation="relu")(out)
        value_output = tfk.layers.Dense(1)(out)

        self.model = tfk.Model([state_input, action_input], value_output)

    @tf.function
    def targetAssign(self,target,tau=UPDATE_RATE):
        for (tar,cur) in zip(target.model.variables,self.model.variables):
            tar.assign(cur * tau + tar * (1 - tau))
 

class PolicyNetwork:
    '''
    Neural representation of the policy function:
    Pi: x -> u=Pi(x) \in R^nu
    '''
    def __init__(self,nx,nu,umax,nhiden=32,learning_rate=None):
        random_init = tf.random_uniform_initializer(minval=-0.005, maxval=0.005)
        
        state_input = tfk.layers.Input(shape=(nx,))
        out = tfk.layers.Dense(nhiden, activation="relu")(state_input)
        out = tfk.layers.Dense(nhiden, activation="relu")(out)
        policy_output = tfk.layers.Dense(1, activation="tanh",
                                         kernel_initializer=random_init)(out)*umax
        self.model = tfk.Model(state_input, policy_output)

    @tf.function
    def targetAssign(self,target,tau=UPDATE_RATE):
        for (tar,cur) in zip(target.model.variables,self.model.variables):
            tar.assign(cur * tau + tar * (1 - tau))

    def numpyPolicy(self,x,noise=None):
        '''Eval the policy with numpy input-output (nx,)->(nu,).'''
        x_tf = tf.expand_dims(tf.convert_to_tensor(x), 0)
        u = np.squeeze(self.model(x_tf).numpy(),0)
        if noise is not None:
            u = np.clip( u+noise, -UMAX,UMAX)
        return u

    def __call__(self, x,**kwargs):
        return self.numpyPolicy(x,**kwargs)

            
        
#######################################################################################################33

class OUNoise:
    '''
    Ornstein–Uhlenbeck processes are markov random walks with the nice property to eventually
    converge to its mean.
    We use it for adding some random search at the begining of the exploration.
    '''
    def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, y_initial=None,dtype=np.float32):
        self.theta = theta
        self.mean = mean.astype(dtype)
        self.std_dev = std_deviation.astype(dtype)
        self.dt = dt
        self.dtype=dtype
        self.reset(y_initial)

    def __call__(self):
        # Formula taken from https://www.wikipedia.org/wiki/Ornstein-Uhlenbeck_process.
        noise = np.random.normal(size=self.mean.shape).astype(self.dtype)
        self.y += \
            self.theta * (self.mean - self.y) * self.dt \
            + self.std_dev * np.sqrt(self.dt) * noise
        return self.y.copy()

    def reset(self,y_initial = None):
        self.y = y_initial.astype(self.dtype) if y_initial is not None else np.zeros_like(self.mean)

### --- Replay memory
class ReplayItem:
    '''
    Storage for the minibatch
    '''
    def __init__(self,x,u,r,d,x2):
        self.x          = x
        self.u          = u
        self.reward     = r
        self.done       = d
        self.x2         = x2


#######################################################################################################33
quality = QValueNetwork(NX,NU,NH1,NH2)
qualityTarget = QValueNetwork(NX,NU,NH1,NH2)
quality.targetAssign(qualityTarget,1)

policy = PolicyNetwork(NX,NU,umax=UMAX,nhiden=NH2)
policyTarget = PolicyNetwork(NX,NU,umax=UMAX,nhiden=NH2)
policy.targetAssign(policyTarget,1)

replayDeque = deque()

ou_noise = OUNoise(mean=np.zeros(1), std_deviation=float(EXPLORATION_NOISE) * np.ones(1))
ou_noise.reset( np.array([ UMAX/2 ]) )

#######################################################################################################33
### MAIN ACTOR-CRITIC BLOCK
#######################################################################################################33

critic_optimizer = tfk.optimizers.Adam(QVALUE_LEARNING_RATE)
actor_optimizer = tfk.optimizers.Adam(POLICY_LEARNING_RATE)

@tf.function
def learn(state_batch, action_batch, reward_batch, next_state_batch):
    '''
    <learn> is isolated in a tf.function to make it more efficient.
    @tf.function forces tensorflow to optimize the inner computation graph defined in this function.
    '''

    # Automatic differentiation of the critic loss, using tf.GradientTape
    # The critic loss is the classical Q-learning loss:
    #         loss = || Q(x,u) -  (reward + Q(xnext,Pi(xnexT)) ) ||**2
    with tf.GradientTape() as tape:
        target_actions = policyTarget.model(next_state_batch, training=True)
        y = reward_batch + DECAY_RATE * qualityTarget.model(
            [next_state_batch, target_actions], training=True
        )
        critic_value = quality.model([state_batch, action_batch], training=True)
        critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))
        
    critic_grad = tape.gradient(critic_loss, quality.model.trainable_variables)
    critic_optimizer.apply_gradients(
        zip(critic_grad, quality.model.trainable_variables)
    )

    # Automatic differentiation of the actor loss, using tf.GradientTape
    # The actor loss implements a greedy optimization on the quality function
    #           loss(u) = Q(x,u)
    with tf.GradientTape() as tape:
        actions = policy.model(state_batch, training=True)
        critic_value = quality.model([state_batch, actions], training=True)
        actor_loss = -tf.math.reduce_mean(critic_value)

    actor_grad = tape.gradient(actor_loss, policy.model.trainable_variables)
    actor_optimizer.apply_gradients(
        zip(actor_grad, policy.model.trainable_variables)
    )
  

#######################################################################################################33
#######################################################################################################33
#######################################################################################################33

def rendertrial(maxiter=NSTEPS,verbose=True):
    '''
    Display a roll-out from random start and optimal feedback.
    Press ^Z to get a roll-out at training time.
    '''
    x = env.reset()
    rsum = 0.
    for i in range(maxiter):
        u = policy(x)
        x, reward = env.step(u)[:2]
        env.render()
        rsum += reward
    if verbose: print('Lasted ',i,' timestep -- total reward:',rsum)
signal.signal(signal.SIGTSTP, lambda x,y:rendertrial()) # Roll-out when CTRL-Z is pressed
env.full.sleepAtDisplay=5e-3

# Logs
h_rewards = []
h_steps   = []

# Takes about 4 min to train
for episode in range(NEPISODES):

    prev_state = env.reset()

    for step in range(NSTEPS):
    # Uncomment this to see the Actor in action
        # But not in a python notebook.
        #env.render()

        action = policy(prev_state, noise=ou_noise())
        state, reward = env.step(action)[:2]
        done=False
        
        replayDeque.append(ReplayItem(prev_state, action, reward, done, state))
        
        prev_state = state

        if len(replayDeque) <= BATCH_SIZE:  continue


        ####################################################################
        # Sample a minibatch
        
        batch = random.sample(replayDeque,BATCH_SIZE)            # Random batch from replay memory.
        state_batch    = tf.convert_to_tensor([ b.x      for b in batch ])
        action_batch    = tf.convert_to_tensor([ b.u      for b in batch ])
        reward_batch    = tf.convert_to_tensor([ [ b.reward ] for b in batch ],dtype=np.float32)
        done_batch    = tf.convert_to_tensor([ b.done   for b in batch ])
        next_state_batch   = tf.convert_to_tensor([ b.x2     for b in batch ])

        ####################################################################
        # One gradient step for the minibatch

        # Critic and actor gradients
        learn(state_batch, action_batch, reward_batch, next_state_batch)
        # Step smoothing using target networks
        policy.targetAssign(policyTarget)
        quality.targetAssign(qualityTarget)

        if done: break   # stop at episode end.

    # Some prints and logs
    episodic_reward = sum([ replayDeque[-i-1].reward for i in range(step+1) ])
    h_rewards.append( episodic_reward )
    h_steps.append(step+1)
    
    print(f'Ep#{episode:3d}: lasted {step+1:d} steps, reward={episodic_reward:3.1f} ')

    
    # avg_reward = np.mean(h_rewards[-40:])
    # if episode==5 and RANDOM_SEED==0:
    #     assert(  abs(avg_reward + 1423.0528188196286) < 1e-3 )
    # if episode==0 and RANDOM_SEED==0:
    #     assert(  abs(avg_reward + 1712.386325099637) < 1e-3 )
        
# Plotting graph
# Episodes versus Avg. Rewards
plt.plot(h_rewards)
plt.xlabel("Episode")
plt.ylabel("Epsiodic Reward")
plt.show()

#######################################################################################################33
#######################################################################################################33
#######################################################################################################33
