# Deep Q Learning on a CartPole made in Pybullet

The main purpose of Deep Reinforcement Learning is about to train an "agent" that interacts with an "environment". The environment, represents the task or problem to be solved, in this case is the CartPole. And the agent, is the neural network that will be trained.

![title](CartPend.png)



What we can obtain from the environment are 2 things:
- State: It represents were our environment is at time t. For example, in this case, the state can be represented as an image, or as a state-vector that includes the position and velocity of the car($x,\dot{x}$) and the angle, and angular velocity of the pendulum($\theta,\omega$).
- Reward: Every time our environment does good things it will recieve a reward, and if it does something bad, will give it a penalty. Here, we'll reward the car in each time-step if the pole is between (-10°,10°),and if it goes out of the range we'll give it a penalty.

Our agent, as I mentioned, is going to be a NN, that will have 4 inputs and to outputs(to go left or right):
![title](NN.png)

The hidden neurons may change, but the input layer and output layer not.

The purpose of the agent, is to maximize the future rewards , this is called "policy"(the strategy that the agent empolys to determine the next action based on the current state. It maps states into actions, actions that promises high rewards):$$\sum_{t=0}^\infty \gamma^t R(x(t),a(t))  \hspace{2cm} \text{Bellman Equation}$$

$\gamma$ Is the "discount factor", its going to help us to give more importance to the actual reward than previus ones.

We will need to create a matrix called "Q", this matrix maps state-action paits to the highest combination of immediate reward with all future rewards that might be harvested by later actions in the trayectory; this is the equation:
$$Q(s_t,a_t) = (1-\alpha) \cdot Q(s_t,a_t) + \eta \cdot (R_t + \gamma \cdot \max\limits_{{a}} Q(s_{t+1},a))$$


Here is a pseudocode from: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
![title](PseudoCode.png)

In [9]:
import pybullet as pb
import time
from PIL import Image
import pybullet_data
import numpy as np
import random
from collections import deque
from tensorflow import keras
from keras.models import Model, load_model
from keras.layers import Input, Dense
from keras.optimizers import Adam, RMSprop

from packaging import version
from datetime import datetime
from tensorboard import notebook
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [10]:
physicsClient = pb.connect(pb.GUI)
# physicsClient = pb.connect(pb.DIRECT)
pb.setAdditionalSearchPath(pybullet_data.getDataPath())
pb.setGravity(0, 0, -9.81)
pb.isNumpyEnabled()

pb.setTimeStep(0.03)

pb.configureDebugVisualizer(pb.COV_ENABLE_GUI,0)
pb.configureDebugVisualizer(pb.COV_ENABLE_MOUSE_PICKING,0,pb.COV_ENABLE_VR_RENDER_CONTROLLERS,0)
planetID = pb.loadURDF("plane.urdf")

# Define the Keras TensorBoard callback.
logdir="logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)

In [6]:
class CartPend():
    def __init__(self):
        self.cubeStarsPos = [0,0,0.12]
        self.cubeStartOrientation = pb.getQuaternionFromEuler([0,0,0])
        
        self.robotId = pb.loadURDF("cartpend.urdf",self.cubeStarsPos,self.cubeStartOrientation)
        
        pb.setJointMotorControl2(self.robotId,4,pb.VELOCITY_CONTROL,force=0)
        pb.resetJointState(self.robotId,4,-0.174 + (0.174+0.174)*np.random.rand())
        
        self.time = 0
        self.reward = 0
        self.End = False
        
    def reset(self):
        self.End = False
        self.reward = 0
        self.time = 0
        
    def run(self,userTorque):
        
        if userTorque == 0:
            userTorque = -10
        else:
            userTorque = 10 
        
        for i in range(4):
            pb.setJointMotorControl2(self.robotId,i,pb.VELOCITY_CONTROL,targetVelocity=userTorque)
            
        view = pb.getBasePositionAndOrientation(self.robotId)[0]
        pb.resetDebugVisualizerCamera(3,0,-20,view)
        pb.stepSimulation()
        
            
        if abs(pb.getJointState(self.robotId,4)[0]) > 0.3 or abs(pb.getJointState(self.robotId,0)[0]) > 100:
            self.reward -= 100
            self.End = True
        else:
            self.reward += 1
            self.reward -= pb.getJointState(self.robotId,0)[0]*0.01
            self.time+=1
        
        time.sleep(0.01)
        return pb.getJointState(self.robotId,0)[0:2] + pb.getJointState(self.robotId,4)[0:2],self.reward,self.End

def Reset_Env(Env):
    if Env.End:
        time.sleep(0.1)
        pb.removeBody(Env.robotId)
        return CartPend()
    return Env

In [4]:
class DQNAgent:
    def __init__(self):
        self.state_size = 4
        self.action_size = 2
        self.EPISODES = 150
        self.memory = deque(maxlen=2000)
        
        self.graph = []
        
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.001
        self.epsilon_decay = 0.999
        self.batch_size = 128
        self.train_start = 128

        self.model = self.create_model()
               
    def create_model(self):
        model = keras.models.Sequential()
    
        model.add(Dense(512, input_shape=(self.state_size,), activation="relu"))
        model.add(Dense(256, activation="relu"))
        model.add(Dense(64, activation="relu"))
        model.add(Dense(32, activation="relu"))
        model.add(Dense(self.action_size, activation="linear"))

        model.compile(loss="mse", optimizer=Adam(lr=0.001), metrics=["accuracy"])

        return model

    def memorize(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.random() <= self.epsilon:
            return random.randrange(self.action_size)
        
        return np.argmax(self.model.predict(state))

    def replay(self):
        if len(self.memory) < self.train_start:
            return
        
        minibatch = random.sample(self.memory,self.batch_size)

        action = [i[1] for i in minibatch]
        reward = [i[2] for i in minibatch]
        done = [i[4] for i in minibatch]
        state = np.array([i[0] for i in minibatch]).reshape((self.batch_size,self.state_size))
        next_state = np.array([i[3] for i in minibatch]).reshape((self.batch_size,self.state_size))
        
        
        target = self.model.predict(state)
        target_next = self.model.predict(next_state)

        for i in range(self.batch_size):
             target[i][action[i]] = reward[i] if done[i] else reward[i] + self.gamma * (np.amax(target_next[i]))

        self.model.fit(state, target, batch_size=self.batch_size, verbose=0,callbacks=[tensorboard_callback])
        

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name+".h5")
        
    def save(self, name):
        self.model.save_weights(name+".h5")
              
    def run(self):
        print("Running")
        Env = CartPend()
        
        try: self.model.load_weights("CartPend_model.h5")
        except: pass
        
        for e in range(self.EPISODES):
            state = pb.getJointState(Env.robotId,0)[0:2] + pb.getJointState(Env.robotId,4)[0:2]
            state = np.reshape(state, [1, self.state_size])
            done = False
            i = 0
            while True:
                action = self.act(state)
                next_state, reward, done = Env.run(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                    
                self.memorize(state, action, reward, next_state, done)
                state = next_state
                
                i += 1
                if done:                   
                    print("episode: {}/{}, score: {}, e: {:.2}".format(e, self.EPISODES, Env.reward, self.epsilon))
                    
                    self.graph.append(Env.reward)
                    
                    if Env.reward >= 800:
                        print("Saving trained model as cartpole-dqn.h5")
                        Agent.save("CartPend_model")
                        return
                    Env = Reset_Env(Env)
                    break
                self.replay()    
                
        
    def test(self):
        Env = CartPend()
        
        self.model.load_weights("CartPend_model.h5")
        
        for e in range(30):
            state = pb.getJointState(Env.robotId,0)[0:2] + pb.getJointState(Env.robotId,4)[0:2]
            state = np.reshape(state, [1, self.state_size])
            done = False
            i = 0
            while True:
                action = np.argmax(self.model.predict(state))
                next_state, reward, done = Env.run(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                    
                state = next_state
                
                i += 1
                if done:                   
                    print("episode: {}/{}, score: {}, e: {:.2}".format(e, 30, Env.reward, self.epsilon))
                    Env = Reset_Env(Env)
                    break

In [11]:
Agent = DQNAgent()
#Agent.run()
Agent.test()

episode: 0/30, score: -41.412090882927906, e: 1.0
episode: 1/30, score: -85.28798405230732, e: 1.0
episode: 2/30, score: -85.28798403561373, e: 1.0
episode: 3/30, score: -68.84906929327616, e: 1.0
episode: 4/30, score: -86.24898658877369, e: 1.0
episode: 5/30, score: -42.076143122189926, e: 1.0
episode: 6/30, score: -8.97679581497114, e: 1.0
episode: 7/30, score: 970.1882699588243, e: 1.0
episode: 8/30, score: -87.21298958189452, e: 1.0
episode: 9/30, score: -88.17999234042809, e: 1.0
episode: 10/30, score: 115.14290555383329, e: 1.0
episode: 11/30, score: -88.17999236403278, e: 1.0
episode: 12/30, score: -88.1619942107773, e: 1.0
episode: 13/30, score: -21.949266390077526, e: 1.0
episode: 14/30, score: -87.21298966715158, e: 1.0
episode: 15/30, score: -50.828229461226016, e: 1.0


error: Not connected to physics server.

In [None]:
pb.disconnect()

In [6]:
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 7012), started 0:37:23 ago. (Use '!kill 7012' to kill it.)