# Acrobot (Simple DeepQ Learning + Replay Memory) 

In [2]:
import gym
import numpy as np
import random
import imageio 
from matplotlib import pyplot as plt

## Context

In [3]:
env = gym.make('Acrobot-v1')

The Pendulm is described on [gym](https://www.gymlibrary.dev/environments/classic_control/acrobot/) as : "The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards."

In this task things are made easy since we only have 3 actions (easier than pendulum which had an "infinity"). These 3 actions are : apply -1 torque, apply 0 torque and apply 1 torque.

In [4]:
env.action_space

Discrete(3)

The observation state is a vector of 6 numbers : cosine of theta1, sine of theta1, cosine of theta2, sine of theta2, angular velocity of theta1 and angular velocity of theta2.

What's important is that the state is not discrete, so there is an infinite combination of state, making a QTable for such a problem is possible but not the right solution because of the infinite amount of state. 

In [5]:
env.observation_space.shape

(6,)

If a QTable is not the right solution, the only remaining is using DeepQ Learning.

Instead of creating a table that will store all the Qvalues, we use a neural network to approximatethose QValues. The neural network allow us to have an infinite number of states, it will only depends on the network's weights and not a static table.

## DeepQ Learning (With replay memory)

The difference between a simple DeepQ Learning alogorithm and another one using replay memory is that the first one will only learn from its last experience, unlike the other one that can learn from past experience. 

This simple make a huge difference, because without memory the model can't generalize well, and is very likely to not train on every possible situations.

### Replay Memory (deque)
First create the replay memory, that will store : the current state, the action taken, the reward, the next state, and if the state is terminal.

Basically, the memory doesn't have an infinite capcity (performance wise), so we setup the memory as a deque, it will automatically handle incoming experience, and throwing the older ones.


The methods created are :
- 'push' : it append a new experience to the memory
- 'sample' : it will randomly take some experience (a batch) in the memory and output them, those will be used to update the network weights.
- '__len__' : it overrides the 'len' methods to return the memory length, used to start updating network's weights when there are enough experience stored.

In [6]:
from collections import deque
class ReplayMemory:
    def __init__(self, capacity):
        self.memory = deque(maxlen=capacity)
    def push(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    def __len__(self):
        return len(self.memory)

### The network

The network can be set how you want, it doesn't need to be a very deep network, one hidden layer of 32 and 64 as input seems to be enough.


The input_shape or input, is what is received, the observation so a vector 6 values (Shape : (6,)).

The output are the QValues for the possible actions, in the case of the acrobot the action space had a size of 3 (1, 0 or -1).

In [7]:
from keras import layers, Sequential, optimizers
def create_model(input_shape, output_shape, learning_rate):
    model = Sequential()
    model.add(layers.Dense(64, input_shape=(input_shape,), activation='relu'))
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dense(output_shape, activation='linear'))
    return model

### DeepQ Learning algorithm

The algorithm is pretty much the same as the QLearning one, the difference will be in the part of the QValue estimation.
Before, we would take the QTable, update the corresponding state with the action taken, here the updates are only made the networks learns from memory.

By taking a batch of past experiences we can compute the corresponding QValue, but instead of storing it in a table, we update the network's weights but training the model to associate the qvalues to a state in order to take the best action.

In [8]:
rewards_list = []

def DeepQLearning(env, learning_rate, discount, epsilon, max_steps, episodes, batch_size_p=32):
    
    input_shape = env.observation_space.shape[0]
    output_shape = env.action_space.n

    model = create_model(
        input_shape=input_shape, 
        output_shape=output_shape, 
        learning_rate=learning_rate)

    model.compile(loss='mse', optimizer=optimizers.Adam(learning_rate = learning_rate))


    memory = ReplayMemory(capacity=100000)
    batch_size = batch_size_p
    

    for i in range(episodes):
        state, info = env.reset()
        state = state.reshape(1, input_shape)
        done = False
        reward_tot = 0

        if i > 50:
            #model.save_weights(f"model.weights-eps{i}.h5")
            model.save_weights("model.weights.h5", overwrite=True)

        
        for j in range (max_steps): 

            if np.random.rand() <= epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(model.predict(state, verbose=0)[0])

            new_state, reward, done, truncated, info = env.step(action)
            new_state = new_state.reshape(1, input_shape) 
            reward_tot += reward

            memory.push(state, action, reward, new_state, done)
            state = new_state

            
            # Only update the model if there are enough experiences in memory
            if len(memory) >= batch_size:
                
                states, actions, rewards, new_states, dones = zip(*memory.sample(batch_size))

                
                dones = np.array(dones, dtype=np.bool_)
                #states = np.array(states)
                states = np.squeeze(states)
                ####
                actions = np.array(actions)
                ####
                
                #new_states = np.array(new_states)
                new_states = np.squeeze(new_states)
                
                targets = model.predict_on_batch(states)

                #q_values_next = target_model.predict_on_batch(new_states)
                q_values_next = model.predict_on_batch(new_states)


                max_q_values_next = np.amax(q_values_next, axis=1)
                
                targets[range(batch_size), actions] = rewards + discount * max_q_values_next *  (1 - dones) 

                model.fit(states, targets, epochs=1, verbose=0)

                model.save_weights('model.weights.h5', overwrite=True)
                


            if done:
                break

        print("Episodes n°:", i, "Epsilon:", epsilon, "Total reward:", reward_tot)
        rewards_list.append(reward_tot)
        if epsilon > 0.001:
            epsilon *= 0.98




## Training time !

In [13]:
DeepQLearning(env, 
          learning_rate=0.01, 
          discount=0.99, 
          epsilon=1.0, 
          max_steps=500, 
          episodes=400,
          batch_size_p=256)

Episodes n°: 0 Epsilon: 1.0 Total reward: -500.0
Episodes n°: 1 Epsilon: 0.98 Total reward: -500.0
Episodes n°: 2 Epsilon: 0.9603999999999999 Total reward: -500.0
Episodes n°: 3 Epsilon: 0.9411919999999999 Total reward: -500.0
Episodes n°: 4 Epsilon: 0.9223681599999999 Total reward: -500.0
Episodes n°: 5 Epsilon: 0.9039207967999998 Total reward: -500.0
Episodes n°: 6 Epsilon: 0.8858423808639998 Total reward: -500.0
Episodes n°: 7 Epsilon: 0.8681255332467198 Total reward: -500.0
Episodes n°: 8 Epsilon: 0.8507630225817854 Total reward: -500.0
Episodes n°: 9 Epsilon: 0.8337477621301497 Total reward: -500.0
Episodes n°: 10 Epsilon: 0.8170728068875467 Total reward: -500.0
Episodes n°: 11 Epsilon: 0.8007313507497957 Total reward: -500.0
Episodes n°: 12 Epsilon: 0.7847167237347998 Total reward: -500.0
Episodes n°: 13 Epsilon: 0.7690223892601038 Total reward: -500.0
Episodes n°: 14 Epsilon: 0.7536419414749017 Total reward: -500.0
Episodes n°: 15 Epsilon: 0.7385691026454037 Total reward: -500.0

: 

In [1]:
plt.plot(rewards_list)

NameError: name 'plt' is not defined

## Create the GIF

In [14]:
images = []
env = gym.make('Acrobot-v1', render_mode='rgb_array')

state,info = env.reset()


input_shape = env.observation_space.shape[0]
output_shape = env.action_space.n

state = state.reshape(1, input_shape)


model = create_model(
        input_shape=input_shape, 
        output_shape=output_shape, 
        learning_rate=0.001
)
model.load_weights('model.weights.h5')
model.compile(loss='mse', optimizer=optimizers.Adam(learning_rate = 0.001))

score = 0
done = False

stp = 0
while stp < 300:

        stp += 1
        print(stp)

        action = np.argmax(model.predict(state, verbose=0))

        new_state, reward, done, trunc, info = env.step(action)
        state = new_state.reshape(1, input_shape) 

        frame = env.render()  # Save the frame
        for m in range(5):
                images.append(frame)


        if done == True:
                break
        score +=1

env.close()
imageio.mimsave('img/AcrobotDQN9.gif', images, fps=30)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
