# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.9275689688183278, score: 21
Scores: (min: 14, avg: 17.5, max: 21)

Run: 3, exploration: 0.7552531090661897, score: 42
Scores: (min: 14, avg: 25.666666666666668, max: 42)

Run: 4, exploration: 0.7147372386831305, score: 12
Scores: (min: 12, avg: 22.25, max: 42)

Run: 5, exploration: 0.6730128848950395, score: 13
Scores: (min: 12, avg: 20.4, max: 42)

Run: 6, exploration: 0.6465587967553006, score: 9
Scores: (min: 9, avg: 18.5, max: 42)

Run: 7, exploration: 0.6088145090359074, score: 13
Scores: (min: 9, avg: 17.714285714285715, max: 42)

Run: 8, exploration: 0.547986285490042, score: 22
Scores: (min: 9, avg: 18.25, max: 42)

Run: 9, exploration: 0.5264466124450268, score: 9
Scores: (min: 9, avg: 17.22222222222222, max: 42)

Run: 10, exploration: 0.4858739637363176, score: 17
Scores: (min: 9, avg: 17.2, max: 42)

Run: 11, exploration: 0.46211964903917074, score: 11
Scores: (min: 9, avg: 16.636

Run: 87, exploration: 0.01, score: 200
Scores: (min: 8, avg: 76.6896551724138, max: 372)

Run: 88, exploration: 0.01, score: 205
Scores: (min: 8, avg: 78.14772727272727, max: 372)

Run: 89, exploration: 0.01, score: 198
Scores: (min: 8, avg: 79.49438202247191, max: 372)

Run: 90, exploration: 0.01, score: 67
Scores: (min: 8, avg: 79.35555555555555, max: 372)

Run: 91, exploration: 0.01, score: 201
Scores: (min: 8, avg: 80.6923076923077, max: 372)

Run: 92, exploration: 0.01, score: 37
Scores: (min: 8, avg: 80.21739130434783, max: 372)

Run: 93, exploration: 0.01, score: 139
Scores: (min: 8, avg: 80.84946236559139, max: 372)

Run: 94, exploration: 0.01, score: 189
Scores: (min: 8, avg: 82, max: 372)

Run: 95, exploration: 0.01, score: 179
Scores: (min: 8, avg: 83.02105263157895, max: 372)

Run: 96, exploration: 0.01, score: 215
Scores: (min: 8, avg: 84.39583333333333, max: 372)

Run: 97, exploration: 0.01, score: 221
Scores: (min: 8, avg: 85.80412371134021, max: 372)

Run: 98, explorati

Run: 187, exploration: 0.01, score: 273
Scores: (min: 15, avg: 184.6, max: 500)

Run: 188, exploration: 0.01, score: 115
Scores: (min: 15, avg: 183.7, max: 500)

Run: 189, exploration: 0.01, score: 120
Scores: (min: 15, avg: 182.92, max: 500)

Run: 190, exploration: 0.01, score: 138
Scores: (min: 15, avg: 183.63, max: 500)

Run: 191, exploration: 0.01, score: 64
Scores: (min: 15, avg: 182.26, max: 500)

Run: 192, exploration: 0.01, score: 174
Scores: (min: 15, avg: 183.63, max: 500)

Run: 193, exploration: 0.01, score: 138
Scores: (min: 15, avg: 183.62, max: 500)

Run: 194, exploration: 0.01, score: 191
Scores: (min: 15, avg: 183.64, max: 500)

Run: 195, exploration: 0.01, score: 154
Scores: (min: 15, avg: 183.39, max: 500)

Run: 196, exploration: 0.01, score: 309
Scores: (min: 15, avg: 184.33, max: 500)

Run: 197, exploration: 0.01, score: 198
Scores: (min: 15, avg: 184.1, max: 500)

Run: 198, exploration: 0.01, score: 144
Scores: (min: 15, avg: 183.75, max: 500)

Run: 199, exploratio

NameError: name 'exit' is not defined

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.0001  #change the learning rate to smaller
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [4]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.9416228069143757, score: 18
Scores: (min: 14, avg: 16, max: 18)

Run: 3, exploration: 0.8955869907338783, score: 11
Scores: (min: 11, avg: 14.333333333333334, max: 18)

Run: 4, exploration: 0.8224322824348486, score: 18
Scores: (min: 11, avg: 15.25, max: 18)

Run: 5, exploration: 0.7514768435208588, score: 19
Scores: (min: 11, avg: 16, max: 19)

Run: 6, exploration: 0.7147372386831305, score: 11
Scores: (min: 11, avg: 15.166666666666666, max: 19)

Run: 7, exploration: 0.6730128848950395, score: 13
Scores: (min: 11, avg: 14.857142857142858, max: 19)

Run: 8, exploration: 0.6433260027715241, score: 10
Scores: (min: 10, avg: 14.25, max: 19)

Run: 9, exploration: 0.5562889678716474, score: 30
Scores: (min: 10, avg: 16, max: 30)

Run: 10, exploration: 0.5264466124450268, score: 12
Scores: (min: 10, avg: 15.6, max: 30)

Run: 11, exploration: 0.500708706245853, score: 11
Scores: (min: 10, avg: 15.1

Run: 83, exploration: 0.011150733307840981, score: 13
Scores: (min: 8, avg: 12.024096385542169, max: 30)

Run: 84, exploration: 0.010552547534153616, score: 12
Scores: (min: 8, avg: 12.023809523809524, max: 30)

Run: 85, exploration: 0.010137759008060509, score: 9
Scores: (min: 8, avg: 11.988235294117647, max: 30)

Run: 86, exploration: 0.01, score: 13
Scores: (min: 8, avg: 12, max: 30)

Run: 87, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.988505747126437, max: 30)

Run: 88, exploration: 0.01, score: 12
Scores: (min: 8, avg: 11.988636363636363, max: 30)

Run: 89, exploration: 0.01, score: 11
Scores: (min: 8, avg: 11.97752808988764, max: 30)

Run: 90, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.988888888888889, max: 30)

Run: 91, exploration: 0.01, score: 10
Scores: (min: 8, avg: 11.967032967032967, max: 30)

Run: 92, exploration: 0.01, score: 13
Scores: (min: 8, avg: 11.978260869565217, max: 30)

Run: 93, exploration: 0.01, score: 15
Scores: (min: 8, avg: 12.0107526

Run: 185, exploration: 0.01, score: 93
Scores: (min: 10, avg: 39.42, max: 139)

Run: 186, exploration: 0.01, score: 58
Scores: (min: 10, avg: 39.87, max: 139)

Run: 187, exploration: 0.01, score: 78
Scores: (min: 10, avg: 40.54, max: 139)

Run: 188, exploration: 0.01, score: 133
Scores: (min: 10, avg: 41.75, max: 139)

Run: 189, exploration: 0.01, score: 102
Scores: (min: 10, avg: 42.66, max: 139)

Run: 190, exploration: 0.01, score: 117
Scores: (min: 10, avg: 43.7, max: 139)

Run: 191, exploration: 0.01, score: 99
Scores: (min: 10, avg: 44.59, max: 139)

Run: 192, exploration: 0.01, score: 81
Scores: (min: 10, avg: 45.27, max: 139)

Run: 193, exploration: 0.01, score: 103
Scores: (min: 10, avg: 46.15, max: 139)

Run: 194, exploration: 0.01, score: 69
Scores: (min: 11, avg: 46.74, max: 139)

Run: 195, exploration: 0.01, score: 107
Scores: (min: 11, avg: 47.69, max: 139)

Run: 196, exploration: 0.01, score: 87
Scores: (min: 11, avg: 48.45, max: 139)

Run: 197, exploration: 0.01, score: 

Run: 286, exploration: 0.01, score: 301
Scores: (min: 69, avg: 189.9, max: 500)

Run: 287, exploration: 0.01, score: 174
Scores: (min: 69, avg: 190.86, max: 500)

Run: 288, exploration: 0.01, score: 144
Scores: (min: 69, avg: 190.97, max: 500)

Run: 289, exploration: 0.01, score: 176
Scores: (min: 69, avg: 191.71, max: 500)

Run: 290, exploration: 0.01, score: 215
Scores: (min: 69, avg: 192.69, max: 500)

Run: 291, exploration: 0.01, score: 233
Scores: (min: 69, avg: 194.03, max: 500)

Run: 292, exploration: 0.01, score: 196
Scores: (min: 69, avg: 195.18, max: 500)

Solved in 192 runs, 292 total runs.


NameError: name 'exit' is not defined

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.01  #change the learning rate to larger value
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 10
Scores: (min: 10, avg: 10, max: 10)

Run: 2, exploration: 0.9558895783575597, score: 19
Scores: (min: 10, avg: 14.5, max: 19)

Run: 3, exploration: 0.9000874278732445, score: 13
Scores: (min: 10, avg: 14, max: 19)

Run: 4, exploration: 0.8433051360508336, score: 14
Scores: (min: 10, avg: 14, max: 19)

Run: 5, exploration: 0.7552531090661897, score: 23
Scores: (min: 10, avg: 15.8, max: 23)

Run: 6, exploration: 0.5878229785513479, score: 51
Scores: (min: 10, avg: 21.666666666666668, max: 51)

Run: 7, exploration: 0.5647174463480732, score: 9
Scores: (min: 9, avg: 19.857142857142858, max: 51)

Run: 8, exploration: 0.5425201222922789, score: 9
Scores: (min: 9, avg: 18.5, max: 51)

Run: 9, exploration: 0.5057535983897912, score: 15
Scores: (min: 9, avg: 18.11111111111111, max: 51)

Run: 10, exploration: 0.47862223409330756, score: 12
Scores: (min: 9, avg: 17.5, max: 51)

Run: 11, exploration: 0.4439551321314536, score: 16
Scores: (min: 9, avg: 17.3636363

Run: 88, exploration: 0.01, score: 56
Scores: (min: 8, avg: 22.295454545454547, max: 88)

Run: 89, exploration: 0.01, score: 86
Scores: (min: 8, avg: 23.01123595505618, max: 88)

Run: 90, exploration: 0.01, score: 25
Scores: (min: 8, avg: 23.033333333333335, max: 88)

Run: 91, exploration: 0.01, score: 41
Scores: (min: 8, avg: 23.23076923076923, max: 88)

Run: 92, exploration: 0.01, score: 10
Scores: (min: 8, avg: 23.08695652173913, max: 88)

Run: 93, exploration: 0.01, score: 36
Scores: (min: 8, avg: 23.225806451612904, max: 88)

Run: 94, exploration: 0.01, score: 20
Scores: (min: 8, avg: 23.19148936170213, max: 88)

Run: 95, exploration: 0.01, score: 25
Scores: (min: 8, avg: 23.210526315789473, max: 88)

Run: 96, exploration: 0.01, score: 20
Scores: (min: 8, avg: 23.177083333333332, max: 88)

Run: 97, exploration: 0.01, score: 44
Scores: (min: 8, avg: 23.391752577319586, max: 88)

Run: 98, exploration: 0.01, score: 31
Scores: (min: 8, avg: 23.46938775510204, max: 88)

Run: 99, explor

Run: 192, exploration: 0.01, score: 9
Scores: (min: 8, avg: 34.51, max: 124)

Run: 193, exploration: 0.01, score: 173
Scores: (min: 8, avg: 35.88, max: 173)

Run: 194, exploration: 0.01, score: 23
Scores: (min: 8, avg: 35.91, max: 173)

Run: 195, exploration: 0.01, score: 30
Scores: (min: 8, avg: 35.96, max: 173)

Run: 196, exploration: 0.01, score: 9
Scores: (min: 8, avg: 35.85, max: 173)

Run: 197, exploration: 0.01, score: 30
Scores: (min: 8, avg: 35.71, max: 173)

Run: 198, exploration: 0.01, score: 25
Scores: (min: 8, avg: 35.65, max: 173)

Run: 199, exploration: 0.01, score: 20
Scores: (min: 8, avg: 35.25, max: 173)

Run: 200, exploration: 0.01, score: 13
Scores: (min: 8, avg: 35.24, max: 173)

Run: 201, exploration: 0.01, score: 147
Scores: (min: 8, avg: 36.24, max: 173)

Run: 202, exploration: 0.01, score: 74
Scores: (min: 8, avg: 36.47, max: 173)

Run: 203, exploration: 0.01, score: 166
Scores: (min: 8, avg: 37.59, max: 173)

Run: 204, exploration: 0.01, score: 13
Scores: (min

Run: 296, exploration: 0.01, score: 53
Scores: (min: 8, avg: 32.75, max: 166)

Run: 297, exploration: 0.01, score: 19
Scores: (min: 8, avg: 32.64, max: 166)

Run: 298, exploration: 0.01, score: 19
Scores: (min: 8, avg: 32.58, max: 166)

Run: 299, exploration: 0.01, score: 10
Scores: (min: 8, avg: 32.48, max: 166)

Run: 300, exploration: 0.01, score: 8
Scores: (min: 8, avg: 32.43, max: 166)

Run: 301, exploration: 0.01, score: 10
Scores: (min: 8, avg: 31.06, max: 166)

Run: 302, exploration: 0.01, score: 9
Scores: (min: 8, avg: 30.41, max: 166)

Run: 303, exploration: 0.01, score: 10
Scores: (min: 8, avg: 28.85, max: 152)

Run: 304, exploration: 0.01, score: 10
Scores: (min: 8, avg: 28.82, max: 152)

Run: 305, exploration: 0.01, score: 9
Scores: (min: 8, avg: 28.46, max: 152)

Run: 306, exploration: 0.01, score: 9
Scores: (min: 8, avg: 27.68, max: 152)

Run: 307, exploration: 0.01, score: 10
Scores: (min: 8, avg: 27.37, max: 152)

Run: 308, exploration: 0.01, score: 9
Scores: (min: 8, a

Run: 401, exploration: 0.01, score: 81
Scores: (min: 9, avg: 27.04, max: 122)

Run: 402, exploration: 0.01, score: 49
Scores: (min: 9, avg: 27.44, max: 122)

Run: 403, exploration: 0.01, score: 99
Scores: (min: 9, avg: 28.33, max: 122)

Run: 404, exploration: 0.01, score: 41
Scores: (min: 9, avg: 28.64, max: 122)

Run: 405, exploration: 0.01, score: 20
Scores: (min: 9, avg: 28.75, max: 122)

Run: 406, exploration: 0.01, score: 14
Scores: (min: 9, avg: 28.8, max: 122)

Run: 407, exploration: 0.01, score: 61
Scores: (min: 9, avg: 29.31, max: 122)

Run: 408, exploration: 0.01, score: 22
Scores: (min: 9, avg: 29.44, max: 122)

Run: 409, exploration: 0.01, score: 106
Scores: (min: 9, avg: 30.4, max: 122)

Run: 410, exploration: 0.01, score: 43
Scores: (min: 9, avg: 30.72, max: 122)

Run: 411, exploration: 0.01, score: 15
Scores: (min: 9, avg: 30.76, max: 122)

Run: 412, exploration: 0.01, score: 49
Scores: (min: 9, avg: 30.73, max: 122)

Run: 413, exploration: 0.01, score: 88
Scores: (min: 

Run: 506, exploration: 0.01, score: 98
Scores: (min: 8, avg: 17.15, max: 106)

Run: 507, exploration: 0.01, score: 36
Scores: (min: 8, avg: 16.9, max: 106)

Run: 508, exploration: 0.01, score: 38
Scores: (min: 8, avg: 17.06, max: 106)

Run: 509, exploration: 0.01, score: 30
Scores: (min: 8, avg: 16.3, max: 98)

Run: 510, exploration: 0.01, score: 39
Scores: (min: 8, avg: 16.26, max: 98)

Run: 511, exploration: 0.01, score: 11
Scores: (min: 8, avg: 16.22, max: 98)

Run: 512, exploration: 0.01, score: 10
Scores: (min: 8, avg: 15.83, max: 98)

Run: 513, exploration: 0.01, score: 9
Scores: (min: 8, avg: 15.04, max: 98)

Run: 514, exploration: 0.01, score: 13
Scores: (min: 8, avg: 14.5, max: 98)

Run: 515, exploration: 0.01, score: 9
Scores: (min: 8, avg: 14.5, max: 98)

Run: 516, exploration: 0.01, score: 9
Scores: (min: 8, avg: 14.49, max: 98)

Run: 517, exploration: 0.01, score: 8
Scores: (min: 8, avg: 14.47, max: 98)

Run: 518, exploration: 0.01, score: 10
Scores: (min: 8, avg: 14.48, m

Run: 612, exploration: 0.01, score: 10
Scores: (min: 8, avg: 9.96, max: 25)

Run: 613, exploration: 0.01, score: 8
Scores: (min: 8, avg: 9.95, max: 25)

Run: 614, exploration: 0.01, score: 9
Scores: (min: 8, avg: 9.91, max: 25)

Run: 615, exploration: 0.01, score: 10
Scores: (min: 8, avg: 9.92, max: 25)

Run: 616, exploration: 0.01, score: 10
Scores: (min: 8, avg: 9.93, max: 25)

Run: 617, exploration: 0.01, score: 8
Scores: (min: 8, avg: 9.93, max: 25)

Run: 618, exploration: 0.01, score: 12
Scores: (min: 8, avg: 9.95, max: 25)

Run: 619, exploration: 0.01, score: 11
Scores: (min: 8, avg: 9.96, max: 25)

Run: 620, exploration: 0.01, score: 16
Scores: (min: 8, avg: 10.03, max: 25)

Run: 621, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.03, max: 25)

Run: 622, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.02, max: 25)

Run: 623, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.01, max: 25)

Run: 624, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10, max: 25)

Run:

Run: 718, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.25, max: 24)

Run: 719, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.23, max: 24)

Run: 720, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.17, max: 24)

Run: 721, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.18, max: 24)

Run: 722, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.18, max: 24)

Run: 723, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.18, max: 24)

Run: 724, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.18, max: 24)

Run: 725, exploration: 0.01, score: 16
Scores: (min: 8, avg: 10.23, max: 24)

Run: 726, exploration: 0.01, score: 11
Scores: (min: 8, avg: 10.25, max: 24)

Run: 727, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.26, max: 24)

Run: 728, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.15, max: 24)

Run: 729, exploration: 0.01, score: 12
Scores: (min: 8, avg: 10.17, max: 24)

Run: 730, exploration: 0.01, score: 10
Scores: (min: 8, avg: 10.18, ma

KeyboardInterrupt: 

In [4]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001 
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 2.0  #change the exploration to larger
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [5]:
cartpole()

Run: 1, exploration: 1.98005, score: 22
Scores: (min: 22, avg: 22, max: 22)

Run: 2, exploration: 1.9022202609315437, score: 9
Scores: (min: 9, avg: 15.5, max: 22)

Run: 3, exploration: 1.7556182834681147, score: 17
Scores: (min: 9, avg: 16, max: 22)

Run: 4, exploration: 1.4657537092873598, score: 37
Scores: (min: 9, avg: 21.25, max: 37)

Run: 5, exploration: 1.325936166922741, score: 21
Scores: (min: 9, avg: 21.2, max: 37)

Run: 6, exploration: 1.2298972430714525, score: 16
Scores: (min: 9, avg: 20.333333333333332, max: 37)

Run: 7, exploration: 1.0850402445845577, score: 26
Scores: (min: 9, avg: 21.142857142857142, max: 37)

Run: 8, exploration: 1.0165901475171681, score: 14
Scores: (min: 9, avg: 20.25, max: 37)

Run: 9, exploration: 0.9335514740318087, score: 18
Scores: (min: 9, avg: 20, max: 37)

Run: 10, exploration: 0.8277366916839737, score: 25
Scores: (min: 9, avg: 20.5, max: 37)

Run: 11, exploration: 0.7525419961860931, score: 20
Scores: (min: 9, avg: 20.454545454545453, max

Run: 91, exploration: 0.01, score: 246
Scores: (min: 9, avg: 180.2967032967033, max: 500)

Run: 92, exploration: 0.01, score: 247
Scores: (min: 9, avg: 181.02173913043478, max: 500)

Run: 93, exploration: 0.01, score: 116
Scores: (min: 9, avg: 180.32258064516128, max: 500)

Run: 94, exploration: 0.01, score: 110
Scores: (min: 9, avg: 179.5744680851064, max: 500)

Run: 95, exploration: 0.01, score: 96
Scores: (min: 9, avg: 178.69473684210527, max: 500)

Run: 96, exploration: 0.01, score: 76
Scores: (min: 9, avg: 177.625, max: 500)

Run: 97, exploration: 0.01, score: 44
Scores: (min: 9, avg: 176.24742268041237, max: 500)

Run: 98, exploration: 0.01, score: 100
Scores: (min: 9, avg: 175.46938775510205, max: 500)

Run: 99, exploration: 0.01, score: 172
Scores: (min: 9, avg: 175.43434343434345, max: 500)

Run: 100, exploration: 0.01, score: 143
Scores: (min: 9, avg: 175.11, max: 500)

Run: 101, exploration: 0.01, score: 61
Scores: (min: 9, avg: 175.5, max: 500)

Run: 102, exploration: 0.01,

Run: 192, exploration: 0.01, score: 9
Scores: (min: 8, avg: 183.62, max: 500)

Run: 193, exploration: 0.01, score: 11
Scores: (min: 8, avg: 182.57, max: 500)

Run: 194, exploration: 0.01, score: 10
Scores: (min: 8, avg: 181.57, max: 500)

Run: 195, exploration: 0.01, score: 15
Scores: (min: 8, avg: 180.76, max: 500)

Run: 196, exploration: 0.01, score: 16
Scores: (min: 8, avg: 180.16, max: 500)

Run: 197, exploration: 0.01, score: 500
Scores: (min: 8, avg: 184.72, max: 500)

Run: 198, exploration: 0.01, score: 419
Scores: (min: 8, avg: 187.91, max: 500)

Run: 199, exploration: 0.01, score: 437
Scores: (min: 8, avg: 190.56, max: 500)

Run: 200, exploration: 0.01, score: 271
Scores: (min: 8, avg: 191.84, max: 500)

Run: 201, exploration: 0.01, score: 175
Scores: (min: 8, avg: 192.98, max: 500)

Run: 202, exploration: 0.01, score: 232
Scores: (min: 8, avg: 194.75, max: 500)

Run: 203, exploration: 0.01, score: 42
Scores: (min: 8, avg: 193.07, max: 500)

Run: 204, exploration: 0.01, score:

NameError: name 'exit' is not defined

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.5  #change decay rate
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [8]:
cartpole()

Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.01, score: 17
Scores: (min: 13, avg: 15, max: 17)

Run: 3, exploration: 0.01, score: 11
Scores: (min: 11, avg: 13.666666666666666, max: 17)

Run: 4, exploration: 0.01, score: 9
Scores: (min: 9, avg: 12.5, max: 17)

Run: 5, exploration: 0.01, score: 9
Scores: (min: 9, avg: 11.8, max: 17)

Run: 6, exploration: 0.01, score: 10
Scores: (min: 9, avg: 11.5, max: 17)

Run: 7, exploration: 0.01, score: 12
Scores: (min: 9, avg: 11.571428571428571, max: 17)

Run: 8, exploration: 0.01, score: 49
Scores: (min: 9, avg: 16.25, max: 49)

Run: 9, exploration: 0.01, score: 9
Scores: (min: 9, avg: 15.444444444444445, max: 49)

Run: 10, exploration: 0.01, score: 9
Scores: (min: 9, avg: 14.8, max: 49)

Run: 11, exploration: 0.01, score: 10
Scores: (min: 9, avg: 14.363636363636363, max: 49)

Run: 12, exploration: 0.01, score: 9
Scores: (min: 9, avg: 13.916666666666666, max: 49)

Run: 13, exploration: 0.01, score

Run: 95, exploration: 0.01, score: 111
Scores: (min: 8, avg: 84.15789473684211, max: 173)

Run: 96, exploration: 0.01, score: 116
Scores: (min: 8, avg: 84.48958333333333, max: 173)

Run: 97, exploration: 0.01, score: 164
Scores: (min: 8, avg: 85.30927835051547, max: 173)

Run: 98, exploration: 0.01, score: 104
Scores: (min: 8, avg: 85.5, max: 173)

Run: 99, exploration: 0.01, score: 133
Scores: (min: 8, avg: 85.97979797979798, max: 173)

Run: 100, exploration: 0.01, score: 107
Scores: (min: 8, avg: 86.19, max: 173)

Run: 101, exploration: 0.01, score: 131
Scores: (min: 8, avg: 87.37, max: 173)

Run: 102, exploration: 0.01, score: 116
Scores: (min: 8, avg: 88.36, max: 173)

Run: 103, exploration: 0.01, score: 106
Scores: (min: 8, avg: 89.31, max: 173)

Run: 104, exploration: 0.01, score: 155
Scores: (min: 8, avg: 90.77, max: 173)

Run: 105, exploration: 0.01, score: 147
Scores: (min: 8, avg: 92.15, max: 173)

Run: 106, exploration: 0.01, score: 108
Scores: (min: 8, avg: 93.13, max: 173)

Run: 196, exploration: 0.01, score: 97
Scores: (min: 12, avg: 119.02, max: 317)

Run: 197, exploration: 0.01, score: 240
Scores: (min: 12, avg: 119.78, max: 317)

Run: 198, exploration: 0.01, score: 116
Scores: (min: 12, avg: 119.9, max: 317)

Run: 199, exploration: 0.01, score: 166
Scores: (min: 12, avg: 120.23, max: 317)

Run: 200, exploration: 0.01, score: 84
Scores: (min: 12, avg: 120, max: 317)

Run: 201, exploration: 0.01, score: 208
Scores: (min: 12, avg: 120.77, max: 317)

Run: 202, exploration: 0.01, score: 127
Scores: (min: 12, avg: 120.88, max: 317)

Run: 203, exploration: 0.01, score: 107
Scores: (min: 12, avg: 120.89, max: 317)

Run: 204, exploration: 0.01, score: 130
Scores: (min: 12, avg: 120.64, max: 317)

Run: 205, exploration: 0.01, score: 135
Scores: (min: 12, avg: 120.52, max: 317)

Run: 206, exploration: 0.01, score: 129
Scores: (min: 12, avg: 120.73, max: 317)

Run: 207, exploration: 0.01, score: 146
Scores: (min: 12, avg: 121.1, max: 317)

Run: 208, exploration: 

Run: 297, exploration: 0.01, score: 112
Scores: (min: 9, avg: 123.02, max: 263)

Run: 298, exploration: 0.01, score: 116
Scores: (min: 9, avg: 123.02, max: 263)

Run: 299, exploration: 0.01, score: 136
Scores: (min: 9, avg: 122.72, max: 263)

Run: 300, exploration: 0.01, score: 163
Scores: (min: 9, avg: 123.51, max: 263)

Run: 301, exploration: 0.01, score: 135
Scores: (min: 9, avg: 122.78, max: 263)

Run: 302, exploration: 0.01, score: 125
Scores: (min: 9, avg: 122.76, max: 263)

Run: 303, exploration: 0.01, score: 131
Scores: (min: 9, avg: 123, max: 263)

Run: 304, exploration: 0.01, score: 170
Scores: (min: 9, avg: 123.4, max: 263)

Run: 305, exploration: 0.01, score: 101
Scores: (min: 9, avg: 123.06, max: 263)

Run: 306, exploration: 0.01, score: 102
Scores: (min: 9, avg: 122.79, max: 263)

Run: 307, exploration: 0.01, score: 269
Scores: (min: 9, avg: 124.02, max: 269)

Run: 308, exploration: 0.01, score: 111
Scores: (min: 9, avg: 124.12, max: 269)

Run: 309, exploration: 0.01, sco

Run: 399, exploration: 0.01, score: 98
Scores: (min: 16, avg: 124.3, max: 281)

Run: 400, exploration: 0.01, score: 111
Scores: (min: 16, avg: 123.78, max: 281)

Run: 401, exploration: 0.01, score: 135
Scores: (min: 16, avg: 123.78, max: 281)

Run: 402, exploration: 0.01, score: 121
Scores: (min: 16, avg: 123.74, max: 281)

Run: 403, exploration: 0.01, score: 180
Scores: (min: 16, avg: 124.23, max: 281)

Run: 404, exploration: 0.01, score: 109
Scores: (min: 16, avg: 123.62, max: 281)

Run: 405, exploration: 0.01, score: 139
Scores: (min: 16, avg: 124, max: 281)

Run: 406, exploration: 0.01, score: 150
Scores: (min: 16, avg: 124.48, max: 281)

Run: 407, exploration: 0.01, score: 135
Scores: (min: 16, avg: 123.14, max: 281)

Run: 408, exploration: 0.01, score: 126
Scores: (min: 16, avg: 123.29, max: 281)

Run: 409, exploration: 0.01, score: 12
Scores: (min: 12, avg: 122.13, max: 281)

Run: 410, exploration: 0.01, score: 122
Scores: (min: 12, avg: 122.31, max: 281)

Run: 411, exploration:

Run: 500, exploration: 0.01, score: 105
Scores: (min: 12, avg: 141.62, max: 416)

Run: 501, exploration: 0.01, score: 109
Scores: (min: 12, avg: 141.36, max: 416)

Run: 502, exploration: 0.01, score: 126
Scores: (min: 12, avg: 141.41, max: 416)

Run: 503, exploration: 0.01, score: 172
Scores: (min: 12, avg: 141.33, max: 416)

Run: 504, exploration: 0.01, score: 154
Scores: (min: 12, avg: 141.78, max: 416)

Run: 505, exploration: 0.01, score: 165
Scores: (min: 12, avg: 142.04, max: 416)

Run: 506, exploration: 0.01, score: 120
Scores: (min: 12, avg: 141.74, max: 416)

Run: 507, exploration: 0.01, score: 183
Scores: (min: 12, avg: 142.22, max: 416)

Run: 508, exploration: 0.01, score: 172
Scores: (min: 12, avg: 142.68, max: 416)

Run: 509, exploration: 0.01, score: 130
Scores: (min: 19, avg: 143.86, max: 416)

Run: 510, exploration: 0.01, score: 139
Scores: (min: 19, avg: 144.03, max: 416)

Run: 511, exploration: 0.01, score: 187
Scores: (min: 19, avg: 144.83, max: 416)

Run: 512, explor

Run: 601, exploration: 0.01, score: 227
Scores: (min: 24, avg: 188.51, max: 500)

Run: 602, exploration: 0.01, score: 54
Scores: (min: 24, avg: 187.79, max: 500)

Run: 603, exploration: 0.01, score: 18
Scores: (min: 18, avg: 186.25, max: 500)

Run: 604, exploration: 0.01, score: 39
Scores: (min: 18, avg: 185.1, max: 500)

Run: 605, exploration: 0.01, score: 96
Scores: (min: 18, avg: 184.41, max: 500)

Run: 606, exploration: 0.01, score: 122
Scores: (min: 18, avg: 184.43, max: 500)

Run: 607, exploration: 0.01, score: 187
Scores: (min: 18, avg: 184.47, max: 500)

Run: 608, exploration: 0.01, score: 117
Scores: (min: 18, avg: 183.92, max: 500)

Run: 609, exploration: 0.01, score: 138
Scores: (min: 18, avg: 184, max: 500)

Run: 610, exploration: 0.01, score: 500
Scores: (min: 18, avg: 187.61, max: 500)

Run: 611, exploration: 0.01, score: 140
Scores: (min: 18, avg: 187.14, max: 500)

Run: 612, exploration: 0.01, score: 232
Scores: (min: 18, avg: 187.69, max: 500)

Run: 613, exploration: 0

NameError: name 'exit' is not defined

For this RL example the agent is trying balance the inverted pendulum and is represeneted by the solver, DQNSolver The possible state values describe the cart position, velocity, pole angle and its tip velocity. The actions describe what the cart can do to change the position o fthe pole, e.g., move left or move right

This exmaple of RL uses the DQN (Deep Q-Learning) as its reinforcement algorithm and uses the DQNSolver.
This algorithm uses experience replay (line 88) so that the states experienced by the agent can be remembered and then have those experiences sampled. By sampling the algorithm tries to "reduce correlation between subsequent actions".

This algorithm also uses a discount factor that is not myopic, i.e., 0.95. The discount factor determines the importance of future rewards. A factor of 0 will make the agent short-sighted by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward.

In long running training sessions, or models with many, many states the Q-table may become too large to fit into memory or be efficiently searched. A neural network may be substituted for this table to approximate the response with the Q-values computed using the Bellman equation.

With the default learning rate, alpha, (0.001) the algorithm finished in 125 runs. At 0.0001 192 were needed ahd at 0.01 more than 784 were need - simulation was termintaed at 784 runs. The learning rate is the "step size". If the step size is too large the simulation can meander around the minimum without converging - which is what happened at 0.01. In theory, a large step size is useful at the beginning of the run so expericnce is gained quickly but becomes detrimental later on. The decay factor could be changed so that the learning rate is high to start and less as the simulation progresses.


Surma, G. (2021). Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning). Medium. Retrieved from https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

Q-learning. (2023, March 28).  In Wikipedia. https://en.wikipedia.org/w/index.php?title=Q-learning&oldid=1146994709