# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.985074875, score: 23
Scores: (min: 23, avg: 23, max: 23)

Run: 2, exploration: 0.9000874278732445, score: 19
Scores: (min: 19, avg: 21, max: 23)

Run: 3, exploration: 0.8390886103705794, score: 15
Scores: (min: 15, avg: 19, max: 23)

Run: 4, exploration: 0.7822236754458713, score: 15
Scores: (min: 15, avg: 18, max: 23)

Run: 5, exploration: 0.697046600835495, score: 24
Scores: (min: 15, avg: 19.2, max: 24)

Run: 6, exploration: 0.6465587967553006, score: 16
Scores: (min: 15, avg: 18.666666666666668, max: 24)

Run: 7, exploration: 0.6149486215357263, score: 11
Scores: (min: 11, avg: 17.571428571428573, max: 24)

Run: 8, exploration: 0.4982051627146237, score: 43
Scores: (min: 11, avg: 20.75, max: 43)

Run: 9, exploration: 0.46211964903917074, score: 16
Scores: (min: 11, avg: 20.22222222222222, max: 43)

Run: 10, exploration: 0.42650460709830135, score: 17
Scores: (min: 11, avg: 19.9, max: 43)

Run: 11, exploration: 0.40565285250151817, score: 11
Scores: (min: 11, 

Run: 90, exploration: 0.01, score: 339
Scores: (min: 11, avg: 128.93333333333334, max: 358)

Run: 91, exploration: 0.01, score: 139
Scores: (min: 11, avg: 129.04395604395606, max: 358)

Run: 92, exploration: 0.01, score: 196
Scores: (min: 11, avg: 129.77173913043478, max: 358)

Run: 93, exploration: 0.01, score: 149
Scores: (min: 11, avg: 129.9784946236559, max: 358)

Run: 94, exploration: 0.01, score: 195
Scores: (min: 11, avg: 130.67021276595744, max: 358)

Run: 95, exploration: 0.01, score: 211
Scores: (min: 11, avg: 131.5157894736842, max: 358)

Run: 96, exploration: 0.01, score: 197
Scores: (min: 11, avg: 132.19791666666666, max: 358)

Run: 97, exploration: 0.01, score: 153
Scores: (min: 11, avg: 132.41237113402062, max: 358)

Run: 98, exploration: 0.01, score: 133
Scores: (min: 11, avg: 132.41836734693877, max: 358)

Run: 99, exploration: 0.01, score: 174
Scores: (min: 11, avg: 132.83838383838383, max: 358)

Run: 100, exploration: 0.01, score: 128
Scores: (min: 11, avg: 132.79, m

Run: 189, exploration: 0.01, score: 151
Scores: (min: 52, avg: 185.53, max: 500)

Run: 190, exploration: 0.01, score: 130
Scores: (min: 52, avg: 183.44, max: 500)

Run: 191, exploration: 0.01, score: 153
Scores: (min: 52, avg: 183.58, max: 500)

Run: 192, exploration: 0.01, score: 164
Scores: (min: 52, avg: 183.26, max: 500)

Run: 193, exploration: 0.01, score: 139
Scores: (min: 52, avg: 183.16, max: 500)

Run: 194, exploration: 0.01, score: 251
Scores: (min: 52, avg: 183.72, max: 500)

Run: 195, exploration: 0.01, score: 128
Scores: (min: 52, avg: 182.89, max: 500)

Run: 196, exploration: 0.01, score: 152
Scores: (min: 52, avg: 182.44, max: 500)

Run: 197, exploration: 0.01, score: 370
Scores: (min: 52, avg: 184.61, max: 500)

Run: 198, exploration: 0.01, score: 196
Scores: (min: 52, avg: 185.24, max: 500)

Run: 199, exploration: 0.01, score: 165
Scores: (min: 52, avg: 185.15, max: 500)

Run: 200, exploration: 0.01, score: 159
Scores: (min: 52, avg: 185.46, max: 500)

Run: 201, explor

Run: 290, exploration: 0.01, score: 251
Scores: (min: 12, avg: 171.47, max: 500)

Run: 291, exploration: 0.01, score: 200
Scores: (min: 12, avg: 171.94, max: 500)

Run: 292, exploration: 0.01, score: 297
Scores: (min: 12, avg: 173.27, max: 500)

Run: 293, exploration: 0.01, score: 96
Scores: (min: 12, avg: 172.84, max: 500)

Run: 294, exploration: 0.01, score: 242
Scores: (min: 12, avg: 172.75, max: 500)

Run: 295, exploration: 0.01, score: 178
Scores: (min: 12, avg: 173.25, max: 500)

Run: 296, exploration: 0.01, score: 102
Scores: (min: 12, avg: 172.75, max: 500)

Run: 297, exploration: 0.01, score: 136
Scores: (min: 12, avg: 170.41, max: 500)

Run: 298, exploration: 0.01, score: 161
Scores: (min: 12, avg: 170.06, max: 500)

Run: 299, exploration: 0.01, score: 124
Scores: (min: 12, avg: 169.65, max: 500)

Run: 300, exploration: 0.01, score: 151
Scores: (min: 12, avg: 169.57, max: 500)

Run: 301, exploration: 0.01, score: 173
Scores: (min: 12, avg: 169.68, max: 500)

Run: 302, explora

Run: 391, exploration: 0.01, score: 468
Scores: (min: 9, avg: 166.74, max: 500)

Run: 392, exploration: 0.01, score: 358
Scores: (min: 9, avg: 167.35, max: 500)

Run: 393, exploration: 0.01, score: 206
Scores: (min: 9, avg: 168.45, max: 500)

Run: 394, exploration: 0.01, score: 96
Scores: (min: 9, avg: 166.99, max: 500)

Run: 395, exploration: 0.01, score: 100
Scores: (min: 9, avg: 166.21, max: 500)

Run: 396, exploration: 0.01, score: 180
Scores: (min: 9, avg: 166.99, max: 500)

Run: 397, exploration: 0.01, score: 281
Scores: (min: 9, avg: 168.44, max: 500)

Run: 398, exploration: 0.01, score: 229
Scores: (min: 9, avg: 169.12, max: 500)

Run: 399, exploration: 0.01, score: 205
Scores: (min: 9, avg: 169.93, max: 500)

Run: 400, exploration: 0.01, score: 139
Scores: (min: 9, avg: 169.81, max: 500)

Run: 401, exploration: 0.01, score: 244
Scores: (min: 9, avg: 170.52, max: 500)

Run: 402, exploration: 0.01, score: 198
Scores: (min: 9, avg: 171.59, max: 500)

Run: 403, exploration: 0.01, 

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.99  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 30  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.1  
EXPLORATION_DECAY = 0.95  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [4]:
cartpole()

Run: 1, exploration: 1.0, score: 11
Scores: (min: 11, avg: 11, max: 11)

Run: 2, exploration: 0.7737809374999999, score: 24
Scores: (min: 11, avg: 17.5, max: 24)

Run: 3, exploration: 0.1, score: 53
Scores: (min: 11, avg: 29.333333333333332, max: 53)

Run: 4, exploration: 0.1, score: 15
Scores: (min: 11, avg: 25.75, max: 53)

Run: 5, exploration: 0.1, score: 29
Scores: (min: 11, avg: 26.4, max: 53)

Run: 6, exploration: 0.1, score: 57
Scores: (min: 11, avg: 31.5, max: 57)

Run: 7, exploration: 0.1, score: 19
Scores: (min: 11, avg: 29.714285714285715, max: 57)

Run: 8, exploration: 0.1, score: 63
Scores: (min: 11, avg: 33.875, max: 63)

Run: 9, exploration: 0.1, score: 36
Scores: (min: 11, avg: 34.111111111111114, max: 63)

Run: 10, exploration: 0.1, score: 51
Scores: (min: 11, avg: 35.8, max: 63)

Run: 11, exploration: 0.1, score: 109
Scores: (min: 11, avg: 42.45454545454545, max: 109)

Run: 12, exploration: 0.1, score: 67
Scores: (min: 11, avg: 44.5, max: 109)

Run: 13, exploration: 0

Run: 95, exploration: 0.1, score: 9
Scores: (min: 8, avg: 121.93684210526315, max: 500)

Run: 96, exploration: 0.1, score: 9
Scores: (min: 8, avg: 120.76041666666667, max: 500)

Run: 97, exploration: 0.1, score: 11
Scores: (min: 8, avg: 119.62886597938144, max: 500)

Run: 98, exploration: 0.1, score: 10
Scores: (min: 8, avg: 118.51020408163265, max: 500)

Run: 99, exploration: 0.1, score: 9
Scores: (min: 8, avg: 117.4040404040404, max: 500)

Run: 100, exploration: 0.1, score: 9
Scores: (min: 8, avg: 116.32, max: 500)

Run: 101, exploration: 0.1, score: 11
Scores: (min: 8, avg: 116.32, max: 500)

Run: 102, exploration: 0.1, score: 8
Scores: (min: 8, avg: 116.16, max: 500)

Run: 103, exploration: 0.1, score: 10
Scores: (min: 8, avg: 115.73, max: 500)

Run: 104, exploration: 0.1, score: 10
Scores: (min: 8, avg: 115.68, max: 500)

Run: 105, exploration: 0.1, score: 9
Scores: (min: 8, avg: 115.48, max: 500)

Run: 106, exploration: 0.1, score: 10
Scores: (min: 8, avg: 115.01, max: 500)

Run:

Run: 201, exploration: 0.1, score: 9
Scores: (min: 8, avg: 9.85, max: 13)

Run: 202, exploration: 0.1, score: 13
Scores: (min: 8, avg: 9.9, max: 13)

Run: 203, exploration: 0.1, score: 10
Scores: (min: 8, avg: 9.9, max: 13)

Run: 204, exploration: 0.1, score: 11
Scores: (min: 8, avg: 9.91, max: 13)

Run: 205, exploration: 0.1, score: 10
Scores: (min: 8, avg: 9.92, max: 13)

Run: 206, exploration: 0.1, score: 9
Scores: (min: 8, avg: 9.91, max: 13)

Run: 207, exploration: 0.1, score: 10
Scores: (min: 8, avg: 9.93, max: 13)

Run: 208, exploration: 0.1, score: 10
Scores: (min: 8, avg: 9.95, max: 13)

Run: 209, exploration: 0.1, score: 11
Scores: (min: 8, avg: 9.96, max: 13)

Run: 210, exploration: 0.1, score: 9
Scores: (min: 8, avg: 9.95, max: 13)

Run: 211, exploration: 0.1, score: 8
Scores: (min: 8, avg: 9.93, max: 13)

Run: 212, exploration: 0.1, score: 10
Scores: (min: 8, avg: 9.9, max: 13)

Run: 213, exploration: 0.1, score: 9
Scores: (min: 8, avg: 9.89, max: 13)

Run: 214, exploratio

Run: 309, exploration: 0.1, score: 113
Scores: (min: 8, avg: 25.1, max: 155)

Run: 310, exploration: 0.1, score: 124
Scores: (min: 8, avg: 26.25, max: 155)

Run: 311, exploration: 0.1, score: 125
Scores: (min: 8, avg: 27.42, max: 155)

Run: 312, exploration: 0.1, score: 106
Scores: (min: 8, avg: 28.38, max: 155)

Run: 313, exploration: 0.1, score: 123
Scores: (min: 8, avg: 29.52, max: 155)

Run: 314, exploration: 0.1, score: 113
Scores: (min: 8, avg: 30.55, max: 155)

Run: 315, exploration: 0.1, score: 109
Scores: (min: 8, avg: 31.54, max: 155)

Run: 316, exploration: 0.1, score: 108
Scores: (min: 8, avg: 32.5, max: 155)

Run: 317, exploration: 0.1, score: 93
Scores: (min: 8, avg: 33.34, max: 155)

Run: 318, exploration: 0.1, score: 36
Scores: (min: 8, avg: 33.61, max: 155)

Run: 319, exploration: 0.1, score: 88
Scores: (min: 8, avg: 34.4, max: 155)

Run: 320, exploration: 0.1, score: 95
Scores: (min: 8, avg: 35.26, max: 155)

Run: 321, exploration: 0.1, score: 104
Scores: (min: 8, avg

NameError: name 'exit' is not defined

In the cartpole problem the goal is to balance a pole on a moving cart by pushing it left or right. The model uses four pieces of information to decide what to do: the cart's position, the cart's speed, the pole's angle, and the pole's rotation speed. The model can perform two actions, push the cart to the left or right. The model learns how to balance the pole using a method called Deep Q-Learning, which involves using a neural network to predict the best action based on its past experiences stored in memory. These experiences are randomly sampled during training which helps the model learn more effectively by utilizing a set of diverse training examples.

The neural network in the DQN has an input layer for the state values, two hidden layers to process the information, and an output layer that suggests the best action for the model to take. Adjusting the learning rate can impact the training process. A high learning rate can make learning faster but unstable, while a low learning rate results in slower but steadier learning. The neural network helps make the Q-learning algorithm more efficient by generalizing from past experiences to new states it can't see, improving the model's ability to learn the task of balancing the pole on the cart.