# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.8690529955452602, score: 48
Scores: (min: 48, avg: 48, max: 48)

Run: 2, exploration: 0.7744209942832988, score: 24
Scores: (min: 24, avg: 36, max: 48)

Run: 3, exploration: 0.7040696960536299, score: 20
Scores: (min: 20, avg: 30.666666666666668, max: 48)

Run: 4, exploration: 0.6433260027715241, score: 19
Scores: (min: 19, avg: 27.75, max: 48)

Run: 5, exploration: 0.6211445383053219, score: 8
Scores: (min: 8, avg: 23.8, max: 48)

Run: 6, exploration: 0.5878229785513479, score: 12
Scores: (min: 8, avg: 21.833333333333332, max: 48)

Run: 7, exploration: 0.5507399854171277, score: 14
Scores: (min: 8, avg: 20.714285714285715, max: 48)

Run: 8, exploration: 0.5159963842937159, score: 14
Scores: (min: 8, avg: 19.875, max: 48)

Run: 9, exploration: 0.4598090507939749, score: 24
Scores: (min: 8, avg: 20.333333333333332, max: 48)

Run: 10, exploration: 0.43732904629000013, score: 11
Scores: (min: 8, avg: 19.4, max: 48)

Run: 11, exploration: 0.4159480862733536, score: 1

Run: 89, exploration: 0.01, score: 162
Scores: (min: 8, avg: 54.47191011235955, max: 175)

Run: 90, exploration: 0.01, score: 198
Scores: (min: 8, avg: 56.06666666666667, max: 198)

Run: 91, exploration: 0.01, score: 141
Scores: (min: 8, avg: 57, max: 198)

Run: 92, exploration: 0.01, score: 141
Scores: (min: 8, avg: 57.91304347826087, max: 198)

Run: 93, exploration: 0.01, score: 132
Scores: (min: 8, avg: 58.70967741935484, max: 198)

Run: 94, exploration: 0.01, score: 128
Scores: (min: 8, avg: 59.4468085106383, max: 198)

Run: 95, exploration: 0.01, score: 128
Scores: (min: 8, avg: 60.16842105263158, max: 198)

Run: 96, exploration: 0.01, score: 149
Scores: (min: 8, avg: 61.09375, max: 198)

Run: 97, exploration: 0.01, score: 135
Scores: (min: 8, avg: 61.855670103092784, max: 198)

Run: 98, exploration: 0.01, score: 118
Scores: (min: 8, avg: 62.42857142857143, max: 198)

Run: 99, exploration: 0.01, score: 102
Scores: (min: 8, avg: 62.82828282828283, max: 198)

Run: 100, exploration: 

Run: 189, exploration: 0.01, score: 125
Scores: (min: 102, avg: 138.26, max: 198)

Run: 190, exploration: 0.01, score: 181
Scores: (min: 102, avg: 138.09, max: 181)

Run: 191, exploration: 0.01, score: 126
Scores: (min: 102, avg: 137.94, max: 181)

Run: 192, exploration: 0.01, score: 139
Scores: (min: 102, avg: 137.92, max: 181)

Run: 193, exploration: 0.01, score: 137
Scores: (min: 102, avg: 137.97, max: 181)

Run: 194, exploration: 0.01, score: 166
Scores: (min: 102, avg: 138.35, max: 181)

Run: 195, exploration: 0.01, score: 121
Scores: (min: 102, avg: 138.28, max: 181)

Run: 196, exploration: 0.01, score: 133
Scores: (min: 102, avg: 138.12, max: 181)

Run: 197, exploration: 0.01, score: 151
Scores: (min: 102, avg: 138.28, max: 181)

Run: 198, exploration: 0.01, score: 140
Scores: (min: 102, avg: 138.5, max: 181)

Run: 199, exploration: 0.01, score: 147
Scores: (min: 103, avg: 138.95, max: 181)

Run: 200, exploration: 0.01, score: 141
Scores: (min: 103, avg: 139.11, max: 181)

Run: 

Run: 289, exploration: 0.01, score: 139
Scores: (min: 58, avg: 139.99, max: 204)

Run: 290, exploration: 0.01, score: 150
Scores: (min: 58, avg: 139.68, max: 204)

Run: 291, exploration: 0.01, score: 141
Scores: (min: 58, avg: 139.83, max: 204)

Run: 292, exploration: 0.01, score: 88
Scores: (min: 58, avg: 139.32, max: 204)

Run: 293, exploration: 0.01, score: 132
Scores: (min: 58, avg: 139.27, max: 204)

Run: 294, exploration: 0.01, score: 120
Scores: (min: 58, avg: 138.81, max: 204)

Run: 295, exploration: 0.01, score: 166
Scores: (min: 58, avg: 139.26, max: 204)

Run: 296, exploration: 0.01, score: 122
Scores: (min: 58, avg: 139.15, max: 204)

Run: 297, exploration: 0.01, score: 144
Scores: (min: 58, avg: 139.08, max: 204)

Run: 298, exploration: 0.01, score: 131
Scores: (min: 58, avg: 138.99, max: 204)

Run: 299, exploration: 0.01, score: 137
Scores: (min: 58, avg: 138.89, max: 204)

Run: 300, exploration: 0.01, score: 139
Scores: (min: 58, avg: 138.87, max: 204)

Run: 301, explora

Run: 390, exploration: 0.01, score: 109
Scores: (min: 13, avg: 132.16, max: 268)

Run: 391, exploration: 0.01, score: 88
Scores: (min: 13, avg: 131.63, max: 268)

Run: 392, exploration: 0.01, score: 81
Scores: (min: 13, avg: 131.56, max: 268)

Run: 393, exploration: 0.01, score: 66
Scores: (min: 13, avg: 130.9, max: 268)

Run: 394, exploration: 0.01, score: 117
Scores: (min: 13, avg: 130.87, max: 268)

Run: 395, exploration: 0.01, score: 114
Scores: (min: 13, avg: 130.35, max: 268)

Run: 396, exploration: 0.01, score: 97
Scores: (min: 13, avg: 130.1, max: 268)

Run: 397, exploration: 0.01, score: 103
Scores: (min: 13, avg: 129.69, max: 268)

Run: 398, exploration: 0.01, score: 112
Scores: (min: 13, avg: 129.5, max: 268)

Run: 399, exploration: 0.01, score: 148
Scores: (min: 13, avg: 129.61, max: 268)

Run: 400, exploration: 0.01, score: 131
Scores: (min: 13, avg: 129.53, max: 268)

Run: 401, exploration: 0.01, score: 112
Scores: (min: 13, avg: 129.44, max: 268)

Run: 402, exploration: 

Run: 491, exploration: 0.01, score: 39
Scores: (min: 12, avg: 124.04, max: 366)

Run: 492, exploration: 0.01, score: 53
Scores: (min: 12, avg: 123.76, max: 366)

Run: 493, exploration: 0.01, score: 98
Scores: (min: 12, avg: 124.08, max: 366)

Run: 494, exploration: 0.01, score: 117
Scores: (min: 12, avg: 124.08, max: 366)

Run: 495, exploration: 0.01, score: 106
Scores: (min: 12, avg: 124, max: 366)

Run: 496, exploration: 0.01, score: 128
Scores: (min: 12, avg: 124.31, max: 366)

Run: 497, exploration: 0.01, score: 140
Scores: (min: 12, avg: 124.68, max: 366)

Run: 498, exploration: 0.01, score: 113
Scores: (min: 12, avg: 124.69, max: 366)

Run: 499, exploration: 0.01, score: 189
Scores: (min: 12, avg: 125.1, max: 366)

Run: 500, exploration: 0.01, score: 121
Scores: (min: 12, avg: 125, max: 366)

Run: 501, exploration: 0.01, score: 100
Scores: (min: 12, avg: 124.88, max: 366)

Run: 502, exploration: 0.01, score: 145
Scores: (min: 12, avg: 125.03, max: 366)

Run: 503, exploration: 0.0

Run: 592, exploration: 0.01, score: 140
Scores: (min: 12, avg: 139.17, max: 356)

Run: 593, exploration: 0.01, score: 164
Scores: (min: 12, avg: 139.83, max: 356)

Run: 594, exploration: 0.01, score: 119
Scores: (min: 12, avg: 139.85, max: 356)

Run: 595, exploration: 0.01, score: 122
Scores: (min: 12, avg: 140.01, max: 356)

Run: 596, exploration: 0.01, score: 150
Scores: (min: 12, avg: 140.23, max: 356)

Run: 597, exploration: 0.01, score: 96
Scores: (min: 12, avg: 139.79, max: 356)

Run: 598, exploration: 0.01, score: 128
Scores: (min: 12, avg: 139.94, max: 356)

Run: 599, exploration: 0.01, score: 189
Scores: (min: 12, avg: 139.94, max: 356)

Run: 600, exploration: 0.01, score: 132
Scores: (min: 12, avg: 140.05, max: 356)

Run: 601, exploration: 0.01, score: 109
Scores: (min: 12, avg: 140.14, max: 356)

Run: 602, exploration: 0.01, score: 144
Scores: (min: 12, avg: 140.13, max: 356)

Run: 603, exploration: 0.01, score: 160
Scores: (min: 12, avg: 140.02, max: 356)

Run: 604, explora

Run: 693, exploration: 0.01, score: 150
Scores: (min: 47, avg: 159.96, max: 323)

Run: 694, exploration: 0.01, score: 198
Scores: (min: 47, avg: 160.75, max: 323)

Run: 695, exploration: 0.01, score: 229
Scores: (min: 47, avg: 161.82, max: 323)

Run: 696, exploration: 0.01, score: 136
Scores: (min: 47, avg: 161.68, max: 323)

Run: 697, exploration: 0.01, score: 106
Scores: (min: 47, avg: 161.78, max: 323)

Run: 698, exploration: 0.01, score: 145
Scores: (min: 47, avg: 161.95, max: 323)

Run: 699, exploration: 0.01, score: 193
Scores: (min: 47, avg: 161.99, max: 323)

Run: 700, exploration: 0.01, score: 150
Scores: (min: 47, avg: 162.17, max: 323)

Run: 701, exploration: 0.01, score: 104
Scores: (min: 47, avg: 162.12, max: 323)

Run: 702, exploration: 0.01, score: 191
Scores: (min: 47, avg: 162.59, max: 323)

Run: 703, exploration: 0.01, score: 148
Scores: (min: 47, avg: 162.47, max: 323)

Run: 704, exploration: 0.01, score: 121
Scores: (min: 47, avg: 161.17, max: 323)

Run: 705, explor

Run: 794, exploration: 0.01, score: 103
Scores: (min: 48, avg: 171.46, max: 500)

Run: 795, exploration: 0.01, score: 124
Scores: (min: 48, avg: 170.41, max: 500)

Run: 796, exploration: 0.01, score: 123
Scores: (min: 48, avg: 170.28, max: 500)

Run: 797, exploration: 0.01, score: 74
Scores: (min: 48, avg: 169.96, max: 500)

Run: 798, exploration: 0.01, score: 101
Scores: (min: 48, avg: 169.52, max: 500)

Run: 799, exploration: 0.01, score: 301
Scores: (min: 48, avg: 170.6, max: 500)

Run: 800, exploration: 0.01, score: 249
Scores: (min: 48, avg: 171.59, max: 500)

Run: 801, exploration: 0.01, score: 500
Scores: (min: 48, avg: 175.55, max: 500)

Run: 802, exploration: 0.01, score: 258
Scores: (min: 48, avg: 176.22, max: 500)

Run: 803, exploration: 0.01, score: 212
Scores: (min: 48, avg: 176.86, max: 500)

Run: 804, exploration: 0.01, score: 130
Scores: (min: 48, avg: 176.95, max: 500)

Run: 805, exploration: 0.01, score: 224
Scores: (min: 48, avg: 176.03, max: 500)

Run: 806, explorat

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 1.0 # Changed from 0.95 to 1.0 
LEARNING_RATE = 0.015 # Changed from 0.001 to 0.015  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.05 # Changed form 0.01 to 0.05
EXPLORATION_DECAY = 0.975 # Changed from 0.995 to 0.975  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 12
Scores: (min: 12, avg: 12, max: 12)

Run: 2, exploration: 0.70155967750146, score: 22
Scores: (min: 12, avg: 17, max: 22)

Run: 3, exploration: 0.558606726608075, score: 10
Scores: (min: 10, avg: 14.666666666666666, max: 22)

Run: 4, exploration: 0.3120414750117885, score: 24
Scores: (min: 10, avg: 17, max: 24)

Run: 5, exploration: 0.19783148213754234, score: 19
Scores: (min: 10, avg: 17.4, max: 24)

Run: 6, exploration: 0.10774724674045444, score: 25
Scores: (min: 10, avg: 18.666666666666668, max: 25)

Run: 7, exploration: 0.05439147112547512, score: 28
Scores: (min: 10, avg: 20, max: 28)

Run: 8, exploration: 0.05, score: 16
Scores: (min: 10, avg: 19.5, max: 28)

Run: 9, exploration: 0.05, score: 26
Scores: (min: 10, avg: 20.22222222222222, max: 28)

Run: 10, exploration: 0.05, score: 10
Scores: (min: 10, avg: 19.2, max: 28)

Run: 11, exploration: 0.05, score: 14
Scores: (min: 10, avg: 18.727272727272727, max: 28)

Run: 12, exploration: 0.05, scor

Run: 95, exploration: 0.05, score: 31
Scores: (min: 8, avg: 19.33684210526316, max: 76)

Run: 96, exploration: 0.05, score: 9
Scores: (min: 8, avg: 19.229166666666668, max: 76)

Run: 97, exploration: 0.05, score: 32
Scores: (min: 8, avg: 19.36082474226804, max: 76)

Run: 98, exploration: 0.05, score: 17
Scores: (min: 8, avg: 19.336734693877553, max: 76)

Run: 99, exploration: 0.05, score: 34
Scores: (min: 8, avg: 19.484848484848484, max: 76)

Run: 100, exploration: 0.05, score: 9
Scores: (min: 8, avg: 19.38, max: 76)

Run: 101, exploration: 0.05, score: 29
Scores: (min: 8, avg: 19.55, max: 76)

Run: 102, exploration: 0.05, score: 20
Scores: (min: 8, avg: 19.53, max: 76)

Run: 103, exploration: 0.05, score: 23
Scores: (min: 8, avg: 19.66, max: 76)

Run: 104, exploration: 0.05, score: 9
Scores: (min: 8, avg: 19.51, max: 76)

Run: 105, exploration: 0.05, score: 9
Scores: (min: 8, avg: 19.41, max: 76)

Run: 106, exploration: 0.05, score: 12
Scores: (min: 8, avg: 19.28, max: 76)

Run: 107, 

Run: 198, exploration: 0.05, score: 142
Scores: (min: 8, avg: 124.4, max: 500)

Run: 199, exploration: 0.05, score: 163
Scores: (min: 8, avg: 125.69, max: 500)

Run: 200, exploration: 0.05, score: 182
Scores: (min: 8, avg: 127.42, max: 500)

Run: 201, exploration: 0.05, score: 156
Scores: (min: 8, avg: 128.69, max: 500)

Run: 202, exploration: 0.05, score: 129
Scores: (min: 8, avg: 129.78, max: 500)

Run: 203, exploration: 0.05, score: 244
Scores: (min: 8, avg: 131.99, max: 500)

Run: 204, exploration: 0.05, score: 125
Scores: (min: 8, avg: 133.15, max: 500)

Run: 205, exploration: 0.05, score: 136
Scores: (min: 8, avg: 134.42, max: 500)

Run: 206, exploration: 0.05, score: 291
Scores: (min: 8, avg: 137.21, max: 500)

Run: 207, exploration: 0.05, score: 187
Scores: (min: 8, avg: 138.98, max: 500)

Run: 208, exploration: 0.05, score: 126
Scores: (min: 10, avg: 140.16, max: 500)

Run: 209, exploration: 0.05, score: 177
Scores: (min: 10, avg: 141.79, max: 500)

Run: 210, exploration: 0.05

NameError: name 'exit' is not defined

In [8]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 1.0 # Changed from 0.95 to 1.0 
LEARNING_RATE = 0.025 # Changed from 0.015 to 0.025  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.075 # Changed form 0.05 to 0.075
EXPLORATION_DECAY = 0.999 # Changed from 0.975 to 0.999  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [9]:
cartpole()

Run: 1, exploration: 1.0, score: 16
Scores: (min: 16, avg: 16, max: 16)

Run: 2, exploration: 0.9821521870514506, score: 22
Scores: (min: 16, avg: 19, max: 22)

Run: 3, exploration: 0.9694605362958227, score: 14
Scores: (min: 14, avg: 17.333333333333332, max: 22)

Run: 4, exploration: 0.9531108968798944, score: 18
Scores: (min: 14, avg: 17.5, max: 22)

Run: 5, exploration: 0.925854183751895, score: 30
Scores: (min: 14, avg: 20, max: 30)

Run: 6, exploration: 0.9093297114626595, score: 19
Scores: (min: 14, avg: 19.833333333333332, max: 30)

Run: 7, exploration: 0.8913148576343527, score: 21
Scores: (min: 14, avg: 20, max: 30)

Run: 8, exploration: 0.8701675093639105, score: 25
Scores: (min: 14, avg: 20.625, max: 30)

Run: 9, exploration: 0.8351949903256736, score: 42
Scores: (min: 14, avg: 23, max: 42)

Run: 10, exploration: 0.8227543820685892, score: 16
Scores: (min: 14, avg: 22.3, max: 42)

Run: 11, exploration: 0.8040377382995522, score: 24
Scores: (min: 14, avg: 22.454545454545453, 

Run: 83, exploration: 0.31519071309757335, score: 12
Scores: (min: 8, avg: 15.120481927710843, max: 45)

Run: 84, exploration: 0.3095652482070879, score: 19
Scores: (min: 8, avg: 15.166666666666666, max: 45)

Run: 85, exploration: 0.30373614537615473, score: 20
Scores: (min: 8, avg: 15.223529411764705, max: 45)

Run: 86, exploration: 0.3013147438372364, score: 9
Scores: (min: 8, avg: 15.151162790697674, max: 45)

Run: 87, exploration: 0.29771878754395165, score: 13
Scores: (min: 8, avg: 15.126436781609195, max: 45)

Run: 88, exploration: 0.2947549613501431, score: 11
Scores: (min: 8, avg: 15.079545454545455, max: 45)

Run: 89, exploration: 0.29211275315428575, score: 10
Scores: (min: 8, avg: 15.02247191011236, max: 45)

Run: 90, exploration: 0.28149687783101773, score: 38
Scores: (min: 8, avg: 15.277777777777779, max: 45)

Run: 91, exploration: 0.2789735162078359, score: 10
Scores: (min: 8, avg: 15.219780219780219, max: 45)

Run: 92, exploration: 0.2767495237336227, score: 9
Scores: (m

Run: 170, exploration: 0.075, score: 22
Scores: (min: 8, avg: 16.69, max: 62)

Run: 171, exploration: 0.075, score: 10
Scores: (min: 8, avg: 16.69, max: 62)

Run: 172, exploration: 0.075, score: 12
Scores: (min: 8, avg: 16.66, max: 62)

Run: 173, exploration: 0.075, score: 11
Scores: (min: 8, avg: 16.66, max: 62)

Run: 174, exploration: 0.075, score: 11
Scores: (min: 8, avg: 16.68, max: 62)

Run: 175, exploration: 0.075, score: 20
Scores: (min: 8, avg: 16.78, max: 62)

Run: 176, exploration: 0.075, score: 9
Scores: (min: 8, avg: 16.78, max: 62)

Run: 177, exploration: 0.075, score: 10
Scores: (min: 8, avg: 16.78, max: 62)

Run: 178, exploration: 0.075, score: 16
Scores: (min: 8, avg: 16.84, max: 62)

Run: 179, exploration: 0.075, score: 16
Scores: (min: 8, avg: 16.9, max: 62)

Run: 180, exploration: 0.075, score: 11
Scores: (min: 8, avg: 16.91, max: 62)

Run: 181, exploration: 0.075, score: 12
Scores: (min: 8, avg: 16.9, max: 62)

Run: 182, exploration: 0.075, score: 16
Scores: (min: 9

Run: 275, exploration: 0.075, score: 8
Scores: (min: 8, avg: 12.87, max: 89)

Run: 276, exploration: 0.075, score: 9
Scores: (min: 8, avg: 12.87, max: 89)

Run: 277, exploration: 0.075, score: 9
Scores: (min: 8, avg: 12.86, max: 89)

Run: 278, exploration: 0.075, score: 10
Scores: (min: 8, avg: 12.8, max: 89)

Run: 279, exploration: 0.075, score: 9
Scores: (min: 8, avg: 12.73, max: 89)

Run: 280, exploration: 0.075, score: 9
Scores: (min: 8, avg: 12.71, max: 89)

Run: 281, exploration: 0.075, score: 10
Scores: (min: 8, avg: 12.69, max: 89)

Run: 282, exploration: 0.075, score: 12
Scores: (min: 8, avg: 12.65, max: 89)

Run: 283, exploration: 0.075, score: 10
Scores: (min: 8, avg: 12.66, max: 89)

Run: 284, exploration: 0.075, score: 8
Scores: (min: 8, avg: 12.63, max: 89)

Run: 285, exploration: 0.075, score: 14
Scores: (min: 8, avg: 12.67, max: 89)

Run: 286, exploration: 0.075, score: 9
Scores: (min: 8, avg: 12.64, max: 89)

Run: 287, exploration: 0.075, score: 26
Scores: (min: 8, avg

Run: 379, exploration: 0.075, score: 142
Scores: (min: 8, avg: 50.44, max: 418)

Run: 380, exploration: 0.075, score: 113
Scores: (min: 8, avg: 51.48, max: 418)

Run: 381, exploration: 0.075, score: 115
Scores: (min: 8, avg: 52.53, max: 418)

Run: 382, exploration: 0.075, score: 500
Scores: (min: 8, avg: 57.41, max: 500)

Run: 383, exploration: 0.075, score: 500
Scores: (min: 8, avg: 62.31, max: 500)

Run: 384, exploration: 0.075, score: 292
Scores: (min: 8, avg: 65.15, max: 500)

Run: 385, exploration: 0.075, score: 342
Scores: (min: 8, avg: 68.43, max: 500)

Run: 386, exploration: 0.075, score: 262
Scores: (min: 8, avg: 70.96, max: 500)

Run: 387, exploration: 0.075, score: 500
Scores: (min: 8, avg: 75.7, max: 500)

Run: 388, exploration: 0.075, score: 155
Scores: (min: 8, avg: 76.67, max: 500)

Run: 389, exploration: 0.075, score: 101
Scores: (min: 8, avg: 77.35, max: 500)

Run: 390, exploration: 0.075, score: 500
Scores: (min: 8, avg: 82.21, max: 500)

Run: 391, exploration: 0.075,

NameError: name 'exit' is not defined

    In this particular scenario, the agent's objective is to balance a pole on top of a cart that is moving along a frictionless track in either the left or right directions. If the pole falls in either the left or right direction, then it will result in a failure for the agent. There are about four states that he agent will be put through: Starting state, Left Correction, Right Correction, and Ending State. The goal is to accomplish an average score of 195. The algorithm used in order to solve this problem is known as Q-Learning. Q-learning is a model-free reinforcement learning method in which the agent will determine the best action to perform based on the current state the agent is in (Banoula, 2023).
    
    According to an article written by Jordi Torres, experience replay is the act of sampling small batches of tuples from a replay buffer (using past experiences for datasets rather than the most recent experience), in order to run an individual tuple multiple times to learn the best outcomes of that particular scenario (TORRES.AI, 2021). Depending on the type of action being performed, the reward can be given at a discounted rate. This "Discount Reward" exists to determine the importance of a reward based on the state the agent is in (Banoula, 2023), making a reward worth more the closer the agent is to the ending objective typically.
    
    Standard Q-learning uses state-action pairs to Q-values wehre deep Q-learning utilizes (action, Q-value) pairs and typically containstwo neural networks in its learning process (Singh, 2022). Since Q-learning is capable of running smaller data samples through the means of replay buffer and experience replay, it allows for less memory requirements of the neural networks while testing. One of the key differences in altering the learning rate of the algorithm is that it appeared to complete the runs either quicker or slower. Howerver, it should also be noted that making small changes to the already existing code seemed to make the agent learn the game quicker (completed in less runs) than drastically changing it which caused it to run more instances. 
    
Resources:

Banoula, M. (2023, February 22). What is Q-learning: Everything you need to know: Simplilearn. Simplilearn.com. https://www.simplilearn.com/tutorials/machine-learning-tutorial/what-is-q-learning 

Singh, S. (2022, October 7). A comprehensive guide to neural networks in Deep Q-Learning. A Comprehensive Guide to Neural Networks in Deep Q-learning. https://www.turing.com/kb/how-are-neural-networks-used-in-deep-q-learning 

TORRES.AI, J. (2021, May 10). Deep Q-Network (DQN)-II. Medium. https://towardsdatascience.com/deep-q-network-dqn-ii-b6bf911b6b2c 

