# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.960693043575437, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 3, exploration: 0.8778091417340573, score: 19
Scores: (min: 14, avg: 15.666666666666666, max: 19)

Run: 4, exploration: 0.7590483508202912, score: 30
Scores: (min: 14, avg: 19.25, max: 30)

Run: 5, exploration: 0.7183288830986236, score: 12
Scores: (min: 12, avg: 17.8, max: 30)

Run: 6, exploration: 0.6596532430440636, score: 18
Scores: (min: 12, avg: 17.833333333333332, max: 30)

Run: 7, exploration: 0.5937455908197752, score: 22
Scores: (min: 12, avg: 18.428571428571427, max: 30)

Run: 8, exploration: 0.5344229416520513, score: 22
Scores: (min: 12, avg: 18.875, max: 30)

Run: 9, exploration: 0.5032248303978422, score: 13
Scores: (min: 12, avg: 18.22222222222222, max: 30)

Run: 10, exploration: 0.4858739637363176, score: 8
Scores: (min: 8, avg: 17.2, max: 30)

Run: 11, exploration: 0.46677573701590436, score: 9
Scores: (mi

Run: 82, exploration: 0.01831548549245123, score: 10
Scores: (min: 8, avg: 10.951219512195122, max: 30)

Run: 83, exploration: 0.017595559502304726, score: 9
Scores: (min: 8, avg: 10.927710843373495, max: 30)

Run: 84, exploration: 0.016903931611681827, score: 9
Scores: (min: 8, avg: 10.904761904761905, max: 30)

Run: 85, exploration: 0.01615829206087557, score: 10
Scores: (min: 8, avg: 10.894117647058824, max: 30)

Run: 86, exploration: 0.015523158778943369, score: 9
Scores: (min: 8, avg: 10.872093023255815, max: 30)

Run: 87, exploration: 0.014838425699981627, score: 10
Scores: (min: 8, avg: 10.862068965517242, max: 30)

Run: 88, exploration: 0.014255172347583332, score: 9
Scores: (min: 8, avg: 10.840909090909092, max: 30)

Run: 89, exploration: 0.013626370684745774, score: 10
Scores: (min: 8, avg: 10.831460674157304, max: 30)

Run: 90, exploration: 0.01302530572838545, score: 10
Scores: (min: 8, avg: 10.822222222222223, max: 30)

Run: 91, exploration: 0.012513320603703188, score: 9


Run: 182, exploration: 0.01, score: 116
Scores: (min: 9, avg: 31.33, max: 122)

Run: 183, exploration: 0.01, score: 110
Scores: (min: 9, avg: 32.34, max: 122)

Run: 184, exploration: 0.01, score: 107
Scores: (min: 9, avg: 33.32, max: 122)

Run: 185, exploration: 0.01, score: 130
Scores: (min: 9, avg: 34.52, max: 130)

Run: 186, exploration: 0.01, score: 108
Scores: (min: 9, avg: 35.51, max: 130)

Run: 187, exploration: 0.01, score: 127
Scores: (min: 9, avg: 36.68, max: 130)

Run: 188, exploration: 0.01, score: 115
Scores: (min: 9, avg: 37.74, max: 130)

Run: 189, exploration: 0.01, score: 119
Scores: (min: 9, avg: 38.83, max: 130)

Run: 190, exploration: 0.01, score: 110
Scores: (min: 9, avg: 39.83, max: 130)

Run: 191, exploration: 0.01, score: 122
Scores: (min: 9, avg: 40.96, max: 130)

Run: 192, exploration: 0.01, score: 110
Scores: (min: 9, avg: 41.97, max: 130)

Run: 193, exploration: 0.01, score: 131
Scores: (min: 9, avg: 43.18, max: 131)

Run: 194, exploration: 0.01, score: 135


Run: 284, exploration: 0.01, score: 129
Scores: (min: 96, avg: 125.35, max: 156)

Run: 285, exploration: 0.01, score: 125
Scores: (min: 96, avg: 125.3, max: 156)

Run: 286, exploration: 0.01, score: 120
Scores: (min: 96, avg: 125.42, max: 156)

Run: 287, exploration: 0.01, score: 134
Scores: (min: 96, avg: 125.49, max: 156)

Run: 288, exploration: 0.01, score: 118
Scores: (min: 96, avg: 125.52, max: 156)

Run: 289, exploration: 0.01, score: 116
Scores: (min: 96, avg: 125.49, max: 156)

Run: 290, exploration: 0.01, score: 141
Scores: (min: 96, avg: 125.8, max: 156)

Run: 291, exploration: 0.01, score: 125
Scores: (min: 96, avg: 125.83, max: 156)

Run: 292, exploration: 0.01, score: 142
Scores: (min: 96, avg: 126.15, max: 156)

Run: 293, exploration: 0.01, score: 140
Scores: (min: 96, avg: 126.24, max: 156)

Run: 294, exploration: 0.01, score: 116
Scores: (min: 96, avg: 126.05, max: 156)

Run: 295, exploration: 0.01, score: 134
Scores: (min: 96, avg: 126.14, max: 156)

Run: 296, explorat

Run: 385, exploration: 0.01, score: 132
Scores: (min: 45, avg: 124.79, max: 156)

Run: 386, exploration: 0.01, score: 136
Scores: (min: 45, avg: 124.95, max: 156)

Run: 387, exploration: 0.01, score: 146
Scores: (min: 45, avg: 125.07, max: 156)

Run: 388, exploration: 0.01, score: 37
Scores: (min: 37, avg: 124.26, max: 156)

Run: 389, exploration: 0.01, score: 129
Scores: (min: 37, avg: 124.39, max: 156)

Run: 390, exploration: 0.01, score: 125
Scores: (min: 37, avg: 124.23, max: 156)

Run: 391, exploration: 0.01, score: 90
Scores: (min: 37, avg: 123.88, max: 156)

Run: 392, exploration: 0.01, score: 92
Scores: (min: 37, avg: 123.38, max: 156)

Run: 393, exploration: 0.01, score: 133
Scores: (min: 37, avg: 123.31, max: 156)

Run: 394, exploration: 0.01, score: 110
Scores: (min: 37, avg: 123.25, max: 156)

Run: 395, exploration: 0.01, score: 159
Scores: (min: 37, avg: 123.5, max: 159)

Run: 396, exploration: 0.01, score: 117
Scores: (min: 37, avg: 123.32, max: 159)

Run: 397, exploratio

Run: 486, exploration: 0.01, score: 150
Scores: (min: 12, avg: 132.02, max: 193)

Run: 487, exploration: 0.01, score: 128
Scores: (min: 12, avg: 131.84, max: 193)

Run: 488, exploration: 0.01, score: 124
Scores: (min: 12, avg: 132.71, max: 193)

Run: 489, exploration: 0.01, score: 60
Scores: (min: 12, avg: 132.02, max: 193)

Run: 490, exploration: 0.01, score: 132
Scores: (min: 12, avg: 132.09, max: 193)

Run: 491, exploration: 0.01, score: 132
Scores: (min: 12, avg: 132.51, max: 193)

Run: 492, exploration: 0.01, score: 143
Scores: (min: 12, avg: 133.02, max: 193)

Run: 493, exploration: 0.01, score: 123
Scores: (min: 12, avg: 132.92, max: 193)

Run: 494, exploration: 0.01, score: 128
Scores: (min: 12, avg: 133.1, max: 193)

Run: 495, exploration: 0.01, score: 169
Scores: (min: 12, avg: 133.2, max: 193)

Run: 496, exploration: 0.01, score: 139
Scores: (min: 12, avg: 133.42, max: 193)

Run: 497, exploration: 0.01, score: 185
Scores: (min: 12, avg: 133.82, max: 193)

Run: 498, explorati

Run: 587, exploration: 0.01, score: 111
Scores: (min: 52, avg: 141.7, max: 227)

Run: 588, exploration: 0.01, score: 143
Scores: (min: 52, avg: 141.89, max: 227)

Run: 589, exploration: 0.01, score: 131
Scores: (min: 52, avg: 142.6, max: 227)

Run: 590, exploration: 0.01, score: 124
Scores: (min: 52, avg: 142.52, max: 227)

Run: 591, exploration: 0.01, score: 128
Scores: (min: 52, avg: 142.48, max: 227)

Run: 592, exploration: 0.01, score: 133
Scores: (min: 52, avg: 142.38, max: 227)

Run: 593, exploration: 0.01, score: 145
Scores: (min: 52, avg: 142.6, max: 227)

Run: 594, exploration: 0.01, score: 128
Scores: (min: 52, avg: 142.6, max: 227)

Run: 595, exploration: 0.01, score: 141
Scores: (min: 52, avg: 142.32, max: 227)

Run: 596, exploration: 0.01, score: 211
Scores: (min: 52, avg: 143.04, max: 227)

Run: 597, exploration: 0.01, score: 57
Scores: (min: 52, avg: 141.76, max: 227)

Run: 598, exploration: 0.01, score: 169
Scores: (min: 52, avg: 141.79, max: 227)

Run: 599, exploration

Run: 688, exploration: 0.01, score: 198
Scores: (min: 15, avg: 161.01, max: 500)

Run: 689, exploration: 0.01, score: 100
Scores: (min: 15, avg: 160.7, max: 500)

Run: 690, exploration: 0.01, score: 134
Scores: (min: 15, avg: 160.8, max: 500)

Run: 691, exploration: 0.01, score: 187
Scores: (min: 15, avg: 161.39, max: 500)

Run: 692, exploration: 0.01, score: 100
Scores: (min: 15, avg: 161.06, max: 500)

Run: 693, exploration: 0.01, score: 370
Scores: (min: 15, avg: 163.31, max: 500)

Run: 694, exploration: 0.01, score: 111
Scores: (min: 15, avg: 163.14, max: 500)

Run: 695, exploration: 0.01, score: 145
Scores: (min: 15, avg: 163.18, max: 500)

Run: 696, exploration: 0.01, score: 258
Scores: (min: 15, avg: 163.65, max: 500)

Run: 697, exploration: 0.01, score: 302
Scores: (min: 15, avg: 166.1, max: 500)

Run: 698, exploration: 0.01, score: 225
Scores: (min: 15, avg: 166.66, max: 500)

Run: 699, exploration: 0.01, score: 128
Scores: (min: 15, avg: 166.56, max: 500)

Run: 700, explorati

Run: 789, exploration: 0.01, score: 197
Scores: (min: 10, avg: 157.71, max: 500)

Run: 790, exploration: 0.01, score: 500
Scores: (min: 10, avg: 161.37, max: 500)

Run: 791, exploration: 0.01, score: 108
Scores: (min: 10, avg: 160.58, max: 500)

Run: 792, exploration: 0.01, score: 440
Scores: (min: 10, avg: 163.98, max: 500)

Run: 793, exploration: 0.01, score: 171
Scores: (min: 10, avg: 161.99, max: 500)

Run: 794, exploration: 0.01, score: 201
Scores: (min: 10, avg: 162.89, max: 500)

Run: 795, exploration: 0.01, score: 130
Scores: (min: 10, avg: 162.74, max: 500)

Run: 796, exploration: 0.01, score: 490
Scores: (min: 10, avg: 165.06, max: 500)

Run: 797, exploration: 0.01, score: 137
Scores: (min: 10, avg: 163.41, max: 500)

Run: 798, exploration: 0.01, score: 145
Scores: (min: 10, avg: 162.61, max: 500)

Run: 799, exploration: 0.01, score: 107
Scores: (min: 10, avg: 162.4, max: 500)

Run: 800, exploration: 0.01, score: 226
Scores: (min: 10, avg: 163.89, max: 500)

Run: 801, explora

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.






# Value Modification to Explore Effects on Algorithm Performance


If the objective is to explore how each value influences the performance, we should be able to return to a baseline setting easily.

In [2]:
# These are values used in the starter code. We will return to these while exploring behavior.
def reset_to_baseline_settings():
    GAMMA = 0.95  
    LEARNING_RATE = 0.001  

    MEMORY_SIZE = 1000000  
    BATCH_SIZE = 20  

    EXPLORATION_MAX = 1.0  
    EXPLORATION_MIN = 0.01  
    EXPLORATION_DECAY = 0.995  

I propose that we do not need to attempt to solve the problem fully at this stage; it may take a long time to reach a stable mean score of 195. We can explore behavior at an earlier stage and extrapolate. Here, am slightly modifying the print statements and adding the option to end early.


In [3]:
def cartpole(max_run):  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                if (run >= max_run):
                    return
                if (run % 10 == 0):
                    print("####################")
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step), end=' | ')  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

First, increase discount factor, or gamma, by .01

In [11]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
GAMMA = 0.96 
cartpole(250)

Run: 1, exploration: 1.0, score: 16 | Scores: (min: 16, avg: 16, max: 16)

Run: 2, exploration: 0.9558895783575597, score: 13 | Scores: (min: 13, avg: 14.5, max: 16)

Run: 3, exploration: 0.7744209942832988, score: 43 | Scores: (min: 13, avg: 24, max: 43)

Run: 4, exploration: 0.6866430931872001, score: 25 | Scores: (min: 13, avg: 24.25, max: 43)

Run: 5, exploration: 0.6274028820538087, score: 19 | Scores: (min: 13, avg: 23.2, max: 43)

Run: 6, exploration: 0.3420891339682016, score: 122 | Scores: (min: 13, avg: 39.666666666666664, max: 122)

Run: 7, exploration: 0.192217783647157, score: 116 | Scores: (min: 13, avg: 50.57142857142857, max: 122)

Run: 8, exploration: 0.1580861105294992, score: 40 | Scores: (min: 13, avg: 49.25, max: 122)

Run: 9, exploration: 0.11299003011401039, score: 68 | Scores: (min: 13, avg: 51.333333333333336, max: 122)

####################
Run: 10, exploration: 0.10068643904747315, score: 24 | Scores: (min: 13, avg: 48.6, max: 122)

Run: 11, exploration: 0.09

Run: 88, exploration: 0.01, score: 194 | Scores: (min: 8, avg: 71.13636363636364, max: 417)

Run: 89, exploration: 0.01, score: 34 | Scores: (min: 8, avg: 70.71910112359551, max: 417)

####################
Run: 90, exploration: 0.01, score: 267 | Scores: (min: 8, avg: 72.9, max: 417)

Run: 91, exploration: 0.01, score: 158 | Scores: (min: 8, avg: 73.83516483516483, max: 417)

Run: 92, exploration: 0.01, score: 160 | Scores: (min: 8, avg: 74.77173913043478, max: 417)

Run: 93, exploration: 0.01, score: 184 | Scores: (min: 8, avg: 75.94623655913979, max: 417)

Run: 94, exploration: 0.01, score: 189 | Scores: (min: 8, avg: 77.14893617021276, max: 417)

Run: 95, exploration: 0.01, score: 181 | Scores: (min: 8, avg: 78.2421052631579, max: 417)

Run: 96, exploration: 0.01, score: 164 | Scores: (min: 8, avg: 79.13541666666667, max: 417)

Run: 97, exploration: 0.01, score: 169 | Scores: (min: 8, avg: 80.0618556701031, max: 417)

Run: 98, exploration: 0.01, score: 180 | Scores: (min: 8, avg: 81

Run: 184, exploration: 0.01, score: 230 | Scores: (min: 8, avg: 158.72, max: 427)

Run: 185, exploration: 0.01, score: 173 | Scores: (min: 8, avg: 158.71, max: 427)

Run: 186, exploration: 0.01, score: 189 | Scores: (min: 8, avg: 159.09, max: 427)

Run: 187, exploration: 0.01, score: 200 | Scores: (min: 8, avg: 159.48, max: 427)

Run: 188, exploration: 0.01, score: 284 | Scores: (min: 8, avg: 160.38, max: 427)

Run: 189, exploration: 0.01, score: 205 | Scores: (min: 8, avg: 162.09, max: 427)

####################
Run: 190, exploration: 0.01, score: 201 | Scores: (min: 8, avg: 161.43, max: 427)

Run: 191, exploration: 0.01, score: 160 | Scores: (min: 8, avg: 161.45, max: 427)

Run: 192, exploration: 0.01, score: 319 | Scores: (min: 8, avg: 163.04, max: 427)

Run: 193, exploration: 0.01, score: 156 | Scores: (min: 8, avg: 162.76, max: 427)

Run: 194, exploration: 0.01, score: 252 | Scores: (min: 8, avg: 163.39, max: 427)

Run: 195, exploration: 0.01, score: 93 | Scores: (min: 8, avg: 162

Next, decrease discount factor, or gamma, by .01

In [5]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
GAMMA = 0.94
cartpole(250)

Run: 1, exploration: 0.9752487531218751, score: 25 | Scores: (min: 25, avg: 25, max: 25)

Run: 2, exploration: 0.8390886103705794, score: 31 | Scores: (min: 25, avg: 28, max: 31)

Run: 3, exploration: 0.7705488893118823, score: 18 | Scores: (min: 18, avg: 24.666666666666668, max: 31)

Run: 4, exploration: 0.7219385759785162, score: 14 | Scores: (min: 14, avg: 22, max: 31)

Run: 5, exploration: 0.6730128848950395, score: 15 | Scores: (min: 14, avg: 20.6, max: 31)

Run: 6, exploration: 0.6211445383053219, score: 17 | Scores: (min: 14, avg: 20, max: 31)

Run: 7, exploration: 0.5937455908197752, score: 10 | Scores: (min: 10, avg: 18.571428571428573, max: 31)

Run: 8, exploration: 0.5507399854171277, score: 16 | Scores: (min: 10, avg: 18.25, max: 31)

Run: 9, exploration: 0.5238143793828016, score: 11 | Scores: (min: 10, avg: 17.444444444444443, max: 31)

####################
Run: 10, exploration: 0.4883155414435353, score: 15 | Scores: (min: 10, avg: 17.2, max: 31)

Run: 11, exploration: 0

Run: 82, exploration: 0.01, score: 98 | Scores: (min: 8, avg: 27.085365853658537, max: 114)

Run: 83, exploration: 0.01, score: 124 | Scores: (min: 8, avg: 28.253012048192772, max: 124)

Run: 84, exploration: 0.01, score: 125 | Scores: (min: 8, avg: 29.404761904761905, max: 125)

Run: 85, exploration: 0.01, score: 90 | Scores: (min: 8, avg: 30.11764705882353, max: 125)

Run: 86, exploration: 0.01, score: 151 | Scores: (min: 8, avg: 31.523255813953487, max: 151)

Run: 87, exploration: 0.01, score: 105 | Scores: (min: 8, avg: 32.367816091954026, max: 151)

Run: 88, exploration: 0.01, score: 225 | Scores: (min: 8, avg: 34.55681818181818, max: 225)

Run: 89, exploration: 0.01, score: 128 | Scores: (min: 8, avg: 35.60674157303371, max: 225)

####################
Run: 90, exploration: 0.01, score: 270 | Scores: (min: 8, avg: 38.21111111111111, max: 270)

Run: 91, exploration: 0.01, score: 243 | Scores: (min: 8, avg: 40.46153846153846, max: 270)

Run: 92, exploration: 0.01, score: 167 | Score

NameError: name 'exit' is not defined

Next, increase learning rate by .0002

In [6]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
LEARNING_RATE = 0.0012
cartpole(250)

Run: 1, exploration: 1.0, score: 14 | Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.9322301194154049, score: 20 | Scores: (min: 14, avg: 17, max: 20)

Run: 3, exploration: 0.8734200960253871, score: 14 | Scores: (min: 14, avg: 16, max: 20)

Run: 4, exploration: 0.7861544476842928, score: 22 | Scores: (min: 14, avg: 17.5, max: 22)

Run: 5, exploration: 0.7147372386831305, score: 20 | Scores: (min: 14, avg: 18, max: 22)

Run: 6, exploration: 0.6832098777212641, score: 10 | Scores: (min: 10, avg: 16.666666666666668, max: 22)

Run: 7, exploration: 0.6498078359349755, score: 11 | Scores: (min: 10, avg: 15.857142857142858, max: 22)

Run: 8, exploration: 0.5937455908197752, score: 19 | Scores: (min: 10, avg: 16.25, max: 22)

Run: 9, exploration: 0.5618938591163328, score: 12 | Scores: (min: 10, avg: 15.777777777777779, max: 22)

####################
Run: 10, exploration: 0.5398075216808175, score: 9 | Scores: (min: 9, avg: 15.1, max: 22)

Run: 11, exploration: 0.5134164023722473

####################
Run: 80, exploration: 0.01, score: 72 | Scores: (min: 8, avg: 15.9, max: 78)

Run: 81, exploration: 0.01, score: 67 | Scores: (min: 8, avg: 16.530864197530864, max: 78)

Run: 82, exploration: 0.01, score: 64 | Scores: (min: 8, avg: 17.109756097560975, max: 78)

Run: 83, exploration: 0.01, score: 64 | Scores: (min: 8, avg: 17.674698795180724, max: 78)

Run: 84, exploration: 0.01, score: 60 | Scores: (min: 8, avg: 18.178571428571427, max: 78)

Run: 85, exploration: 0.01, score: 62 | Scores: (min: 8, avg: 18.694117647058825, max: 78)

Run: 86, exploration: 0.01, score: 88 | Scores: (min: 8, avg: 19.5, max: 88)

Run: 87, exploration: 0.01, score: 67 | Scores: (min: 8, avg: 20.04597701149425, max: 88)

Run: 88, exploration: 0.01, score: 107 | Scores: (min: 8, avg: 21.03409090909091, max: 107)

Run: 89, exploration: 0.01, score: 70 | Scores: (min: 8, avg: 21.584269662921347, max: 107)

####################
Run: 90, exploration: 0.01, score: 80 | Scores: (min: 8, avg: 22.

Run: 175, exploration: 0.01, score: 206 | Scores: (min: 26, avg: 196.04, max: 500)

Solved in 75 runs, 175 total runs.


NameError: name 'exit' is not defined

Next, decrease learning rate by .0002

In [7]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
LEARNING_RATE = 0.0008  
cartpole(250)

Run: 1, exploration: 1.0, score: 16 | Scores: (min: 16, avg: 16, max: 16)

Run: 2, exploration: 0.9229311239742362, score: 20 | Scores: (min: 16, avg: 18, max: 20)

Run: 3, exploration: 0.8866535105013078, score: 9 | Scores: (min: 9, avg: 15, max: 20)

Run: 4, exploration: 0.7666961448653229, score: 30 | Scores: (min: 9, avg: 18.75, max: 30)

Run: 5, exploration: 0.7219385759785162, score: 13 | Scores: (min: 9, avg: 17.6, max: 30)

Run: 6, exploration: 0.6763948591909945, score: 14 | Scores: (min: 9, avg: 17, max: 30)

Run: 7, exploration: 0.6465587967553006, score: 10 | Scores: (min: 9, avg: 16, max: 30)

Run: 8, exploration: 0.6057704364907278, score: 14 | Scores: (min: 9, avg: 15.75, max: 30)

Run: 9, exploration: 0.5618938591163328, score: 16 | Scores: (min: 9, avg: 15.777777777777779, max: 30)

####################
Run: 10, exploration: 0.531750826943791, score: 12 | Scores: (min: 9, avg: 15.4, max: 30)

Run: 11, exploration: 0.47622912292284103, score: 23 | Scores: (min: 9, avg: 

Run: 87, exploration: 0.01, score: 233 | Scores: (min: 9, avg: 134.183908045977, max: 332)

Run: 88, exploration: 0.01, score: 204 | Scores: (min: 9, avg: 134.97727272727272, max: 332)

Run: 89, exploration: 0.01, score: 136 | Scores: (min: 9, avg: 134.98876404494382, max: 332)

####################
Run: 90, exploration: 0.01, score: 482 | Scores: (min: 9, avg: 138.84444444444443, max: 482)

Run: 91, exploration: 0.01, score: 211 | Scores: (min: 9, avg: 139.63736263736263, max: 482)

Run: 92, exploration: 0.01, score: 191 | Scores: (min: 9, avg: 140.19565217391303, max: 482)

Run: 93, exploration: 0.01, score: 183 | Scores: (min: 9, avg: 140.65591397849462, max: 482)

Run: 94, exploration: 0.01, score: 227 | Scores: (min: 9, avg: 141.5744680851064, max: 482)

Run: 95, exploration: 0.01, score: 210 | Scores: (min: 9, avg: 142.29473684210527, max: 482)

Run: 96, exploration: 0.01, score: 251 | Scores: (min: 9, avg: 143.42708333333334, max: 482)

Run: 97, exploration: 0.01, score: 86 | Sc

NameError: name 'exit' is not defined

Next, we will change the exploration factor by increasing exploration_max by .2

In [8]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_MAX = 1.2
cartpole(250)

Run: 1, exploration: 1.1702985037462499, score: 25 | Scores: (min: 25, avg: 25, max: 25)

Run: 2, exploration: 1.080104913447893, score: 17 | Scores: (min: 17, avg: 21, max: 25)

Run: 3, exploration: 1.0221622315204157, score: 12 | Scores: (min: 12, avg: 18, max: 25)

Run: 4, exploration: 0.9339750684823697, score: 19 | Scores: (min: 12, avg: 18.25, max: 25)

Run: 5, exploration: 0.8364559210025934, score: 23 | Scores: (min: 12, avg: 19.2, max: 25)

Run: 6, exploration: 0.7915838916528757, score: 12 | Scores: (min: 12, avg: 18, max: 25)

Run: 7, exploration: 0.75288345846457, score: 11 | Scores: (min: 11, avg: 17, max: 25)

Run: 8, exploration: 0.719673451664079, score: 10 | Scores: (min: 10, avg: 16.125, max: 25)

Run: 9, exploration: 0.6776609356176874, score: 13 | Scores: (min: 10, avg: 15.777777777777779, max: 25)

####################
Run: 10, exploration: 0.6349104873708861, score: 14 | Scores: (min: 10, avg: 15.6, max: 25)

Run: 11, exploration: 0.6008504474950233, score: 12 | S

Run: 85, exploration: 0.01, score: 130 | Scores: (min: 10, avg: 101.98823529411764, max: 500)

Run: 86, exploration: 0.01, score: 176 | Scores: (min: 10, avg: 102.84883720930233, max: 500)

Run: 87, exploration: 0.01, score: 174 | Scores: (min: 10, avg: 103.66666666666667, max: 500)

Run: 88, exploration: 0.01, score: 320 | Scores: (min: 10, avg: 106.125, max: 500)

Run: 89, exploration: 0.01, score: 327 | Scores: (min: 10, avg: 108.6067415730337, max: 500)

####################
Run: 90, exploration: 0.01, score: 152 | Scores: (min: 10, avg: 109.08888888888889, max: 500)

Run: 91, exploration: 0.01, score: 176 | Scores: (min: 10, avg: 109.82417582417582, max: 500)

Run: 92, exploration: 0.01, score: 275 | Scores: (min: 10, avg: 111.6195652173913, max: 500)

Run: 93, exploration: 0.01, score: 236 | Scores: (min: 10, avg: 112.95698924731182, max: 500)

Run: 94, exploration: 0.01, score: 239 | Scores: (min: 10, avg: 114.29787234042553, max: 500)

Run: 95, exploration: 0.01, score: 236 | S

NameError: name 'exit' is not defined

Next, we will change the exploration factor by decreasing exploration_max by .2

In [4]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_MAX = .8
cartpole(250)

Run: 1, exploration: 0.8, score: 14 | Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.6481259022523785, score: 48 | Scores: (min: 14, avg: 31, max: 48)

Run: 3, exploration: 0.6102901313127972, score: 13 | Scores: (min: 13, avg: 25, max: 48)

Run: 4, exploration: 0.5775508607828131, score: 12 | Scores: (min: 12, avg: 21.75, max: 48)

Run: 5, exploration: 0.5465679021770115, score: 12 | Scores: (min: 12, avg: 19.8, max: 48)

Run: 6, exploration: 0.5095270607151027, score: 15 | Scores: (min: 12, avg: 19, max: 48)

Run: 7, exploration: 0.47262149029254136, score: 16 | Scores: (min: 12, avg: 18.571428571428573, max: 48)

Run: 8, exploration: 0.43838902839203386, score: 16 | Scores: (min: 12, avg: 18.25, max: 48)

Run: 9, exploration: 0.40257986431827397, score: 18 | Scores: (min: 12, avg: 18.22222222222222, max: 48)

####################
Run: 10, exploration: 0.3848218967584385, score: 10 | Scores: (min: 10, avg: 17.4, max: 48)

Run: 11, exploration: 0.36969571923133676, score:

Run: 86, exploration: 0.01, score: 9 | Scores: (min: 9, avg: 175.13953488372093, max: 500)

Run: 87, exploration: 0.01, score: 14 | Scores: (min: 9, avg: 173.28735632183907, max: 500)

Run: 88, exploration: 0.01, score: 12 | Scores: (min: 9, avg: 171.45454545454547, max: 500)

Run: 89, exploration: 0.01, score: 194 | Scores: (min: 9, avg: 171.7078651685393, max: 500)

####################
Run: 90, exploration: 0.01, score: 177 | Scores: (min: 9, avg: 171.76666666666668, max: 500)

Run: 91, exploration: 0.01, score: 223 | Scores: (min: 9, avg: 172.32967032967034, max: 500)

Run: 92, exploration: 0.01, score: 203 | Scores: (min: 9, avg: 172.66304347826087, max: 500)

Run: 93, exploration: 0.01, score: 289 | Scores: (min: 9, avg: 173.91397849462365, max: 500)

Run: 94, exploration: 0.01, score: 239 | Scores: (min: 9, avg: 174.60638297872342, max: 500)

Run: 95, exploration: 0.01, score: 199 | Scores: (min: 9, avg: 174.86315789473684, max: 500)

Run: 96, exploration: 0.01, score: 191 | Sco

NameError: name 'exit' is not defined

Next, we will change the exploration factor by increasing exploration_min by .002

In [5]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_MIN = 0.012
cartpole(250)

Run: 1, exploration: 0.8, score: 18 | Scores: (min: 18, avg: 18, max: 18)

Run: 2, exploration: 0.738344899179389, score: 18 | Scores: (min: 18, avg: 18, max: 18)

Run: 3, exploration: 0.6679145338549811, score: 21 | Scores: (min: 18, avg: 19, max: 21)

Run: 4, exploration: 0.6320839780376225, score: 12 | Scores: (min: 12, avg: 17.25, max: 21)

Run: 5, exploration: 0.5775508607828131, score: 19 | Scores: (min: 12, avg: 17.6, max: 21)

Run: 6, exploration: 0.5438350626661264, score: 13 | Scores: (min: 12, avg: 16.833333333333332, max: 21)

Run: 7, exploration: 0.5198462687479806, score: 10 | Scores: (min: 10, avg: 15.857142857142858, max: 21)

Run: 8, exploration: 0.43184601734465417, score: 38 | Scores: (min: 10, avg: 18.625, max: 38)

Run: 9, exploration: 0.40867945628830904, score: 12 | Scores: (min: 10, avg: 17.88888888888889, max: 38)

####################
Run: 10, exploration: 0.3848218967584385, score: 13 | Scores: (min: 10, avg: 17.4, max: 38)

Run: 11, exploration: 0.3498632370

Run: 86, exploration: 0.012, score: 132 | Scores: (min: 10, avg: 114.70930232558139, max: 324)

Run: 87, exploration: 0.012, score: 153 | Scores: (min: 10, avg: 115.14942528735632, max: 324)

Run: 88, exploration: 0.012, score: 250 | Scores: (min: 10, avg: 116.68181818181819, max: 324)

Run: 89, exploration: 0.012, score: 147 | Scores: (min: 10, avg: 117.02247191011236, max: 324)

####################
Run: 90, exploration: 0.012, score: 169 | Scores: (min: 10, avg: 117.6, max: 324)

Run: 91, exploration: 0.012, score: 154 | Scores: (min: 10, avg: 118, max: 324)

Run: 92, exploration: 0.012, score: 167 | Scores: (min: 10, avg: 118.53260869565217, max: 324)

Run: 93, exploration: 0.012, score: 166 | Scores: (min: 10, avg: 119.04301075268818, max: 324)

Run: 94, exploration: 0.012, score: 196 | Scores: (min: 10, avg: 119.86170212765957, max: 324)

Run: 95, exploration: 0.012, score: 94 | Scores: (min: 10, avg: 119.58947368421053, max: 324)

Run: 96, exploration: 0.012, score: 150 | Scores

####################
Run: 180, exploration: 0.012, score: 9 | Scores: (min: 8, avg: 136.05, max: 308)

Run: 181, exploration: 0.012, score: 10 | Scores: (min: 8, avg: 134.34, max: 308)

Run: 182, exploration: 0.012, score: 9 | Scores: (min: 8, avg: 133.8, max: 308)

Run: 183, exploration: 0.012, score: 8 | Scores: (min: 8, avg: 131.37, max: 308)

Run: 184, exploration: 0.012, score: 10 | Scores: (min: 8, avg: 129.74, max: 308)

Run: 185, exploration: 0.012, score: 9 | Scores: (min: 8, avg: 127.82, max: 308)

Run: 186, exploration: 0.012, score: 10 | Scores: (min: 8, avg: 126.6, max: 308)

Run: 187, exploration: 0.012, score: 9 | Scores: (min: 8, avg: 125.16, max: 308)

Run: 188, exploration: 0.012, score: 8 | Scores: (min: 8, avg: 122.74, max: 308)

Run: 189, exploration: 0.012, score: 10 | Scores: (min: 8, avg: 121.37, max: 308)

####################
Run: 190, exploration: 0.012, score: 9 | Scores: (min: 8, avg: 119.77, max: 308)

Run: 191, exploration: 0.012, score: 10 | Scores: (min

Next, we will change the exploration factor by decreasing exploration_min by .002

In [6]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_MIN = 0.008
cartpole(250)

Run: 1, exploration: 0.7801990024975001, score: 25 | Scores: (min: 25, avg: 25, max: 25)

Run: 2, exploration: 0.7309799088100746, score: 14 | Scores: (min: 14, avg: 19.5, max: 25)

Run: 3, exploration: 0.6780342802418761, score: 16 | Scores: (min: 14, avg: 18.333333333333332, max: 25)

Run: 4, exploration: 0.6226500456549138, score: 18 | Scores: (min: 14, avg: 18.25, max: 25)

Run: 5, exploration: 0.5689308419917721, score: 19 | Scores: (min: 14, avg: 18.4, max: 25)

Run: 6, exploration: 0.48461634919258256, score: 33 | Scores: (min: 14, avg: 20.833333333333332, max: 33)

Run: 7, exploration: 0.4655675554722388, score: 9 | Scores: (min: 9, avg: 19.142857142857142, max: 33)

Run: 8, exploration: 0.4405919883337024, score: 12 | Scores: (min: 9, avg: 18.25, max: 33)

Run: 9, exploration: 0.4211572899560216, score: 10 | Scores: (min: 9, avg: 17.333333333333332, max: 33)

####################
Run: 10, exploration: 0.39856413017169917, score: 12 | Scores: (min: 9, avg: 16.8, max: 33)

Run: 

Run: 85, exploration: 0.008, score: 164 | Scores: (min: 8, avg: 94.29411764705883, max: 326)

Run: 86, exploration: 0.008, score: 332 | Scores: (min: 8, avg: 97.05813953488372, max: 332)

Run: 87, exploration: 0.008, score: 180 | Scores: (min: 8, avg: 98.01149425287356, max: 332)

Run: 88, exploration: 0.008, score: 147 | Scores: (min: 8, avg: 98.56818181818181, max: 332)

Run: 89, exploration: 0.008, score: 207 | Scores: (min: 8, avg: 99.78651685393258, max: 332)

####################
Run: 90, exploration: 0.008, score: 177 | Scores: (min: 8, avg: 100.64444444444445, max: 332)

Run: 91, exploration: 0.008, score: 137 | Scores: (min: 8, avg: 101.04395604395604, max: 332)

Run: 92, exploration: 0.008, score: 182 | Scores: (min: 8, avg: 101.92391304347827, max: 332)

Run: 93, exploration: 0.008, score: 150 | Scores: (min: 8, avg: 102.44086021505376, max: 332)

Run: 94, exploration: 0.008, score: 176 | Scores: (min: 8, avg: 103.22340425531915, max: 332)

Run: 95, exploration: 0.008, score

Run: 178, exploration: 0.008, score: 163 | Scores: (min: 13, avg: 170.62, max: 332)

Run: 179, exploration: 0.008, score: 104 | Scores: (min: 13, avg: 169.37, max: 332)

####################
Run: 180, exploration: 0.008, score: 123 | Scores: (min: 13, avg: 168.92, max: 332)

Run: 181, exploration: 0.008, score: 209 | Scores: (min: 13, avg: 168.75, max: 332)

Run: 182, exploration: 0.008, score: 119 | Scores: (min: 13, avg: 167.84, max: 332)

Run: 183, exploration: 0.008, score: 171 | Scores: (min: 13, avg: 167.84, max: 332)

Run: 184, exploration: 0.008, score: 143 | Scores: (min: 13, avg: 167.62, max: 332)

Run: 185, exploration: 0.008, score: 186 | Scores: (min: 13, avg: 167.84, max: 332)

Run: 186, exploration: 0.008, score: 108 | Scores: (min: 13, avg: 165.6, max: 290)

Run: 187, exploration: 0.008, score: 169 | Scores: (min: 13, avg: 165.49, max: 290)

Run: 188, exploration: 0.008, score: 129 | Scores: (min: 13, avg: 165.31, max: 290)

Run: 189, exploration: 0.008, score: 120 | Sc

Next, we will change the exploration factor by increasing exploration_decay by .002

In [7]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_DECAY = 0.997 
cartpole(250)

Run: 1, exploration: 0.8, score: 14 | Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.7310460524835921, score: 36 | Scores: (min: 14, avg: 25, max: 36)

Run: 3, exploration: 0.6842868893850883, score: 23 | Scores: (min: 14, avg: 24.333333333333332, max: 36)

Run: 4, exploration: 0.6640332134021367, score: 11 | Scores: (min: 11, avg: 21, max: 36)

Run: 5, exploration: 0.6424458732078172, score: 12 | Scores: (min: 11, avg: 19.2, max: 36)

Run: 6, exploration: 0.5995496929668116, score: 24 | Scores: (min: 11, avg: 20, max: 36)

Run: 7, exploration: 0.5800586751618989, score: 12 | Scores: (min: 11, avg: 18.857142857142858, max: 36)

Run: 8, exploration: 0.5578391422208724, score: 14 | Scores: (min: 11, avg: 18.25, max: 36)

Run: 9, exploration: 0.5429569657600744, score: 10 | Scores: (min: 10, avg: 17.333333333333332, max: 36)

####################
Run: 10, exploration: 0.5268864043803683, score: 11 | Scores: (min: 10, avg: 16.7, max: 36)

Run: 11, exploration: 0.50822835566609

Run: 85, exploration: 0.008, score: 183 | Scores: (min: 9, avg: 211.3294117647059, max: 500)

Run: 86, exploration: 0.008, score: 109 | Scores: (min: 9, avg: 210.13953488372093, max: 500)

Run: 87, exploration: 0.008, score: 35 | Scores: (min: 9, avg: 208.1264367816092, max: 500)

Run: 88, exploration: 0.008, score: 12 | Scores: (min: 9, avg: 205.89772727272728, max: 500)

Run: 89, exploration: 0.008, score: 167 | Scores: (min: 9, avg: 205.46067415730337, max: 500)

####################
Run: 90, exploration: 0.008, score: 386 | Scores: (min: 9, avg: 207.46666666666667, max: 500)

Run: 91, exploration: 0.008, score: 322 | Scores: (min: 9, avg: 208.72527472527472, max: 500)

Run: 92, exploration: 0.008, score: 179 | Scores: (min: 9, avg: 208.40217391304347, max: 500)

Run: 93, exploration: 0.008, score: 351 | Scores: (min: 9, avg: 209.93548387096774, max: 500)

Run: 94, exploration: 0.008, score: 500 | Scores: (min: 9, avg: 213.0212765957447, max: 500)

Run: 95, exploration: 0.008, score

NameError: name 'exit' is not defined

Finally, we will change the exploration factor by decreasing exploration_decay by .002

In [8]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
EXPLORATION_DECAY = 0.993
cartpole(250)

Run: 1, exploration: 0.7250692144831015, score: 34 | Scores: (min: 34, avg: 34, max: 34)

Run: 2, exploration: 0.5670262481879318, score: 36 | Scores: (min: 34, avg: 35, max: 36)

Run: 3, exploration: 0.5360399416300314, score: 9 | Scores: (min: 9, avg: 26.333333333333332, max: 36)

Run: 4, exploration: 0.506746945032946, score: 9 | Scores: (min: 9, avg: 22, max: 36)

Run: 5, exploration: 0.47905472401804494, score: 9 | Scores: (min: 9, avg: 19.4, max: 36)

Run: 6, exploration: 0.41626519225154546, score: 21 | Scores: (min: 9, avg: 19.666666666666668, max: 36)

Run: 7, exploration: 0.3720129626646657, score: 17 | Scores: (min: 9, avg: 19.285714285714285, max: 36)

Run: 8, exploration: 0.34922177020490575, score: 10 | Scores: (min: 9, avg: 18.125, max: 36)

Run: 9, exploration: 0.3232533563698685, score: 12 | Scores: (min: 9, avg: 17.444444444444443, max: 36)

####################
Run: 10, exploration: 0.2992159748319074, score: 12 | Scores: (min: 9, avg: 16.9, max: 36)

Run: 11, explor

Run: 81, exploration: 0.008, score: 17 | Scores: (min: 8, avg: 11.938271604938272, max: 36)

Run: 82, exploration: 0.008, score: 17 | Scores: (min: 8, avg: 12, max: 36)

Run: 83, exploration: 0.008, score: 25 | Scores: (min: 8, avg: 12.156626506024097, max: 36)

Run: 84, exploration: 0.008, score: 29 | Scores: (min: 8, avg: 12.357142857142858, max: 36)

Run: 85, exploration: 0.008, score: 31 | Scores: (min: 8, avg: 12.576470588235294, max: 36)

Run: 86, exploration: 0.008, score: 29 | Scores: (min: 8, avg: 12.767441860465116, max: 36)

Run: 87, exploration: 0.008, score: 41 | Scores: (min: 8, avg: 13.091954022988507, max: 41)

Run: 88, exploration: 0.008, score: 37 | Scores: (min: 8, avg: 13.363636363636363, max: 41)

Run: 89, exploration: 0.008, score: 50 | Scores: (min: 8, avg: 13.775280898876405, max: 50)

####################
Run: 90, exploration: 0.008, score: 53 | Scores: (min: 8, avg: 14.21111111111111, max: 53)

Run: 91, exploration: 0.008, score: 68 | Scores: (min: 8, avg: 14.

Run: 177, exploration: 0.008, score: 213 | Scores: (min: 15, avg: 107.74, max: 407)

Run: 178, exploration: 0.008, score: 253 | Scores: (min: 15, avg: 110.06, max: 407)

Run: 179, exploration: 0.008, score: 251 | Scores: (min: 17, avg: 112.42, max: 407)

####################
Run: 180, exploration: 0.008, score: 258 | Scores: (min: 17, avg: 114.78, max: 407)

Run: 181, exploration: 0.008, score: 199 | Scores: (min: 17, avg: 116.6, max: 407)

Run: 182, exploration: 0.008, score: 355 | Scores: (min: 17, avg: 119.98, max: 407)

Run: 183, exploration: 0.008, score: 259 | Scores: (min: 17, avg: 122.32, max: 407)

Run: 184, exploration: 0.008, score: 182 | Scores: (min: 17, avg: 123.85, max: 407)

Run: 185, exploration: 0.008, score: 186 | Scores: (min: 17, avg: 125.4, max: 407)

Run: 186, exploration: 0.008, score: 37 | Scores: (min: 17, avg: 125.48, max: 407)

Run: 187, exploration: 0.008, score: 147 | Scores: (min: 17, avg: 126.54, max: 407)

Run: 188, exploration: 0.008, score: 223 | Scor

NameError: name 'exit' is not defined

# Thoughts about results:

For each variable change, we had a sample size of one. This is largely due to constraints on computational power and access to the computer lab. As such, these results are not conclusive.

We took 826 runs to solve the scenario with default values; at run #249, the average score was slightly above 100. When we adjusted gamma, we found that increasing it slightly resulted in an average score of 194 by run 249 and when we decreased it slightly we found that the provlem was solved in 158 runs. When we increased learning rate slightly, we solved in 175; when we decreased we solved in 121. When we increased exploration max, we solved in 147; when we decreased, we solved in 110. When we increased exploration min, the average score at run 249 was 77; when we decreased, the average was 88. Finally, when we increased exploration decay, we solved in 100 runs; When we decreased, we solved in 235.

Overall, adjusting gamma in either direction resulted in improvements, with decreasing the value yeilding better performance. Adjusting learning rate in either direction resulted in an improvement, but decreasing it resulted in the best performance. When we modified exploration max in either direction, we improved our score, but decreasing it resulted in the best performance. When we changed exploration min, we universally harmed our performance. When we changed exploration decay, we improved our performance in either direction, but increasing it resulted in the quickest solve yet.

It could be that the performance changes were a result of a low sample size, it could be that these values need to be in certain ratios relative to eachother, or it could be that we could produce a much quicker solve by combining what has worked so far:



In [None]:
# Note: the baseline average at run# 250 was 101.74
reset_to_baseline_settings()
GAMMA = 0.94
LEARNING_RATE = 0.0008  
EXPLORATION_MAX = .8
EXPLORATION_DECAY = 0.997 
cartpole(250)

Run: 1, exploration: 0.8, score: 11 | Scores: (min: 11, avg: 11, max: 11)

Run: 2, exploration: 0.7857075689708344, score: 15 | Scores: (min: 11, avg: 13, max: 15)

Run: 3, exploration: 0.7556105185200527, score: 14 | Scores: (min: 11, avg: 13.333333333333334, max: 15)

Run: 4, exploration: 0.7244863565164134, score: 15 | Scores: (min: 11, avg: 13.75, max: 15)

Run: 5, exploration: 0.7009337191955577, score: 12 | Scores: (min: 11, avg: 13.4, max: 15)

Run: 6, exploration: 0.6740839873838664, score: 14 | Scores: (min: 11, avg: 13.5, max: 15)

Run: 7, exploration: 0.6405185355881937, score: 18 | Scores: (min: 11, avg: 14.142857142857142, max: 18)

Run: 8, exploration: 0.6104558150072419, score: 17 | Scores: (min: 11, avg: 14.5, max: 18)

Run: 9, exploration: 0.509757628551752, score: 61 | Scores: (min: 11, avg: 19.666666666666668, max: 61)

####################
Run: 10, exploration: 0.4742938935465297, score: 25 | Scores: (min: 11, avg: 20.2, max: 61)

Run: 11, exploration: 0.40447429223

Run: 85, exploration: 0.01, score: 231 | Scores: (min: 11, avg: 153.9294117647059, max: 333)

Run: 86, exploration: 0.01, score: 293 | Scores: (min: 11, avg: 155.54651162790697, max: 333)

Run: 87, exploration: 0.01, score: 204 | Scores: (min: 11, avg: 156.10344827586206, max: 333)

Run: 88, exploration: 0.01, score: 186 | Scores: (min: 11, avg: 156.4431818181818, max: 333)

Run: 89, exploration: 0.01, score: 165 | Scores: (min: 11, avg: 156.53932584269663, max: 333)

####################
Run: 90, exploration: 0.01, score: 177 | Scores: (min: 11, avg: 156.76666666666668, max: 333)

Run: 91, exploration: 0.01, score: 147 | Scores: (min: 11, avg: 156.65934065934067, max: 333)

Run: 92, exploration: 0.01, score: 131 | Scores: (min: 11, avg: 156.3804347826087, max: 333)

Run: 93, exploration: 0.01, score: 149 | Scores: (min: 11, avg: 156.30107526881721, max: 333)

Run: 94, exploration: 0.01, score: 229 | Scores: (min: 11, avg: 157.0744680851064, max: 333)

Run: 95, exploration: 0.01, score

Run: 179, exploration: 0.01, score: 144 | Scores: (min: 14, avg: 188.53, max: 449)

####################
Run: 180, exploration: 0.01, score: 166 | Scores: (min: 14, avg: 187.58, max: 449)

Run: 181, exploration: 0.01, score: 185 | Scores: (min: 14, avg: 187.32, max: 449)

Run: 182, exploration: 0.01, score: 267 | Scores: (min: 14, avg: 188.32, max: 449)

Run: 183, exploration: 0.01, score: 151 | Scores: (min: 14, avg: 188.12, max: 449)

Run: 184, exploration: 0.01, score: 194 | Scores: (min: 14, avg: 187.67, max: 449)



# Final thoughts about modifying the values



The goal of the agent is to balance a pole on a cart (Surma, 2019). The pole is subject to simulated gravity and is attached to the cart at a point which can be rotated. Another way to describe this setup is an inverted pendulum. The goal is for the cart to balance the pole upright without it falling over. In some versions of the scenario, the goal is to keep the pole within a certain range of angles. The cart will need to move in order to counter-act the simulated gravity. The simulated environment in this scenario was created by OpenAI. The pole can have a position relative to the cart at various angles. The cart can have a displacement along a line. In some versions of the problem, the cart is constrained to a finite x movement. The cart can move left or right. Moving the cart will modify the angle of the pole.


In our scenario, the agent is a neural network which learns from experience by making attempts, determining which strategies were successful, and continuing successful behaviors (Surma, 2019). In this case, the agent is trained by reinforcement learning. Specifically, we are using a Markov chain. A score is calculated by analyzing how favorable a state is. In this case, we process the score through a "GAMMA" value, which influences how large or small the reward is; we will discuss the "GAMMA" value later. The agent is then updated with the corresponding reward, then the agent incorporates this data. The agent will try to select the actions which it predicts will result in the state which maximizes the reward. If the pole falls down, the run is determined to be terminal. Information is printed out about the run, then the process is repeated.


Q-learning refers to a model-free, value-based algorithm to find the best series of actions (Awan, 2022). The “Q” in q-learning stands for quality, which is a measure for how much a specific action will maximize future reward. In our scenario, we remember each state which our agent has experienced (Surma, 2019). A Q-table is formed, which is a collection of the state-action pairs and associated Q values (Luu, 2024). After an action, we perform an experience replay, which updates the Q table with the maximum predicted reward for an action in a specific state. It is important to note that we will progressively discount future predicted rewards.


In our scenario, we are not using Q-learning, but rather, deep Q-learning, which replaces the Q table with a neural network. The neural network is used to predict Q values of actions based on a given state. Instead of recording state-action pairs and associated Q values, the neural network attempts to minimize the difference between its predicted Q-value for a given state and the observed outcome (Luu, 2024).


The reason why we want to discount future rewards is so that we weight the more immediate rewards so that we do note take an especially risky action because we anticipate a high payoff (Chanda, 2024). This discounting factor is one of the variables which we can fine-tune in order to prevent overly-cautious behavior while also preventing destructively risky behavior.


In our scenario, we are using deep Q-learning, which applies Q-learning principles onto a neural network. Like with other neural networks, we have an input layer which inputs the values that we can observe from the current state, we have processing layers, and we have an output layer which corresponds to which actions we can take (In this case, two: left or right) (Halthor, 2023). Whenever we apply an experience replay, we are adjusting the various values of the neural network and identifying which values correspond to better Q values.


One of the issues with standard Q learning is that it becomes challenging with large numbers of possible states (Luu, 2024). In this scenario, this problem is not quite as pronounced because we have a closed set of potential states; the pole can only have a finite number of simulated angles and the cart has only a finite set of potential positions. One consequence of this approach is that the neural network does not always record the most accurate action for a given scenario, which could negatively affect the performance of the deep Q-learning algorithm. Training time for standard Q-learning and deep Q-learning would depend on the number of possible states. For standard Q-learning to be effective, you would need to ensure a high level of coverage for all potential scenarios. Deep Q-learning may be able to be trained effectively with a subset of potential scenarios, however because of the aforementioned accuracy problems the overall training time may be higher due to the need to refine the accuracy.


In our tests, we observed that the starting values produced a very low performance, in contradiction with the performance reported by the documentation. While the documentation claimed to solve the problem with an average of 131 runs, we observed the agent was sufficiently trained after 826 runs. This indicates that there is a high level of variability across potential outcomes, and that a larger sample size is needed in order to determine changes in performance. We observed this variability because nearly all of our modified runs resulted in better performance. For this reason, even though we observed that there was a relatively decreased performance whenever we increased the learning rate as compared to whenever we decreased the learning rate, I do not believe that we can draw any firm conclusions. 


My current hypothesis is that there exists multiple potential values and strategies for solving this problem quickly. While one might tend to be more mathematically reliable. I was unable to reproduce the working environment on my local machine, so I am unable to increase the computational power or bypass the time constraints imposed by the virtual lab environment.


# Works Cited:
Awan, A. A. (2022, October 27). An introduction to Q-learning: A tutorial for Beginners. DataCamp. https://www.datacamp.com/tutorial/introduction-q-learning-beginner-tutorial
 

Chanda, K. (2024, August 26). Q-learning. GeeksforGeeks. https://www.geeksforgeeks.org/q-learning-in-python/ 


Halthor, A. (2023, November 28). Deep Q-Networks Explained!. YouTube. https://www.youtube.com/watch?v=x83WmvbRa2I 


Luu, W. by: Q. T. (2024, March 18). Q-learning vs. deep Q-learning vs. Deep Q-Network. Baeldung on Computer Science. https://www.baeldung.com/cs/q-learning-vs-deep-q-learning-vs-deep-q-network 


Surma, G. (2019, November 10). Cartpole - introduction to reinforcement learning (DQN - deep Q-learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288#f94f 

