# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [2]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            if step > 100:
                return
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 11
Scores: (min: 11, avg: 11, max: 11)

Run: 2, exploration: 0.8911090557802088, score: 32
Scores: (min: 11, avg: 21.5, max: 32)

Run: 3, exploration: 0.7940753492934954, score: 24
Scores: (min: 11, avg: 22.333333333333332, max: 32)

Run: 4, exploration: 0.6498078359349755, score: 41
Scores: (min: 11, avg: 27, max: 41)

Run: 5, exploration: 0.6057704364907278, score: 15
Scores: (min: 11, avg: 24.6, max: 41)

Run: 6, exploration: 0.5618938591163328, score: 16
Scores: (min: 11, avg: 23.166666666666668, max: 41)

Run: 7, exploration: 0.46211964903917074, score: 40
Scores: (min: 11, avg: 25.571428571428573, max: 41)

Run: 8, exploration: 0.42650460709830135, score: 17
Scores: (min: 11, avg: 24.5, max: 41)

Run: 9, exploration: 0.40769130904675194, score: 10
Scores: (min: 10, avg: 22.88888888888889, max: 41)

Run: 10, exploration: 0.3858205374665315, score: 12
Scores: (min: 10, avg: 21.8, max: 41)

Run: 11, exploration: 0.37251769488706843, score: 8
Scores: 

Run: 83, exploration: 0.01, score: 36
Scores: (min: 8, avg: 14.096385542168674, max: 61)

Run: 84, exploration: 0.01, score: 53
Scores: (min: 8, avg: 14.55952380952381, max: 61)

Run: 85, exploration: 0.01, score: 43
Scores: (min: 8, avg: 14.894117647058824, max: 61)

Run: 86, exploration: 0.01, score: 21
Scores: (min: 8, avg: 14.965116279069768, max: 61)

Run: 87, exploration: 0.01, score: 84
Scores: (min: 8, avg: 15.758620689655173, max: 84)

Run: 88, exploration: 0.01, score: 22
Scores: (min: 8, avg: 15.829545454545455, max: 84)

Run: 89, exploration: 0.01, score: 58
Scores: (min: 8, avg: 16.303370786516854, max: 84)

Run: 90, exploration: 0.01, score: 39
Scores: (min: 8, avg: 16.555555555555557, max: 84)

Run: 91, exploration: 0.01, score: 64
Scores: (min: 8, avg: 17.076923076923077, max: 84)

Run: 92, exploration: 0.01, score: 20
Scores: (min: 8, avg: 17.108695652173914, max: 84)

Run: 93, exploration: 0.01, score: 18
Scores: (min: 8, avg: 17.118279569892472, max: 84)

Run: 94, ex

Run: 184, exploration: 0.01, score: 160
Scores: (min: 18, avg: 144.66, max: 500)

Run: 185, exploration: 0.01, score: 196
Scores: (min: 18, avg: 146.19, max: 500)

Run: 186, exploration: 0.01, score: 267
Scores: (min: 18, avg: 148.65, max: 500)

Run: 187, exploration: 0.01, score: 255
Scores: (min: 18, avg: 150.36, max: 500)

Run: 188, exploration: 0.01, score: 228
Scores: (min: 18, avg: 152.42, max: 500)

Run: 189, exploration: 0.01, score: 155
Scores: (min: 18, avg: 153.39, max: 500)

Run: 190, exploration: 0.01, score: 220
Scores: (min: 18, avg: 155.2, max: 500)

Run: 191, exploration: 0.01, score: 311
Scores: (min: 18, avg: 157.67, max: 500)

Run: 192, exploration: 0.01, score: 289
Scores: (min: 18, avg: 160.36, max: 500)

Run: 193, exploration: 0.01, score: 220
Scores: (min: 40, avg: 162.38, max: 500)

Run: 194, exploration: 0.01, score: 179
Scores: (min: 42, avg: 163.77, max: 500)

Run: 195, exploration: 0.01, score: 288
Scores: (min: 42, avg: 166.17, max: 500)

Run: 196, explora

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [13]:
# Reduce GAMMA
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.55  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:
            if step > 100:
                return
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Run: 1, exploration: 1.0, score: 19
Scores: (min: 19, avg: 19, max: 19)

Run: 2, exploration: 0.9091562615825302, score: 20
Scores: (min: 19, avg: 19.5, max: 20)

Run: 3, exploration: 0.6935613678313175, score: 55
Scores: (min: 19, avg: 31.333333333333332, max: 55)

Run: 4, exploration: 0.6149486215357263, score: 25
Scores: (min: 19, avg: 29.75, max: 55)

Run: 5, exploration: 0.5878229785513479, score: 10
Scores: (min: 10, avg: 25.8, max: 55)

Run: 6, exploration: 0.547986285490042, score: 15
Scores: (min: 10, avg: 24, max: 55)

Run: 7, exploration: 0.5159963842937159, score: 13
Scores: (min: 10, avg: 22.428571428571427, max: 55)

Run: 8, exploration: 0.42013897252428334, score: 42
Scores: (min: 10, avg: 24.875, max: 55)

Run: 9, exploration: 0.3858205374665315, score: 18
Scores: (min: 10, avg: 24.11111111111111, max: 55)

Run: 10, exploration: 0.3507711574848344, score: 20
Scores: (min: 10, avg: 23.7, max: 55)

Run: 11, exploration: 0.27714603575484437, score: 48
Scores: (min: 10, avg

In [16]:
# 10x Learning Rate
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.0001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:
            if step > 100:
                return
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.9558895783575597, score: 16
Scores: (min: 13, avg: 14.5, max: 16)

Run: 3, exploration: 0.8866535105013078, score: 16
Scores: (min: 13, avg: 15, max: 16)

Run: 4, exploration: 0.736559652908221, score: 38
Scores: (min: 13, avg: 20.75, max: 38)

Run: 5, exploration: 0.6465587967553006, score: 27
Scores: (min: 13, avg: 22, max: 38)

Run: 6, exploration: 0.6088145090359074, score: 13
Scores: (min: 13, avg: 20.5, max: 38)

Run: 7, exploration: 0.567555222460375, score: 15
Scores: (min: 13, avg: 19.714285714285715, max: 38)

Run: 8, exploration: 0.5371084840724134, score: 12
Scores: (min: 12, avg: 18.75, max: 38)

Run: 9, exploration: 0.5185893309484582, score: 8
Scores: (min: 8, avg: 17.555555555555557, max: 38)

Run: 10, exploration: 0.46211964903917074, score: 24
Scores: (min: 8, avg: 18.2, max: 38)

Run: 11, exploration: 0.4417353564707963, score: 10
Scores: (min: 8, avg: 17.454545454545453, 

Run: 85, exploration: 0.01, score: 20
Scores: (min: 8, avg: 16.49411764705882, max: 43)

Run: 86, exploration: 0.01, score: 32
Scores: (min: 8, avg: 16.674418604651162, max: 43)

Run: 87, exploration: 0.01, score: 27
Scores: (min: 8, avg: 16.79310344827586, max: 43)

Run: 88, exploration: 0.01, score: 51
Scores: (min: 8, avg: 17.181818181818183, max: 51)

Run: 89, exploration: 0.01, score: 55
Scores: (min: 8, avg: 17.60674157303371, max: 55)

Run: 90, exploration: 0.01, score: 32
Scores: (min: 8, avg: 17.766666666666666, max: 55)

Run: 91, exploration: 0.01, score: 28
Scores: (min: 8, avg: 17.87912087912088, max: 55)

Run: 92, exploration: 0.01, score: 24
Scores: (min: 8, avg: 17.945652173913043, max: 55)

Run: 93, exploration: 0.01, score: 31
Scores: (min: 8, avg: 18.086021505376344, max: 55)

Run: 94, exploration: 0.01, score: 57
Scores: (min: 8, avg: 18.5, max: 57)

Run: 95, exploration: 0.01, score: 34
Scores: (min: 8, avg: 18.66315789473684, max: 57)

Run: 96, exploration: 0.01, s

In [10]:
# decrease Exploration Decay
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.595  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:
            if step > 100:
                return
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Run: 1, exploration: 1.0, score: 18
Scores: (min: 18, avg: 18, max: 18)

Run: 2, exploration: 0.01, score: 13
Scores: (min: 13, avg: 15.5, max: 18)

Run: 3, exploration: 0.01, score: 10
Scores: (min: 10, avg: 13.666666666666666, max: 18)

Run: 4, exploration: 0.01, score: 9
Scores: (min: 9, avg: 12.5, max: 18)

Run: 5, exploration: 0.01, score: 9
Scores: (min: 9, avg: 11.8, max: 18)

Run: 6, exploration: 0.01, score: 10
Scores: (min: 9, avg: 11.5, max: 18)

Run: 7, exploration: 0.01, score: 10
Scores: (min: 9, avg: 11.285714285714286, max: 18)

Run: 8, exploration: 0.01, score: 10
Scores: (min: 9, avg: 11.125, max: 18)

Run: 9, exploration: 0.01, score: 9
Scores: (min: 9, avg: 10.88888888888889, max: 18)

Run: 10, exploration: 0.01, score: 10
Scores: (min: 9, avg: 10.8, max: 18)

Run: 11, exploration: 0.01, score: 8
Scores: (min: 8, avg: 10.545454545454545, max: 18)

Run: 12, exploration: 0.01, score: 9
Scores: (min: 8, avg: 10.416666666666666, max: 18)

Run: 13, exploration: 0.01, sco

In [None]:
# narrow exploration min and max
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 0.7  
EXPLORATION_MIN = 0.4  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:
            if step > 100:
                return
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()

Explain how reinforcement learning concepts apply to the cartpole problem.
    The nature of the cartpole problem is that the bottom of the system (I will call it the "truck") nees to be centered under the center of gravity of the pole. If not, the pole begins to fall and the truck needs to center itself again. As a reinforcement learning algorithm, the agent (truck) can receive rewards for keeping the pole upright and punishment for allowing it to lean or fall. The outputs of our system would be either left, right, or stay still.
    
What is the goal of the agent in this case?
    The goal of the agent in the cartpole problem is to keep the pole upright by moving the truck towards the poles center of gravity.
What are the various state values?        
    0	Cart Position
    1	Cart Velocity
    2	Pole Angle   
    3	Pole Velocity At Tip 
What are the possible actions that can be performed?
    Left, right. There are only 2 options
What reinforcement algorithm is used for this problem?
    This example uses a Q-Learning algorithm.
Analyze how experience replay is applied to the cartpole problem.
    Each run of the situation is remembered. Then, when the system tries to determine its next action, it will look back in these memories for a similar situaion and act accordingly similar to biological memory.
How does experience replay work in this algorithm?
    Each run of the situation is remembered. Then, when the system tries to determine its next action, it will look back in these memories for a similar situaion and act accordingly similar to biological memory.
What is the effect of introducing a discount factor for calculating the future rewards?
    The discount is multiplied against the estimation of optimal future, so this would mean that we are reducing how much we take the future into account when making our decision.
Analyze how neural networks are used in deep Q-learning.
    The network is a sequential network that has an input layer and 2 hidden layers. The output layer of the network allows for 2 outputs, left or right. The input is our observation space. The network utalizes both relu and linear activation.
Explain the neural network architecture that is used in the cartpole problem.
    The network is a sequential network that has an input layer and 2 hidden layers. The output layer of the network allows for 2 outputs, left or right. The input is our observation space. The network utalizes both relu and linear activation.
How does the neural network make the Q-learning algorithm more efficient?
    By using a neural network it allows the system to avoid programatically analyzing the history and allows it to make an informed decision based on its own memory matrix
What difference do you see in the algorithm performance when you increase or decrease the learning rate?
    When I increase learing rate (i multiplied by x10) each run was significantly faster, however there was little to no improvement between each run. When learning reate was reduced (again by a factor of 10) the system learned much more each run, and each run took longer to complete.
    
Citation: 
Surma, G. (2019, November 10). Cartpole - introduction to reinforcement Learning (DQN - DEEP Q-LEARNING). Retrieved February 07, 2021, from https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288