# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [4]:
cartpole()

Run: 1, exploration: 1.0, score: 15
Scores: (min: 15, avg: 15, max: 15)

Run: 2, exploration: 0.9229311239742362, score: 21
Scores: (min: 15, avg: 18, max: 21)

Run: 3, exploration: 0.7705488893118823, score: 37
Scores: (min: 15, avg: 24.333333333333332, max: 37)

Run: 4, exploration: 0.7076077347272662, score: 18
Scores: (min: 15, avg: 22.75, max: 37)

Run: 5, exploration: 0.6662995813682115, score: 13
Scores: (min: 13, avg: 20.8, max: 37)

Run: 6, exploration: 0.6180388156137953, score: 16
Scores: (min: 13, avg: 20, max: 37)

Run: 7, exploration: 0.5907768628656763, score: 10
Scores: (min: 10, avg: 18.571428571428573, max: 37)

Run: 8, exploration: 0.5647174463480732, score: 10
Scores: (min: 10, avg: 17.5, max: 37)

Run: 9, exploration: 0.531750826943791, score: 13
Scores: (min: 10, avg: 17, max: 37)

Run: 10, exploration: 0.4858739637363176, score: 19
Scores: (min: 10, avg: 17.2, max: 37)

Run: 11, exploration: 0.46677573701590436, score: 9
Scores: (min: 9, avg: 16.454545454545453, 

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.90  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

In [8]:
cartpole()

Run: 1, exploration: 0.995, score: 21
Scores: (min: 21, avg: 21, max: 21)

Run: 2, exploration: 0.9416228069143757, score: 12
Scores: (min: 12, avg: 16.5, max: 21)

Run: 3, exploration: 0.8307187014821328, score: 26
Scores: (min: 12, avg: 19.666666666666668, max: 26)

Run: 4, exploration: 0.7861544476842928, score: 12
Scores: (min: 12, avg: 17.75, max: 26)

Run: 5, exploration: 0.6935613678313175, score: 26
Scores: (min: 12, avg: 19.4, max: 26)

Run: 6, exploration: 0.6180388156137953, score: 24
Scores: (min: 12, avg: 20.166666666666668, max: 26)

Run: 7, exploration: 0.5790496471185967, score: 14
Scores: (min: 12, avg: 19.285714285714285, max: 26)

Run: 8, exploration: 0.547986285490042, score: 12
Scores: (min: 12, avg: 18.375, max: 26)

Run: 9, exploration: 0.5238143793828016, score: 10
Scores: (min: 10, avg: 17.444444444444443, max: 26)

Run: 10, exploration: 0.500708706245853, score: 10
Scores: (min: 10, avg: 16.7, max: 26)

Run: 11, exploration: 0.47147873742168567, score: 13
Scor

NameError: name 'exit' is not defined

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.005  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

In [9]:
cartpole()

Run: 1, exploration: 0.9369146928798039, score: 33
Scores: (min: 33, avg: 33, max: 33)

Run: 2, exploration: 0.778312557068642, score: 38
Scores: (min: 33, avg: 35.5, max: 38)

Run: 3, exploration: 0.7219385759785162, score: 16
Scores: (min: 16, avg: 29, max: 38)

Run: 4, exploration: 0.653073201944699, score: 21
Scores: (min: 16, avg: 27, max: 38)

Run: 5, exploration: 0.5937455908197752, score: 20
Scores: (min: 16, avg: 25.6, max: 38)

Run: 6, exploration: 0.5344229416520513, score: 22
Scores: (min: 16, avg: 25, max: 38)

Run: 7, exploration: 0.4982051627146237, score: 15
Scores: (min: 15, avg: 23.571428571428573, max: 38)

Run: 8, exploration: 0.47622912292284103, score: 10
Scores: (min: 10, avg: 21.875, max: 38)

Run: 9, exploration: 0.4417353564707963, score: 16
Scores: (min: 10, avg: 21.22222222222222, max: 38)

Run: 10, exploration: 0.41386834584198684, score: 14
Scores: (min: 10, avg: 20.5, max: 38)

Run: 11, exploration: 0.3976004408064698, score: 9
Scores: (min: 9, avg: 19.45

NameError: name 'exit' is not defined

In [None]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 0.9  
EXPLORATION_MIN = 0.02  
EXPLORATION_DECAY = 0.8  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

In [10]:
cartpole()

Run: 1, exploration: 0.8390886103705794, score: 55
Scores: (min: 55, avg: 55, max: 55)

Run: 2, exploration: 0.7590483508202912, score: 21
Scores: (min: 21, avg: 38, max: 55)

Run: 3, exploration: 0.7292124703704616, score: 9
Scores: (min: 9, avg: 28.333333333333332, max: 55)

Run: 4, exploration: 0.6832098777212641, score: 14
Scores: (min: 9, avg: 24.75, max: 55)

Run: 5, exploration: 0.653073201944699, score: 10
Scores: (min: 9, avg: 21.8, max: 55)

Run: 6, exploration: 0.5937455908197752, score: 20
Scores: (min: 9, avg: 21.5, max: 55)

Run: 7, exploration: 0.5452463540625918, score: 18
Scores: (min: 9, avg: 21, max: 55)

Run: 8, exploration: 0.5211953074858876, score: 10
Scores: (min: 9, avg: 19.625, max: 55)

Run: 9, exploration: 0.4883155414435353, score: 14
Scores: (min: 9, avg: 19, max: 55)

Run: 10, exploration: 0.457510005540005, score: 14
Scores: (min: 9, avg: 18.5, max: 55)

Run: 11, exploration: 0.43732904629000013, score: 10
Scores: (min: 9, avg: 17.727272727272727, max: 5

NameError: name 'exit' is not defined

In [1]:
# I originally tried setting gamma to 0.75 and 0.8 and I found that it was unable to complete, when I tried 0.9 I saw that it could solve it but it did take more runs to do so. Changing the learning rate to .005 was a much faster and completed it in significantly less runs. Changing the exploration variables to Max 0.9, Min 0.02, and Decay to 0.8 didn't show a lot of change and produced a very similar result to the original with 75 / 175 runs and this time 78 / 178. 

# The goal is to keep the cartpole balanced by applying appropriate forces to a pivot point. The different states are the condition of the cartpole or when it tilts to the side. Possible movements are to the left, right, or none. When the cart moves to the left the pole will tilt to the right, when the cart moves to the right then the cart tilts to the left. The reinforcement algorithm, Q learning, is used when it makes the right decisions then the pole will stay up and it gets rewarded. Reinforcement learning picks the best known action for any given state so we need to know which actions are better by assigning values to it. 

# Experience Replay is a replay memory technique used in reinforcement learning that solves the problem of autocorrelation leading to unstable training, by making the problem like a supervised learning problem. The discount factor determines the importance of future rewards. If it's lower then only current rewards are considered.

# By using a neural network, there are less memory requirements of continuous or bigger sets. The neural network is setup in a way so the inputs are states and ght output being pairs of actions and q values. The highest q value output will represent the best known actions for the current state. I noticed with a slight increase to the learning rate, it drastically reduced the amount of runs required to perfect its actions. 

# Surma, G. (2019, November 10). Cartpole - introduction to reinforcement learning (DQN - deep Q-learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288 

# Papers with code - experience replay explained. Explained | Papers With Code. (n.d.). https://paperswithcode.com/method/experience-replay 