# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.9275689688183278, score: 35
Scores: (min: 35, avg: 35, max: 35)

Run: 2, exploration: 0.8647077305675338, score: 15
Scores: (min: 15, avg: 25, max: 35)

Run: 3, exploration: 0.7705488893118823, score: 24
Scores: (min: 15, avg: 24.666666666666668, max: 35)

Run: 4, exploration: 0.7328768546436799, score: 11
Scores: (min: 11, avg: 21.25, max: 35)

Run: 5, exploration: 0.6498078359349755, score: 25
Scores: (min: 11, avg: 22, max: 35)

Run: 6, exploration: 0.5761543988830038, score: 25
Scores: (min: 11, avg: 22.5, max: 35)

Run: 7, exploration: 0.5507399854171277, score: 10
Scores: (min: 10, avg: 20.714285714285715, max: 35)

Run: 8, exploration: 0.531750826943791, score: 8
Scores: (min: 8, avg: 19.125, max: 35)

Run: 9, exploration: 0.47622912292284103, score: 23
Scores: (min: 8, avg: 19.555555555555557, max: 35)

Run: 10, exploration: 0.4417353564707963, score: 16
Scores: (min: 8, avg: 19.2, max: 35)

Run: 11, exploration: 0.42013897252428334, score: 11
Scores: (mi

NameError: name 'exit' is not defined

In [3]:
# modifying Discount factor/GAMMA
# original = 0.95
# experiment = 0.75 
# I stopped the program at 278 Runs and 30 minutes.
# Previously stopped at 1500 total runs.
# No solution found.

ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.75  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

cartpole()

Run: 1, exploration: 0.985074875, score: 23
Scores: (min: 23, avg: 23, max: 23)

Run: 2, exploration: 0.8911090557802088, score: 21
Scores: (min: 21, avg: 22, max: 23)

Run: 3, exploration: 0.8390886103705794, score: 13
Scores: (min: 13, avg: 19, max: 23)

Run: 4, exploration: 0.7901049725470279, score: 13
Scores: (min: 13, avg: 17.5, max: 23)

Run: 5, exploration: 0.7111635524897149, score: 22
Scores: (min: 13, avg: 18.4, max: 23)

Run: 6, exploration: 0.6118738784280476, score: 31
Scores: (min: 13, avg: 20.5, max: 31)

Run: 7, exploration: 0.5618938591163328, score: 18
Scores: (min: 13, avg: 20.142857142857142, max: 31)

Run: 8, exploration: 0.5057535983897912, score: 22
Scores: (min: 13, avg: 20.375, max: 31)

Run: 9, exploration: 0.47147873742168567, score: 15
Scores: (min: 13, avg: 19.77777777777778, max: 31)

Run: 10, exploration: 0.3706551064126331, score: 49
Scores: (min: 13, avg: 22.7, max: 49)

Run: 11, exploration: 0.26759021970270175, score: 66
Scores: (min: 13, avg: 26.636

KeyboardInterrupt: 

In [5]:
# modifying exploration factor
# original MAX = 1.0
# experiment MAX = 1.0
# original MIN = 0.01
# experiment MIN = 0.05
# original DECAY = 0.995
# experiment DECAY = 0.9999
# Solved after 300 total runs.

ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.05  
EXPLORATION_DECAY = 0.9999  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

cartpole()

Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.9982015291843062, score: 21
Scores: (min: 17, avg: 19, max: 21)

Run: 3, exploration: 0.9962070215713769, score: 21
Scores: (min: 17, avg: 19.666666666666668, max: 21)

Run: 4, exploration: 0.9945148237990713, score: 18
Scores: (min: 17, avg: 19.25, max: 21)

Run: 5, exploration: 0.9921307310710261, score: 25
Scores: (min: 17, avg: 20.4, max: 25)

Run: 6, exploration: 0.9908417346989258, score: 14
Scores: (min: 14, avg: 19.333333333333332, max: 25)

Run: 7, exploration: 0.9889608287826225, score: 20
Scores: (min: 14, avg: 19.428571428571427, max: 25)

Run: 8, exploration: 0.9851112007201276, score: 40
Scores: (min: 14, avg: 22, max: 40)

Run: 9, exploration: 0.9836345678377172, score: 16
Scores: (min: 14, avg: 21.333333333333332, max: 40)

Run: 10, exploration: 0.9799036631402607, score: 39
Scores: (min: 14, avg: 23.1, max: 40)

Run: 11, exploration: 0.9779457165143248, score: 21
Scores: (mi

NameError: name 'exit' is not defined

In [6]:
# modifying learning rate
# original = 0.001
# experiment = 0.0001
# solved in 329 total runs.

ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.0001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()

cartpole()

Run: 1, exploration: 0.9000874278732445, score: 41
Scores: (min: 41, avg: 41, max: 41)

Run: 2, exploration: 0.8475428503023453, score: 13
Scores: (min: 13, avg: 27, max: 41)

Run: 3, exploration: 0.6596532430440636, score: 51
Scores: (min: 13, avg: 35, max: 51)

Run: 4, exploration: 0.6027415843082742, score: 19
Scores: (min: 13, avg: 31, max: 51)

Run: 5, exploration: 0.5535075230322891, score: 18
Scores: (min: 13, avg: 28.4, max: 51)

Run: 6, exploration: 0.500708706245853, score: 21
Scores: (min: 13, avg: 27.166666666666668, max: 51)

Run: 7, exploration: 0.4738479773082268, score: 12
Scores: (min: 12, avg: 25, max: 51)

Run: 8, exploration: 0.43080185560799106, score: 20
Scores: (min: 12, avg: 24.375, max: 51)

Run: 9, exploration: 0.41386834584198684, score: 9
Scores: (min: 9, avg: 22.666666666666668, max: 51)

Run: 10, exploration: 0.39166620452737816, score: 12
Scores: (min: 9, avg: 21.6, max: 51)

Run: 11, exploration: 0.3596735257153405, score: 18
Scores: (min: 9, avg: 21.272

NameError: name 'exit' is not defined

Summary of experiements:

Original Run:
Solved after 152 total runs.

Experiment One:
Modified Discount Factor to 0.75
Stopped test at 1500 total runs with no solution.

Experiment Two:
Modified Exploration Factor.
Changed Min to 0.05
Changed Decay to 0.9999
Solved after 300 total runs.

Experiment Three:
Modified Learning Rate to 0.0001
Solved after 329 total runs.

Explain how reinforcement learning concepts apply to the cartpole problem.
What is the goal of the agent in this case?
    To keep the cartpole balanced by applying appropriate forces to a pivot point.
What are the various state values?
    Cart Position, Cart Velocity, Pole Angle, Pole Velocity At Tip
What are the possible actions that can be performed?
    Push cart to the left, Push cart to the right
What reinforcement algorithm is used for this problem?
    Deep Q-Learning

Analyze how experience replay is applied to the cartpole problem.
How does experience replay work in this algorithm?
    It is a process that uniformly samples experiences from the memory and for each entry updates its Q value (Surma, 2019).
What is the effect of introducing a discount factor for calculating the future rewards?
    It makes an infinite sum finite.  If rewards are not discounted, the sum of the rewards would grow infinitely and an
    optimal solution would not be found (Beysolow, 2019).

Analyze how neural networks are used in deep Q-learning.
Explain the neural network architecture that is used in the cartpole problem.
    The neural network consists of an input layer, hidden layers, and an ouput layer.  The input comes from the observation
    space then they are fed through the layers until the output layer selects and action.  The state if the environment is fed
    into the neural network and the expected reward is calculated for each action.  The next action is then selected based on
    the greatest possible reward (Kedia, 2020).  Q-Learning uses two neural networks in its learning process (Singh, 2022).
How does the neural network make the Q-learning algorithm more efficient?
    By remembering each state, and the possible rewards, actions are more efficiently selected.  "Despite having the same
    architecture, these networks differ in their weights. During each N-step, the weights get copied between the main and
    target networks. Both networks help the algorithm learn more effectively and stabilize the learning process" (Singh, 2022).
What difference do you see in the algorithm performance when you increase or decrease the learning rate?
    When decreasing the learning rate, it took much longer to solve the problem.


References:
Surma, G. (2019, November 10). Cartpole - introduction to reinforcement learning (DQN - deep Q-learning). Medium. Retrieved March 29, 2023, from https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288 

Gulli, A., & Pal, S. (2017). Deep learning with keras : Get to grips with the basics of keras to implement fast and efficient deep-learning models. Packt Publishing, Limited. 

Beysolow, I. T. (2019). Applied reinforcement learning with python : With openai gym, tensorflow, and keras. Apress L. P..

Kedia, A. (2020, April 15). Creating deep neural networks from scratch, an introduction to reinforcement learning. Medium. Retrieved March 29, 2023, from https://towardsdatascience.com/creating-deep-neural-networks-from-scratch-an-introduction-to-reinforcement-learning-part-i-549ef7b149d2 

Singh, S. (2022, October 7). A comprehensive guide to neural networks in Deep Q-Learning. A Comprehensive Guide to Neural Networks in Deep Q-learning. Retrieved March 29, 2023, from https://www.turing.com/kb/how-are-neural-networks-used-in-deep-q-learning#how-deep-q-networks-function 

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.