# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"
  
GAMMA = 0.95
LEARNING_RATE = 0.001
  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20
  
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 1.0, score: 9
Scores: (min: 9, avg: 9, max: 9)

Run: 2, exploration: 0.960693043575437, score: 19
Scores: (min: 9, avg: 14, max: 19)

Run: 3, exploration: 0.8183201210226743, score: 33
Scores: (min: 9, avg: 20.333333333333332, max: 33)

Run: 4, exploration: 0.7076077347272662, score: 30
Scores: (min: 9, avg: 22.75, max: 33)

Run: 5, exploration: 0.6797938283326578, score: 9
Scores: (min: 9, avg: 20, max: 33)

Run: 6, exploration: 0.6118738784280476, score: 22
Scores: (min: 9, avg: 20.333333333333332, max: 33)

Run: 7, exploration: 0.5819594443402982, score: 11
Scores: (min: 9, avg: 19, max: 33)

Run: 8, exploration: 0.5507399854171277, score: 12
Scores: (min: 9, avg: 18.125, max: 33)

Run: 9, exploration: 0.5211953074858876, score: 12
Scores: (min: 9, avg: 17.444444444444443, max: 33)

Run: 10, exploration: 0.483444593917636, score: 16
Scores: (min: 9, avg: 17.3, max: 33)

Run: 11, exploration: 0.457510005540005, score: 12
Scores: (min: 9, avg: 16.818181818181817, 

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"
  
GAMMA = 0.95
LEARNING_RATE = 0.01    # changed learning rate from 0.001 to 0.01
  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20
  
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

In [4]:
# changed learning rate from 0.001 to 0.01
cartpole()

Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.918316468354365, score: 24
Scores: (min: 13, avg: 18.5, max: 24)

Run: 3, exploration: 0.8778091417340573, score: 10
Scores: (min: 10, avg: 15.666666666666666, max: 24)

Run: 4, exploration: 0.8061065909263957, score: 18
Scores: (min: 10, avg: 16.25, max: 24)

Run: 5, exploration: 0.6433260027715241, score: 46
Scores: (min: 10, avg: 22.2, max: 46)

Run: 6, exploration: 0.5790496471185967, score: 22
Scores: (min: 10, avg: 22.166666666666668, max: 46)

Run: 7, exploration: 0.5535075230322891, score: 10
Scores: (min: 10, avg: 20.428571428571427, max: 46)

Run: 8, exploration: 0.5211953074858876, score: 13
Scores: (min: 10, avg: 19.5, max: 46)

Run: 9, exploration: 0.4932355662165453, score: 12
Scores: (min: 10, avg: 18.666666666666668, max: 46)

Run: 10, exploration: 0.46211964903917074, score: 14
Scores: (min: 10, avg: 18.2, max: 46)

Run: 11, exploration: 0.446186062443672, score: 8
Scores: (

KeyboardInterrupt: 

I stopped the above run after letting it go for the entire time I was at work. The script ran for ~10 hours and was nowhere near close to solving the problem even after 10k+ runs.

In [5]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"
  
GAMMA = 0.5     # changed discount factor from 0.95 to 0.5
LEARNING_RATE = 0.001
  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20
  
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

In [6]:
# changed discount factor from 0.95 to 0.5
cartpole()

Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.9137248860125932, score: 21
Scores: (min: 17, avg: 19, max: 21)

Run: 3, exploration: 0.7255664080186093, score: 47
Scores: (min: 17, avg: 28.333333333333332, max: 47)

Run: 4, exploration: 0.6696478204705644, score: 17
Scores: (min: 17, avg: 25.5, max: 47)

Run: 5, exploration: 0.5997278763867329, score: 23
Scores: (min: 17, avg: 25, max: 47)

Run: 6, exploration: 0.5344229416520513, score: 24
Scores: (min: 17, avg: 24.833333333333332, max: 47)

Run: 7, exploration: 0.4351424010585501, score: 42
Scores: (min: 17, avg: 27.285714285714285, max: 47)

Run: 8, exploration: 0.40974000909221303, score: 13
Scores: (min: 13, avg: 25.5, max: 47)

Run: 9, exploration: 0.3386767948568688, score: 39
Scores: (min: 13, avg: 27, max: 47)

Run: 10, exploration: 0.28993519966087045, score: 32
Scores: (min: 13, avg: 27.5, max: 47)

Run: 11, exploration: 0.23489314109365644, score: 43
Scores: (min: 13, avg: 28

KeyboardInterrupt: 

I stopped this run after ~2.5 hours when I noticed that the average score had regressed from the 130s when I had checked it last to the 90s and then 88.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"
  
GAMMA = 0.95
LEARNING_RATE = 0.001
  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20
  
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.5     # changed exploration decay from 0.995 to 0.5
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

Using TensorFlow backend.


In [2]:
# changed exploration decay from 0.995 to 0.5
cartpole()

Run: 1, exploration: 1.0, score: 15
Scores: (min: 15, avg: 15, max: 15)

Run: 2, exploration: 0.015625, score: 11
Scores: (min: 11, avg: 13, max: 15)

Run: 3, exploration: 0.01, score: 10
Scores: (min: 10, avg: 12, max: 15)

Run: 4, exploration: 0.01, score: 10
Scores: (min: 10, avg: 11.5, max: 15)

Run: 5, exploration: 0.01, score: 10
Scores: (min: 10, avg: 11.2, max: 15)

Run: 6, exploration: 0.01, score: 10
Scores: (min: 10, avg: 11, max: 15)

Run: 7, exploration: 0.01, score: 10
Scores: (min: 10, avg: 10.857142857142858, max: 15)

Run: 8, exploration: 0.01, score: 11
Scores: (min: 10, avg: 10.875, max: 15)

Run: 9, exploration: 0.01, score: 10
Scores: (min: 10, avg: 10.777777777777779, max: 15)

Run: 10, exploration: 0.01, score: 12
Scores: (min: 10, avg: 10.9, max: 15)

Run: 11, exploration: 0.01, score: 14
Scores: (min: 10, avg: 11.181818181818182, max: 15)

Run: 12, exploration: 0.01, score: 36
Scores: (min: 10, avg: 13.25, max: 36)

Run: 13, exploration: 0.01, score: 9
Scores: 

- **Explain how reinforcement learning concepts apply to the cartpole problem.**
  - What is the goal of the agent in this case?
    - The goal of the agent is to move a cart either left or right in a manner that keeps a pole with a weight at the top balanced on said cart. It can be thought of as balancing a pendulum on your hand, but the range of motion is only on an X-axis rather than X and Z axes.
  - What are the various state values?
    - The state values are found in the _observation\_space_ parameter. They correspond to the cart's position, the cart's velocity, the pole's angle, and the velocity of the pole at its tip (Surma, 2018).
  - What are the possible actions that can be performed?
    - The agent can only take make one of two actions. Move the cart left (0) or move the cart right (1) (Surma, 2018).
  - What reinforcement algorithm is used for this problem?
    - The code we are given uses the Deep Q-Learning algorithm.
- **Analyze how experience replay is applied to the cartpole problem.**
  - How does experience replay work in this algorithm? -
    - Experience replay works in this algorithm by attempting to adjust the weights of the neural network in a way where it can predict the Q-values of the different state/action pairs which would in turn allow it to take the most optimal actions at a given state to maximize the total rewards received. It does this by keeping track of its experiences during each time step. An experience of a given time step is made up of the state at said time step, the action taken at said time step, the reward given to the agent for taking the action, and the resulting state of the action taken. So, after each time step/action, the program stores its experience. It then uses a batch of experiences as the training data for the neural network with the goal of training the neural network well enough that it can produce accurate Q-value estimations like previously stated (Deeplizard, 2018).
  - What is the effect of introducing a discount factor for calculating the future rewards?
    - The purpose of Q-learning is for the AI to make decisions based on maximizing the rewards earned. This can take multiple steps to accomplish. During this, there can be some uncertainty about actually obtaining later rewards. This makes a later reward of the same value as a current reward worth less at the current step. For example, say that you had the reward of possibly getting a piece of cake in the next 5 minutes versus the possibility of getting a piece of cake the same size, flavor, etc. in 5 days. The reward is the same, but the future is uncertain. Therefore, the possibility of getting the cake in 5 minutes is inherently more valuable than the possibility of getting the cake in 5 days.
- **Analyze how neural networks are used in deep Q-learning.**
  - Explain the neural network architecture that is used in the cartpole problem.
    - The neural network for this problem is made up of three layers. An input layer, a hidden layer, and an output layer. The input layer has an input shape of 4. This corresponds to the observation space, which are the different state factors such as cart position, pole angle, cart velocity, and pole velocity. This layer outputs to the hidden layer, which has 24 nodes. The hidden layer uses the rectified linear unit activation function and passes the value to the output layer. This layer "converges" into two outputs. The two outputs correspond to the two available actions for the AI, move the cart left or move the cart right.
  - How does the neural network make the Q-learning algorithm more efficient?
    - For the cart pole problem, there aren't many state-action pairs, so it isn't such a large task for a computer to manually calculate the Q-values after each step, but some tasks have many state-action pairs. Obviously, the more states and actions there are, the more state-action pairs there will be, then the more compute intensive it will be to update each state-action pair's Q-value after each step. The neural network makes this more efficient by using past experiences to estimate the Q-values for the available actions at the current state rather than calculating them manually.
  - What difference do you see in the algorithm performance when you increase or decrease the learning rate?
    - I changed the learning rate of the algorithm from 0.001 to 0.1 and it was never able to "win" the cartpole game. This is caused by the algorithm "over correcting" or adjusting the model's weights too much after each pass. An easier way to understand this is if you were shooting an arrow at a target and say you miss to the right by 1 foot. If you're learning rate is too high, you might try to hit the bullseye by adjusting your aim to the left, but you aim too far to the left and miss your second shot by 0.9 feet to the left. You try to correct your next shot by aiming to the right, but you over correct again and miss to the right by 1.1 feet. You keep over correcting, shot after shot, and maybe get close to the center a few times, but the over corrections throw your aim way off again. If you have a learning rate that is too small, you under correct. Looking at that first shot that is off to the right by 1 foot, you under adjust your aim. Your next shot is still off to the right, but only 0.98 feet this time. The next shot is closer, but still off to the right by 0.96 feet. It will take you many shots to finally hit the bullseye. To circle back to the question, the basic answer is that a learning rate that is too large will cause the algorithm to repeatedly overshoot the optimal solution while a learning rate that is too small will require way too many epochs to eventually reach the optimal solution (Brownlee, 2019).

##### References
Brownlee, J. (2019, January 25). _Understand the Impact of Learning Rate on Neural Network Performance_. Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

Deeplizard. (2018, November 2). _Reinforcement Learning - Developing Intelligent Agents_. Retrieved from Deeplizard: https://deeplizard.com/learn/video/Bcuj2fTH4\_4

Surma, G. (2018, September 26). _Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)_. Retrieved from Medium: https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288