# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.9416228069143757, score: 32
Scores: (min: 32, avg: 32, max: 32)

Run: 2, exploration: 0.8265651079747222, score: 27
Scores: (min: 27, avg: 29.5, max: 32)

Run: 3, exploration: 0.7822236754458713, score: 12
Scores: (min: 12, avg: 23.666666666666668, max: 32)

Run: 4, exploration: 0.7111635524897149, score: 20
Scores: (min: 12, avg: 22.75, max: 32)

Run: 5, exploration: 0.6180388156137953, score: 29
Scores: (min: 12, avg: 24, max: 32)

Run: 6, exploration: 0.5618938591163328, score: 20
Scores: (min: 12, avg: 23.333333333333332, max: 32)

Run: 7, exploration: 0.531750826943791, score: 12
Scores: (min: 12, avg: 21.714285714285715, max: 32)

Run: 8, exploration: 0.4858739637363176, score: 19
Scores: (min: 12, avg: 21.375, max: 32)

Run: 9, exploration: 0.446186062443672, score: 18
Scores: (min: 12, avg: 21, max: 32)

Run: 10, exploration: 0.42650460709830135, score: 10
Scores: (min: 10, avg: 19.9, max: 32)

Run: 11, exploration: 0.40974000909221303, score: 9
Scores: (

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [5]:
# Experiment 1: Adjusting Exploration Factor for Quicker Runs
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  

ENV_NAME = "CartPole-v1"

# Parameters
GAMMA = 0.95  
LEARNING_RATE = 0.001  

MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  

# Adjusted Exploration Factor
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.05  # Lower minimum exploration rate
EXPLORATION_DECAY = 0.995  # Faster decay

class DQNSolver:  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  

        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  

    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  

    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  

    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  

def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    max_runs = 500  # Set a maximum number of runs
    while run < max_runs:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()


Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.985074875, score: 10
Scores: (min: 10, avg: 11.5, max: 13)

Run: 3, exploration: 0.8142285204175609, score: 39
Scores: (min: 10, avg: 20.666666666666668, max: 39)

Run: 4, exploration: 0.7183288830986236, score: 26
Scores: (min: 10, avg: 22, max: 39)

Run: 5, exploration: 0.6696478204705644, score: 15
Scores: (min: 10, avg: 20.6, max: 39)

Run: 6, exploration: 0.6088145090359074, score: 20
Scores: (min: 10, avg: 20.5, max: 39)

Run: 7, exploration: 0.5732736268885887, score: 13
Scores: (min: 10, avg: 19.428571428571427, max: 39)

Run: 8, exploration: 0.46912134373457726, score: 41
Scores: (min: 10, avg: 22.125, max: 41)

Run: 9, exploration: 0.32050833588933575, score: 77
Scores: (min: 10, avg: 28.22222222222222, max: 77)

Run: 10, exploration: 0.2384520680152932, score: 60
Scores: (min: 10, avg: 31.4, max: 77)

Run: 11, exploration: 0.13001512070402377, score: 122
Scores: (min: 10, avg: 39.

NameError: name 'exit' is not defined

In [7]:
# Experiment 2: Adjusting Discount Factor with Baseline Values
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  

ENV_NAME = "CartPole-v1"

# Parameters (Baseline values except for GAMMA)
GAMMA = 0.85  # Lower discount factor to prioritize immediate rewards
LEARNING_RATE = 0.001  # Baseline learning rate

MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  

# Exploration Parameters (Baseline values)
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  

class DQNSolver:  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  

        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  

    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  

    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  

    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  

def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    max_runs = 500  # Set a maximum number of runs to limit execution time
    while run < max_runs:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()


Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.9229311239742362, score: 19
Scores: (min: 17, avg: 18, max: 19)

Run: 3, exploration: 0.8265651079747222, score: 23
Scores: (min: 17, avg: 19.666666666666668, max: 23)

Run: 4, exploration: 0.7255664080186093, score: 27
Scores: (min: 17, avg: 21.5, max: 27)

Run: 5, exploration: 0.6211445383053219, score: 32
Scores: (min: 17, avg: 23.6, max: 32)

Run: 6, exploration: 0.43952667968844233, score: 70
Scores: (min: 17, avg: 31.333333333333332, max: 70)

Run: 7, exploration: 0.3669578217261671, score: 37
Scores: (min: 17, avg: 32.142857142857146, max: 70)

Run: 8, exploration: 0.28417984116121187, score: 52
Scores: (min: 17, avg: 34.625, max: 70)

Run: 9, exploration: 0.20931540516921554, score: 62
Scores: (min: 17, avg: 37.666666666666664, max: 70)

Run: 10, exploration: 0.09770335251664321, score: 153
Scores: (min: 17, avg: 49.2, max: 153)

Run: 11, exploration: 0.027488364100186506, score: 254

In [None]:
# Experiment 3: Adjusting Learning Rate
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  

from scores.score_logger import ScoreLogger  

ENV_NAME = "CartPole-v1"

# Parameters (Baseline values except for LEARNING_RATE)
GAMMA = 0.95  # Baseline discount factor
LEARNING_RATE = 0.01  # Increased learning rate for faster learning

MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  

# Exploration Parameters (Baseline values)
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  

class DQNSolver:  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  

        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  

    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  

    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  

    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  

def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  # No max_runs, as in the base code
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  

cartpole()


Run: 1, exploration: 0.9416228069143757, score: 32
Scores: (min: 32, avg: 32, max: 32)

Run: 2, exploration: 0.8647077305675338, score: 18
Scores: (min: 18, avg: 25, max: 32)

Run: 3, exploration: 0.8224322824348486, score: 11
Scores: (min: 11, avg: 20.333333333333332, max: 32)

Run: 4, exploration: 0.7477194593032545, score: 20
Scores: (min: 11, avg: 20.25, max: 32)

Run: 5, exploration: 0.7111635524897149, score: 11
Scores: (min: 11, avg: 18.4, max: 32)

Run: 6, exploration: 0.5967292370047992, score: 36
Scores: (min: 11, avg: 21.333333333333332, max: 36)

Run: 7, exploration: 0.567555222460375, score: 11
Scores: (min: 11, avg: 19.857142857142858, max: 36)

Run: 8, exploration: 0.531750826943791, score: 14
Scores: (min: 11, avg: 19.125, max: 36)

Run: 9, exploration: 0.5032248303978422, score: 12
Scores: (min: 11, avg: 18.333333333333332, max: 36)

Run: 10, exploration: 0.47622912292284103, score: 12
Scores: (min: 11, avg: 17.7, max: 36)

Run: 11, exploration: 0.45522245551230495, sc

Analysis of CartPole Problem Using Reinforcement Learning
1. How Reinforcement Learning Concepts Apply to the CartPole Problem
Reinforcement learning (RL) is a method where an agent learns to make sequential decisions through interactions with an environment to maximize cumulative rewards (Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems, 1983). In the CartPole problem, the agent learns by trial and error how to balance a pole on a moving cart by taking actions that lead to the highest cumulative reward.

2. The Goal of the Agent
The goal of the agent in the CartPole problem is to keep the pole balanced for as many time steps as possible. The agent receives a reward at each time step for successfully balancing the pole, with the episode ending when the pole falls over or the cart moves out of bounds.
3. State Values
The state in the CartPole problem is represented by four values:
Position of the cart.
Velocity of the cart.
Angle of the pole.
Angular velocity of the pole 
These state values describe the environment's current situation, which the agent uses to decide on its next action.
4. Possible Actions
The agent has two possible actions:
Move the cart to the left.
Move the cart to the right.
These actions directly affect the cart's movement and the pole's stability, determining the success or failure of each episode.
5. Reinforcement Algorithm Used
The algorithm used in this problem is Deep Q-Learning, which is an extension of the Q-learning algorithm. Q-learning updates the expected reward (Q-value) for each action in a given state to converge on an optimal policy (Mnih et al., 2015). The "deep" aspect involves using a neural network to approximate these Q-values for continuous state spaces.
6. How Experience Replay Is Applied
Experience replay stores the agent's experiences (state, action, reward, next state) in a buffer, allowing the agent to sample random batches for training. This process breaks the temporal correlation between consecutive experiences, which improves the stability and convergence of the learning algorithm (Lin, 1992). In the CartPole problem, the agent replays random batches of experiences to update the neural network's weights.
7. Effect of the Discount Factor (GAMMA) on Future Rewards
The discount factor (GAMMA) determines the importance of future rewards. A high GAMMA (close to 1) focuses on long-term gains, while a lower GAMMA emphasizes short-term rewards. In our experiment, setting GAMMA to 0.85 resulted in the agent's failure to solve the problem within 500 runs, suggesting that prioritizing immediate rewards can hinder the agent's ability to learn an effective long-term strategy.
8. Neural Network Architecture in Deep Q-Learning
The neural network used in the CartPole problem is a fully connected feedforward network:
Input layer: Takes the state values (4 nodes for the cart's position, velocity, pole's angle, and angular velocity).
Hidden layers: Two hidden layers with 24 neurons each, using ReLU activation functions to introduce non-linearity (Glorot & Bengio, 2010).
Output layer: Contains two neurons representing the possible actions (left and right), outputting the expected Q-values for each action (Mnih et al., 2015).
9. How the Neural Network Makes the Q-Learning Algorithm More Efficient
In traditional Q-learning, the agent maintains a Q-table for all state-action pairs, which becomes impractical for environments with large or continuous state spaces. Using a neural network allows the agent to generalize across similar states, approximating Q-values for unseen states, which significantly enhances learning efficiency (Mnih et al., 2015).
10. Observations on the Learning Rate (LEARNING_RATE)
The learning rate controls how much the network adjusts its weights based on new information. A higher learning rate results in quicker learning but risks instability (Glorot & Bengio, 2010). In our experiment with an increased learning rate (0.01), the agent did not solve the problem and timed out after 1,069 runs, indicating that too aggressive updates led to oscillations in the learning process.

11. Observations on the Exploration Factor
The exploration factor encourages the agent to explore various actions to learn an effective policy. Adjusting the EXPLORATION_MIN and EXPLORATION_DECAY directly impacts the agent's exploration-exploitation trade-off. In our experiment, aggressive decay and a lower minimum exploration rate resulted in 100 runs without solving the problem, implying insufficient exploration of the environment .


References
Neuronlike adaptive elements that can solve difficult learning control problems. (1983, October 1). IEEE Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/document/6313077

Glorot, X., & Bengio, Y. (2010, March 31). Understanding the difficulty of training deep feedforward neural networks. PMLR. https://proceedings.mlr.press/v9/glorot10a.html

Lin, L. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4), 293–321. https://doi.org/10.1007/bf00992699

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

