# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [4]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [5]:
cartpole()

Run: 1, exploration: 0.995, score: 21
Scores: (min: 21, avg: 21, max: 21)

Run: 2, exploration: 0.9558895783575597, score: 9
Scores: (min: 9, avg: 15, max: 21)

Run: 3, exploration: 0.8390886103705794, score: 27
Scores: (min: 9, avg: 19, max: 27)

Run: 4, exploration: 0.7940753492934954, score: 12
Scores: (min: 9, avg: 17.25, max: 27)

Run: 5, exploration: 0.7402609576967045, score: 15
Scores: (min: 9, avg: 16.8, max: 27)

Run: 6, exploration: 0.7005493475733617, score: 12
Scores: (min: 9, avg: 16, max: 27)

Run: 7, exploration: 0.653073201944699, score: 15
Scores: (min: 9, avg: 15.857142857142858, max: 27)

Run: 8, exploration: 0.5997278763867329, score: 18
Scores: (min: 9, avg: 16.125, max: 27)

Run: 9, exploration: 0.5732736268885887, score: 10
Scores: (min: 9, avg: 15.444444444444445, max: 27)

Run: 10, exploration: 0.5425201222922789, score: 12
Scores: (min: 9, avg: 15.1, max: 27)

Run: 11, exploration: 0.49571413690105054, score: 19
Scores: (min: 9, avg: 15.454545454545455, max: 

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.99                              #changed from 0.95 to 0.99  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [7]:
cartpole()

Run: 1, exploration: 0.8866535105013078, score: 44
Scores: (min: 44, avg: 44, max: 44)

Run: 2, exploration: 0.8183201210226743, score: 17
Scores: (min: 17, avg: 30.5, max: 44)

Run: 3, exploration: 0.7705488893118823, score: 13
Scores: (min: 13, avg: 24.666666666666668, max: 44)

Run: 4, exploration: 0.6305556603555866, score: 41
Scores: (min: 13, avg: 28.75, max: 44)

Run: 5, exploration: 0.6027415843082742, score: 10
Scores: (min: 10, avg: 25, max: 44)

Run: 6, exploration: 0.5761543988830038, score: 10
Scores: (min: 10, avg: 22.5, max: 44)

Run: 7, exploration: 0.5398075216808175, score: 14
Scores: (min: 10, avg: 21.285714285714285, max: 44)

Run: 8, exploration: 0.4932355662165453, score: 19
Scores: (min: 10, avg: 21, max: 44)

Run: 9, exploration: 0.47147873742168567, score: 10
Scores: (min: 10, avg: 19.77777777777778, max: 44)

Run: 10, exploration: 0.4506816115185697, score: 10
Scores: (min: 10, avg: 18.8, max: 44)

Run: 11, exploration: 0.42650460709830135, score: 12
Scores: (

NameError: name 'exit' is not defined

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.99                              # 1 changed from 0.95 to 0.99  
LEARNING_RATE = 0.005                     # 2 changed from 0.001 to 0.005
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


Using TensorFlow backend.


In [2]:
cartpole()

Run: 1, exploration: 0.985074875, score: 23
Scores: (min: 23, avg: 23, max: 23)

Run: 2, exploration: 0.9369146928798039, score: 11
Scores: (min: 11, avg: 17, max: 23)

Run: 3, exploration: 0.8690529955452602, score: 16
Scores: (min: 11, avg: 16.666666666666668, max: 23)

Run: 4, exploration: 0.7940753492934954, score: 19
Scores: (min: 11, avg: 17.25, max: 23)

Run: 5, exploration: 0.7111635524897149, score: 23
Scores: (min: 11, avg: 18.4, max: 23)

Run: 6, exploration: 0.653073201944699, score: 18
Scores: (min: 11, avg: 18.333333333333332, max: 23)

Run: 7, exploration: 0.6118738784280476, score: 14
Scores: (min: 11, avg: 17.714285714285715, max: 23)

Run: 8, exploration: 0.567555222460375, score: 16
Scores: (min: 11, avg: 17.5, max: 23)

Run: 9, exploration: 0.5425201222922789, score: 10
Scores: (min: 10, avg: 16.666666666666668, max: 23)

Run: 10, exploration: 0.5211953074858876, score: 9
Scores: (min: 9, avg: 15.9, max: 23)

Run: 11, exploration: 0.4858739637363176, score: 15
Score

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95                             # 3 changed from 0.95 to 0.99  
LEARNING_RATE = 0.0005                     # 2 changed from 0.001 to 0.005
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


In [4]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.9703725093562657, score: 12
Scores: (min: 12, avg: 13, max: 14)

Run: 3, exploration: 0.9046104802746175, score: 15
Scores: (min: 12, avg: 13.666666666666666, max: 15)

Run: 4, exploration: 0.8475428503023453, score: 14
Scores: (min: 12, avg: 13.75, max: 15)

Run: 5, exploration: 0.7402609576967045, score: 28
Scores: (min: 12, avg: 16.6, max: 28)

Run: 6, exploration: 0.6900935609921609, score: 15
Scores: (min: 12, avg: 16.333333333333332, max: 28)

Run: 7, exploration: 0.6465587967553006, score: 14
Scores: (min: 12, avg: 16, max: 28)

Run: 8, exploration: 0.5997278763867329, score: 16
Scores: (min: 12, avg: 16, max: 28)

Run: 9, exploration: 0.5704072587541458, score: 11
Scores: (min: 11, avg: 15.444444444444445, max: 28)

Run: 10, exploration: 0.5344229416520513, score: 14
Scores: (min: 11, avg: 15.3, max: 28)

Run: 11, exploration: 0.510849320360386, score: 10
Scores: (min: 10, avg: 14.81

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Reinforcement Learning Analysis: CartPole Problem

## Reinforcement Learning Concepts in CartPole

The CartPole problem demonstrates key reinforcement learning concepts by maximizing the reward it receives for its actions.

| Concept               | Implementation Details                       |
|-----------------------|----------------------------------------------|
| **Agent Goal**        | Balance pole for maximum time (500 timesteps)|
| **State Values**      | [Cart pos, Cart vel, Pole angle, Pole vel]   |
| **Possible Actions**  | 0 (left) or 1 (right)                        |
| **Algorithm**         | Deep Q-Network with experience replay        |

## Experimental Results

| Configuration                 | Runs to Solve| Total Runs | Max Score| Avg Score|
|-------------------------------|--------------|------------|----------|----------|
| Baseline (Gamma=0.95,LR 0.001)| 83           | 183        | 500      | 195.4    |
| Gamma=0.99, LR=0.001          | 321          | 421        | 500      | 197.84   |
| Gamma=0.99, LR=0.005          | 372          | 472        | 500      | 196.64   |
| Gamma=0.95, LR=0.0005         | 211          | 311        | 500      | 195.66   |

## Key Findings

1. **Discount Factor (γ) Impact**:
   - Higher Gamma = longer training (83→374 runs from baseline increase of 0.04) 
   - Suggests higher Gamma values lead to more stable but slower learning
   - Maintained maximum score of 500 in all configurations


2. **Learning Rate Effects**:
   - Optimal learning rate appears to be 0.001:
     - Fastest convergence in baseline configuration
     - Best balance between stability and training speed
   - Increased to 0.005 increases run to solve along with total runs with average score being higher.
   - Decreased to 0.0005 when compared to 0.005 reduces the amount of runs to solve and total runs by about 100 each with the avg score about 2 points lower.
   
   
3. **Experience replay applied to the cartpole problem**:
    - In the CartPole implementation, experience replay enhances the DQN algorithm's stability and efficiency.
    - During updates, the algorithm combines immediate rewards with discounted future rewards using γ (set to 0.95 or 0.99), following the Bellman equation. The discount factor creates a crucial tradeoff - higher values (γ=0.99) lead to better long-term policies (evidenced by the 197.84 average score) but require 3.8 times more training runs (321 vs 83) compared to γ=0.95. This occurs because γ=0.99 makes the agent prioritize future pole stability over immediate actions, while γ=0.95 focuses more on short-term rewards.
  
  
4. **Neural networks are used in deep Q-learning**:

In CartPole the neural network serves as a function approximator that enables the agent to handle the environment's continuous state. The code's DQNSolver class has a 4-dimensional state as input and outputs Q-values for the two discrete actions. The network's efficiency stems from two features. First, the ability to generalize across similar states. Finally, batch training with experience replay. The learning rate affects performance. The experiments showed higher rates (0.005) caused slower convergence (372 runs vs baseline 83) due to parameter overshooting, while lower rates (0.0005) improved stability but required more episodes (211 runs). This aligns with established deep RL principles where neural networks trade off sample efficiency for generalization capability.

**References**

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. 

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.