# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [6]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [7]:
cartpole()

Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.9703725093562657, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 3, exploration: 0.8224322824348486, score: 34
Scores: (min: 13, avg: 20, max: 34)

Run: 4, exploration: 0.7255664080186093, score: 26
Scores: (min: 13, avg: 21.5, max: 34)

Run: 5, exploration: 0.6662995813682115, score: 18
Scores: (min: 13, avg: 20.8, max: 34)

Run: 6, exploration: 0.567555222460375, score: 33
Scores: (min: 13, avg: 22.833333333333332, max: 34)

Run: 7, exploration: 0.4932355662165453, score: 29
Scores: (min: 13, avg: 23.714285714285715, max: 34)

Run: 8, exploration: 0.43732904629000013, score: 25
Scores: (min: 13, avg: 23.875, max: 34)

Run: 9, exploration: 0.3858205374665315, score: 26
Scores: (min: 13, avg: 24.11111111111111, max: 34)

Run: 10, exploration: 0.33698341088258443, score: 28
Scores: (min: 13, avg: 24.5, max: 34)

Run: 11, exploration: 0.28704309604425327, score: 33
Scores: (min: 13, avg: 

Run: 90, exploration: 0.01, score: 170
Scores: (min: 13, avg: 135.66666666666666, max: 328)

Run: 91, exploration: 0.01, score: 120
Scores: (min: 13, avg: 135.4945054945055, max: 328)

Run: 92, exploration: 0.01, score: 118
Scores: (min: 13, avg: 135.30434782608697, max: 328)

Run: 93, exploration: 0.01, score: 38
Scores: (min: 13, avg: 134.25806451612902, max: 328)

Run: 94, exploration: 0.01, score: 170
Scores: (min: 13, avg: 134.63829787234042, max: 328)

Run: 95, exploration: 0.01, score: 241
Scores: (min: 13, avg: 135.7578947368421, max: 328)

Run: 96, exploration: 0.01, score: 193
Scores: (min: 13, avg: 136.35416666666666, max: 328)

Run: 97, exploration: 0.01, score: 150
Scores: (min: 13, avg: 136.49484536082474, max: 328)

Run: 98, exploration: 0.01, score: 176
Scores: (min: 13, avg: 136.89795918367346, max: 328)

Run: 99, exploration: 0.01, score: 219
Scores: (min: 13, avg: 137.72727272727272, max: 328)

Run: 100, exploration: 0.01, score: 157
Scores: (min: 13, avg: 137.92, ma

Run: 189, exploration: 0.01, score: 133
Scores: (min: 12, avg: 188.84, max: 500)

Run: 190, exploration: 0.01, score: 120
Scores: (min: 12, avg: 188.34, max: 500)

Run: 191, exploration: 0.01, score: 276
Scores: (min: 12, avg: 189.9, max: 500)

Run: 192, exploration: 0.01, score: 192
Scores: (min: 12, avg: 190.64, max: 500)

Run: 193, exploration: 0.01, score: 388
Scores: (min: 12, avg: 194.14, max: 500)

Run: 194, exploration: 0.01, score: 192
Scores: (min: 12, avg: 194.36, max: 500)

Run: 195, exploration: 0.01, score: 351
Scores: (min: 12, avg: 195.46, max: 500)

Solved in 95 runs, 195 total runs.


NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

In [None]:
# Changing Learning Rate
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.005  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  


Explain how reinforcement learning concepts apply to the cartpole problem.
    What is the goal of the agent in this case?
        The goal of the agent in this assignment was to keep the pole balanced on the cart for as long as they can.
        
    What are the various state values?
        The different states that I can think of would be the pole staying stationary, the pole not falling over, and the pole falling over
        
    What are the possible actions that can be performed?
        I think the actions that could happen would be moving the cart either in the left or right direction
        
    What reinforcement algorithm is used for this problem?
        I believe that the algorithm that is being used is Q-learning as the AI is trained on a pass or fail if it makes the right move and either has the pole staying upright or making the incorrect move that would result in the pole falling over.
        
Analyze how experience replay is applied to the cartpole problem.
    How does experience replay work in this algorithm?
        With this type of learning model, it requires it to store and think on the different things that are happening such as the state, the action perfromed, the reward or punishment, and the next state as well as the random sampling.
        
    What is the effect of introducing a discount factor for calculating the future rewards?
        The overall importance of the discount factor is to try and make the AI perform better faster by making the rewards that it is given change depend on the amount of time that has passed.
        
Analyze how neural networks are used in deep Q-learning.
    Explain the neural network architecture that is used in the cartpole problem.
        I think that this problem uses the Deep Q-learning,  I mainly think this instead of a traditional Q learning is that it ditches the Q table and adds in a neural network that will produce more than one Q value. 
       
    How does the neural network make the Q-learning algorithm more efficient?
        Why this was done is because Q learning is less efficient and scalable than deep Q learning which will provide different Q values and then selects the action to be taken based off which one is the highest. It also ditches the Q table entierly and the updates that would happen when using just Q learning.
        
    What difference do you see in the algorithm performance when you increase or decrease the learning rate?
        What I would see when Increasing the learning rate was it would take less time to get the AI to be better at the game and the opposite seems to be true if I lower the learning rate it seems to make the training process take longer.
        
Referneces:
GfG. (2023, January 23). Deep Q-learning. GeeksforGeeks. https://www.geeksforgeeks.org/deep-q-learning/# 
Lamba, A. (2018, September 3). An introduction to Q-learning: Reinforcement learning. Medium. https://medium.com/free-code-camp/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc 
Loeber, P. (2022, February 10). Reinforcement learning with (deep) Q-learning explained. News, Tutorials, AI Research. https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/ 
Singh, S. (2022, October 7). A comprehensive guide to neural networks in Deep Q-Learning. A Comprehensive Guide to Neural Networks in Deep Q-learning. https://www.turing.com/kb/how-are-neural-networks-used-in-deep-q-learning