# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random
import gym
import numpy as np
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import tensorflow as tf

# If score_logger is not available, you can remove it or implement a simple logging mechanism.
from scores.score_logger import ScoreLogger

ENV_NAME = "CartPole-v1"

GAMMA = 0.95
LEARNING_RATE = 0.001

MEMORY_SIZE = 1000000
BATCH_SIZE = 20

EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995


class DQNSolver:

    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(self.action_space, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(learning_rate=LEARNING_RATE))

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        state = np.array(state)  # Ensure state is a NumPy array
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        state = np.reshape(state, [1, -1])  # Reshape state to match model input
        q_values = self.model.predict(state, verbose=0)
        return np.argmax(q_values[0])

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, terminal in batch:
            state = np.array(state)  # Ensure state is a NumPy array
            state_next = np.array(state_next)  # Ensure next state is a NumPy array
            state = np.reshape(state, [1, -1])
            state_next = np.reshape(state_next, [1, -1])
            q_update = reward
            if not terminal:
                q_update = reward + GAMMA * np.amax(
                    self.model.predict(state_next, verbose=0)[0]
                )
            q_values = self.model.predict(state, verbose=0)
            q_values[0][action] = q_update
            self.model.fit(state, q_values, verbose=0)
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)


def cartpole():
    env = gym.make(ENV_NAME)
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    dqn_solver = DQNSolver(observation_space, action_space)
    run = 0
    while run <= 100:
        run += 1
        state, _ = env.reset()  # Unpack the tuple and ignore the dictionary
        state = np.array(state)  # Convert only the relevant part to a NumPy array
        state = np.reshape(state, [1, -1])  # Correctly reshape the initial state
        step = 0
        while True:
            step += 1
            action = dqn_solver.act(state)
            results = env.step(action)
            state_next = results[0]  # Extract the state_next from the results
            reward = results[1]
            terminal = results[2]
            # Handle any additional returned values (like info) as necessary
            reward = reward if not terminal else -reward
            state_next = np.array(state_next)  # Ensure next state is a NumPy array
            state_next = np.reshape(
                state_next, [1, -1]
            )  # Correctly reshape the next state
            dqn_solver.remember(state, action, reward, state_next, terminal)
            state = state_next
            if terminal:
                print(
                    f"Run: {run}, exploration: {dqn_solver.exploration_rate:.4f}, score: {step}"
                )
                break
            dqn_solver.experience_replay()




In [2]:
cartpole()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  if not isinstance(terminated, (bool, np.bool8)):


Run: 1, exploration: 0.8911, score: 43
Run: 2, exploration: 0.8518, score: 10
Run: 3, exploration: 0.7862, score: 17
Run: 4, exploration: 0.7292, score: 16
Run: 5, exploration: 0.6798, score: 15
Run: 6, exploration: 0.6433, score: 12
Run: 7, exploration: 0.5997, score: 15
Run: 8, exploration: 0.5676, score: 12
Run: 9, exploration: 0.5186, score: 19
Run: 10, exploration: 0.4908, score: 12
Run: 11, exploration: 0.4621, score: 13
Run: 12, exploration: 0.4308, score: 15
Run: 13, exploration: 0.4057, score: 13
Run: 14, exploration: 0.3839, score: 12
Run: 15, exploration: 0.3670, score: 10
Run: 16, exploration: 0.3455, score: 13
Run: 17, exploration: 0.3110, score: 22
Run: 18, exploration: 0.2973, score: 10
Run: 19, exploration: 0.2799, score: 13
Run: 20, exploration: 0.2649, score: 12
Run: 21, exploration: 0.2445, score: 17
Run: 22, exploration: 0.2326, score: 11
Run: 23, exploration: 0.2212, score: 11
Run: 24, exploration: 0.2072, score: 14
Run: 25, exploration: 0.1810, score: 28
Run: 26, 

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

The goal of this agent is to do its very best to keep a pole balanced on a cart for as long as possible. By doing so it will gain the highest score. 
The state values are the cart position, cart velocity, pole angle, and angualar velocity. 
The agent only has the ability to move left and right. These are the possible actions that can be performed. 
The agent is trained using the Deep Q-learning algorithm which is a different version of just the Q-Learning algorithm. The Deep version of this algorithm used a neural network (often referred to as deep learning) in order to approximate the values. We can only approximate as long as the relative importance is preserved (Kumar, 2023).

Experience reply is actually really cool. We use a replay buffer that allows the agent to access it's training experience more efficently during it's training! (Hugging Face, (n.d.))
The discount factor is how future rewards are weighted relative to the more immediate rewards. SO if it is closer to 1, the agent is going to weight more long term rewards as, more. It is almost alive in how the agent has to balance short term and long term rewards. This could help with my chess example!

Neural Network Architecture for this problem; because we are using Deep Q-learning, we are leveraging a neural network. This means we have an input layer that is taking the state vector as input, layered on some hidden layers that must have Relu in there. We have another Dense layer for linear. This is whith just 24 neurons. 
Using the neural network approach is going to increase efficency accorss similar stateswhich can help with large or continuos spaces as opposed to maintaining one table for every possible action state pair. neural networks help with approximating the more generalized data as I mentioned about after reading geeksforgeeks. 

Learning rate (which is no longer "lr" - I fixed alot of old code that was given), is responsible for finding the best possiblescore on the problem. If it is too large, we see that it is instable which is because it may jump over what we would score as optimal. If it is too low, we get a stable result but it can take much longer. Many Data scientists call the learning rate and art more than a science but there are a few methods for optimizing it. In non reinforcment learning, I like to create a matrix and plot out the results of different learning rates before honing in on an ideal one. Another method is hyperparameter tuning like randomsearchcv by sklearn or gridsearchcv  (if your hardware can handle the load).  


Kumar, S. (2023, July 18). Deep Q-Learning. GeeksforGeeks. https://www.geeksforgeeks.org/deep-q-learning/

Hugging Face. (n.d.). Deep Q-Algorithm. Hugging Face. https://huggingface.co/learn/deep-rl-course/en/unit3/deep-q-algorithm
