# Assessment 3: RL Gym
### Game Selection: CartPole
For this assignment I have chosen the 2D training tool CartPole due to its straighforward mechanics and clear reward structure. The AI is rewarded every time it takes a step and is still alive, so can be trained to improve the time it keeps the pole upright and thus stays alive for. https://gymnasium.farama.org/environments/classic_control/cart_pole/


Using some code from: 
- https://www.sliceofexperiments.com/p/an-actually-runnable-march-2023-tutorial 
- https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5
- https://gist.github.com/maciejbalawejder/d028e0ddc4c88c19d3761e58fb90c137#file-q-learning-py
- https://www.baeldung.com/cs/epsilon-greedy-q-learning
- https://www.digitalocean.com/community/tutorials/how-to-build-atari-bot-with-openai-gym

In [None]:
#Pre-setup installs
%pip install gymnasium[classic-control]
%pip install gymnasium[ToyText]
%pip install tensorflow

In [None]:
# Setup/imports
import numpy as np
import gymnasium

env = gymnasium.make("FrozenLake-v1")  # create the environment used for the game

### Model Implementation: 
Implement and train an RL model using an algorithm like Q-learning, Deep Q-Networks (DQN), or any other suitable method. Explain your choice of algorithm and any modifications you made. Comment on the hyperparameters and why you chose them.

In [None]:
# Define hyperparameters
number_of_runs = 30000  # takes about 5 seconds
learning_rate = 0.1
discount_factor = 0.99
exploration = 0.1
report_interval = 500
report = 'Average: %.2f, 100-run average: %.2f (Run %d)'
q_table = np.zeros((env.observation_space.n, env.action_space.n)) # stores learned values

### Training Process: 
Describe the training process, including any pre-processing steps such as frame stacking or converting frames to grayscale. Take short (<10 sec) videos at suitable training steps to demonstrate the agent's progress. Provide commentary on the agent's performance and any notable observations.

In [None]:
# Start training
observations, actions = env.observation_space, env.action_space
rewards = []

for run in range(number_of_runs):
    observation, info = env.reset()
    done = False
    run_reward = 0
    while not done:
        if np.random.rand() < exploration:
            action = actions.sample()  # Take random actions
        else:
            action = np.argmax(q_table[observation, :])  # Take learned action 

        new_observation, reward, terminated, truncated, _ = env.step(action)

        q_table[observation, action] = (1 - learning_rate) * q_table[observation, action] + learning_rate * \
            (reward + discount_factor * np.max(q_table[new_observation, :]))
        
        run_reward += reward
        
        observation = new_observation

        if terminated or truncated:
            done = True
            rewards.append(run_reward)
            if ((run + 1) % 1000 == 0):
                print(report % (np.mean(rewards), np.mean(rewards[-100:]), run + 1))
env.close()

### Evaluation and Performance Metrics: 
Evaluate the performance of your trained model. Provide relevant metrics such as average reward, episodes needed to solve the game, and any additional visualizations or graphs. Comment on the strengths and limitations of your trained agent.

In [None]:
import gymnasium
import numpy as np
import tensorflow as tf
import random

# Create the CartPole environment
env = gymnasium.make('CartPole-v1')

# Define the neural network model
class DQN(tf.keras.Model):
    def __init__(self, num_actions):
        super(DQN, self).__init__()
        self.dense1 = tf.keras.layers.Dense(24, activation='relu')
        self.dense2 = tf.keras.layers.Dense(24, activation='relu')
        self.output_layer = tf.keras.layers.Dense(num_actions)

    def call(self, state):
        x = self.dense1(state)
        x = self.dense2(x)
        return self.output_layer(x)

# Initialize the DQN model
num_actions = env.action_space.n
model = DQN(num_actions)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# Define the epsilon-greedy exploration strategy
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995

# Replay memory parameters
memory = []
memory_capacity = 10000
batch_size = 32

# Training parameters
gamma = 0.99  # Discount factor
num_episodes = 1000

# Function to preprocess observations
def preprocess_observation(obs):
    return np.reshape(obs, [1, -1])

# Training the DQN
for episode in range(num_episodes):
    state, _ = env.reset()
    state = preprocess_observation(state)
    done = False
    total_reward = 0

    while not done:
        # Epsilon-greedy exploration
        if np.random.rand() <= epsilon:
            action = env.action_space.sample()  # Explore
        else:
            q_values = model.predict(state)
            action = np.argmax(q_values)  # Exploit

        next_state, reward, terminated, truncated, _ = env.step(action)
        next_state = preprocess_observation(next_state)
        total_reward += reward
        
        if terminated or truncated:
            done = True

        memory.append((state, action, reward, next_state, done))
        if len(memory) > memory_capacity:
            del memory[0]
        
        # Experience replay
        if len(memory) >= batch_size:
            batch = np.array(random.sample(memory, batch_size))
            states, actions, rewards, next_states, dones = np.split(batch, 5, axis=1)

            states = np.vstack(states)
            next_states = np.vstack(next_states)

            q_values = model.predict(states)
            next_q_values = model.predict(next_states)

            max_next_q_values = np.max(next_q_values, axis=1)
            targets = rewards.squeeze() + (1 - dones.squeeze()) * gamma * max_next_q_values

            q_values[range(batch_size.astype(int)), actions.squeeze().astype(int)] = targets
            model.train_on_batch(states, q_values)

        state = next_state

    # Decay epsilon
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    # Display progress
    if (episode + 1) % 10 == 0:
        print(f"Episode: {episode + 1}, Total Reward: {total_reward}")

# After training, use the model to play the game
state = env.reset()
state = preprocess_observation(state)
done = False
while not done:
    q_values = model.predict(state)
    action = np.argmax(q_values)
    next_state, reward, done, _ = env.step(action)
    next_state = preprocess_observation(next_state)
    state = next_state
    env.render()

env.close()


### Documentation and Report: 
Provide a clear and detailed report of your process, including decisions, challenges, and any improvements made during the training. Include commentary on the weights chosen and any pre-processing techniques applied.