# Assessment 3: RL Gym
### Game Selection: CartPole
For this assignment I have chosen the 2D training tool CartPole due to its straighforward mechanics and clear reward structure. The AI is rewarded every time it takes a step and is still alive, so can be trained to improve the time it stays alive for. https://gymnasium.farama.org/environments/classic_control/cart_pole/


Using some code from https://www.sliceofexperiments.com/p/an-actually-runnable-march-2023-tutorial

In [None]:
#Pre-setup installs
%pip install gymnasium[classic-control]
%pip install tensorflow

In [None]:
# Setup/imports
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Patch
from tqdm import tqdm
import tensorflow as tf
import gymnasium

env = gymnasium.make("CartPole-v1")  # create the environment used for the game

### Model Implementation: 
Implement and train an RL model using an algorithm like Q-learning, Deep Q-Networks (DQN), or any other suitable method. Explain your choice of algorithm and any modifications you made. Comment on the hyperparameters and why you chose them.

In [None]:
# Define hyperparameters
number_of_runs = 1000
learning_rate = 0.15
discount_factor = 0.99
exploration = lambda run: 50. / (run + 10)

### Training Process: 
Describe the training process, including any pre-processing steps such as frame stacking or converting frames to grayscale. Take short (<10 sec) videos at suitable training steps to demonstrate the agent's progress. Provide commentary on the agent's performance and any notable observations.

In [None]:
# Start training
observations, actions = env.observation_space, env.action_space

for run in range(1, number_of_runs + 1):
    observation, info = env.reset()
    done = False
    while not done:
        if np.random.random() < exploration:
            action = env.action_space.sample() # Explore action space
        else:
            action = env.take_action(observation) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - learning_rate) * old_value + learning_rate * (reward + next_max)
        q_table[state, action] = new_value

        state = next_state

        observation, reward, terminated, truncated, info = env.step(action)

        if terminated or truncated:
            done = True
            observation, info = env.reset()
            # print result
            #if run % (number_of_runs / 10) == 0:
                #print(agent.rewards)

env.close()

### Evaluation and Performance Metrics: 
Evaluate the performance of your trained model. Provide relevant metrics such as average reward, episodes needed to solve the game, and any additional visualizations or graphs. Comment on the strengths and limitations of your trained agent.

### Documentation and Report: 
Provide a clear and detailed report of your process, including decisions, challenges, and any improvements made during the training. Include commentary on the weights chosen and any pre-processing techniques applied.