# Assessment 3: RL Gym
### Game Selection: CartPole
For this assignment I have chosen the 2D training tool CartPole due to its straighforward mechanics and clear reward structure. The AI is rewarded every time it takes a step and is still alive, so can be trained to improve the time it keeps the pole upright and thus stays alive for. https://gymnasium.farama.org/environments/classic_control/cart_pole/


Using some code from: 
- https://www.sliceofexperiments.com/p/an-actually-runnable-march-2023-tutorial 
- https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5
- https://gist.github.com/maciejbalawejder/d028e0ddc4c88c19d3761e58fb90c137#file-q-learning-py
- https://www.baeldung.com/cs/epsilon-greedy-q-learning


In [None]:
#Pre-setup installs
%pip install gymnasium[classic-control]
%pip install gymnasium[box2d]
%pip install tensorflow

In [None]:
# Setup/imports
import numpy as np
import gymnasium

env = gymnasium.make("CartPole-v1")  # create the environment used for the game

### Model Implementation: 
Implement and train an RL model using an algorithm like Q-learning, Deep Q-Networks (DQN), or any other suitable method. Explain your choice of algorithm and any modifications you made. Comment on the hyperparameters and why you chose them.

In [None]:
# Define hyperparameters
number_of_runs = 10000  # takes about 1.5 seconds
learning_rate = 0.15
discount_factor = 0.99
exploration = 0.1
q_table = np.zeros((env.observation_space.shape[0], env.action_space.n)) # stores learned values

### Training Process: 
Describe the training process, including any pre-processing steps such as frame stacking or converting frames to grayscale. Take short (<10 sec) videos at suitable training steps to demonstrate the agent's progress. Provide commentary on the agent's performance and any notable observations.

In [None]:
# Start training
observations, actions = env.observation_space, env.action_space

for run in range(number_of_runs):
    observation = env.reset()
    done = False

    while not done:
        if np.random.uniform(0,1) < exploration:
            action = actions.sample()  # Take random actions
        else:
            action = np.argmax(q_table[observation, :])  # Take learned action 

        new_observation, reward, terminated, truncated, _ = env.step(action)

        q_table[observation, action] = q_table[observation, action] + learning_rate * \
            (reward + discount_factor * np.max(q_table[new_observation, :]) - q_table[observation, action])
        
        observation = new_observation

        if terminated or truncated:
            done = True
            if (reward > 1.0):
                print(reward + " on run " + run)

env.close()

### Evaluation and Performance Metrics: 
Evaluate the performance of your trained model. Provide relevant metrics such as average reward, episodes needed to solve the game, and any additional visualizations or graphs. Comment on the strengths and limitations of your trained agent.

In [None]:
import gymnasium
env = gymnasium.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()

for _ in range(1000):
    action = env.action_space.sample()  # agent policy that uses the observation and info
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

### Documentation and Report: 
Provide a clear and detailed report of your process, including decisions, challenges, and any improvements made during the training. Include commentary on the weights chosen and any pre-processing techniques applied.