# Assessment 3: RL Gym
### Game Selection: CartPole
For this assignment I have chosen the 2D training tool CartPole due to its straighforward mechanics and clear reward structure. The AI is rewarded every time it takes a step and is still alive, so can be trained to improve the time it keeps the pole upright and thus stays alive for. https://gymnasium.farama.org/environments/classic_control/cart_pole/


https://gymnasium.farama.org/environments/toy_text/blackjack/

Using some code from: 
- https://www.sliceofexperiments.com/p/an-actually-runnable-march-2023-tutorial 
- https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5
- https://gist.github.com/maciejbalawejder/d028e0ddc4c88c19d3761e58fb90c137#file-q-learning-py
- https://www.baeldung.com/cs/epsilon-greedy-q-learning


In [None]:
#Pre-setup installs
%pip install gymnasium[classic-control]
%pip install gymnasium[ToyText]
%pip install tensorflow

In [42]:
# Setup/imports
import numpy as np
import gymnasium

env = gymnasium.make("FrozenLake-v1")  # create the environment used for the game

### Model Implementation: 
Implement and train an RL model using an algorithm like Q-learning, Deep Q-Networks (DQN), or any other suitable method. Explain your choice of algorithm and any modifications you made. Comment on the hyperparameters and why you chose them.

In [43]:
# Define hyperparameters
number_of_runs = 10000  # takes about 1.5 seconds
learning_rate = 0.15
discount_factor = 0.99
exploration = 0.2
q_table = np.zeros((env.observation_space.n, env.action_space.n)) # stores learned values

### Training Process: 
Describe the training process, including any pre-processing steps such as frame stacking or converting frames to grayscale. Take short (<10 sec) videos at suitable training steps to demonstrate the agent's progress. Provide commentary on the agent's performance and any notable observations.

In [46]:
# Start training
observations, actions = env.observation_space, env.action_space

for run in range(number_of_runs):
    observation, info = env.reset()
    done = False

    while not done:
        if np.random.rand() < exploration:
            action = actions.sample()  # Take random actions
        else:
            action = np.argmax(q_table[observation, :])  # Take learned action 

        new_observation, reward, terminated, truncated, _ = env.step(action)

        q_table[observation, action] = q_table[observation, action] + learning_rate * \
            (reward + discount_factor * np.max(q_table[new_observation, :]) - q_table[observation, action])
        
        observation = new_observation

        if terminated or truncated:
            done = True
            if (reward == 1.0):
                print(str(reward) + " on run " + str(run))

env.close()

0.0 on run 0
0.0 on run 1
1.0 on run 2
0.0 on run 3
1.0 on run 4
0.0 on run 5
0.0 on run 6
0.0 on run 7
1.0 on run 8
0.0 on run 9
0.0 on run 10
1.0 on run 11
0.0 on run 12
0.0 on run 13
0.0 on run 14
0.0 on run 15
1.0 on run 16
0.0 on run 17
0.0 on run 18
0.0 on run 19
0.0 on run 20
0.0 on run 21
0.0 on run 22
0.0 on run 23
0.0 on run 24
0.0 on run 25
0.0 on run 26
0.0 on run 27
1.0 on run 28
0.0 on run 29
0.0 on run 30
0.0 on run 31
0.0 on run 32
1.0 on run 33
0.0 on run 34
0.0 on run 35
0.0 on run 36
1.0 on run 37
0.0 on run 38
0.0 on run 39
0.0 on run 40
0.0 on run 41
0.0 on run 42
0.0 on run 43
1.0 on run 44
0.0 on run 45
0.0 on run 46
0.0 on run 47
0.0 on run 48
0.0 on run 49
0.0 on run 50
0.0 on run 51
0.0 on run 52
0.0 on run 53
0.0 on run 54
0.0 on run 55
0.0 on run 56
1.0 on run 57
1.0 on run 58
0.0 on run 59
0.0 on run 60
0.0 on run 61
0.0 on run 62
0.0 on run 63
0.0 on run 64
0.0 on run 65
1.0 on run 66
1.0 on run 67
1.0 on run 68
1.0 on run 69
0.0 on run 70
0.0 on run 71
0.

### Evaluation and Performance Metrics: 
Evaluate the performance of your trained model. Provide relevant metrics such as average reward, episodes needed to solve the game, and any additional visualizations or graphs. Comment on the strengths and limitations of your trained agent.

In [None]:
import gymnasium
env = gymnasium.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()

for _ in range(1000):
    action = env.action_space.sample()  # agent policy that uses the observation and info
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()

### Documentation and Report: 
Provide a clear and detailed report of your process, including decisions, challenges, and any improvements made during the training. Include commentary on the weights chosen and any pre-processing techniques applied.