# Assessment 3: RL Gym
### Game Selection: FrozenLake
For this assignment I have chosen the simple game Frozen Lake due to its straighforward mechanics and clear reward structure. The AI is rewarded when it reaches the end of the maze without falling into a hole. As this game has a discrete observation space instead of a continous one, the algorithms used can be much simpler. https://gymnasium.farama.org/environments/toy_text/frozen_lake/


Using some code from: 
- https://www.sliceofexperiments.com/p/an-actually-runnable-march-2023-tutorial 
- https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5
- https://gist.github.com/maciejbalawejder/d028e0ddc4c88c19d3761e58fb90c137#file-q-learning-py
- https://www.baeldung.com/cs/epsilon-greedy-q-learning
- https://www.digitalocean.com/community/tutorials/how-to-build-atari-bot-with-openai-gym

In [None]:
#Pre-setup installs
%pip install gymnasium[ToyText]
%pip install numpy

In [None]:
# Setup/imports
import numpy as np
import gymnasium

env = gymnasium.make("FrozenLake-v1")  # create the environment used for the game

### Model Implementation: 
For this game, I chose to use the Q-learning algorithm, primarily as it is one of the simplest algorithms that can be used to show learning and improvement. I modified this by adding an exploration rate that decays over time, meaning the model will rely more and more on its learned behaviours. 
The hyperparameters for this algorithm, shown below, were chosen based on trial and error. I found a higher learning rate would overfit quickly and do worse as the run count increased. With the hyperparameters shown below, the training code can reliably generate a model which can solve the Frozen Lake puzzle ("solving" meaning have a best 100-run average of at least 0.78) in about 6000 runs.

In [None]:
# Define hyperparameters
number_of_runs = 10000  # takes about 3 seconds
learning_rate = 0.1
discount_factor = 0.99
initial_exploration = 1.0
min_exploration = 0.01
exploration_decay = 0.001
report_interval = 500
report = 'Average: %.2f, 100-run average: %.2f, Best average: %.2f (Run %d)'

### Training Process: 
Describe the training process, including any pre-processing steps such as frame stacking or converting frames to grayscale. Take short (<10 sec) videos at suitable training steps to demonstrate the agent's progress. Provide commentary on the agent's performance and any notable observations.

In [None]:
# Reset learned values, rewards and best streak
q_table = np.zeros((env.observation_space.n, env.action_space.n)) # stores learned values
rewards = []
best_streak = 0.0

# Start training
for run in range(number_of_runs):
    observation, info = env.reset()
    done = False
    run_reward = 0
    exploration_rate = max(min_exploration, initial_exploration * np.exp(-exploration_decay * run)) # decrease exploration rate every run
    while not done:
        if np.random.rand() < exploration_rate:
            action = env.action_space.sample()  # Take random actions
        else:
            action = np.argmax(q_table[observation, :])  # Take learned action 

        new_observation, reward, terminated, truncated, _ = env.step(action)

        q_table[observation, action] = (1 - learning_rate) * q_table[observation, action] + learning_rate * \
            (reward + discount_factor * np.max(q_table[new_observation, :]))
        
        run_reward += reward        
        observation = new_observation
        
        if (run + 1) % 100 == 0: # check if last 100 run average was the best so far
            current_streak = np.mean(rewards[-100:])
            if current_streak > best_streak:
                best_streak = current_streak

        if terminated or truncated:
            done = True
            rewards.append(run_reward)
            if ((run + 1) % report_interval == 0): # every 500 runs, print a report showing progress
                print(report % (np.mean(rewards), np.mean(rewards[-100:]), best_streak, run + 1))
env.close()

### Evaluation and Performance Metrics: 
Evaluate the performance of your trained model. Provide relevant metrics such as average reward, episodes needed to solve the game, and any additional visualizations or graphs. Comment on the strengths and limitations of your trained agent.

### Documentation and Report: 
Provide a clear and detailed report of your process, including decisions, challenges, and any improvements made during the training. Include commentary on the weights chosen and any pre-processing techniques applied.