### Important Note
I have took out the comments we have made during the class all comments/markdowns have specifically been made to tell what I'm changing and why.

In [13]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import time
from IPython.display import clear_output

In [14]:
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode="rgb_array")

grid_height, grid_width = env.unwrapped.desc.shape

env.reset()

(0, {'prob': 1})

In [15]:
gamma = 0.9

max_epsilon = 0.9
epsilon = max_epsilon
min_epsilon = 0.01

decay = 0.0005

learning_rate = 0.5

trainings = 50
total_episodes = 10_000
max_steps = 50
print(f'Total steps: {trainings * total_episodes * max_steps}')
print(f'Estimated time: {total_episodes * trainings / 10_000 * 6 / 60} minutes')  # just based on my laptop's performance

Total steps: 25000000
Estimated time: 5.0 minutes


In [16]:
def calculate_value(Q, state, action, reward, next_state, alpha, gamma):
    Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
    return Q

In [17]:
# helper for Q-values visualisation
def summarise_best(q_table):  # removed width and height from incoming variables as the function can already access that
    best_values = np.max(q_table, axis = 1)
    return best_values.reshape(grid_width, grid_height)

In [18]:
first_solves = []
step_counts = []

for training in range(trainings):
    first_solve = (-1, -1)
    min_step_count = (-1, max_steps + 1) # episode, step count
    q_table = np.zeros([env.observation_space.n, env.action_space.n])

    for episode in range(total_episodes):
        state, info = env.reset()

        done = False
        total_rewards = 0

        for step in range(max_steps):
            # next action
            if np.random.uniform(0, 1) > epsilon:
                action = np.argmax(q_table[state, :])
            else:
                action = env.action_space.sample()

            # results
            observation, reward, done, truncated, info = env.step(action)

            # q_table update
            q_table = calculate_value(q_table, state, action, reward, observation, learning_rate, gamma)

            # update: state, reward
            state = observation

            if reward == 1:
                if first_solve == (-1, -1):
                    first_solve = (episode, step)
                if min_step_count[1] > step:
                    min_step_count = (episode, step)

            total_rewards += reward

            # end condition check
            if done:
                #print(f"Step count: {step}")
                break

        # decay
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay * episode)
        #print(f"Episode {episode} done with current reward: {total_rewards}\nEpsilon: {epsilon}")

    first_solves.append(first_solve)
    step_counts.append(min_step_count)

In [24]:
# average out the results
average_frist_solve = [0, 0]
average_step_count = [0, 0]
for i in range(trainings):
    average_frist_solve[0] += first_solves[i][0]
    average_frist_solve[1] += first_solves[i][1]
    average_step_count[0] += step_counts[i][0]
    average_step_count[1] += step_counts[i][1]

average_frist_solve[0] = average_frist_solve[0] / trainings
average_frist_solve[1] = average_frist_solve[1] / trainings
average_step_count[0] = average_step_count[0] / trainings
average_step_count[1] = average_step_count[1] / trainings

print(f'First average solve episode: {average_frist_solve[0]}, with average step count: {first_solve[1]}')
print(f'Smallest average step count of {average_step_count[1]} reached on average episode {average_step_count[0]}')

First average solve episode: 149.08, with average step count: 22
Smallest average step count of 5.0 reached on average episode 345.16


based on this I will reduce my total episodes to 2_500 for speed of testing in the other

### Concepts talk
The setup
- Agent: our reinforcement learning model is called the agent, this is who acts and thinks.
- Environment: this is what our agent interracts with, the whole world (in the agent's eye).
- State: the current, well, state of the environment, all the info about what is where and in what way. The environment is a videogame and for example a save file in that videogame (namely the current save file) would be the state.
- Observation: the next state, if we make a move the observation of what that move does with the environment becomes the new state.

The main loop
- Rewards: our agent "wants" to get as much of it as it can, this is what it optimizes around. So if we choose an unoptimal award, we'll get an unoptimal agent.
- Policy: actually this is the first one that I can't just intuitively say, so let's google... okay scoured some sites, and kind of gets described as an agents "brain" or how it's actions are implemented into the world. So this is basically the connector between what an agent "thinks" and how that "thought" is actually realized in the environment/state.

Q-stuff
- Q-learning: is a complex math algorithm that is responsible for the learning of the agent. It achieves this goal through constantly adjusting values through successes and failures (which it gets through positive and negative awards). I have to assume there are infinite amount of RL algorithms and this is just a simple one that works really well. Now that I know the base concepts even I could just poke at some math until it shows RL style behaviour.
- Q-table: is the current map of the agent's knowledge. It has the exact best move for each possible state of the environment (according to the agent).

Uh, smthsmth
- Exploration vs Exploitation: this is a choice the agent makes at every move. Exploration simply means it will act (semi-)randomly, while exploitation means it simply acts based on what it "thinks" is the best move (Q-table)

Variables:
- gamma: how much the agent focuses on it's future rewards, usually slightly less than the current rewards (like 0.9). However more vague goals require lower values here.
- epsilon: how wild the agent is, basically the ration between exploration and exploitation
- decay: how quickly epsilon goes down during training (until it reaches some minimum value)
- learning rate: how strongly it should value "new experiences" over older ones. Aka if it gets a reward how much this new reward poke at the values in the Q-table
- total episodes: how many training loops, we didn't use one but an early cutoff is probably recommeded here
- max steps: max actions taken in a single training (or episode), used to avoid infinite loops of actions clogging the system