# Imports
---
Nice lets get started!\
Before we start we need to make sure we have installed all the packages to our development environment.\
Either create or make sure you python env is active, you see the env near the top right conner of the screen.\
Next open the terminal and run these commands.
```
pip install numpy  // Powerful tool for working with array's, matrices, and vectors.
pip install gymnasium   // The library containing games our agent can play
pip install gymnasium[toy-text]   // The sub package containing the we use in this project.
```

Now we can import the need libraries.\
\
First we are going to grab the gymnasium library as gym, so we can more easily refference it later.\
Next, we import numpy as np. \
Last we grab math, this is built into the Python language and does not need to bee installed.

In [1]:
import gymnasium as gym
import numpy as np
import math

# Hyper Variables
---

In [6]:
is_training = False     # If true, Q-vals wont be updated and random actions will not happen
render = True           # If true, the board will render for us to watch, but the agent will train slower.
is_slippery = False           # If true, the agent will preform a random action 2/3s of time, without epsilon

episodes = 5                   # Number of times the agent will attempt the game
learning_rate_a = 0.95         # The weight future Q-vals will have one the current state
discount_factor_g = 0.98       # The weight of possible outcomes vs the desired outcome will have 

epsilon = 1                         # The probability of random events - 1 = 100%
epsilon_decay_rate = 2 / episodes   # The amount that will be subtracted from epsilon on every episode
rng = np.random.default_rng()       # a random val between 0-1, if > epsilon a random action is choosen

# The Agents Envrionment
---

In [7]:
env = gym.make('FrozenLake-v1', desc=['SFHH','FFFF','HHFF','GFFH'], map_name="4x4", is_slippery=is_slippery, render_mode='human' if render else None)
goal = 12

Below is the matrix that will contain all of the Q-values for each action in each state.\
It is in its own cell, so when we switch from train to test we wont reset the Q-values.\

In [4]:
q = np.zeros((env.observation_space.n,env.action_space.n))

# Train/Test The Agent
---


In [8]:
for i in range(episodes):
    terminated = False
    steps = 0
    
    state = env.reset()[0]

    while(not terminated and steps < 30):
        if is_training and rng.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q[state, :])

        new_state,reward,terminated,_,_ = env.step(action)
        if terminated and new_state != goal:
            reward = -2
        elif not terminated:
            sx = (new_state % 4) + 1
            sy = math.floor(new_state/4)+1
            gx = (goal % 4) + 1
            gy = math.floor(goal/4)+1
            reward = ((sx+gx)/2+(sy+gy)/2)*1e-5

        if is_training:
            q[state, action] = q[state, action]+learning_rate_a*(
                reward+discount_factor_g*np.max(q[new_state,:])-q[state, action]
            )

        steps += 1

        if reward > 0 and terminated: 
            print(f'Episode {i+1} won in {steps} steps')
        elif terminated: 
            print(f'Episode {i+1} lost  in {steps} steps')

        state = new_state
    epsilon = max(epsilon - epsilon_decay_rate, 0)
env.close()

Episode 1 won in 7 steps
Episode 2 won in 7 steps
Episode 3 won in 7 steps
Episode 4 won in 7 steps
Episode 5 won in 7 steps
