# Imports
---
Nice lets get started!\
Before we start we need to make sure we have installed all the packages to our development environment.\
Either create or make sure you python env is active, you see the env near the top right conner of the screen.\
Next open the terminal and run these commands.
```
pip install numpy  // Powerful tool for working with array's, matrices, and vectors.
pip install gymnasium   // The library containing games our agent can play
pip install gymnasium[toy-text]   // The sub package containing the we use in this project.
```

Now we can import the need libraries.\
\
First we are going to grab the gymnasium library as gym, so we can more easily refference it later.\
Next, we import numpy as np. \
Last we grab math, this is built into the Python language and does not need to bee installed.

In [8]:
import gymnasium as gym
import numpy as np
import math

# Hyper Variables
---
Here we are initiating all the adjustable variable that will be used to train/test the agent.\
In this this cell, we put all the variable we will want to adjust to maximize the agents results.\
\
The only **NON**-Adjustable variable here is rng, which is used to generate a random value between 0-1.\
Of the adjustable variables the only one that doesn't affect the agents prefromance is render,\
it will just display board and the agent as it runs the map.  It will train much slower.

In [9]:
is_training = True     # If true, Q-vals wont be updated and random actions will not happen
render = False           # If true, the board will render for us to watch, but the agent will train slower.
is_slippery = False     # If true, the agent will preform a random action 2/3s of time, without epsilon
board = ['SFHG','FHFF','FFFF','HFFH'] # S = start / F = frozen ice / H = hole / G = goal

episodes = 500                 # Number of times the agent will attempt the game
max_steps = 40                 # The number of moves the agent is allocated per episode
learning_rate_a = 0.95         # The weight future Q-vals will have one the current state
discount_factor_g = 0.98       # The weight of possible outcomes vs the desired outcome will have 

epsilon = 1                         # The probability of random events - 1 = 100%
epsilon_decay_rate = 2 / episodes   # The amount that will be subtracted from epsilon on every episode
rng = np.random.default_rng()       # a random val between 0-1, if > epsilon a random action is choosen / NON-ADJUSTABLE

# The Agents Envrionment
---
In this section we have 3 different code cells.\
Reason for this is convenience, this way all the edits are made in the cell above.\
Then after makeing any adjustments, re-run the above cell and the marked below for the changes to take affect.\
\
This function will search the board matrix and find the "G".\
Once it finds this, it will add the index of "G" in that string and add it to row index multipled by 4.\
This will give you its position value, if the board was a grid like a calender and the first day was 0.

In [10]:
def find_goal_in_board(board):
    for i, row in enumerate(board): 
        if 'G' in row: return (i*4)+row.index('G')

Here, using the Gymnasium [Docs](https://gymnasium.farama.org/) initiate the agents environment.\
Note "board", "is_slippery", and "render".\
These are variables above that can be adjusted. There are note to explain what they do.\
\
We also use the fuction to find the position of the goal, and store in a variable to be used later.\
This has to match the board variable and should not be adjusted.

In [11]:
env = gym.make('FrozenLake-v1', desc=board, map_name="4x4", is_slippery=is_slippery, render_mode='human' if render else None)
goal = find_goal_in_board(board)
print(goal)

3


In [12]:
env.observation_space

Discrete(16)

Below is the matrix that will contain all of the Q-values for each action in each state.\
It is in its own cell, so when we switch from train to test we wont reset the Q-values.

In [13]:
q = np.zeros((env.observation_space.n,env.action_space.n))

# Train/Test The Agent
---
Theres a lot going on in this cell.\
We are using the Epsilon-Greedy Algorithm to acheive Markov Decision Process.\
Then we are using a unique solution I came up with that maps each states reward in correspondence with its distance from the goal.\
Lastly we use the Bellmans Equation to calculate the Q-value for each action in each state.

In [14]:
for i in range(episodes):
    terminated = False
    steps = 0
    
    state = env.reset()[0]

    while(not terminated and steps < max_steps):
        if is_training and rng.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q[state, :])

        new_state,reward,terminated,_,_ = env.step(action)
        if terminated and new_state != goal:
            reward = -2
        elif not terminated:
            sx = (new_state % 4) + 1
            sy = math.floor(new_state/4)+1
            gx = (goal % 4) + 1
            gy = math.floor(goal/4)+1
            reward = ((sx+gx)/2+(sy+gy)/2)*1e-5

        if is_training:
            q[state, action] = q[state, action]+learning_rate_a*(
                reward+discount_factor_g*np.max(q[new_state,:])-q[state, action]
            )

        steps += 1

        if reward > 0 and terminated: 
            print(f'Episode {i+1} won in {steps} steps')
        elif terminated: 
            print(f'Episode {i+1} lost  in {steps} steps')

        state = new_state
    epsilon = max(epsilon - epsilon_decay_rate, 0)
env.close()

Episode 1 lost  in 11 steps
Episode 2 lost  in 4 steps
Episode 3 lost  in 8 steps
Episode 4 lost  in 12 steps
Episode 5 lost  in 3 steps
Episode 6 lost  in 6 steps
Episode 7 lost  in 5 steps
Episode 8 lost  in 4 steps
Episode 9 lost  in 2 steps
Episode 10 lost  in 4 steps
Episode 11 lost  in 7 steps
Episode 12 lost  in 3 steps
Episode 13 lost  in 2 steps
Episode 14 lost  in 2 steps
Episode 15 lost  in 9 steps
Episode 16 lost  in 5 steps
Episode 17 lost  in 4 steps
Episode 18 lost  in 2 steps
Episode 19 lost  in 2 steps
Episode 20 lost  in 2 steps
Episode 21 lost  in 4 steps
Episode 22 lost  in 10 steps
Episode 23 lost  in 4 steps
Episode 24 lost  in 19 steps
Episode 25 lost  in 3 steps
Episode 26 lost  in 9 steps
Episode 27 lost  in 3 steps
Episode 28 lost  in 9 steps
Episode 29 lost  in 2 steps
Episode 30 lost  in 10 steps
Episode 31 lost  in 5 steps
Episode 32 lost  in 4 steps
Episode 33 lost  in 3 steps
Episode 34 lost  in 10 steps
Episode 35 lost  in 17 steps
Episode 36 lost  in 12

: 