# Frozen Lake - Kaan Gogcay
In this notebook I made an agent that can learn how to navigate from start to finish. When creating a new enivornment with a new map layout, it will need to relearn everything. Everything it learn is based of the current map layout. Througout the document I will try to explain what is going on.
- [Import Depencies](#import-dependencies)
- [Create Environment](#create-environment)
- [Define Q-Table](#define-q-table)
- [Define Agent](#define-agent)
- [Create Agent](#create-agent)
- [Make the Agent Learn](#make-the-agent-learn)
- [Test Agent](#test-agent)
- [Evalutaion](#evaluation)
- [Sources](#sources)

## Import dependencies

In [1]:
import sys # 3.8.18
import gym # 0.26.2
import pandas as pd # 2.0.3
import numpy as np # 1.23.4
import random as rnd # N/A
import time # N/A

print(sys.version[0:7])
print(gym.__version__)
print(pd.__version__)
print(np.__version__)

3.8.18 
0.26.2
2.0.3
1.23.4


## Create Environment
when creating the environment you can see a few noticeable things.
- `is_slippery=False` - `is_slippery` makes it possible that the agent will move either 90 degrees left or 90 degrees right. For example, if it wants to move right. the actual chance of moving right is 1/3, the chance of moving up is 1/3 (90 degrees to the left), and the chance to move down is also 1/3 (90 degrees to the right). I decided to set this to False since it adds a lot of chance to the exercise. I'm trying to get into RL and don't want to make it too overcomplicated right from the start.
- `map_name='8x8'` - this makes the map 8x8 instead of 4x4. By making the map bigger it is less easy to reach the gift by accident. Meaning it requires a well thought reward system to make the Agent learn how to play.
- `desc=[x]` - here I set the layout of the map. In this map there are multiple ways to reach the gift mixed with lots of holes. 

In [97]:
desc_base=[
"SFFFFFFF",
"HHHHHHHF",
"FFFFFFHF",
"FHHHHFHF",
"FHFGHFHF",
"FHFFFFHF",
"FHHHHHHF",
"FFFFFFFF"

]


In [99]:
#env = gym.make('FrozenLake-v1', render_mode='human', is_slippery = False, map_name='8x8')
env = gym.make('FrozenLake-v1', 
               is_slippery=False, 
               map_name='8x8', 
               desc=desc_base)
env.reset()
env.render()

print(f'states: {env.observation_space.n}')
print(f'actions: {env.action_space.n}')

states: 64
actions: 4


## Define Q-Table
To make the agent learn it will need something to base its decisions off of. Therefore I'm creating a Q-Table. This is basically a very long array with states. foreach state array, we have an array with actions it can take. and for each state-action combination there will be a reward saved. this way it will slowly find out which action gives the best reward in a certain state.

In [100]:
def create_bigger_q_table(states, actions):
    q = np.full((states, actions), 0) # 16 arrays
    return q

## Define Agent
Here we actually define the Agent. The agent has a few functions worth mentioning.
- `step(self, state, exploration_rate)` - step() evantually returns an action. before it returns an action, it checks which action gives the highest reward. If there are multiple actions giving the same reward, it will make a random choice between the good actions.
- `learn(self, state, action, reward)` - learn() overwrites the reward gained for a state-action combination if the reward is higher than the one saved previously. when we define the Q-Table, every state-action combination will have a reward of 0 saved. so whenever we give a reward bigger than 0 it will overwrite the default reward. This also means that if ew give a reward below 0 it will not overwrite. This gets usefull whenever we set the `exploration_rate` high.
- `force_learn(self, state, action, reward)` - force_learn() will always overwrite a reward. It won't check if the previously saved reward is better. it will force overwrite the reward onto the state-action combination. This is usefull for whenever we want to punish our agent. since the default value of each reward is 0. we can't overwrite it with the learn() function.
- `evaluate(self)` - prints the Q-Table.

In [101]:
class Agent_v6():  
    def __init__(self, q_table):
        self.q_table = q_table


    def step(self, state, exploration_rate):
        # Random action if exploring
        if rnd.randint(1, 100) <= (exploration_rate*100):
            return env.action_space.sample()
        
        # Find best action
        actions = self.q_table[state]
        best_reward = [-1.0]
        good_actions = []
        for i, r in enumerate(actions):
            if r > best_reward:
                best_reward = r
                good_actions.clear()
            if r == best_reward:
                ('add it')
                good_actions.append(i)
        choice = rnd.choice(good_actions)
        return choice



    def learn(self, state, action, reward):
        old_reward = self.q_table[state, action]
        if reward > old_reward:
            self.q_table[state, action] = reward


    def force_learn(self, state, action, reward):
        self.q_table[state, action] = reward



    def evaluate(self):
        df = pd.DataFrame(self.q_table)
        print(df)

## Create Agent
Here we combine everything we mentioned above. We create the Agent and give him a Q-Table.

In [102]:
agent11 = Agent_v6(create_bigger_q_table(env.observation_space.n, env.action_space.n))
agent11.q_table

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0,

## Make the Agent Learn
To make the agent learn we will let it play in the game loop I've created. Inside this gameloop there is happening a lot. I tried to write comments to make it as clear as possible.

In [105]:
env.reset()
env.render()

max_moves = 1000 # max moves before resetting env
episodes = 70 # amouint of tries
exploration_rate=0.0 # the chance of ignoring what we've learned and doing a random action
steps_looking_into_future = 36 # the amount of steps we should look into the future
agent = agent11 # set agent

for episode in range(episodes):
    new_observation = 0  # We consider that we always start top left // if you wanna start somewhere else you will have to change this variable 
    action_history = [] # clear the action history
    state_history = [] # clear the state history
    env.reset()
    for move in range(max_moves):  

        # keep track of the action history
        # later used to laern the entire state-action history if we ever reach the gift
        if move>0:
            if len(action_history) < steps_looking_into_future:
                action_history.append(action)
            else: 
                action_history.pop(0)
                action_history.append(action)

        # Get an action based of the Q-table
        action = agent.step(new_observation, exploration_rate)

        # keep track of the state history
        # later used to laern the entire state-action history if we ever reach the gift
        if move>0:
            if len(state_history) < steps_looking_into_future:
                state_history.append(observation)
            else:
                state_history.pop(0)
                state_history.append(observation)

        observation = new_observation

        # Execute action
        new_observation, reward, terminated, truncated, _ = env.step(action)

        # Learn
        if (not terminated):
            if new_observation == observation:
                reward = -1
                agent.force_learn(observation, action, reward)
            else:
                agent.learn(observation, action, reward)

        # Stop if dead
        if terminated:
            # if the player did not collect the gift and the game stops it will reach this point with reward=0
            if reward == 0 and move != max_moves:
                reward = -2 # punishment
            # If the player collected the gift it will reach this point with reward=1
            elif reward == 1:
                reward = steps_looking_into_future+1
                # reward the last x moves it took to get to the finish.
                print('Reached the Gift')
                for x, _ in enumerate(range(steps_looking_into_future)):
                    new_reward = reward - (reward-x-1)
                    try:
                        agent.learn(state_history[x], action_history[x], new_reward)
                    except:
                        raise Exception('Tried to look more steps into the future than the steps it took to reach the finish. Try lowering the variable `steps_looking_into_future`')
            # force learn the last action (this usually is something really good or really bad)
            agent.force_learn(observation, action, reward)
            break

Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gift
Reached the Gi

## Test Agent
For testing the agent we will make the map visible so we can see how the agent behaves. To make the map visible we have to redifine the environment.

In [106]:
env = gym.make('FrozenLake-v1', 
               render_mode='human',
               is_slippery=False, 
               map_name='8x8', 
               desc=desc_base)
env.reset()
env.render()

In [107]:
env.reset()
env.render()

max_moves = 1000 # max moves before resetting env
episodes = 3 # amouint of tries
exploration_rate=0.0 # the chance of ignoring what we've learned and doing a random action
steps_looking_into_future = 36 # the amount of steps we should look into the future
agent = agent11 # set agent

for episode in range(episodes):
    new_observation = 0  # We consider that we always start top left // if you wanna start somewhere else you will have to change this variable 
    action_history = [] # clear the action history
    state_history = [] # clear the state history
    env.reset()
    for move in range(max_moves):  

        # keep track of the action history
        # later used to laern the entire state-action history if we ever reach the gift
        if move>0:
            if len(action_history) < steps_looking_into_future:
                action_history.append(action)
            else: 
                action_history.pop(0)
                action_history.append(action)

        # Get an action based of the Q-table
        action = agent.step(new_observation, exploration_rate)

        # keep track of the state history
        # later used to laern the entire state-action history if we ever reach the gift
        if move>0:
            if len(state_history) < steps_looking_into_future:
                state_history.append(observation)
            else:
                state_history.pop(0)
                state_history.append(observation)

        observation = new_observation

        # Execute action
        new_observation, reward, terminated, truncated, _ = env.step(action)

        # Learn
        if (not terminated):
            if new_observation == observation:
                reward = -1
                agent.force_learn(observation, action, reward)
            else:
                agent.learn(observation, action, reward)

        # Stop if dead
        if terminated:
            # if the player did not collect the gift and the game stops it will reach this point with reward=0
            if reward == 0 and move != max_moves:
                reward = -2 # punishment
            # If the player collected the gift it will reach this point with reward=1
            elif reward == 1:
                reward = steps_looking_into_future+1
                # reward the last x moves it took to get to the finish.
                print('Reached the Gift')
                for x, _ in enumerate(range(steps_looking_into_future)):
                    new_reward = reward - (reward-x-1)
                    try:
                        agent.learn(state_history[x], action_history[x], new_reward)
                    except:
                        raise Exception('Tried to look more steps into the future than the steps it took to reach the finish. Try lowering the variable `steps_looking_into_future`')
            # force learn the last action (this usually is something really good or really bad)
            agent.force_learn(observation, action, reward)
            break

Reached the Gift
Reached the Gift
Reached the Gift


## Evaluation
We will take a look into my Q-Table now. Here is the definition for each number you will see in the Q-Table.

- -2: will result in death
- -1: will result in standing still
- 0: just walk
- 1: will result in being 15 steps away from gift
- 2: will result in being 14 step away from gift
- 3: will result in being 13 steps away from gift
- 4: will result in being 12 steps away from gift
- 5: will result in being 11 step away from gift
- 6: will result in being 10 steps away from gift
- 7: will result in being 9 step away from gift
- 8: will result in being 8 step away from gift
- 9: will result in being 7 step away from gift
- 10: will result in being 6 step away from gift
- 11: will result in being 5 steps away from gift
- 12: will result in being 4 step away from gift
- 13: will result in being 3 step away from gift
- 14: will result in being 2 step away from gift
- 15: will result in being 1 step away from gift
- 16: will result in getting gift

In [67]:
pd.set_option('display.max_rows', None)
print((agent11.evaluate()))

     0   1  2  3
0   -1   0  1 -1
1    0  -2  2 -1
2    0   2  4 -1
3    0  -2  5 -1
4    0   0  6 -1
5    0  -2  7 -1
6    2   8  4 -1
7    5  -2 -1 -1
8   -1   0 -2  0
9    0   0  0  0
10  -2   0 -2  3
11   0   0  0  0
12  -2   0 -2  0
13   0   0  0  0
14  -2   9 -2  1
15   0   0  0  0
16  -1   0 -2  0
17   0   0  0  0
18  -2   0 -2  0
19   0   0  0  0
20  -2  -2  0  0
21   0  -2  0 -2
22   0  10  8  0
23   9  -2 -1 -2
24  -1   0 -2  0
25   0   0  0  0
26  -2   0 -2  0
27   0   0  0  0
28   0   0  0  0
29   0   0  0  0
30  -2  11 -2  0
31   0   0  0  0
32  -1   0  0  0
33   0  -2  0 -2
34   0  -2 -2  0
35   0   0  0  0
36   0   0  0  0
37  12   0  0  0
38  12  -2  0  0
39   0   0  0  0
40  -1   0 -2  0
41   0   0  0  0
42   0   0  0  0
43   0   0  0  0
44   0   0 -2  0
45   0   0  0  0
46   0   0  0  0
47   0   0  0  0
48  -1   0 -2  0
49   0   0  0  0
50  -2   0  0 -2
51   0  -2  0  0
52   0   0  0  0
53   0   0  0  0
54   0   0  0  0
55   0   0  0  0
56  -1  -1  0  0
57   0  -1  0 

## Sources
- https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
- https://www.youtube.com/watch?v=ZhoIgo3qqLU&t=277s
- https://www.youtube.com/watch?v=V65phXUGb4I
- https://towardsdatascience.com/q-learning-for-beginners-2837b777741
- https://www.youtube.com/watch?v=Vrro7W7iW2w
