# Lab 4: Q-table based reinforcement learning



Solve [`FrozenLake8x8-v1`](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) using a Q-table.


1. Import Necessary Packages (e.g. `gym`, `numpy`):

In [74]:
import numpy as np
import gym
import random
from tqdm import tqdm


2. Instantiate the Environment and Agent

In [75]:
#Start environment
env = gym.make('FrozenLake-v1', map_name="8x8", is_slippery=True)

def play_rnd(env, times=2):
    for _ in tqdm(range(times)):
        state, _ = env.reset()
        done = False
        while not done:
            action = env.action_space.sample()
            next_state, reward, done, _, _ = env.step(action)
    env.close()

# play_rnd(env)

# The class agent contains only the hyperparameters of the training to make more easy to divide the sections inside the notebook
# Q-table and next methods could be placed here
class Agent:
    def __init__(self, qtable):
        self.qtable = qtable

    def test(self, env, total_test_episodes, max_steps):
        env.reset()
        rewards = []

        for episode in tqdm(range(total_test_episodes)):
            state,_ = env.reset()
            step = 0
            done = False
            total_rewards = 0

            for step in range(max_steps):

                action = np.argmax(self.qtable[state,:])
                new_state, reward, done, _, _ = env.step(action)
                
                total_rewards += reward
                
                if done:
                    rewards.append(total_rewards)
                    break
                state = new_state

        print ("Average episode return: " +  str(sum(rewards)/total_test_episodes) )
        print ("Average win: " +  str((sum(rewards)/total_test_episodes)*100) + "%")


3. Set up the QTable:

In [76]:
action_size = env.action_space.n
state_size = env.observation_space.n
# choose the better initialization
initialize = "memory"

if(initialize == "0"):
    qtable = np.zeros((state_size, action_size))
elif(initialize == "random"):
    qtable = np.random.rand(state_size, action_size)
elif(initialize == "memory"):
    qtable = np.loadtxt("qtable_FrozenLake")


4. The Q-Learning algorithm training

In [77]:
def train(qtable, total_episodes=50000, max_steps=100, learning_rate=0.7, gamma=0.618, decay_rate=0.01):
    epsilon = 1.0                      # Exploration rate
    max_epsilon = 1.0                  # Exploration probability at start
    min_epsilon = 0.01                 # Minimum exploration probability 
    win = 0
    for episode in tqdm(range(total_episodes)):
        # Reset the environment
        state,_ = env.reset()
        step = 0
        done = False
        total_return = 0
        for step in range(max_steps):
            # Choose an action a in the current world state (s)
            ## First we randomize a number
            exp_exp_tradeoff = random.uniform(0,1)
            
            ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
            if exp_exp_tradeoff > epsilon:
                action = np.argmax(qtable[state,:])
            
            # Else doing a random choice --> exploration
            else:
                action = env.action_space.sample()
            
            # Take the action (a) and observe the outcome state(s') and reward (r)
            new_state, reward, done, _, _ = env.step(action)

            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
            qtable[state, action] += learning_rate * (reward + gamma * 
                                        np.max(qtable[new_state, :]) - qtable[state, action])
                    
            # Our new state is state
            state = new_state
            
            total_return += reward
            # If done : finish episode
            if done == True and total_return > 0: 
                win += 1
                break
    #    print(f"episode return {total_return}")
        
        # Reduce epsilon (because we need less and less exploration)
        epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    print(f"Total wins: {win}")
train(qtable, 10000)
env.close()

100%|██████████| 10000/10000 [01:32<00:00, 108.57it/s]

Total wins: 1063





5. Evaluate how well your agent performs
* Render output of one episode
* Give an average episode return

In [78]:
env = gym.make('FrozenLake-v1', map_name="8x8", is_slippery=True) # render_mode='human' to see the game
agent = Agent(qtable)
agent.test(env, 100, 100)
env.close()

100%|██████████| 100/100 [00:00<00:00, 383.02it/s]

Average episode return: 0.18
Average win: 18.0%





In [79]:
# Here show some gameplay
env = gym.make('FrozenLake-v1', map_name="8x8", render_mode='human', is_slippery=True) # render_mode='human' to see the game
agent = Agent(qtable)
agent.test(env, 5, 100)
env.close()

100%|██████████| 5/5 [00:00<00:00, 448.91it/s]

Average episode return: 0.2
Average win: 20.0%





In [80]:
np.savetxt("qtable_FrozenLake", qtable)

6. (<i>Optional</i>) Adapt code for one of the continuous [Classical Control](https://www.gymlibrary.dev/environments/classic_control/) problems. Think/talk about how you could use our  `Model` class from last Thursday to decide actions.