# Simple algorythm with Q-Learning

## Import

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import gymnasium as gym

import numpy as np
from random import randint, uniform
from IPython import display
from IPython.display import clear_output

import matplotlib
import matplotlib.pyplot as plt

from time import sleep

## Charging environment

In [3]:
env = gym.make("Taxi-v3", render_mode="ansi")
# We can test render_mode to 'human' in the future

env.reset()
print(env.render())

+---------+
|R: | : :[35mG[0m|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+




The filled square represents the taxi, which is yellow without a passenger and green with a passenger.  
The pipe ("|") represents a wall which the taxi cannot cross.  
R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.  

In [4]:
print(f"Action Space {env.action_space}")
print(f"State Space {env.observation_space}")

Action Space Discrete(6)
State Space Discrete(500)


We got 6 actions inside the environment:
- **0**: south
- **1**: north
- **2**: east
- **3**: west
- **4**: pickup
- **5**: dropoff

And 500 possible states:
- **5x5** grid
- **4** destinations
- **5** passenger locations

In [13]:
print(f"Current state: {env.s}")

Current state: 475


In [14]:
env.P[env.s]

{0: [(1.0, 475, -1, False)],
 1: [(1.0, 375, -1, False)],
 2: [(1.0, 495, -1, False)],
 3: [(1.0, 475, -1, False)],
 4: [(1.0, 479, -1, False)],
 5: [(1.0, 475, -10, False)]}

For each action in this state, we have:
- **probability**: always at 1.0 in this env
- **nextstate**: the next state if the agent takes this action
- **reward**: the reward (positive or negative) gained after performing this action
- **done**: boolean at True when a passenger is correctly dropof

## Q-Learning

In order to implementing the Q-Learning algorithm, we will be through those steps:

- Initialize the Q-table by all zeros.
- Start exploring actions: For each state, select any one among all possible actions for the current state (S).
- Travel to the next state (S') as a result of that action (a).
- For all possible actions from the state (S') select the one with the highest Q-value.
- Update Q-table values using the equation.
- Set the next state as the current state.
- If goal state is reached, then end and repeat the process.

In [15]:
# Initialize the Q-Table
q_table = np.zeros([env.observation_space.n, env.action_space.n])

q_table

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [16]:
# Hyperparameters
ALPHA = 0.1
GAMMA = 0.6
EPSILON = 0.1

For now, hyperparameters are initialized with arbitrary values, but will be tested in another notebook.  
Since, they are not gonna change, we noted them as constantes.  

In [17]:
reward_list = []
penalities_list = []

episode_number = 100000

# Learning
for i in range(episode_number):
    # Reset the environment
    state = env.reset()[0]

    reward_count = 0
    
    # initialize fields
    epochs = penalties = reward = 0
    done = False
    
    # Start the episode process
    while not done:
        # Deciding which action to perform
        if uniform(0, 1) < EPSILON:
            action = env.action_space.sample() # Exploration
        else:
            action = np.argmax(q_table[state]) # Exploitation
        
        # Performing action inside the environment
        next_state, reward, done, info, _ = env.step(action)
        
        # Getting usefull fields to calculate the new value of Q-Table
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        # Calculate new value
        new_value = (1 - ALPHA) * old_value + ALPHA * (reward + GAMMA * next_max)
        
        q_table[state, action] = new_value
        
        # Getting stats when the agent performed illegal action (pickup or dropoff)
        penalties += 1 if reward == -10 else 0
        
        # Updating state
        state = next_state

        reward_count += reward
        
        epochs += 1
    
    # Display the number of episode
    if i % 1000 == 0:
        clear_output(wait=True)
        penalities_list.append(penalties)
        reward_list.append(reward_count)
        print('Episode: {}, reward: {}, wrong dropouts: {}\n'.format(i, reward_count, penalties))

        fig, (axs1,axs2) = plt.subplots(1,2, figsize=(12, 6)) # create in 1 line 2 plots
        
        axs1.plot(reward_list)
        axs1.set_xlabel('episode*1000')
        axs1.set_ylabel('reward')
        axs1.grid(True)
        
        axs2.plot(penalities_list)
        axs2.set_xlabel('episode*1000')
        axs2.set_ylabel('wrong dropouts')
        axs2.grid(True)
        
        plt.show()
        sleep(0.1)

print("Finished")

Episode: 5120, reward: -5, wrong dropouts: 1



KeyboardInterrupt: 

In [9]:
# Examining the q_table
q_table

array([[  0.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   0.        ],
       [ -2.4183706 ,  -2.3639511 ,  -2.41837062,  -2.36395109,
         -2.27325184, -11.36393207],
       [ -1.87014399,  -1.45024   ,  -1.870144  ,  -1.45024003,
         -0.7504    , -10.45023981],
       ...,
       [ -1.10895034,   0.41599995,  -1.02411286,  -1.15270773,
         -1.96      ,  -5.19012549],
       [ -2.14625018,  -2.12207667,  -2.14627998,  -2.12207695,
         -5.07090015,  -6.29348347],
       [  3.37134783,   1.37178285,   3.32635135,  11.        ,
         -2.85982595,  -1.82536   ]])

## Evaluation

Evaluating the agent after the Q-Learning with 100 episodes.  

Metrics:
- **total_epochs**: number of action performed by the agent during an episode
- **total_penalties**: number of illegal actions performed by the agent during an episode (pickup / dropoff)

In [10]:
# Initialize fields
total_epochs = total_penalties = 0
EPISODES = 100

In [12]:
def print_frames(frames, sleep_time, episodes, random):
    if random:
        random_number_list = []
        for e in range(episodes):
            random_number = randint(1, EPISODES)
            random_number_list.append(random_number)
        
    for i, frame in enumerate(frames, start=1):
        if random and frame["episode"] in random_number_list:
            clear_output(wait=True)
            
            print(f"Episode: {frame['episode']}")
            
            print(frame['frame'])
            
            print(f"Timestep: {i}")
            print(f"State: {frame['state']}")
            print(f"Action: {frame['action']}")
            print(f"Reward: {frame['reward']}")
            
            sleep(sleep_time)
        elif not random and frame["episode"] < episodes: 
            clear_output(wait=True)
            
            print(f"Episode: {frame['episode']}")
            
            print(frame['frame'])
            
            print(f"Timestep: {i}")
            print(f"State: {frame['state']}")
            print(f"Action: {frame['action']}")
            print(f"Reward: {frame['reward']}")
            
            sleep(sleep_time)

In [13]:
frames = []

for i in range(EPISODES):
    # Reset the environment
    state = env.reset()[0]
    
    # Initialize fields
    epochs = penalties = reward = 0
    done = False
    
    # Start the episode process
    while not done:
        # Only Exploitation during the evaluation phase
        action = np.argmax(q_table[state])
        
        # Performing action inside the environment
        state, reward, done, info, _ = env.step(action)
        
        # Put each rendered frame into a dict for animation
        frames.append({
            "episode": i,
            "frame": env.render(),
            "state": state,
            "action": action,
            "reward": reward
        })
        
        # Getting stats when the agent performed illegal action (pickup or dropoff)
        penalties += 1 if reward == -10 else 0
        
        epochs += 1
    
    total_penalties += penalties
    total_epochs += epochs

In [14]:
# Results
print(f"Results ({EPISODES} episodes)")
print(f"Average timesteps: {total_epochs / EPISODES}")
print(f"Average penalties: {total_penalties / EPISODES}")

Results (100 episodes)
Average timesteps: 13.14
Average penalties: 0.0


## Visualization

In [17]:
print_frames(frames, sleep_time=0.1, episodes=5, random=True)

Episode: 96
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Timestep: 1274
State: 475
Action: 5
Reward: 20
