<a href="https://colab.research.google.com/github/Deep-of-Machine/AI_Academy/blob/main/9_1_taxi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gym

In [2]:
env = gym.make('Taxi-v3').env

In [9]:
env?

agent : 택시    
env : 5 x 5 크기의 주차장(4개의 고정된 정류장) + 승객    
승객 : 현재 위치(어느 정류장에서 기다리던가, 택시에 타있거나) + 목표 위치

In [4]:
env.render()

+---------+
|[35mR[0m:[43m [0m| : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [6]:
env.reset()
env.render()

+---------+
|R: | : :G|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+



택시의 액션 : (south, north, east, west, pickup, dropoff)    
상태 : 5 x 5 x 5 x 4    
보상 : 1번의 액션 당 기본적으로 -1, 승객을 잘못 태우거나 내리면 -10, 승객이 무사히 도착하면 +20

In [7]:
print(env.action_space)
print(env.observation_space)

Discrete(6)
Discrete(500)


In [8]:
env.reset()
env.render()
print(env.s)

+---------+
|[35mR[0m: | : :G|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

188


In [10]:
one_step = env.step(0)
print(one_step) #next state, reward, done, probability
print()
env.render()

(288, -1, False, {'prob': 1.0})

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : :[43m [0m|
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)


In [11]:
state = env.encode(1,2,3,0) #(taxi row, taxi column, passenger index, destinationi index)
print("state:", state)

env.s = state
env.render()

state: 152
+---------+
|[35mR[0m: | : :G|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+
  (South)


각 액션이 다음 상태로 넘어갈 확률, 보상

In [12]:
# {action: [(probability, nextstate, reward, done)]}
env.P[152]

{0: [(1.0, 252, -1, False)],
 1: [(1.0, 52, -1, False)],
 2: [(1.0, 172, -1, False)],
 3: [(1.0, 152, -1, False)],
 4: [(1.0, 152, -10, False)],
 5: [(1.0, 152, -10, False)]}

랜덤한 액션을 취해서 승객 한명을 목적지까지 태워줍니다

In [19]:
env.s = 408
epochs = 0
penalties, reward = 0, 0
frames = []
done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1

    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
    })

    epochs += 1
print("Timestep taken: {}" .format(epochs))
print("penalties incurred: {}" .format(penalties))

Timestep taken: 208
penalties incurred: 65


In [21]:
len(frames)

208

In [22]:
frames[0]['frame']

'+---------+\n|\x1b[35mR\x1b[0m: | : :G|\n| : | : : |\n| : : : : |\n| | : | : |\n|\x1b[34;1m\x1b[43mY\x1b[0m\x1b[0m| : |B: |\n+---------+\n  (West)\n'

In [23]:
frame_num = -1
print(frames[frame_num]['frame'])
print(frames[frame_num]['state'])
print(frames[frame_num]['action'])
print(frames[frame_num]['reward'])

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

0
5
20


시작부터 그림으로 나타내봅시다

In [29]:
from IPython.display import clear_output, Pretty, display
import time

def print_frames(frames, term = True):
    if term:
        for i, frame in enumerate(frames):
            clear_output(wait=True)
            display(Pretty(frame['frame']))
            time.sleep(0.3)
    
    else:
        for i, frame in enumerate(frames):
            clear_output(wait=True)
            display(Pretty(frame['frame']))


print_frames(frames[:], False)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)


Q-table을 만들고, 이를 학습합니다


In [30]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
print(q_table.shape)

(500, 6)


In [None]:
print(q_table[408])
print()
env.s = 408
env.render()

In [31]:
%%time

import random

alpha = 0.1
gamma = 0.6
epsilon = 0.1

all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward = 0, 0, 0
    done = False
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() 
        else:
            action = np.argmax(q_table[state])

        next_state, reward, done, info = env.step(action)

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")
    
print("Training Finishing.\n")

Episode: 100000
Training Finishing.

CPU times: user 1min 37s, sys: 22.2 s, total: 1min 59s
Wall time: 1min 42s


학습한 Q-table을 확인해봅니다

In [32]:
env.reset()
env.s = 408
print(q_table[408])
print()
env.render()

print(np.argmax(q_table[408]))

[ -1.45024     -1.870144    -1.45024001  -1.45024     -0.7504
 -10.45023842]

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1m[43mY[0m[0m| : |B: |
+---------+

4


In [34]:
frames = []
env.s = env.encode(4,4,0,3)
done = False

while not done:
    trained_action = np.argmax(q_table[env.s])
    state, reward, done, info = env.step(trained_action)
    
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

print_frames(frames, True)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)


In [35]:
len(frames)

17

In [36]:
total_epochs, total_penalties = 0,0
episodes = 100
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes
Average timesteps per episode: 13.38
Average penalties per episode: 0.0
