In [1]:
from pathlib import Path
import herringbone as hb

All initialization tests passed.
imported herringbone without any errors :)


### Creating an MDP

An MDP is formally defined as a 5-tuple $\mathcal{M} = (S, A, P, R, \gamma)$, where:
- $S$ defines the state space
- $A$ defines the action space
- $P$ models the environment dynamics
- $R$ models the reward function
- $\gamma$ defines the discount factor

To create an MDP with this framework, it needs paths to at least a state config, a map, and an action config.

Additionally, it can take an array of transition matrices (see $P$ in formal MDP definition), a seed, and the discount factor $\gamma$. 
But these have default values, so do not fret if you do not understand them!

In [2]:
print('Map options:\n\t',
      *sorted([m.stem for m in Path("herringbone/env_core/maps").glob('*.csv')]), sep='\n\t- ')

print('\nSupported configurations:\n\t',
      *sorted([c.stem for c in Path("herringbone/env_core/config").glob('*.json')]), sep='\n\t- ')

Map options:
	
	- danger_holes
	- double_fish
	- easy
	- example
	- example2
	- mega
	- slides
	- wall_of_death

Supported configurations:
	
	- action_config
	- state_config


In [3]:
state_path = "herringbone/env_core/config/state_config.json"
map_path = f"herringbone/env_core/maps/double_fish.csv"
action_path = "herringbone/env_core/config/action_config.json"

gamma = 1

mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma)

### Previewing the board

The board can be previewed with the following code.

**Render Modes**
1. `'sar'`: prints the state, action, reward of each iteration (only used in Monte Carlo simulations and Temporal Difference learning)
2. `'rewards'`: prints the board with the calculated rewards for each state
3. `'ascii'`: prints an ascii representation of the board

In [4]:
render_modes = ['sar', 'rewards', 'ascii']
hb.Render.preview_frame(board=mdp.get_board(), agent_state=None, render_mode=render_modes[2])

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝[0]


### Creating a policy & running an episode

A policy can be created with help of de MDP, this policy is unfirom/random by default.

An episode is created with an MPD and a policy, and a max depth to ensure that it does not run forever with a sub optimal policy.

This episode instance can be ran with a render method.

In [5]:

random_policy = hb.Policy(mdp=mdp)
episode = hb.Episode(mdp=mdp, policy=random_policy, max_depth=1000)
episode.run("ascii")

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [31m =^.^= [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝[0]
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣

# Dynamic Programming

### Policy Iteration

The code below runs the policy iteration algorithm.

Calling the algorithm only takes an MDP object and $\theta$, this defines how precise the algorithm must update its values before terminating.

The optimal policy and optimal state values of the MDP can be retrieved by calling the `run()` function on your `PolicyIteration` object.

In [6]:
theta = 0.000_000_000_1

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)

# Run PolicyIteration
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

#### Displaying policy/state values

The policy can be displayed by simply printing the `Policy` object.

The learned state values can be displayed by calling `hb.Render.preview_V(mdp, state_values)`

In [7]:
hb.Render.preview_V(mdp=mdp, learned_V=pi_state_values)

print('-----')

print(pi_optimal_policy)

print('-----')

hb.Render.preview_Q(mdp=mdp, learned_Q=pi_q_values)

╔═══════╦═══════╦═══════╦═══════╦═══════╗
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  5.00 ║  6.00 ║  7.00 ║  8.00 ║  9.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═

### Value Iteration

The code below runs the value iteration algorithm.
It works pretty much the same as the policy iteration algorithm, aside from the name of the class.

In [8]:
theta = 0.000_000_000_1

value_iteration = hb.ValueIteration(mdp=mdp, theta_threshold=theta)

vi_optimal_policy, vi_state_values, vi_q_values = value_iteration.run()

In [9]:
hb.Render.preview_V(mdp=mdp, learned_V=vi_state_values)

print('-----')

print(vi_optimal_policy)

print('-----')

hb.Render.preview_Q(mdp=mdp, learned_Q=vi_q_values)

╔═══════╦═══════╦═══════╦═══════╦═══════╗
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  5.00 ║  6.00 ║  7.00 ║  8.00 ║  9.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═

# Monte Carlo Methods

### Monte Carlo Prediction

To run Monte Carlo Prediction a `MonteCarloPredictor` object needs to be initialized. 

After that a policy and sample count can be given as parameters, to the `evaluate_policy` method.

Finally, the value functions of this object can be retrieved with `mc_predictor.value_functions` and previewed with `Render.preview_V`

In [10]:
N = 100_000
mc_predictor = hb.MonteCarloPredictor(mdp)
mc_predictor.evaluate_policy(random_policy, n_samples=N)
hb.Render.preview_V(mdp=mdp, learned_V=mc_predictor.value_functions)

╔════════╦════════╦════════╦════════╦════════╗
║ -42.26 ║ -37.93 ║ -28.98 ║ -14.19 ║  0.00  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.62 ║ -38.59 ║ -30.90 ║ -20.54 ║ -9.76  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.70 ║ -38.92 ║ -31.99 ║ -23.10 ║ -15.46 ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.39 ║ -38.60 ║ -30.81 ║ -20.45 ║ -9.46  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.16 ║ -37.95 ║ -28.92 ║ -14.06 ║  0.00  ║
╚════════╩════════╩════════╩════════╩════════╝


### Monte Carlo Control

Monte Carlo control works in a similar fashion. First an opbject is initiliazed which then can be trained using the `.train` method.

The optimal policy can be retrieved by calling `.policy` on the `MonteCarloController` object.

In [11]:

N = 100_000
mc_control = hb.MonteCarloController(mdp, epsilon=0.25)
mc_control.train(n_episodes=N)
trained_policy = mc_control.policy

This trained policy can be ran in an episode. the render mode `"sar"` can be used to give a quick overview of actions taken by the agent.

In [12]:
print(trained_policy)
episode = hb.Episode(mdp=mdp, policy=trained_policy, max_depth=1000)
episode.run("sar")

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    ↑    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝
t: 0 | S: [2, 2], R: -1, A: →
t: 1 | S: [2, 3], R: -1, A: →
t: 2 | S: [2, 4], R: -1, A: ↓
t: 3 | S: [3, 4], R: -1, A: ↓
t: 4 | S: [4, 4], R: 10, A: None


# Temporal Difference Learning

### Sarsa

In [13]:
mdp_sarsa = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

s = hb.Sarsa(num_episodes, mdp=mdp_sarsa)

In [14]:
s.run()

{[0, 0]: {↑: -8.756797413308496,
  ↓: -8.839242017720442,
  ←: -8.766816994981962,
  →: -7.676003514828438},
 [0, 1]: {↑: -7.162564920232798,
  ↓: -8.110610101275519,
  ←: -8.4964076348574,
  →: -7.1028290880034},
 [0, 2]: {↑: -3.9912187883975547,
  ↓: -7.5796014675425205,
  ←: -7.628890356795175,
  →: -3.6807232691858114},
 [0, 3]: {↑: 5.151323340396946,
  ↓: -4.651376561862955,
  ←: -4.154878269953178,
  →: 10.0},
 [0, 4]: {↑: 0.0, ↓: 0.0, ←: 0.0, →: 0.0},
 [1, 0]: {↑: -8.15139939289955,
  ↓: -8.775222824924995,
  ←: -8.533085768307288,
  →: -8.105630107280646},
 [1, 1]: {↑: -8.137059732430753,
  ↓: -8.835827035740863,
  ←: -8.507999722440763,
  →: -4.1726151972575485},
 [1, 2]: {↑: -7.2284446952057175,
  ↓: -8.212022241982083,
  ←: -7.088421726013737,
  →: -1.8504131980973129},
 [1, 3]: {↑: 2.0287880299061722,
  ↓: -5.8519597579882525,
  ←: -6.931692924021375,
  →: 2.7170122360007722},
 [1, 4]: {↑: 10.0,
  ↓: -5.588855982457644,
  ←: -2.400903677826759,
  →: 0.4719992031391427},
 [2

In [15]:
print(s.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    →    ║    →    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    ↓    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


In [16]:
hb.Render.preview_Q(mdp_sarsa, s.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: -8.76  ║   ↑: -7.16  ║   ↑: -3.99  ║   ↑: 5.15   ║   ↑: 0.00   ║ 
║   ↓: -8.84  ║   ↓: -8.11  ║   ↓: -7.58  ║   ↓: -4.65  ║   ↓: 0.00   ║ 
║   ←: -8.77  ║   ←: -8.50  ║   ←: -7.63  ║   ←: -4.15  ║   ←: 0.00   ║ 
║   →: -7.68  ║   →: -7.10  ║   →: -3.68  ║   →: 10.00  ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -8.15  ║   ↑: -8.14  ║   ↑: -7.23  ║   ↑: 2.03   ║   ↑: 10.00  ║ 
║   ↓: -8.78  ║   ↓: -8.84  ║   ↓: -8.21  ║   ↓: -5.85  ║   ↓: -5.59  ║ 
║   ←: -8.53  ║   ←: -8.51  ║   ←: -7.09  ║   ←: -6.93  ║   ←: -2.40  ║ 
║   →: -8.11  ║   →: -4.17  ║   →: -1.85  ║   →: 2.72   ║   →: 0.47   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -8.55  ║   ↑: -8.14  ║   ↑: -7.20  ║   ↑: -4.84  ║   ↑: 0.90   ║ 
║   ↓: -7.39  ║   ↓: -7.46  ║   ↓: -6.62  ║   ↓: -4.18  ║   ↓: -4.35  ║ 
║   ←: -8.65  ║   ←: -8.79  ║   ←: -8.62  ║   ←: -8.17

### Q-Learning

In [17]:
mdp_q_learning = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

ql = hb.QLearning(num_episodes, mdp=mdp_q_learning)

In [18]:
ql.run()

{[0, 0]: {↑: 3.121999999999999,
  ↓: 1.8097999999999992,
  ←: 3.121999999999999,
  →: 4.579999999999998},
 [0, 1]: {↑: 4.579999999999998,
  ↓: 3.121999999999999,
  ←: 3.121999999999999,
  →: 6.199999999999999},
 [0, 2]: {↑: 6.199999999999999,
  ↓: 4.579999999999998,
  ←: 4.579999999999998,
  →: 8.0},
 [0, 3]: {↑: 8.0, ↓: 6.199999999999999, ←: 6.199999999999999, →: 10.0},
 [0, 4]: {↑: 0.0, ↓: 0.0, ←: 0.0, →: 0.0},
 [1, 0]: {↑: 3.121999999999999,
  ↓: 0.6288199999999993,
  ←: 1.8097999999999992,
  →: 3.121999999999999},
 [1, 1]: {↑: 4.579999999999998,
  ↓: 1.8097999999999992,
  ←: 1.8097999999999992,
  →: 4.579999999999998},
 [1, 2]: {↑: 6.199999999999999,
  ↓: 3.121999999999999,
  ←: 3.121999999999999,
  →: 6.199999999999999},
 [1, 3]: {↑: 8.0, ↓: 4.579999999999998, ←: 4.579999999999998, →: 8.0},
 [1, 4]: {↑: 10.0, ↓: 6.199999999999999, ←: 6.199999999999999, →: 8.0},
 [2, 0]: {↑: 1.8097999999999992,
  ↓: 1.8097999999999992,
  ←: 0.6288199999999993,
  →: 1.8097999999999992},
 [2, 1]: {↑:

In [19]:
print(ql.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


In [20]:
hb.Render.preview_Q(mdp=mdp_q_learning, learned_Q=ql.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║   ↑: 0.00   ║ 
║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║   ↓: 0.00   ║ 
║   ←: 3.12   ║   ←: 3.12   ║   ←: 4.58   ║   ←: 6.20   ║   ←: 0.00   ║ 
║   →: 4.58   ║   →: 6.20   ║   →: 8.00   ║   →: 10.00  ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║   ↑: 10.00  ║ 
║   ↓: 0.63   ║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║ 
║   ←: 1.81   ║   ←: 1.81   ║   ←: 3.12   ║   ←: 4.58   ║   ←: 6.20   ║ 
║   →: 3.12   ║   →: 4.58   ║   →: 6.20   ║   →: 8.00   ║   →: 8.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 1.81   ║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║ 
║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║   ↓: 8.00   ║ 
║   ←: 0.63   ║   ←: 0.63   ║   ←: 1.81   ║   ←: 3.12 

### Deep Q-Learning

In [21]:
mdp_deep_q_learning = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

dql = hb.DeepQLearning(num_episodes, mdp=mdp_deep_q_learning)

In [22]:
dql.run()

{[0, 0]: {↑: np.float32(1.2735373),
  ↓: np.float32(0.52230424),
  ←: np.float32(0.7144997),
  →: np.float32(2.6016333)},
 [0, 1]: {↑: np.float32(2.504319),
  ↓: np.float32(1.163114),
  ←: np.float32(1.2879168),
  →: np.float32(4.194538)},
 [0, 2]: {↑: np.float32(3.3830638),
  ↓: np.float32(2.126968),
  ←: np.float32(2.3171227),
  →: np.float32(5.9427443)},
 [0, 3]: {↑: np.float32(3.3209622),
  ↓: np.float32(3.497663),
  ←: np.float32(2.9782357),
  →: np.float32(7.9020677)},
 [0, 4]: {↑: 0, ↓: 0, ←: 0, →: 0},
 [1, 0]: {↑: np.float32(1.1756893),
  ↓: np.float32(-0.31608272),
  ←: np.float32(-0.07923591),
  →: np.float32(0.66712856)},
 [1, 1]: {↑: np.float32(2.2421432),
  ↓: np.float32(-0.047480702),
  ←: np.float32(0.2868037),
  →: np.float32(1.965961)},
 [1, 2]: {↑: np.float32(3.6586637),
  ↓: np.float32(0.17442444),
  ←: np.float32(0.9907745),
  →: np.float32(2.9085658)},
 [1, 3]: {↑: np.float32(4.338876),
  ↓: np.float32(0.5904563),
  ←: np.float32(1.4689506),
  →: np.float32(3.70827

In [23]:
print(dql.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↑    ║    ↑    ║    ↑    ║    ↑    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    ↓    ║    ↓    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    ↓    ║    ↓    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


In [24]:
hb.Render.preview_Q(mdp=mdp_deep_q_learning, learned_Q=dql.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: 1.27   ║   ↑: 2.50   ║   ↑: 3.38   ║   ↑: 3.32   ║   ↑: 0.00   ║ 
║   ↓: 0.52   ║   ↓: 1.16   ║   ↓: 2.13   ║   ↓: 3.50   ║   ↓: 0.00   ║ 
║   ←: 0.71   ║   ←: 1.29   ║   ←: 2.32   ║   ←: 2.98   ║   ←: 0.00   ║ 
║   →: 2.60   ║   →: 4.19   ║   →: 5.94   ║   →: 7.90   ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 1.18   ║   ↑: 2.24   ║   ↑: 3.66   ║   ↑: 4.34   ║   ↑: 4.73   ║ 
║   ↓: -0.32  ║   ↓: -0.05  ║   ↓: 0.17   ║   ↓: 0.59   ║   ↓: 0.15   ║ 
║   ←: -0.08  ║   ←: 0.29   ║   ←: 0.99   ║   ←: 1.47   ║   ←: 1.32   ║ 
║   →: 0.67   ║   →: 1.97   ║   →: 2.91   ║   →: 3.71   ║   →: 2.98   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -0.00  ║   ↑: 0.41   ║   ↑: 1.09   ║   ↑: 1.43   ║   ↑: 1.70   ║ 
║   ↓: 0.43   ║   ↓: 1.00   ║   ↓: 1.24   ║   ↓: 1.85   ║   ↓: 2.16   ║ 
║   ←: -0.41  ║   ←: -0.39  ║   ←: 0.07   ║   ←: 0.36 