In [1]:
from pathlib import Path
import herringbone as hb

All initialization tests passed.
imported herringbone without any errors :)


### Creating an MDP

An MDP is formally defined as a 5-tuple $\mathcal{M} = (S, A, P, R, \gamma)$, where:
- $S$ defines the state space
- $A$ defines the action space
- $P$ models the environment dynamics
- $R$ models the reward function
- $\gamma$ defines the discount factor

To create an MDP with this framework, it needs paths to at least a state config, a map, and an action config.

Additionally, it can take an array of transition matrices (see $P$ in formal MDP definition), a seed, and the discount factor $\gamma$. 
But these have default values, so do not fret if you do not understand them!

In [2]:
print('Map options:\n\t',
      *sorted([m.stem for m in Path("herringbone/env_core/maps").glob('*.csv')]), sep='\n\t- ')

print('\nSupported configurations:\n\t',
      *sorted([c.stem for c in Path("herringbone/env_core/config").glob('*.json')]), sep='\n\t- ')

Map options:
	
	- danger_holes
	- double_fish
	- easy
	- example
	- example2
	- flappy_bird
	- gamma
	- heart
	- maze
	- mega
	- slides
	- wall_of_death

Supported configurations:
	
	- action_config
	- state_config


In [3]:
state_path = "herringbone/env_core/config/state_config.json"
map_path = f"herringbone/env_core/maps/double_fish.csv"
action_path = "herringbone/env_core/config/action_config.json"

gamma = 1

mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma)

### Previewing the board

The board can be previewed with the following code.

**Render Modes**
1. `'sar'`: prints the state, action, reward of each iteration (only used in Monte Carlo simulations and Temporal Difference learning)
2. `'rewards'`: prints the board with the calculated rewards for each state
3. `'ascii'`: prints an ascii representation of the board

In [4]:
render_modes = ['sar', 'rewards', 'ascii']
hb.Render.preview_frame(board=mdp.get_board(), agent_state=None, render_mode=render_modes[2])

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝[0]


### Creating a policy & running an episode

A policy can be created with help of de MDP, this policy is unfirom/random by default.

An episode is created with an MPD and a policy, and a max depth to ensure that it does not run forever with a sub optimal policy.

This episode instance can be ran with a render method.

In [5]:
random_policy = hb.Policy(mdp=mdp)
episode = hb.Episode(mdp=mdp, policy=random_policy, max_depth=1000)
episode.run("ascii")

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║ [34m       [0m ║ [31m =^.^= [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝[0]
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣

# Dynamic Programming

### Policy Iteration

The code below runs the policy iteration algorithm.

Calling the algorithm only takes an MDP object and $\theta$, this defines how precise the algorithm must update its values before terminating.

The optimal policy and optimal state values of the MDP can be retrieved by calling the `run()` function on your `PolicyIteration` object.

In [6]:
theta = 0.000_000_000_1

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)

# Run PolicyIteration
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

#### Displaying policy/state values

The policy can be displayed by simply printing the `Policy` object.

The learned state values can be displayed by calling `hb.Render.preview_V(mdp, state_values)`

In [7]:
hb.Render.preview_V(mdp=mdp, learned_V=pi_state_values)

print('-----')

print(pi_optimal_policy)

print('-----')

hb.Render.preview_Q(mdp=mdp, learned_Q=pi_q_values)

╔═══════╦═══════╦═══════╦═══════╦═══════╗
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  5.00 ║  6.00 ║  7.00 ║  8.00 ║  9.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═

### Value Iteration

The code below runs the value iteration algorithm.
It works pretty much the same as the policy iteration algorithm, aside from the name of the class.

In [8]:
theta = 0.000_000_000_1

value_iteration = hb.ValueIteration(mdp=mdp, theta_threshold=theta)

vi_optimal_policy, vi_state_values, vi_q_values = value_iteration.run()

In [9]:
hb.Render.preview_V(mdp=mdp, learned_V=vi_state_values)

print('-----')

print(vi_optimal_policy)

print('-----')

hb.Render.preview_Q(mdp=mdp, learned_Q=vi_q_values)

╔═══════╦═══════╦═══════╦═══════╦═══════╗
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  5.00 ║  6.00 ║  7.00 ║  8.00 ║  9.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  6.00 ║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║
╠═══════╬═══════╬═══════╬═══════╬═══════╣
║  7.00 ║  8.00 ║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═

# Monte Carlo Methods

### Monte Carlo Prediction

To run Monte Carlo Prediction a `MonteCarloPredictor` object needs to be initialized. 

After that a policy and sample count can be given as parameters, to the `evaluate_policy` method.

Finally, the value functions of this object can be retrieved with `mc_predictor.value_functions` and previewed with `Render.preview_V`

In [10]:
N = 100_000
mc_predictor = hb.MonteCarloPredictor(mdp)
mc_predictor.evaluate_policy(random_policy, n_samples=N)
hb.Render.preview_V(mdp=mdp, learned_V=mc_predictor.value_functions)

╔════════╦════════╦════════╦════════╦════════╗
║ -42.26 ║ -37.93 ║ -28.98 ║ -14.19 ║  0.00  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.62 ║ -38.59 ║ -30.90 ║ -20.54 ║ -9.76  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.70 ║ -38.92 ║ -31.99 ║ -23.10 ║ -15.46 ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.39 ║ -38.60 ║ -30.81 ║ -20.45 ║ -9.46  ║
╠════════╬════════╬════════╬════════╬════════╣
║ -42.16 ║ -37.95 ║ -28.92 ║ -14.06 ║  0.00  ║
╚════════╩════════╩════════╩════════╩════════╝


### Monte Carlo Control

Monte Carlo control works in a similar fashion. First an opbject is initiliazed which then can be trained using the `.train` method.

The optimal policy can be retrieved by calling `.policy` on the `MonteCarloController` object.

In [11]:
N = 100_000
mc_control = hb.MonteCarloController(mdp, epsilon=0.25)
mc_control.train(n_episodes=N)
trained_policy = mc_control.policy

This trained policy can be ran in an episode. the render mode `"sar"` can be used to give a quick overview of actions taken by the agent.

In [12]:
print(trained_policy)
episode = hb.Episode(mdp=mdp, policy=trained_policy, max_depth=1000)
episode.run("sar")

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    ↑    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝
t: 0 | S: [2, 2], R: -1, A: →
t: 1 | S: [2, 3], R: -1, A: →
t: 2 | S: [2, 4], R: -1, A: ↓
t: 3 | S: [3, 4], R: -1, A: ↓
t: 4 | S: [4, 4], R: 10, A: None


# Temporal Difference Learning

### Sarsa

`Sarsa` inherits from `TDControl`.

Required arguments:

* `num_episodes: int` The number of training loops
* `mdp: MDP` The quintupe $\mathcal{M} = (S, A, P, R, \gamma)$ defining the learning environment

Optional arguments:

* `alpha: float = 0.5` The learning rate $\alpha$
* `epsilon: float = 1.0` The exploration rate $\epsilon$
    
    Reward-Based Decay:
    * `epsilon_min: float = sys.float_info.epsilon` The lower bound of $\epsilon$
    * `epsilon_delta: float = 0.01` The rate at which $\epsilon$ decays
    * `reward_threshold: float = 1.0` The minimum reward required for one step of $\epsilon$-decay
    * `reward_increment: float = 1.0` The rate at which the reward threshold increases after each $\epsilon$-decay step

#### Initialize the model

In [13]:
mdp_sarsa = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

s = hb.Sarsa(num_episodes, mdp=mdp_sarsa)

#### Train the model

In [14]:
s.run();

#### Display the found policy

In [15]:
print(s.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    →    ║    →    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    ↓    ║    →    ║    →    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


#### Inspect the found Q-values

In [16]:
hb.Render.preview_Q(mdp_sarsa, s.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: -8.76  ║   ↑: -7.16  ║   ↑: -3.99  ║   ↑: 5.15   ║   ↑: 0.00   ║ 
║   ↓: -8.84  ║   ↓: -8.11  ║   ↓: -7.58  ║   ↓: -4.65  ║   ↓: 0.00   ║ 
║   ←: -8.77  ║   ←: -8.50  ║   ←: -7.63  ║   ←: -4.15  ║   ←: 0.00   ║ 
║   →: -7.68  ║   →: -7.10  ║   →: -3.68  ║   →: 10.00  ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -8.15  ║   ↑: -8.14  ║   ↑: -7.23  ║   ↑: 2.03   ║   ↑: 10.00  ║ 
║   ↓: -8.78  ║   ↓: -8.84  ║   ↓: -8.21  ║   ↓: -5.85  ║   ↓: -5.59  ║ 
║   ←: -8.53  ║   ←: -8.51  ║   ←: -7.09  ║   ←: -6.93  ║   ←: -2.40  ║ 
║   →: -8.11  ║   →: -4.17  ║   →: -1.85  ║   →: 2.72   ║   →: 0.47   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -8.55  ║   ↑: -8.14  ║   ↑: -7.20  ║   ↑: -4.84  ║   ↑: 0.90   ║ 
║   ↓: -7.39  ║   ↓: -7.46  ║   ↓: -6.62  ║   ↓: -4.18  ║   ↓: -4.35  ║ 
║   ←: -8.65  ║   ←: -8.79  ║   ←: -8.62  ║   ←: -8.17

### Q-Learning

`QLearning` is a subclass of `TDZero` and takes the same arguments as `Sarsa`.

#### Initialize the model

In [17]:
mdp_q_learning = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

ql = hb.QLearning(num_episodes, mdp=mdp_q_learning)

#### Train the model

In [18]:
ql.run();

#### Display the found policy

In [19]:
print(ql.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↑/→   ║   ↑/→   ║   ↑/→   ║   ↑/→   ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║  ↑/↓/→  ║   ↑/↓   ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║   ↓/→   ║   ↓/→   ║   ↓/→   ║   ↓/→   ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


#### Inspect the found Q-values

In [20]:
hb.Render.preview_Q(mdp=mdp_q_learning, learned_Q=ql.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║   ↑: 0.00   ║ 
║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║   ↓: 0.00   ║ 
║   ←: 3.12   ║   ←: 3.12   ║   ←: 4.58   ║   ←: 6.20   ║   ←: 0.00   ║ 
║   →: 4.58   ║   →: 6.20   ║   →: 8.00   ║   →: 10.00  ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║   ↑: 10.00  ║ 
║   ↓: 0.63   ║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║ 
║   ←: 1.81   ║   ←: 1.81   ║   ←: 3.12   ║   ←: 4.58   ║   ←: 6.20   ║ 
║   →: 3.12   ║   →: 4.58   ║   →: 6.20   ║   →: 8.00   ║   →: 8.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 1.81   ║   ↑: 3.12   ║   ↑: 4.58   ║   ↑: 6.20   ║   ↑: 8.00   ║ 
║   ↓: 1.81   ║   ↓: 3.12   ║   ↓: 4.58   ║   ↓: 6.20   ║   ↓: 8.00   ║ 
║   ←: 0.63   ║   ←: 0.63   ║   ←: 1.81   ║   ←: 3.12 

### Deep Q-Learning

`DeepQLearning` inherits from `QLearning` and takes the same arguments as `QLearning` (and `Sarsa`). In addition, its class attributes are:

* `SYNC_RATE = 10` Rate at which the target network's weights are updated with the policy network's weights
* `REPLAY_MEMORY_SIZE = 1_000` The upper bound on transitions stored in replay memory
* `MINI_BATCH_SIZE = 32` The number of transitions $(s, a, s', r)$ sampled by the agent for training
* `DEPTH_MAX = 5_000` The upper bound on the number of steps the agent can take during a training loop
* `LOSS_FN = nn.MSELoss()` The loss function to compute the error between the predicted and target Q-values
* `OPTIMIZER = torch.optim.AdamW` The optimizer to update the model paramters $\theta$ during training

#### Initialize the model

In [21]:
mdp_deep_q_learning = hb.MDP(
    state_config=state_path,
    map=map_path,
    action_config=action_path,
    seed=42,
)

num_episodes = 10_000

dql = hb.DeepQLearning(num_episodes, mdp=mdp_deep_q_learning)

#### Train the model

In [22]:
dql.run();

#### Display the found policy

In [23]:
print(dql.policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╗
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↑    ║    ↑    ║    ↑    ║    ↑    ║    ↑    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    ↓    ║    ↓    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    ↓    ║    ↓    ║    ↓    ║    ↓    ║    ↓    ║
╠═════════╬═════════╬═════════╬═════════╬═════════╣
║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╝


#### Display the found Q-values as a Q-table for convenience

In [24]:
hb.Render.preview_Q(mdp=mdp_deep_q_learning, learned_Q=dql.q_values)

╔═════════════╦═════════════╦═════════════╦═════════════╦═════════════╗
║   ↑: 1.24   ║   ↑: 2.47   ║   ↑: 3.34   ║   ↑: 3.28   ║   ↑: 0.00   ║ 
║   ↓: 0.52   ║   ↓: 1.14   ║   ↓: 2.11   ║   ↓: 3.50   ║   ↓: 0.00   ║ 
║   ←: 0.69   ║   ←: 1.26   ║   ←: 2.29   ║   ←: 2.96   ║   ←: 0.00   ║ 
║   →: 2.56   ║   →: 4.16   ║   →: 5.92   ║   →: 7.87   ║   →: 0.00   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: 1.13   ║   ↑: 2.21   ║   ↑: 3.64   ║   ↑: 4.31   ║   ↑: 4.75   ║ 
║   ↓: -0.33  ║   ↓: -0.08  ║   ↓: 0.14   ║   ↓: 0.59   ║   ↓: 0.10   ║ 
║   ←: -0.09  ║   ←: 0.27   ║   ←: 0.98   ║   ←: 1.47   ║   ←: 1.32   ║ 
║   →: 0.65   ║   →: 1.95   ║   →: 2.89   ║   →: 3.73   ║   →: 2.98   ║ 
╠═════════════╬═════════════╬═════════════╬═════════════╬═════════════╣
║   ↑: -0.03  ║   ↑: 0.36   ║   ↑: 1.08   ║   ↑: 1.40   ║   ↑: 1.72   ║ 
║   ↓: 0.43   ║   ↓: 0.95   ║   ↓: 1.19   ║   ↓: 1.84   ║   ↓: 2.09   ║ 
║   ←: -0.43  ║   ←: -0.41  ║   ←: 0.06   ║   ←: 0.32 

# Extra Maps

The following section adresses how our algorithms and policies react to the more complex situations that can be created with our framework.

### 1. Gamma
The following map shows the policy's dependence on gamma:

In [25]:
map_path = f"herringbone/env_core/maps/gamma.csv"
gamma = 1
mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma, start_coords=[0,1])
hb.Render.preview_frame(board=mdp.get_board(), agent_state=mdp.get_board().states[0][1], render_mode=render_modes[1])

╔═══════╦═══════╦═══════╦═══════╦═══════╦═══════╗
║ [32m  10 [0m ║ [31m=^.^=[0m ║ [34m  -1 [0m ║ [34m  -1 [0m ║ [34m  -1 [0m ║ [34m 100 [0m ║
╚═══════╩═══════╩═══════╩═══════╩═══════╩═══════╝[0]


In [26]:
"""Low Gama, prioritise immediate rewards"""
mdp.gamma = 0.1
policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()
print(pi_optimal_policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╦═════════╗
║ ↑/↓/←/→ ║    ←    ║    ←    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╩═════════╝


In [27]:
"""Low Gama, prioritise long term rewards"""
mdp.gamma = 1
policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()
print(pi_optimal_policy)

╔═════════╦═════════╦═════════╦═════════╦═════════╦═════════╗
║ ↑/↓/←/→ ║    →    ║    →    ║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╩═════════╩═════════╩═════════╝


### 2. Maze

This maps shows that a maze can be created by utilizing negative rewards.

In [28]:
map_path = f"herringbone/env_core/maps/maze.csv"
mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma, start_coords=[0,0])

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

hb.Render.preview_frame(board=mdp.get_board(), agent_state=mdp.get_board().states[0][0], render_mode=render_modes[2])

print(pi_optimal_policy)

╔════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╗
║ [31m=^.^= [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [34m<✰))><[0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m      [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [32m |||| [0m ║ [34m      [0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [34m      [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      

### 3. Flappy Cat

This maps shows that a gridworld is adaptable to multiple game types.

In [29]:
map_path = f"herringbone/env_core/maps/flappy_bird.csv"
mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma, start_coords=[4,0])

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

hb.Render.preview_frame(board=mdp.get_board(), agent_state=mdp.get_board().states[4][0], render_mode=render_modes[2])

print(pi_optimal_policy)

╔════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╗
║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m<✰))><[0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m      [0m ║ [34m      [0m ║ [34m<✰))><[0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m      [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      [0m ║ [34m<✰))><[0m ║
╠════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ [34m      [0m ║ [34m      [0m ║ [34m      [0m ║ [32m |||| [0m ║ [34m      

### 4. The end.

This map thanks you for grading our work!

In [30]:
map_path = f"herringbone/env_core/maps/heart.csv"
mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma, start_coords=[4,0])

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

hb.Render.preview_frame(board=mdp.get_board(), agent_state=mdp.get_board().states[4][0], render_mode=render_modes[2])

╔═══════╦═══════╦═══════╦═══════╦═══════╦═══════╦═══════╗
║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║
╠═══════╬═══════╬═══════╬═══════╬═══════╬═══════╬═══════╣
║ [34m     [0m ║ [34m     [0m ║ [31m hole[0m ║ [34m     [0m ║ [31m hole[0m ║ [34m     [0m ║ [34m     [0m ║
╠═══════╬═══════╬═══════╬═══════╬═══════╬═══════╬═══════╣
║ [34m     [0m ║ [31m hole[0m ║ [31m hole[0m ║ [31m hole[0m ║ [31m hole[0m ║ [31m hole[0m ║ [34m     [0m ║
╠═══════╬═══════╬═══════╬═══════╬═══════╬═══════╬═══════╣
║ [34m     [0m ║ [34m     [0m ║ [31m hole[0m ║ [31m hole[0m ║ [31m hole[0m ║ [34m     [0m ║ [34m     [0m ║
╠═══════╬═══════╬═══════╬═══════╬═══════╬═══════╬═══════╣
║ [31m=^.^=[0m ║ [34m     [0m ║ [34m     [0m ║ [31m hole[0m ║ [34m     [0m ║ [34m     [0m ║ [34m     [0m ║
╠═══════╬═══════╬═══════╬═══════╬═══════╬═══════╬═══════╣
║ [34m     [0m ║ [34m     [0m ║ [34m     