In [4]:
import herringbone as hb

All initialization tests passed.
imported herringbone without any errors :)


### Creating an MDP

An MDP is formally defined as a 5-tuple $\mathcal{M} = (S, A, P, R, \gamma)$, where:
- $S$ defines the state space
- $A$ defines the action space
- $P$ models the environment dynamics
- $R$ models the reward function
- $\gamma$ defines the discount factor

To create an MDP with this framework, it needs paths to at least a state config, a map, and an action config.

Additionally, it can take an array of transition matrices (see $P$ in formal MDP definition), a seed, and the discount factor $\gamma$. 
But these have default values, so do not fret if you do not understand them!

In [5]:
map_names = ["slides", "example", "easy", "danger_holes", "double_fish", "wall_of_death", "example2", "mega"]
selected_map_id = 2

state_path = "herringbone/env_core/config/state_config.json"
map_path = f"herringbone/env_core/maps/{map_names[selected_map_id]}.csv"
action_path = "herringbone/env_core/config/action_config.json"

gamma = 1

mdp = hb.MDP(state_path, map_path, action_path, seed=42, gamma=gamma)

### Previewing the board

The board can be previewed with the following code.

**Render Modes**
1. `'sar'`: prints the state, action, reward of each iteration (only used in Monte Carlo simulations and Temporal Difference learning)
2. `'rewards'`: prints the board with the calculated rewards for each state
3. `'ascii'`: prints an ascii representation of the board

In [6]:
render_modes = ['sar', 'rewards', 'ascii']
hb.Render.preview_frame(board=mdp.get_board(), agent_state=None, render_mode=render_modes[2])

╔═════════╦═════════╦═════════╗
║ [34m       [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[0]


### Creating a policy & running an episode

A policy can be created with help of de MDP, this policy is unfirom/random by default.

An episode is created with an MPD and a policy, and a max depth to ensure that it does not run forever with a sub optimal policy.

This episode instance can be ran with a render method.

In [None]:

random_policy = hb.Policy(mdp=mdp)
episode = hb.Episode(mdp=mdp, policy=random_policy, max_depth=1000)
episode.run("ascii")


╔═══════╦═══════╦═══════╗
║ [34m  -1 [0m ║ [34m  -1 [0m ║ [32m  10 [0m ║
╚═══════╩═══════╩═══════╝[0]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[0]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[1]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[2]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[3]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[4]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[5]
╔═════════╦═════════╦═════════╗
║ [31m =^.^= [0m ║ [34m       [0m ║ [32m<x)))><[0m ║
╚═════════╩═════════╩═════════╝[6]
╔═════════

# Dynamic Programming

### Policy Iteration

The code below runs the policy iteration algorithm. The pseudocode can be found 
> insert citation url here or something idk yet

Calling the algorithm only takes an MDP object and $\theta$, this defines how precise the algorithm must update its values before terminating.

The optimal policy and optimal state values of the MDP can be retrieved by calling the `run()` function on your `PolicyIteration` object.

In [8]:
theta = 0.000_000_000_1

policy_iteration = hb.PolicyIteration(mdp=mdp, theta_threshold=theta)

# Run PolicyIteration
pi_optimal_policy, pi_state_values, pi_q_values = policy_iteration.run()

#### Displaying policy/state values

The policy can be displayed by simply printing the `Policy` object.

The learned state values can be displayed by calling `hb.Render.preview_V(mdp, state_values)`

In [9]:
hb.Render.preview_V(mdp=mdp, learned_V=pi_state_values)

print('-----')

print(pi_optimal_policy)

print('-----')

print(pi_q_values)

╔═══════╦═══════╦═══════╗
║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╗
║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╝
-----
{[0, 0]: {↑: 8.0, ↓: 8.0, ←: 8.0, →: 9.0}, [0, 1]: {↑: 9.0, ↓: 9.0, ←: 8.0, →: 10.0}, [0, 2]: {↑: 0, ↓: 0, ←: 0, →: 0}}


### Value Iteration

The code below runs the value iteration algorithm, the pseudocode can be found 
> insert citation url here or something idk yet

It works pretty much the same as the policy iteration algorithm, aside from the name of the class.

In [10]:
theta = 0.000_000_000_1

value_iteration = hb.ValueIteration(mdp=mdp, theta_threshold=theta)

vi_optimal_policy, vi_state_values, vi_q_values = value_iteration.run()

In [11]:
hb.Render.preview_V(mdp=mdp, learned_V=vi_state_values)

print('-----')

print(vi_optimal_policy)

print('-----')

print(vi_q_values)

╔═══════╦═══════╦═══════╗
║  9.00 ║ 10.00 ║  0.00 ║
╚═══════╩═══════╩═══════╝
-----
╔═════════╦═════════╦═════════╗
║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╝
-----
{[0, 0]: {↑: 8.0, ↓: 8.0, ←: 8.0, →: 9.0}, [0, 1]: {↑: 9.0, ↓: 9.0, ←: 8.0, →: 10.0}, [0, 2]: {↑: 0, ↓: 0, ←: 0, →: 0}}


# Monte Carlo Methods

### Monte Carlo Prediction

To run Monte Carlo Prediction a `MonteCarloPredictor` object needs to be initialized. 

After that a policy and sample count can be given as parameters, to the `evaluate_policy` method.

Finally, the value functions of this object can be retrieved with `mc_predictor.value_functions` and previewed with `Render.preview_V`

In [15]:
N = 100000
mc_predictor = hb.MonteCarloPredictor(mdp)
mc_predictor.evaluate_policy(random_policy, n_samples=N)
hb.Render.preview_V(mdp=mdp, learned_V=mc_predictor.value_functions)

╔═══════╦═══════╦═══════╗
║ -1.00 ║  2.98 ║  0.00 ║
╚═══════╩═══════╩═══════╝


### Monte Carlo Control

Monte Carlo control works in a similar fashion. First an opbject is initiliazed which then can be trained using the `.train` method.

The optimal policy can be retrieved by calling `.policy` on the `MonteCarloController` object.

In [17]:

N = 100000
mc_control = hb.MonteCarloController(mdp, epsilon=0.25)
mc_control.train(n_episodes=N)
trained_policy = mc_control.policy

This trained policy can be ran in an episode. the render mode `"sar"` can be used to give a quick overview of actions taken by the agent.

In [18]:
print(trained_policy)
episode = hb.Episode(mdp=mdp, policy=trained_policy, max_depth=1000)
episode.run("sar")

╔═════════╦═════════╦═════════╗
║    →    ║    →    ║ ↑/↓/←/→ ║
╚═════════╩═════════╩═════════╝
t: 0 | S: [0, 1], R: -1, A: ↑
t: 1 | S: [0, 1], R: -1, A: →
t: 2 | S: [0, 2], R: 10, A: None
