# <h1><center>Q-learning Tutorial</center></h1>

This notebook provides an introductory tutorial to Q-learning. Specifically, we will implement Q-Learning using a Q-table in JAX and use it to steer a simplified brittle star robot towards a random target. The [brittle star robot and its environment](https://github.com/Co-Evolve/brb/tree/new-framework/brb/brittle_star) is part of the [**the Bio-inspired Robotics Benchmark (BRB)**](https://github.com/Co-Evolve/brb). Instead of directly outputting joint-level actions, we will use our Q-learned controller to modulate a CPG that in turn outputs the joint-level actions.


## Q-Learning

* Belongs to the class of model-free algorithms, meaning that it does not require prior knowledge (or a model) of the environment.
* Is an off-policy algorithm, meaning that it does not necessarily use the 'current policy' to produce actions
* As it name entails, the goal of the algorithm is to learn the Q-function
    * The Q-function receives state and action pairs, and (tries to) return the expected cumulative reward
        * In other words, it tries to predict the expected payoff of doing a certain action in a given state
* In this tutorial we will focus on tabular Q-learning
    * We will optimize the values of a two-dimensional table with states as rows and actions as columns.
    * Each cell in this table corresponds to the Q value of a state and action
        * As we use a table, the state and actions have to be discretized.
    * Initially, this Q-table will be populated with arbitrary values or zeros
    * The q-learning algorithm then tries to optimize the values in this table through an iterative process of exploration and exploitation.
        * During the exploration phase, the agent will take random actions to gather information about the environment and update the Q-table accordingly
        * As the agent explores more, it gradually transitions to the exploitation phase, where it leverages the learned Q-values to make more informed decisions and maximize the cumulative reward
    * The Q-learning algorithm can be summarized as follows:
        1. Initialize the Q-table with arbitrary values or zeros
        2. Observe the current state of the environment
        3. Choose an action to take based on a exploration-exploitation trade-off. This can for instance be done by using an exploration strategy like epsilon-greedy, where the agent selects a random action with a certain probability and chooses the action with the highest Q-value with a complementary probability.
        4. Perform the chosen action and observe the reward and the resulting next state
        5. Update the Q-value of the state-action pair using the Q-learning update rule:<br>
        $Q(s,a) = (1 - \alpha) Q(s,a) + \alpha(r + \gamma\max(Q(s', a')))$<br>
                where alpha $\alpha$ is the learning rate, gamma $\gamma$ is the discount factor that determines the importance of future rewards, $r$ is the immediate reward obtained (given by the environment), and $\max(Q(s', a'))$ represents the maximum Q-value for the next state. 
        6. Repeat steps $2$ to $5$ until convergence or a predefined number of iterations.
* A natural extension of the tabular Q-learning algorithm is the Deep-QLearning (DQN) algorithm. As it name gives away, the DQN algorithm swaps the Q-table for a deep Q-neural-network. This neural network maps states to Q(s, a) values. 

* This tutorial will focus on Q-Table learning 

## Implementing tabular Q-learning in JAX

When implementing something in JAX it's important to remember that JAX follows the functional programming paradigm. Put simply, we thus rely on pure functions (deterministic and without side effects) and immutable data structures (instead of changing data in place, new data structures are created with the desired modifications) as primary building blocks.

We will thus start by creating a data structure that we will use to hold our current Q-Learner's state and its related learning (hyper)parameters.

In [1]:
from typing import Tuple
import functools
import chex
from flax import struct
import jax
import jax.numpy as jnp


@struct.dataclass
class QLearningPolicyState:
    q_table: jnp.ndarray
    alpha: float
    epsilon: float
    gamma: float
    rng: chex.PRNGKey


class QLearningPolicy:
    def __init__(
            self,
            num_states: int,
            num_actions: int
            ) -> None:
        self._num_states = num_states
        self._num_actions = num_actions

    @functools.partial(jax.jit, static_argnums=(0,))
    def apply_q_learning_update_rule(
            self,
            policy_state: QLearningPolicyState,
            state_index: int,
            next_state_index: int,
            action_index: int,
            reward: float
            ) -> QLearningPolicyState:
        old_q_value = policy_state.q_table[state_index, action_index]
        best_future_q_value = jnp.max(policy_state.q_table[next_state_index])
        q_value_update = reward + policy_state.gamma * best_future_q_value
        new_q_value = (1 - policy_state.alpha) * old_q_value + policy_state.alpha * q_value_update

        new_q_table = policy_state.q_table.at[state_index, action_index].set(new_q_value)
        # noinspection PyUnresolvedReferences
        return policy_state.replace(q_table=new_q_table)

    @functools.partial(jax.jit, static_argnums=(0,))
    def epsilon_greedy_policy(
            self,
            policy_state: QLearningPolicyState,
            state_index: int
            ) -> Tuple[QLearningPolicyState, int]:
        rng, explore_rng, random_action_rng = jax.random.split(policy_state.rng, 3)
        explore = jax.random.uniform(explore_rng) < policy_state.epsilon

        def get_random_action() -> int:
            return jax.random.choice(key=random_action_rng, a=jnp.arange(policy_state.q_table.shape[1]))

        def get_greedy_action() -> int:
            return jnp.argmax(policy_state.q_table[state_index])

        action_index = jax.lax.cond(
                pred=explore, true_fun=get_random_action, false_fun=get_greedy_action
                )

        # noinspection PyUnresolvedReferences
        return policy_state.replace(rng=rng), action_index

    @functools.partial(jax.jit, static_argnums=(0,))
    def reset(
            self,
            rng: chex.PRNGKey,
            alpha: float,
            gamma: float,
            epsilon: float
            ) -> QLearningPolicyState:
        rng, q_table_rng = jax.random.split(rng, 2)
        # noinspection PyArgumentList
        return QLearningPolicyState(
                q_table=jax.random.uniform(
                        key=rng,
                        shape=(self._num_states, self._num_actions),
                        dtype=jnp.float32,
                        minval=-0.001,
                        maxval=0.001
                        ), alpha=alpha, epsilon=epsilon, gamma=gamma, rng=rng
                )

Great, we have implemented the tabular Q-learning algorithm. Time to test it out with the brittle star environment!

## Case study: CPG modulations for directed brittle star locomotion

### Environment setup
* Load BRB's brittle star environment -> targeted locomotion
* Create a state indexer
* Create an action mapper 

In [2]:
import os
import subprocess
import logging

try:
    if subprocess.run('nvidia-smi').returncode:
        raise RuntimeError(
                'Cannot communicate with GPU. '
                'Make sure you are using a GPU Colab runtime. '
                'Go to the Runtime menu and select Choose runtime type.'
                )

    # Add an ICD config so that glvnd can pick up the Nvidia EGL driver.
    # This is usually installed as part of an Nvidia driver package, but the Colab
    # kernel doesn't install its driver via APT, and as a result the ICD is missing.
    # (https://github.com/NVIDIA/libglvnd/blob/master/src/EGL/icd_enumeration.md)
    NVIDIA_ICD_CONFIG_PATH = '/usr/share/glvnd/egl_vendor.d/10_nvidia.json'
    if not os.path.exists(NVIDIA_ICD_CONFIG_PATH):
        with open(NVIDIA_ICD_CONFIG_PATH, 'w') as f:
            f.write(
                    """{
                            "file_format_version" : "1.0.0",
                            "ICD" : {
                                "library_path" : "libEGL_nvidia.so.0"
                            }
                        }
                        """
                    )

    # Configure MuJoCo to use the EGL rendering backend (requires GPU)
    print('Setting environment variable to use GPU rendering:')
    %env MUJOCO_GL=egl

    # Check if jax finds the GPU
    import jax

    print(jax.devices('gpu'))
except Exception:
    logging.warning("Failed to initialize GPU. Everything will run on the cpu.")

try:
    print('Checking that the mujoco installation succeeded:')
    import mujoco

    mujoco.MjModel.from_xml_string('<mujoco/>')
except Exception as e:
    raise e from RuntimeError(
            'Something went wrong during installation. Check the shell output above '
            'for more information.\n'
            'If using a hosted Colab runtime, make sure you enable GPU acceleration '
            'by going to the Runtime menu and selecting "Choose runtime type".'
            )

print('MuJoCo installation successful.')

Tue Jan 30 19:32:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A2                      On  | 00000000:3B:00.0 Off |                    0 |
|  0%   46C    P8               8W /  60W |      4MiB / 15356MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
from brb.brittle_star.environment.directed_locomotion.shared import \
    BrittleStarDirectedLocomotionEnvironmentConfiguration
import numpy as np
from mujoco_utils.environment.base import MuJoCoEnvironmentConfiguration
from brb.brittle_star.environment.directed_locomotion.dual import BrittleStarDirectedLocomotionEnvironment
from typing import List
import mediapy as media
from brb.brittle_star.mjcf.morphology.morphology import MJCFBrittleStarMorphology
from brb.brittle_star.mjcf.morphology.specification.default import default_brittle_star_morphology_specification
from brb.brittle_star.mjcf.arena.aquarium import AquariumArenaConfiguration, MJCFAquariumArena

morphology_specification = default_brittle_star_morphology_specification(
        num_arms=5, num_segments_per_arm=5, use_p_control=True, use_torque_control=False
        )
arena_configuration = AquariumArenaConfiguration(
        size=(10, 5), sand_ground_color=False, attach_target=True, wall_height=1.5, wall_thickness=0.1
        )
environment_configuration = BrittleStarDirectedLocomotionEnvironmentConfiguration(
        joint_randomization_noise_scale=0.0,
        render_mode="rgb_array",
        simulation_time=20,
        num_physics_steps_per_control_step=10,
        time_scale=2,
        camera_ids=[0, 1],
        render_size=(480, 640)
        )


def create_environment() -> BrittleStarDirectedLocomotionEnvironment:
    morphology = MJCFBrittleStarMorphology(
            specification=morphology_specification
            )
    arena = MJCFAquariumArena(
            configuration=arena_configuration
            )
    env = BrittleStarDirectedLocomotionEnvironment.from_morphology_and_arena(
            morphology=morphology, arena=arena, configuration=environment_configuration, backend="MJX"
            )
    return env


def post_render(
        render_output: List[np.ndarray],
        environment_configuration: MuJoCoEnvironmentConfiguration
        ) -> np.ndarray:
    if render_output is None:
        # Temporary workaround until https://github.com/google-deepmind/mujoco/issues/1379 is fixed
        return None

    num_cameras = len(environment_configuration.camera_ids)
    num_envs = len(render_output) // num_cameras

    if num_cameras > 1:
        # Horizontally stack frames of the same environment
        frames_per_env = np.array_split(render_output, num_envs)
        render_output = [np.concatenate(env_frames, axis=1) for env_frames in frames_per_env]

    # Vertically stack frames of different environments
    render_output = np.concatenate(render_output, axis=0)

    return render_output[:, :, ::-1]  # RGB to BGR


def show_video(
        images: List[np.ndarray | None],
        path: str | None = None
        ) -> str | None:
    # Temporary workaround until https://github.com/google-deepmind/mujoco/issues/1379 is fixed
    filtered_images = [image for image in images if image is not None]
    num_nones = len(images) - len(filtered_images)
    if num_nones > 0:
        logging.warning(
                f"env.render produced {num_nones} None's. Resulting video might be a bit choppy (consquence of https://github.com/google-deepmind/mujoco/issues/1379)."
                )
    if path:
        media.write_video(path=path, images=filtered_images)
    return media.show_video(images=filtered_images)

In [4]:
rng = jax.random.PRNGKey(seed=0)
env = create_environment()
env_reset_fn = jax.jit(env.reset)
env_step_fn = jax.jit(env.step)

In [5]:
print("Observation space:")
print(env.observation_space)
print()
print("Action space:")
print(env.action_space)
rng, sub_rng = jax.random.split(rng, 2)
env_state = env_reset_fn(rng=sub_rng)
media.show_image(post_render(env.render(env_state), environment_configuration=env.environment_configuration))

Observation space:
Dict('in_plane_joint_position': Box(-0.5235988, 0.5235988, (25,), <class 'jax.numpy.float32'>), 'out_of_plane_joint_position': Box(-0.5235988, 0.5235988, (25,), <class 'jax.numpy.float32'>), 'in_plane_joint_velocity': Box(-inf, inf, (25,), <class 'jax.numpy.float32'>), 'out_of_plane_joint_velocity': Box(-inf, inf, (25,), <class 'jax.numpy.float32'>), 'segment_contact': Box(0.0, 1.0, (25,), <class 'jax.numpy.float32'>), 'disk_position': Box(-inf, inf, (3,), <class 'jax.numpy.float32'>), 'disk_rotation': Box(-3.1415927, 3.1415927, (3,), <class 'jax.numpy.float32'>), 'disk_linear_velocity': Box(-inf, inf, (3,), <class 'jax.numpy.float32'>), 'disk_angular_velocity': Box(-inf, inf, (3,), <class 'jax.numpy.float32'>), 'unit_xy_direction_to_target': Box(-1.0, 1.0, (2,), <class 'jax.numpy.float32'>), 'xy_distance_to_target': Box(0.0, inf, (1,), <class 'jax.numpy.float32'>))

Action space:
Box(-0.5235988, 0.5235988, (50,), <class 'jax.numpy.float32'>)


### CPG setup

* CPG model from Sproewitz paper
* Oscillator system from the CPG tutorial

Copied CPG implementation from the CPG tutorial:

In [6]:
import functools
from flax import struct
import jax
import jax.numpy as jnp
import chex
from functools import partial
from typing import Tuple

from typing import Callable


def euler_solver(
        current_time: float,
        y: float,
        derivative_fn: Callable[[float, float], float],
        delta_time: float
        ) -> float:
    slope = derivative_fn(current_time, y)
    next_y = y + delta_time * slope
    return next_y


def rk4_solver(
        current_time: float,
        y: float,
        derivative_fn: Callable[[float, float], float],
        delta_time: float
        ) -> float:
    # This is the original euler
    slope1 = derivative_fn(current_time, y)
    # These are additional slope calculations that improve our approximation of the true slope  
    slope2 = derivative_fn(current_time + delta_time / 2, y + slope1 * delta_time / 2)
    slope3 = derivative_fn(current_time + delta_time / 2, y + slope2 * delta_time / 2)
    slope4 = derivative_fn(current_time + delta_time, y + slope3 * delta_time)
    average_slope = (slope1 + 2 * slope2 + 2 * slope3 + slope4) / 6
    next_y = y + average_slope * delta_time
    return next_y


@struct.dataclass
class CPGState:
    time: float
    phases: jnp.ndarray
    dot_amplitudes: jnp.ndarray  # first order derivative of the amplitude
    amplitudes: jnp.ndarray
    dot_offsets: jnp.ndarray  # first order derivative of the offset 
    offsets: jnp.ndarray
    outputs: jnp.ndarray

    # We'll make these modulatory parameters part of the state as they will change as well
    R: jnp.ndarray
    X: jnp.ndarray
    omegas: jnp.ndarray
    rhos: jnp.ndarray


class CPG:
    def __init__(
            self,
            weights: jnp.ndarray,
            amplitude_gain: float = 20,
            offset_gain: float = 20,
            dt: float = 0.01,
            solver: str = "euler"
            ) -> None:
        self._weights = weights
        self._amplitude_gain = amplitude_gain
        self._offset_gain = offset_gain
        self._dt = dt
        assert solver in ["euler", "rk4"], f"'solver' must be one of ['euler', 'rk4']"

        if solver == "euler":
            self._solver = euler_solver
        else:
            self._solver = rk4_solver

    @property
    def num_oscillators(
            self
            ) -> int:
        return self._weights.shape[0]

    @staticmethod
    def phase_de(
            weights: jnp.ndarray,
            amplitudes: jnp.ndarray,
            phases: jnp.ndarray,
            phase_biases: jnp.ndarray,
            omegas: jnp.ndarray
            ) -> jnp.ndarray:
        @jax.vmap  # vectorizes this function for us over an additional batch dimension (in this case over all oscillators)
        def sine_term(
                phase_i: float,
                phase_biases_i: float
                ) -> jnp.ndarray:
            return jnp.sin(phases - phase_i - phase_biases_i)

        couplings = jnp.sum(weights * amplitudes * sine_term(phase_i=phases, phase_biases_i=phase_biases), axis=1)
        return omegas + couplings

    @staticmethod
    def second_order_de(
            gain: jnp.ndarray,
            modulator: jnp.ndarray,
            values: jnp.ndarray,
            dot_values: jnp.ndarray
            ) -> jnp.ndarray:

        return gain * ((gain / 4) * (modulator - values) - dot_values)

    @staticmethod
    def first_order_de(
            dot_values: jnp.ndarray
            ) -> jnp.ndarray:
        return dot_values

    @staticmethod
    def output(
            offsets: jnp.ndarray,
            amplitudes: jnp.ndarray,
            phases: jnp.ndarray
            ) -> jnp.ndarray:
        return offsets + amplitudes * jnp.cos(phases)

    def reset(
            self,
            rng: chex.PRNGKey
            ) -> CPGState:
        phase_rng, amplitude_rng, offsets_rng = jax.random.split(rng, 3)
        # noinspection PyArgumentList
        state = CPGState(
                phases=jax.random.uniform(
                        key=phase_rng, shape=(self.num_oscillators,), dtype=jnp.float32, minval=-0.01, maxval=0.01
                        ),
                amplitudes=jnp.zeros(self.num_oscillators),
                offsets=jnp.zeros(self.num_oscillators),
                dot_amplitudes=jnp.zeros(self.num_oscillators),
                dot_offsets=jnp.zeros(self.num_oscillators),
                outputs=jnp.zeros(self.num_oscillators),
                time=0.0,
                R=jnp.zeros(self.num_oscillators),
                X=jnp.zeros(self.num_oscillators),
                omegas=jnp.zeros(self.num_oscillators),
                rhos=jnp.zeros_like(self._weights)
                )
        return state

    @functools.partial(jax.jit, static_argnums=(0,))
    def step(
            self,
            state: CPGState
            ) -> CPGState:
        # Update phase
        new_phases = self._solver(
                current_time=state.time,
                y=state.phases,
                derivative_fn=lambda
                    t,
                    y: self.phase_de(
                        omegas=state.omegas,
                        amplitudes=state.amplitudes,
                        phases=y,
                        phase_biases=state.rhos,
                        weights=self._weights
                        ),
                delta_time=self._dt
                )
        new_dot_amplitudes = self._solver(
                current_time=state.time,
                y=state.dot_amplitudes,
                derivative_fn=lambda
                    t,
                    y: self.second_order_de(
                        gain=self._amplitude_gain, modulator=state.R, values=state.amplitudes, dot_values=y
                        ),
                delta_time=self._dt
                )
        new_amplitudes = self._solver(
                current_time=state.time,
                y=state.amplitudes,
                derivative_fn=lambda
                    t,
                    y: self.first_order_de(dot_values=state.dot_amplitudes),
                delta_time=self._dt
                )
        new_dot_offsets = self._solver(
                current_time=state.time,
                y=state.dot_offsets,
                derivative_fn=lambda
                    t,
                    y: self.second_order_de(
                        gain=self._offset_gain, modulator=state.X, values=state.offsets, dot_values=y
                        ),
                delta_time=self._dt
                )
        new_offsets = self._solver(
                current_time=0,
                y=state.offsets,
                derivative_fn=lambda
                    t,
                    y: self.first_order_de(dot_values=state.dot_offsets),
                delta_time=self._dt
                )

        new_outputs = self.output(offsets=new_offsets, amplitudes=new_amplitudes, phases=new_phases)
        # noinspection PyUnresolvedReferences
        return state.replace(
                phases=new_phases,
                dot_amplitudes=new_dot_amplitudes,
                amplitudes=new_amplitudes,
                dot_offsets=new_dot_offsets,
                offsets=new_offsets,
                outputs=new_outputs,
                time=state.time + self._dt
                )


def create_cpg() -> CPG:
    ip_oscillator_indices = jnp.arange(0, 10, 2)
    oop_oscillator_indices = jnp.arange(1, 10, 2)

    adjacency_matrix = jnp.zeros((10, 10))
    # Connect oscillators within an arm
    adjacency_matrix = adjacency_matrix.at[ip_oscillator_indices, oop_oscillator_indices].set(1)
    # Connect IP oscillators of neighbouring arms
    adjacency_matrix = adjacency_matrix.at[
        ip_oscillator_indices, jnp.concatenate((ip_oscillator_indices[1:], jnp.array([ip_oscillator_indices[0]])))].set(
            1
            )
    # Connect OOP oscillators of neighbouring arms
    adjacency_matrix = adjacency_matrix.at[oop_oscillator_indices, jnp.concatenate(
            (oop_oscillator_indices[1:], jnp.array([oop_oscillator_indices[0]]))
            )].set(1)

    # Make adjacency matrix symmetric (i.e. make all connections bi-directional)
    adjacency_matrix = jnp.maximum(adjacency_matrix, adjacency_matrix.T)
    # Connect oscillators within an arm
    ip_oscillator_indices = jnp.arange(0, 10, 2)
    oop_oscillator_indices = jnp.arange(1, 10, 2)
    adjacency_matrix = adjacency_matrix.at[ip_oscillator_indices, oop_oscillator_indices].set(1)

    return CPG(
            weights=5 * adjacency_matrix,
            amplitude_gain=20,
            offset_gain=20,
            dt=environment_configuration.control_timestep
            )


def get_oscillator_indices_for_arm(
        arm_index: int
        ) -> Tuple[int, int]:
    return arm_index * 2, arm_index * 2 + 1


@jax.jit
def modulate_cpg(
        cpg_state: CPGState,
        leading_arm_index: int,
        joint_limit: float
        ) -> CPGState:
    left_rower_arm_indices = [(leading_arm_index - 1) % 5, (leading_arm_index - 2) % 5]
    right_rower_arm_indices = [(leading_arm_index + 1) % 5, (leading_arm_index + 2) % 5]

    leading_arm_ip_oscillator_index, leading_arm_oop_oscillator_index = get_oscillator_indices_for_arm(
            arm_index=leading_arm_index
            )

    R = jnp.zeros_like(cpg_state.R)
    X = jnp.zeros_like(cpg_state.X)
    rhos = jnp.zeros_like(cpg_state.rhos)
    omegas = jnp.pi * jnp.ones_like(cpg_state.omegas)
    phases_bias_pairs = []

    def modulate_leading_arm(
            _X: jnp.ndarray,
            _arm_index: int
            ) -> jnp.ndarray:
        ip_oscillator_index, oop_oscillator_index = get_oscillator_indices_for_arm(arm_index=_arm_index)
        return _X.at[oop_oscillator_index].set(joint_limit)

    def modulate_left_rower(
            _R: jnp.ndarray,
            _arm_index: int
            ) -> Tuple[jnp.ndarray, List[Tuple[int, int, float]]]:
        ip_oscillator_index, oop_oscillator_index = get_oscillator_indices_for_arm(arm_index=_arm_index)
        _R = _R.at[ip_oscillator_index].set(joint_limit)
        _R = _R.at[oop_oscillator_index].set(joint_limit)
        _phase_bias_pairs = [(ip_oscillator_index, oop_oscillator_index, jnp.pi / 2)]
        return _R, _phase_bias_pairs

    def phase_biases_first_left_rower(
            _arm_index: int
            ) -> List[Tuple[int, int, float]]:
        ip_oscillator_index, oop_oscillator_index = get_oscillator_indices_for_arm(arm_index=_arm_index)
        _phase_bias_pairs = [(ip_oscillator_index, leading_arm_ip_oscillator_index, jnp.pi / 4),
                             (leading_arm_oop_oscillator_index, oop_oscillator_index, jnp.pi / 4)]
        return _phase_bias_pairs

    def modulate_right_rower(
            _R: jnp.ndarray,
            _arm_index: int
            ) -> Tuple[jnp.ndarray, List[Tuple[int, int, float]]]:
        ip_oscillator_index, oop_oscillator_index = get_oscillator_indices_for_arm(arm_index=_arm_index)
        _R = _R.at[ip_oscillator_index].set(joint_limit)
        _R = _R.at[oop_oscillator_index].set(joint_limit)
        _phase_bias_pairs = [(oop_oscillator_index, ip_oscillator_index, jnp.pi / 2)]
        return _R, _phase_bias_pairs

    def phase_biases_first_right_rower(
            _arm_index: int
            ) -> List[Tuple[int, int, float]]:
        ip_oscillator_index, oop_oscillator_index = get_oscillator_indices_for_arm(arm_index=_arm_index)
        _phase_bias_pairs = [(leading_arm_ip_oscillator_index, ip_oscillator_index, jnp.pi / 4),
                             (oop_oscillator_index, leading_arm_oop_oscillator_index, jnp.pi / 4)]
        return _phase_bias_pairs

    def phase_biases_second_rowers(
            _left_arm_index: int,
            _right_arm_index: int
            ) -> List[Tuple[int, int, float]]:
        left_ip_oscillator_index, _ = get_oscillator_indices_for_arm(arm_index=_left_arm_index)
        right_ip_oscillator_index, _ = get_oscillator_indices_for_arm(arm_index=_right_arm_index)
        _phase_bias_pairs = [(left_ip_oscillator_index, right_ip_oscillator_index, jnp.pi)]
        return _phase_bias_pairs

    X = modulate_leading_arm(_X=X, _arm_index=leading_arm_index)

    R, phb = modulate_left_rower(_R=R, _arm_index=left_rower_arm_indices[0])
    phases_bias_pairs += phb

    R, phb = modulate_left_rower(_R=R, _arm_index=left_rower_arm_indices[1])
    phases_bias_pairs += phb

    R, phb = modulate_right_rower(_R=R, _arm_index=right_rower_arm_indices[0])
    phases_bias_pairs += phb

    R, phb = modulate_right_rower(_R=R, _arm_index=right_rower_arm_indices[1])
    phases_bias_pairs += phb

    phases_bias_pairs += phase_biases_first_left_rower(_arm_index=left_rower_arm_indices[0])
    phases_bias_pairs += phase_biases_first_right_rower(_arm_index=right_rower_arm_indices[0])

    phases_bias_pairs += phase_biases_second_rowers(
            _left_arm_index=left_rower_arm_indices[1], _right_arm_index=right_rower_arm_indices[1]
            )

    for oscillator1, oscillator2, bias in phases_bias_pairs:
        rhos = rhos.at[oscillator1, oscillator2].set(bias)
        rhos = rhos.at[oscillator2, oscillator1].set(-bias)

    # noinspection PyUnresolvedReferences
    return cpg_state.replace(
            R=R, X=X, rhos=rhos, omegas=omegas
            )


@jax.jit
def map_cpg_outputs_to_actions(
        cpg_state: CPGState
        ) -> jnp.ndarray:
    num_arms = 5
    num_oscillators_per_arm = 2
    num_segments_per_arm = 5

    cpg_outputs_per_arm = cpg_state.outputs.reshape((num_arms, num_oscillators_per_arm))
    cpg_outputs_per_segment = cpg_outputs_per_arm.repeat(num_segments_per_arm, axis=0)

    actions = cpg_outputs_per_segment.flatten()
    return actions

Helper function to create our CPG system:

Implement the action mapper: in this case our `action_index` is the leading arm for our CPG modulation.

In [7]:
@functools.partial(jax.jit, static_argnums=(0,))
def cpg_action_mapper(
        cpg: CPG,
        cpg_state: CPGState,
        action_index: int,
        joint_limit: float
        ) -> Tuple[CPGState, jnp.ndarray]:
    cpg_state = modulate_cpg(cpg_state=cpg_state, leading_arm_index=action_index, joint_limit=joint_limit)
    cpg_state = cpg.step(state=cpg_state)
    actions = map_cpg_outputs_to_actions(cpg_state=cpg_state)
    return cpg_state, actions

Implement the state indexer. In this case, we will only use the `unit_xy_direction_to_target` observation. We'll convert this into an actual angle w.r.t. the robot's orientation and descretize it in 5 areas (one per arm).

In [8]:
from typing import Dict


@jax.jit
def state_indexer(
        observations: Dict[str, jnp.ndarray]
        ) -> int:
    direction_to_target = observations["unit_xy_direction_to_target"]
    angle_to_target_wrt_x_axis = jnp.arctan2(direction_to_target[1], direction_to_target[0])
    disk_rotation_wrt_x_axis = observations["disk_rotation"][-1]

    angle_to_target_wrt_first_arm = angle_to_target_wrt_x_axis - disk_rotation_wrt_x_axis

    num_arms = morphology_specification.number_of_arms
    a = jnp.pi / num_arms
    bin_edges = jnp.arange(-num_arms * a, (num_arms + 1) * a, 2 * a)
    bin_index = jnp.digitize(angle_to_target_wrt_first_arm, bin_edges, right=False) - 1
    bin_index_to_arm_index = jnp.array([3, 4, 0, 1, 2])

    return bin_index_to_arm_index[bin_index]

### Validation check
* We can write a visualize_episode function
* As the current solution is quite trivial (identitiy matrix), we can do a validation of all of the above code by using this as a qtable

In [9]:
def visualize_episode(policy_state: QLearningPolicyState, rng: chex.PRNGKey) -> None:
    env_rng, cpg_rng, = jax.random.split(rng, 2) 
    
    env_state = env_reset_fn(env_rng)
    cpg_state = cpg.reset(cpg_rng)
    
    # noinspection PyUnresolvedReferences
    # make the policy deterministic
    policy_state = policy_state.replace(epsilon=0)
    
    frames = []
    while not(env_state.terminated | env_state.truncated):
        state_index = state_indexer(env_state.observations)
        policy_state, action_index = q_learning_policy.epsilon_greedy_policy(
                policy_state=policy_state, state_index=state_index
                ) 
        cpg_state, actions = cpg_action_mapper(cpg=cpg, cpg_state=cpg_state, action_index=action_index, joint_limit=env.action_space.high[0] * 0.5)
        
        env_state = env_step_fn(env_state, actions)
        frame = post_render(env.render(state=env_state), environment_configuration=environment_configuration)
        frames.append(frame)
    
    show_video(images=frames)

In [10]:
q_learning_policy = QLearningPolicy(num_states=5, num_actions=5)
cpg = create_cpg()

rng, episode_rng, q_learner_rng = jax.random.split(rng, 3)
policy_state = q_learning_policy.reset(rng=q_learner_rng, alpha=0.1, gamma=0.99, epsilon=0)
policy_state = policy_state.replace(q_table=jnp.identity(n=5))

visualize_episode(policy_state=policy_state, rng=episode_rng)



0
This browser does not support the video tag.


### Rollout function

In [11]:
from mujoco_utils.environment.mjx_env import MJXEnvState


@jax.jit
def rollout(
        rng: chex.PRNGKey,
        policy_state: QLearningPolicyState
        ) -> Tuple[QLearningPolicyState, jnp.ndarray]:
    """
    Do a single episode rollout
    """
    rng, env_rng, cpg_rng = jax.random.split(rng, 3)
    env_state = env_reset_fn(rng=rng)
    cpg_state = cpg.reset(rng=cpg_rng)

    def policy_step(
            _state: Tuple[MJXEnvState, QLearningPolicyState, CPGState],
            _
            ):
        _env_state, _policy_state, _cpg_state = _state
        _state_index = state_indexer(_env_state.observations)

        _next_policy_state, _action_index = q_learning_policy.epsilon_greedy_policy(
                policy_state=_policy_state, state_index=_state_index
                )
        _next_cpg_state, _actions = cpg_action_mapper(
                cpg=cpg, cpg_state=_cpg_state, action_index=_action_index, joint_limit=env.action_space.high[0] * 0.5
                )
        _next_env_state = env_step_fn(state=_env_state, action=_actions)

        _next_state_index = state_indexer(_next_env_state.observations)
        carry = (_next_env_state, _next_policy_state, _next_cpg_state)
        return carry, {
                "state_index": _state_index, "next_state_index": _next_state_index, "action_index": _action_index,
                "reward": _next_env_state.reward}

    carry, scan_out = jax.lax.scan(
            policy_step,
            (env_state, policy_state, cpg_state),
            (),
            env.environment_configuration.total_num_control_steps - 1
            # avoid last step (terminal one, so _next_state_index would be the reset observations, which we don't want in this case)
            )
    _, policy_state, _ = carry
    
    return policy_state, scan_out

### Train function

In [12]:
from typing import Union


def update_policy(
        policy_state: QLearningPolicyState,
        rollout_data: Dict[str, jnp.ndarray]
        ) -> QLearningPolicyState:
    def _update_step(
            _policy_state: QLearningPolicyState,
            data_sample: Dict[str, Union[int, float]]
            ):
        _updated_policy_state = q_learning_policy.apply_q_learning_update_rule(
                policy_state=_policy_state,
                state_index=data_sample["state_index"],
                next_state_index=data_sample["next_state_index"],
                action_index=data_sample["action_index"],
                reward=data_sample["reward"]
                )
        return _updated_policy_state, None

    policy_state, _ = jax.lax.scan(
        _update_step, policy_state, rollout_data
        )
    return policy_state

In [13]:
rng, q_learner_rng = jax.random.split(rng, 2)
q_learning_policy = QLearningPolicy(num_states=5, num_actions=5)
policy_state = q_learning_policy.reset(rng=q_learner_rng, alpha=0.1, gamma=0.99, epsilon=0.3)
cpg = create_cpg()

In [17]:
from tqdm import tqdm

def train_policy(rng: chex.PRNGKey, policy_state: QLearningPolicyState, num_episodes: int) -> QLearningPolicyState:
    original_q_table = policy_state.q_table
    
    for step in tqdm(range(num_episodes), desc="Training policy"):
        rng, sub_rng = jax.random.split(rng, 2)
        policy_state, rollout_data = rollout(rng=sub_rng, policy_state=policy_state)
        policy_state = update_policy(policy_state=policy_state, rollout_data=rollout_data)

        avg_reward = jnp.average(rollout_data["reward"])
        q_table_update = policy_state.q_table - original_q_table

        tqdm.write(f"Step: {step}")
        tqdm.write(f"\tAverage reward: {avg_reward}")
        tqdm.write(f"\tQ-table update:")
        tqdm.write(str(q_table_update))
    return policy_state

First iteration might take a while (jit compilation of the rollout function)

In [18]:
rng, train_rng = jax.random.split(rng, 2)
updated_policy_state = train_policy(rng=train_rng, policy_state=policy_state, num_episodes=100)

Training policy:   1%|          | 1/100 [00:49<1:22:19, 49.90s/it]

Step: 0
	Average reward: -0.007328457664698362
	Q-table update:
[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [-0.00942609 -0.0063887  -0.00978278 -0.01446791 -0.01234108]]


Training policy:   2%|▏         | 2/100 [01:30<1:12:31, 44.41s/it]

Step: 1
	Average reward: -0.0016144717810675502
	Q-table update:
[[ 0.          0.          0.          0.          0.        ]
 [ 0.00105066  0.00277698  0.00789041  0.00446887  0.00064403]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.          0.          0.          0.          0.        ]
 [-0.00942609 -0.0063887  -0.00978278 -0.01446791 -0.01234108]]


Training policy:   3%|▎         | 3/100 [02:09<1:07:52, 41.98s/it]

Step: 2
	Average reward: -0.004330573137849569
	Q-table update:
[[-0.00955017 -0.00605517 -0.00931935 -0.01048278 -0.00341312]
 [ 0.00105066  0.00277698  0.00789041  0.00446887  0.00064403]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.          0.          0.          0.          0.        ]
 [-0.00942609 -0.0063887  -0.00978278 -0.01446791 -0.01234108]]


Training policy:   4%|▍         | 4/100 [02:48<1:05:10, 40.73s/it]

Step: 3
	Average reward: 0.0022559131029993296
	Q-table update:
[[ 0.02921684  0.00268832  0.00566957  0.01570714  0.04268624]
 [ 0.03651081  0.03388714  0.04137666  0.0326297   0.03756818]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.          0.          0.          0.          0.        ]
 [-0.00942609 -0.0063887  -0.00978278 -0.01446791 -0.01234108]]


Training policy:   5%|▌         | 5/100 [03:27<1:03:24, 40.05s/it]

Step: 4
	Average reward: -0.004278081934899092
	Q-table update:
[[ 0.02921684  0.00268832  0.00566957  0.01570714  0.04268624]
 [ 0.03651081  0.03388714  0.04137666  0.0326297   0.03756818]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [-0.01566138 -0.01647122 -0.01733224 -0.01771543 -0.0206251 ]]


Training policy:   6%|▌         | 6/100 [04:07<1:02:47, 40.08s/it]

Step: 5
	Average reward: 0.0016707705799490213
	Q-table update:
[[ 0.04618793  0.04389442  0.02716469  0.02914672  0.05411799]
 [ 0.06585576  0.0659995   0.06548399  0.06576505  0.06270025]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [-0.01566138 -0.01647122 -0.01733224 -0.01771543 -0.0206251 ]]


Training policy:   7%|▋         | 7/100 [04:47<1:02:00, 40.00s/it]

Step: 6
	Average reward: -0.0019173999316990376
	Q-table update:
[[ 0.04739139  0.04937695  0.03803986  0.04165535  0.06196202]
 [ 0.06087993  0.06303462  0.06667958  0.06475752  0.06523859]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [-0.01566138 -0.01647122 -0.01733224 -0.01771543 -0.0206251 ]]


Training policy:   8%|▊         | 8/100 [05:28<1:01:46, 40.29s/it]

Step: 7
	Average reward: 0.0029644803144037724
	Q-table update:
[[ 0.05111822  0.06195284  0.04433484  0.04699047  0.08939579]
 [ 0.10058102  0.09402202  0.13118595  0.09354378  0.08850393]
 [-0.00075969 -0.00224661 -0.00100014 -0.00294211 -0.00079892]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [-0.01566138 -0.01647122 -0.01733224 -0.01771543 -0.0206251 ]]


Training policy:   9%|▉         | 9/100 [06:08<1:01:10, 40.33s/it]

Step: 8
	Average reward: -0.0050645689480006695
	Q-table update:
[[ 0.05111822  0.06195284  0.04433484  0.04699047  0.08939579]
 [ 0.0990446   0.09629192  0.10678337  0.08563443  0.09145891]
 [ 0.01135813  0.02872079  0.01585086  0.00437885  0.04205313]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [-0.01566138 -0.01647122 -0.01733224 -0.01771543 -0.0206251 ]]


Training policy:  10%|█         | 10/100 [06:48<1:00:20, 40.22s/it]

Step: 9
	Average reward: 0.005111172329634428
	Q-table update:
[[ 0.05111822  0.06195284  0.04433484  0.04699047  0.08939579]
 [ 0.0990446   0.09629192  0.10678337  0.08563443  0.09145891]
 [ 0.01865331  0.03247523  0.02645942  0.01454782  0.02945361]
 [ 0.         -0.00395116 -0.00146385  0.         -0.01740062]
 [ 0.16196877  0.12871788  0.13459735  0.10625286  0.10508946]]


Training policy:  11%|█         | 11/100 [07:26<58:51, 39.68s/it]  

Step: 10
	Average reward: -0.001205463893711567
	Q-table update:
[[0.05111822 0.06195284 0.04433484 0.04699047 0.08939579]
 [0.0990446  0.09629192 0.10678337 0.08563443 0.09145891]
 [0.01865331 0.03247523 0.02645942 0.01454782 0.02945361]
 [0.00812251 0.00607731 0.01031263 0.00673596 0.00425522]
 [0.12244091 0.14250691 0.14360213 0.13151622 0.13570847]]


Training policy:  12%|█▏        | 12/100 [08:06<58:15, 39.72s/it]

Step: 11
	Average reward: 0.000570237054489553
	Q-table update:
[[0.07961834 0.08179849 0.07471278 0.07871307 0.0864573 ]
 [0.10635353 0.10315333 0.11476696 0.10717361 0.10580634]
 [0.01865331 0.03247523 0.02645942 0.01454782 0.02945361]
 [0.00812251 0.00607731 0.01031263 0.00673596 0.00425522]
 [0.12244091 0.14250691 0.14360213 0.13151622 0.13570847]]


Training policy:  13%|█▎        | 13/100 [08:46<57:46, 39.85s/it]

Step: 12
	Average reward: -0.002648968016728759
	Q-table update:
[[0.07961834 0.08179849 0.07471278 0.07871307 0.0864573 ]
 [0.10635353 0.10315333 0.11476696 0.10717361 0.10580634]
 [0.02395608 0.01660735 0.0247243  0.02410668 0.02201357]
 [0.00571147 0.00941927 0.00639955 0.00548473 0.00350235]
 [0.12244091 0.14250691 0.14360213 0.13151622 0.13570847]]


Training policy:  14%|█▍        | 14/100 [09:26<56:55, 39.72s/it]

Step: 13
	Average reward: -0.006231142673641443
	Q-table update:
[[ 0.07961834  0.08179849  0.07471278  0.07871307  0.0864573 ]
 [ 0.10635353  0.10315333  0.11476696  0.10717361  0.10580634]
 [ 0.02395608  0.01660735  0.0247243   0.02410668  0.02201357]
 [-0.00925351 -0.00634902 -0.00194556 -0.00290406 -0.00983298]
 [ 0.12244091  0.14250691  0.14360213  0.13151622  0.13570847]]


Training policy:  15%|█▌        | 15/100 [10:05<56:12, 39.68s/it]

Step: 14
	Average reward: 0.0003340225084684789
	Q-table update:
[[ 0.08818576  0.09195022  0.08216756  0.08613822  0.10589756]
 [ 0.10812448  0.10878164  0.11122835  0.10929624  0.10664944]
 [ 0.02395608  0.01660735  0.0247243   0.02410668  0.02201357]
 [-0.00925351 -0.00634902 -0.00194556 -0.00290406 -0.00983298]
 [ 0.12244091  0.14250691  0.14360213  0.13151622  0.13570847]]


Training policy:  16%|█▌        | 16/100 [10:45<55:30, 39.65s/it]

Step: 15
	Average reward: -0.004292625933885574
	Q-table update:
[[0.08818576 0.09195022 0.08216756 0.08613822 0.10589756]
 [0.10812448 0.10878164 0.11122835 0.10929624 0.10664944]
 [0.02395608 0.01660735 0.0247243  0.02410668 0.02201357]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.12307186 0.1406645  0.12832034 0.1360688  0.13628606]]


Training policy:  17%|█▋        | 17/100 [11:24<54:28, 39.38s/it]

Step: 16
	Average reward: -0.0032352705020457506
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.10057282 0.10567646 0.09464865 0.10625447 0.1062941 ]
 [0.02395608 0.01660735 0.0247243  0.02410668 0.02201357]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.12307186 0.1406645  0.12832034 0.1360688  0.13628606]]


Training policy:  18%|█▊        | 18/100 [12:02<53:32, 39.18s/it]

Step: 17
	Average reward: -0.0020116011146456003
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.09784168 0.1014356  0.09967005 0.10007841 0.0998591 ]
 [0.0241038  0.01802558 0.02551191 0.04975696 0.02521585]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.12307186 0.1406645  0.12832034 0.1360688  0.13628606]]


Training policy:  19%|█▉        | 19/100 [12:42<53:07, 39.36s/it]

Step: 18
	Average reward: 0.0035523015540093184
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.10446078 0.10711196 0.10376829 0.10588103 0.10478055]
 [0.12175021 0.11602672 0.08429884 0.13457118 0.10791697]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.12307186 0.1406645  0.12832034 0.1360688  0.13628606]]


Training policy:  20%|██        | 20/100 [13:21<52:21, 39.26s/it]

Step: 19
	Average reward: -0.0038065470289438963
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.10446078 0.10711196 0.10376829 0.10588103 0.10478055]
 [0.12175021 0.11602672 0.08429884 0.13457118 0.10791697]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.12829855 0.12738569 0.12758417 0.12377679 0.12665601]]


Training policy:  21%|██        | 21/100 [14:00<51:39, 39.23s/it]

Step: 20
	Average reward: -0.006968076806515455
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.10446078 0.10711196 0.10376829 0.10588103 0.10478055]
 [0.12175021 0.11602672 0.08429884 0.13457118 0.10791697]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.10703472 0.10806508 0.10048493 0.10391533 0.10803758]]


Training policy:  22%|██▏       | 22/100 [14:42<51:55, 39.94s/it]

Step: 21
	Average reward: 0.0011888406006619334
	Q-table update:
[[0.0980095  0.0976242  0.09662665 0.09552174 0.09841119]
 [0.12579605 0.1319943  0.12928696 0.12990564 0.1285765 ]
 [0.13083614 0.12718844 0.11317436 0.12829337 0.12356684]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.10703472 0.10806508 0.10048493 0.10391533 0.10803758]]


Training policy:  23%|██▎       | 23/100 [15:21<50:54, 39.66s/it]

Step: 22
	Average reward: -0.0016309871571138501
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.12579605 0.1319943  0.12928696 0.12990564 0.1285765 ]
 [0.13083614 0.12718844 0.11317436 0.12829337 0.12356684]
 [0.00445574 0.01536007 0.02762567 0.01851527 0.00747049]
 [0.09932715 0.09938421 0.09900307 0.0987228  0.10270094]]


Training policy:  24%|██▍       | 24/100 [16:02<50:55, 40.20s/it]

Step: 23
	Average reward: -8.790765423327684e-05
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.12579605 0.1319943  0.12928696 0.12990564 0.1285765 ]
 [0.13083614 0.12718844 0.11317436 0.12829337 0.12356684]
 [0.06215486 0.04924799 0.08005124 0.05117385 0.04694097]
 [0.0983848  0.09833609 0.09544207 0.09079023 0.10010412]]


Training policy:  25%|██▌       | 25/100 [16:43<50:24, 40.33s/it]

Step: 24
	Average reward: -0.0008657514117658138
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.12579605 0.1319943  0.12928696 0.12990564 0.1285765 ]
 [0.13083614 0.12718844 0.11317436 0.12829337 0.12356684]
 [0.07933073 0.07908982 0.07645434 0.078303   0.07419634]
 [0.09574408 0.09671831 0.09533983 0.09303671 0.0952284 ]]


Training policy:  26%|██▌       | 26/100 [17:22<49:13, 39.91s/it]

Step: 25
	Average reward: 0.0024577570147812366
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.22038494 0.22977354 0.20409809 0.2375665  0.2027635 ]
 [0.1122446  0.12213819 0.11632075 0.11814593 0.12044893]
 [0.07933073 0.07908982 0.07645434 0.078303   0.07419634]
 [0.09574408 0.09671831 0.09533983 0.09303671 0.0952284 ]]


Training policy:  27%|██▋       | 27/100 [18:02<48:34, 39.92s/it]

Step: 26
	Average reward: -0.006087891757488251
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.22038494 0.22977354 0.20409809 0.2375665  0.2027635 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.07215172 0.0790784  0.07632949 0.08318108 0.06961561]
 [0.09574408 0.09671831 0.09533983 0.09303671 0.0952284 ]]


Training policy:  28%|██▊       | 28/100 [18:43<48:21, 40.30s/it]

Step: 27
	Average reward: 0.0037788860499858856
	Q-table update:
[[0.1161632  0.10341989 0.10044172 0.1067357  0.10134387]
 [0.22038494 0.22977354 0.20409809 0.2375665  0.2027635 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.13727996 0.12467214 0.13790666 0.16687791 0.11189965]
 [0.11057987 0.14575826 0.10478403 0.10308123 0.09019978]]


Training policy:  29%|██▉       | 29/100 [19:22<47:17, 39.97s/it]

Step: 28
	Average reward: 0.003507007146254182
	Q-table update:
[[0.209903   0.17549698 0.1692659  0.16725099 0.16894749]
 [0.22038494 0.22977354 0.20409809 0.2375665  0.2027635 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.14145517 0.13111296 0.14023887 0.16178675 0.11690387]
 [0.14178823 0.13717633 0.13529053 0.13548404 0.12678666]]


Training policy:  30%|███       | 30/100 [20:03<46:45, 40.08s/it]

Step: 29
	Average reward: 0.00234965980052948
	Q-table update:
[[0.209903   0.17549698 0.1692659  0.16725099 0.16894749]
 [0.22038494 0.22977354 0.20409809 0.2375665  0.2027635 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.17189953 0.1694925  0.17276101 0.18146716 0.16943729]
 [0.15042473 0.14875011 0.14767598 0.14665341 0.14660443]]


Training policy:  31%|███       | 31/100 [20:42<45:41, 39.73s/it]

Step: 30
	Average reward: -0.006260796450078487
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.20834409 0.21621408 0.21230769 0.20804054 0.2092526 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.17189953 0.1694925  0.17276101 0.18146716 0.16943729]
 [0.15042473 0.14875011 0.14767598 0.14665341 0.14660443]]


Training policy:  32%|███▏      | 32/100 [21:22<45:05, 39.79s/it]

Step: 31
	Average reward: 0.0017942789709195495
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.20834409 0.21621408 0.21230769 0.20804054 0.2092526 ]
 [0.11359214 0.10512087 0.10579392 0.10492355 0.10859356]
 [0.19447923 0.18906218 0.19782667 0.19525976 0.19456884]
 [0.18214373 0.1608934  0.17110634 0.17027956 0.16511019]]


Training policy:  33%|███▎      | 33/100 [22:00<43:54, 39.32s/it]

Step: 32
	Average reward: -0.006615381687879562
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.20834409 0.21621408 0.21230769 0.20804054 0.2092526 ]
 [0.08751184 0.08935846 0.09209416 0.08880714 0.09012706]
 [0.19447923 0.18906218 0.19782667 0.19525976 0.19456884]
 [0.18214373 0.1608934  0.17110634 0.17027956 0.16511019]]


Training policy:  34%|███▍      | 34/100 [22:39<43:05, 39.17s/it]

Step: 33
	Average reward: 0.006649913731962442
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.3088516  0.35137588 0.33012718 0.32708678 0.2982433 ]
 [0.08751184 0.08935846 0.09209416 0.08880714 0.09012706]
 [0.19447923 0.18906218 0.19782667 0.19525976 0.19456884]
 [0.18214373 0.1608934  0.17110634 0.17027956 0.16511019]]


Training policy:  35%|███▌      | 35/100 [23:17<42:16, 39.02s/it]

Step: 34
	Average reward: 0.0001893282460514456
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.3088516  0.35137588 0.33012718 0.32708678 0.2982433 ]
 [0.12720954 0.12407299 0.15553534 0.12517709 0.12591794]
 [0.18581228 0.19180621 0.18345405 0.18859182 0.18579899]
 [0.18214373 0.1608934  0.17110634 0.17027956 0.16511019]]


Training policy:  36%|███▌      | 36/100 [23:56<41:34, 38.98s/it]

Step: 35
	Average reward: -0.00753745436668396
	Q-table update:
[[0.18958934 0.1797565  0.17173184 0.17306799 0.17489107]
 [0.3088516  0.35137588 0.33012718 0.32708678 0.2982433 ]
 [0.12720954 0.12407299 0.15553534 0.12517709 0.12591794]
 [0.17208701 0.1684505  0.16873176 0.17621373 0.16760415]
 [0.18214373 0.1608934  0.17110634 0.17027956 0.16511019]]


Training policy:  37%|███▋      | 37/100 [24:35<40:56, 38.99s/it]

Step: 36
	Average reward: 0.005220227409154177
	Q-table update:
[[0.20080237 0.20695147 0.18730511 0.18922806 0.18181123]
 [0.3019737  0.33672312 0.33012718 0.33082235 0.29951283]
 [0.13322468 0.13147849 0.16424426 0.13173138 0.12895554]
 [0.17208701 0.1684505  0.16873176 0.17621373 0.16760415]
 [0.2979255  0.25324705 0.24827494 0.24050325 0.2456802 ]]


Training policy:  38%|███▊      | 38/100 [25:14<40:22, 39.07s/it]

Step: 37
	Average reward: 0.0028857358265668154
	Q-table update:
[[0.20080237 0.20695147 0.18730511 0.18922806 0.18181123]
 [0.3019737  0.33672312 0.33012718 0.33082235 0.29951283]
 [0.13322468 0.13147849 0.16424426 0.13173138 0.12895554]
 [0.17986804 0.175911   0.18292166 0.2115347  0.18125395]
 [0.2947838  0.30349952 0.30743238 0.28539926 0.30402133]]


Training policy:  39%|███▉      | 39/100 [25:53<39:42, 39.06s/it]

Step: 38
	Average reward: 0.004032460041344166
	Q-table update:
[[0.20080237 0.20695147 0.18730511 0.18922806 0.18181123]
 [0.36242118 0.35481563 0.36680573 0.3719289  0.35498884]
 [0.17687194 0.17774865 0.19199252 0.17077512 0.17658865]
 [0.17986804 0.175911   0.18292166 0.2115347  0.18125395]
 [0.2947838  0.30349952 0.30743238 0.28539926 0.30402133]]


Training policy:  40%|████      | 40/100 [26:32<38:45, 38.76s/it]

Step: 39
	Average reward: -0.005889091640710831
	Q-table update:
[[0.20080237 0.20695147 0.18730511 0.18922806 0.18181123]
 [0.36242118 0.35481563 0.36680573 0.3719289  0.35498884]
 [0.17687194 0.17774865 0.19199252 0.17077512 0.17658865]
 [0.17986804 0.175911   0.18292166 0.2115347  0.18125395]
 [0.2760013  0.27991015 0.27880773 0.2850214  0.27997983]]


Training policy:  41%|████      | 41/100 [27:12<38:28, 39.13s/it]

Step: 40
	Average reward: -0.0034316470846533775
	Q-table update:
[[0.24069837 0.2737472  0.24176843 0.23637566 0.23999828]
 [0.32642925 0.34058514 0.33934388 0.3342129  0.3322997 ]
 [0.17687194 0.17774865 0.19199252 0.17077512 0.17658865]
 [0.17986804 0.175911   0.18292166 0.2115347  0.18125395]
 [0.2760013  0.27991015 0.27880773 0.2850214  0.27997983]]


Training policy:  42%|████▏     | 42/100 [27:50<37:43, 39.02s/it]

Step: 41
	Average reward: 0.007646812126040459
	Q-table update:
[[0.24069837 0.2737472  0.24176843 0.23637566 0.23999828]
 [0.32642925 0.34058514 0.33934388 0.3342129  0.3322997 ]
 [0.28840542 0.29730305 0.37533188 0.31032416 0.33522618]
 [0.19209377 0.1942128  0.20231229 0.23495544 0.19215739]
 [0.2760013  0.27991015 0.27880773 0.2850214  0.27997983]]


Training policy:  43%|████▎     | 43/100 [28:30<37:10, 39.13s/it]

Step: 42
	Average reward: 0.004732043948024511
	Q-table update:
[[0.24768798 0.26550448 0.24444592 0.24600327 0.24436596]
 [0.32642925 0.34058514 0.33934388 0.3342129  0.3322997 ]
 [0.28840542 0.29730305 0.37533188 0.31032416 0.33522618]
 [0.28534377 0.2803788  0.2948385  0.3238661  0.28728366]
 [0.28013802 0.28172475 0.28362727 0.27859154 0.28129074]]


Training policy:  44%|████▍     | 44/100 [29:09<36:27, 39.07s/it]

Step: 43
	Average reward: 0.005423384252935648
	Q-table update:
[[0.24768798 0.26550448 0.24444592 0.24600327 0.24436596]
 [0.34860238 0.37224248 0.35117838 0.3475664  0.35142094]
 [0.37711143 0.4084314  0.41846833 0.415772   0.41147026]
 [0.28842506 0.28440142 0.30317214 0.317014   0.29044527]
 [0.2800618  0.2814488  0.27746734 0.2785324  0.28129074]]


Training policy:  45%|████▌     | 45/100 [29:47<35:43, 38.96s/it]

Step: 44
	Average reward: -0.005317110102623701
	Q-table update:
[[0.24768798 0.26550448 0.24444592 0.24600327 0.24436596]
 [0.34860238 0.37224248 0.35117838 0.3475664  0.35142094]
 [0.37711143 0.4084314  0.41846833 0.415772   0.41147026]
 [0.28842506 0.28440142 0.30317214 0.317014   0.29044527]
 [0.26052213 0.25419536 0.26180464 0.25962472 0.25987267]]


Training policy:  46%|████▌     | 46/100 [30:26<35:03, 38.96s/it]

Step: 45
	Average reward: 0.0028347379993647337
	Q-table update:
[[0.31722885 0.31300595 0.31523874 0.3106723  0.30783027]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.37711143 0.4084314  0.41846833 0.415772   0.41147026]
 [0.28842506 0.28440142 0.30317214 0.317014   0.29044527]
 [0.25817782 0.25031987 0.25274926 0.25599983 0.25323185]]


Training policy:  47%|████▋     | 47/100 [31:05<34:24, 38.95s/it]

Step: 46
	Average reward: 0.007006984669715166
	Q-table update:
[[0.31722885 0.31300595 0.31523874 0.3106723  0.30783027]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.40740362 0.42336315 0.4348642  0.41724226 0.42054546]
 [0.3815699  0.38970828 0.3599343  0.4230951  0.35625666]
 [0.25817782 0.25031987 0.25274926 0.25599983 0.25323185]]


Training policy:  48%|████▊     | 48/100 [31:44<33:39, 38.84s/it]

Step: 47
	Average reward: 0.00407676724717021
	Q-table update:
[[0.31722885 0.31300595 0.31523874 0.3106723  0.30783027]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.44709653 0.44788352 0.44482306 0.43451074 0.44158626]
 [0.41519004 0.40626276 0.41561517 0.4290278  0.40452927]
 [0.25019512 0.24822354 0.25229716 0.25390792 0.25323185]]


Training policy:  49%|████▉     | 49/100 [32:26<33:46, 39.74s/it]

Step: 48
	Average reward: 0.0005903989658690989
	Q-table update:
[[0.30471036 0.30622655 0.30699638 0.30957016 0.302968  ]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.44709653 0.44788352 0.44482306 0.43451074 0.44158626]
 [0.41519004 0.40626276 0.41561517 0.4290278  0.40452927]
 [0.26738596 0.26959124 0.26821092 0.28792724 0.26226655]]


Training policy:  50%|█████     | 50/100 [33:06<33:09, 39.79s/it]

Step: 49
	Average reward: -0.00022781970619689673
	Q-table update:
[[0.30471036 0.30622655 0.30699638 0.30957016 0.302968  ]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.43419543 0.43848202 0.4361925  0.43687907 0.43289807]
 [0.41329324 0.41157225 0.41451982 0.41163304 0.40987822]
 [0.26738596 0.26959124 0.26821092 0.28792724 0.26226655]]


Training policy:  51%|█████     | 51/100 [33:45<32:24, 39.68s/it]

Step: 50
	Average reward: -0.0011519936379045248
	Q-table update:
[[0.30471036 0.30622655 0.30699638 0.30957016 0.302968  ]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.40414998 0.40436462 0.39734814 0.40328768 0.40372425]
 [0.26738596 0.26959124 0.26821092 0.28792724 0.26226655]]


Training policy:  52%|█████▏    | 52/100 [34:24<31:33, 39.44s/it]

Step: 51
	Average reward: -0.0061341519467532635
	Q-table update:
[[0.2891883  0.2920705  0.28971618 0.283791   0.28887776]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.40414998 0.40436462 0.39734814 0.40328768 0.40372425]
 [0.26738596 0.26959124 0.26821092 0.28792724 0.26226655]]


Training policy:  53%|█████▎    | 53/100 [35:02<30:40, 39.16s/it]

Step: 52
	Average reward: -0.00449570594355464
	Q-table update:
[[0.2891883  0.2920705  0.28971618 0.283791   0.28887776]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.37156522 0.37658623 0.37861225 0.37436095 0.38027143]
 [0.26738596 0.26959124 0.26821092 0.28792724 0.26226655]]


Training policy:  54%|█████▍    | 54/100 [35:41<29:57, 39.09s/it]

Step: 53
	Average reward: 0.002961383666843176
	Q-table update:
[[0.28994778 0.30406827 0.29313698 0.29116583 0.29246855]
 [0.34860238 0.35099775 0.3542951  0.3519186  0.3532151 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.37156522 0.37658623 0.37861225 0.37436095 0.38027143]
 [0.28869203 0.28951073 0.28245714 0.32650328 0.27498445]]


Training policy:  55%|█████▌    | 55/100 [36:20<29:15, 39.02s/it]

Step: 54
	Average reward: 0.003552706679329276
	Q-table update:
[[0.34172636 0.34971654 0.34459025 0.33404827 0.33588168]
 [0.34565336 0.3489152  0.34444565 0.35145333 0.3471934 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.37156522 0.37658623 0.37861225 0.37436095 0.38027143]
 [0.28869203 0.28951073 0.28245714 0.32650328 0.27498445]]


Training policy:  56%|█████▌    | 56/100 [36:59<28:33, 38.95s/it]

Step: 55
	Average reward: 0.00015593435091432184
	Q-table update:
[[0.33759746 0.33676484 0.33708242 0.33783928 0.34071544]
 [0.34565336 0.3489152  0.34444565 0.35145333 0.3471934 ]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.37156522 0.37658623 0.37861225 0.37436095 0.38027143]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  57%|█████▋    | 57/100 [37:39<28:15, 39.44s/it]

Step: 56
	Average reward: -0.005345235578715801
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.3357831  0.33951345 0.33665186 0.32442108 0.33770517]
 [0.4319166  0.4214476  0.4255321  0.42451456 0.4283959 ]
 [0.37156522 0.37658623 0.37861225 0.37436095 0.38027143]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  58%|█████▊    | 58/100 [38:18<27:30, 39.29s/it]

Step: 57
	Average reward: 0.002622096799314022
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.3357831  0.33951345 0.33665186 0.32442108 0.33770517]
 [0.42037904 0.4239634  0.4255535  0.42569974 0.4258251 ]
 [0.3851701  0.3839403  0.38631064 0.3883738  0.4041686 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  59%|█████▉    | 59/100 [38:58<26:56, 39.42s/it]

Step: 58
	Average reward: 0.0012748929439112544
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.3508644  0.35705027 0.35048997 0.34376684 0.3483711 ]
 [0.41972238 0.41253224 0.41680783 0.4024214  0.42556307]
 [0.3851701  0.3839403  0.38631064 0.3883738  0.4041686 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  60%|██████    | 60/100 [39:37<26:11, 39.29s/it]

Step: 59
	Average reward: 0.004772119224071503
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.39793837 0.42214838 0.41221544 0.407312   0.41490865]
 [0.41753888 0.41466528 0.4166236  0.40586835 0.40606678]
 [0.3851701  0.3839403  0.38631064 0.3883738  0.4041686 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  61%|██████    | 61/100 [40:17<25:41, 39.52s/it]

Step: 60
	Average reward: 0.0019430489046499133
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.45199376 0.43608665 0.45055553 0.4573483  0.4578321 ]
 [0.40403506 0.41088727 0.4137161  0.4062247  0.4029217 ]
 [0.3851701  0.3839403  0.38631064 0.3883738  0.4041686 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  62%|██████▏   | 62/100 [40:56<24:54, 39.34s/it]

Step: 61
	Average reward: -0.005317156668752432
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.42855865 0.4255189  0.4298708  0.43194807 0.4219426 ]
 [0.40403506 0.41088727 0.4137161  0.4062247  0.4029217 ]
 [0.3851701  0.3839403  0.38631064 0.3883738  0.4041686 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  63%|██████▎   | 63/100 [41:36<24:23, 39.55s/it]

Step: 62
	Average reward: 0.0028814272955060005
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.42855865 0.4255189  0.4298708  0.43194807 0.4219426 ]
 [0.4224081  0.4314175  0.4295639  0.4284682  0.4247796 ]
 [0.39810362 0.3915343  0.39255568 0.39025605 0.3930333 ]
 [0.30301815 0.30245298 0.29480326 0.30869415 0.28988513]]


Training policy:  64%|██████▍   | 64/100 [42:15<23:34, 39.28s/it]

Step: 63
	Average reward: -0.004222115036100149
	Q-table update:
[[0.32586733 0.32317388 0.3290071  0.3249215  0.32007268]
 [0.42855865 0.4255189  0.4298708  0.43194807 0.4219426 ]
 [0.4224081  0.4314175  0.4295639  0.4284682  0.4247796 ]
 [0.363247   0.37159878 0.372323   0.36927128 0.3710271 ]
 [0.30301815 0.30245298 0.29578504 0.3162513  0.28988513]]


Training policy:  65%|██████▌   | 65/100 [42:54<22:53, 39.25s/it]

Step: 64
	Average reward: -0.0050420272164046764
	Q-table update:
[[0.31480053 0.3140255  0.3107951  0.31452906 0.3129348 ]
 [0.42855865 0.4255189  0.4298708  0.43194807 0.4219426 ]
 [0.4224081  0.4314175  0.4295639  0.4284682  0.4247796 ]
 [0.363247   0.37159878 0.372323   0.36927128 0.3710271 ]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  66%|██████▌   | 66/100 [43:34<22:26, 39.59s/it]

Step: 65
	Average reward: -0.002768634120002389
	Q-table update:
[[0.35956863 0.32830155 0.33508086 0.32602754 0.33343184]
 [0.41012603 0.40686595 0.41124514 0.40687066 0.4125588 ]
 [0.4224081  0.4314175  0.4295639  0.4284682  0.4247796 ]
 [0.363247   0.37159878 0.372323   0.36927128 0.3710271 ]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  67%|██████▋   | 67/100 [44:14<21:45, 39.57s/it]

Step: 66
	Average reward: -0.0017023856053128839
	Q-table update:
[[0.39046815 0.36030498 0.36704296 0.36631238 0.36666223]
 [0.39770505 0.3951489  0.39489087 0.39168218 0.38362902]
 [0.4224081  0.4314175  0.4295639  0.4284682  0.4247796 ]
 [0.363247   0.37159878 0.372323   0.36927128 0.3710271 ]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  68%|██████▊   | 68/100 [44:54<21:07, 39.60s/it]

Step: 67
	Average reward: -0.0033574458211660385
	Q-table update:
[[0.39046815 0.36030498 0.36704296 0.36631238 0.36666223]
 [0.39770505 0.3951489  0.39489087 0.39168218 0.38362902]
 [0.42101324 0.40913403 0.42328238 0.42296934 0.41618326]
 [0.35336587 0.3546008  0.35005432 0.35229993 0.35528642]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  69%|██████▉   | 69/100 [45:33<20:26, 39.57s/it]

Step: 68
	Average reward: 0.005160314496606588
	Q-table update:
[[0.39046815 0.36030498 0.36704296 0.36631238 0.36666223]
 [0.41004398 0.3972308  0.3951177  0.39428848 0.39513853]
 [0.45627895 0.45296666 0.4659666  0.45990124 0.461175  ]
 [0.35336587 0.3546008  0.35005432 0.35229993 0.35528642]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  70%|███████   | 70/100 [46:12<19:39, 39.31s/it]

Step: 69
	Average reward: 0.000916363496799022
	Q-table update:
[[0.41859204 0.40080932 0.38752246 0.39338484 0.3983469 ]
 [0.38877985 0.38967368 0.38819733 0.39352953 0.3954831 ]
 [0.45627895 0.45296666 0.4659666  0.45990124 0.461175  ]
 [0.35336587 0.3546008  0.35005432 0.35229993 0.35528642]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  71%|███████   | 71/100 [46:52<19:06, 39.53s/it]

Step: 70
	Average reward: 0.0014923326671123505
	Q-table update:
[[0.41859204 0.40080932 0.38752246 0.39338484 0.3983469 ]
 [0.38877985 0.38967368 0.38819733 0.39352953 0.3954831 ]
 [0.45973888 0.45524898 0.4586439  0.44569588 0.44896096]
 [0.36083037 0.3590913  0.3711298  0.36810043 0.38972872]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  72%|███████▏  | 72/100 [47:30<18:17, 39.18s/it]

Step: 71
	Average reward: -0.0074232700280845165
	Q-table update:
[[0.41859204 0.40080932 0.38752246 0.39338484 0.3983469 ]
 [0.38877985 0.38967368 0.38819733 0.39352953 0.3954831 ]
 [0.42116606 0.4254595  0.42967653 0.42697477 0.4238818 ]
 [0.36083037 0.3590913  0.3711298  0.36810043 0.38972872]
 [0.2973416  0.29884282 0.29310644 0.28408438 0.28708565]]


Training policy:  73%|███████▎  | 73/100 [48:10<17:44, 39.44s/it]

Step: 72
	Average reward: -0.003971870988607407
	Q-table update:
[[0.39275527 0.40080932 0.38752246 0.39391923 0.3983469 ]
 [0.38877985 0.38967368 0.38819733 0.39352953 0.3954831 ]
 [0.42116606 0.4254595  0.42967653 0.42697477 0.4238818 ]
 [0.36083037 0.3590913  0.3711298  0.36810043 0.38972872]
 [0.27734804 0.277459   0.28043708 0.27656767 0.27848348]]


Training policy:  74%|███████▍  | 74/100 [48:49<17:02, 39.35s/it]

Step: 73
	Average reward: 0.0058660260401666164
	Q-table update:
[[0.3950003  0.3981844  0.39245865 0.39487892 0.3982358 ]
 [0.38878876 0.38967368 0.3879581  0.39352953 0.39439797]
 [0.46636268 0.4588925  0.47934452 0.46839395 0.4575092 ]
 [0.37009782 0.36227882 0.37573585 0.37237996 0.39662245]
 [0.27734804 0.277459   0.28043708 0.27656767 0.27848348]]


Training policy:  75%|███████▌  | 75/100 [49:28<16:20, 39.20s/it]

Step: 74
	Average reward: -0.007269200868904591
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.38878876 0.38967368 0.3879581  0.39352953 0.39439797]
 [0.46636268 0.4588925  0.47934452 0.46839395 0.4575092 ]
 [0.37009782 0.36227882 0.37573585 0.37237996 0.39662245]
 [0.27050534 0.28664258 0.27570498 0.2675532  0.2737268 ]]


Training policy:  76%|███████▌  | 76/100 [50:08<15:45, 39.38s/it]

Step: 75
	Average reward: -0.007362614385783672
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.38878876 0.38967368 0.3879581  0.39352953 0.39439797]
 [0.46636268 0.4588925  0.47934452 0.46839395 0.4575092 ]
 [0.35657322 0.3578621  0.3600675  0.34841016 0.33459294]
 [0.27806276 0.29256654 0.28050664 0.2755229  0.2803678 ]]


Training policy:  77%|███████▋  | 77/100 [50:46<14:59, 39.10s/it]

Step: 76
	Average reward: 0.004395902156829834
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.38878876 0.38967368 0.3879581  0.39352953 0.39439797]
 [0.48827845 0.5148167  0.515145   0.51196843 0.49659574]
 [0.35562053 0.35142055 0.3479451  0.3468579  0.34454873]
 [0.27806276 0.29256654 0.28050664 0.2755229  0.2803678 ]]


Training policy:  78%|███████▊  | 78/100 [51:26<14:23, 39.26s/it]

Step: 77
	Average reward: -0.006391280796378851
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.38878876 0.38967368 0.3879581  0.39352953 0.39439797]
 [0.48877093 0.49848324 0.5144024  0.49410725 0.49742514]
 [0.39852852 0.35575595 0.37067774 0.35939294 0.35496184]
 [0.27806276 0.29256654 0.28050664 0.2755229  0.2803678 ]]


Training policy:  79%|███████▉  | 79/100 [52:05<13:40, 39.06s/it]

Step: 78
	Average reward: -0.006969363894313574
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.48877093 0.49848324 0.5144024  0.49410725 0.49742514]
 [0.39852852 0.35575595 0.37067774 0.35939294 0.35496184]
 [0.27806276 0.29256654 0.28050664 0.2755229  0.2803678 ]]


Training policy:  80%|████████  | 80/100 [52:44<13:02, 39.10s/it]

Step: 79
	Average reward: -0.004165532998740673
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.48877093 0.49848324 0.5144024  0.49410725 0.49742514]
 [0.3779928  0.35575595 0.37260997 0.35172433 0.35496184]
 [0.27252796 0.28356794 0.27435336 0.27328852 0.2903776 ]]


Training policy:  81%|████████  | 81/100 [53:23<12:21, 39.01s/it]

Step: 80
	Average reward: 0.0013138712383806705
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.48877093 0.49848324 0.5144024  0.49410725 0.49742514]
 [0.35786816 0.3600629  0.3627418  0.3595643  0.36038366]
 [0.3029437  0.32032055 0.3168468  0.32479915 0.35223967]]


Training policy:  82%|████████▏ | 82/100 [54:01<11:36, 38.72s/it]

Step: 81
	Average reward: 0.002260736655443907
	Q-table update:
[[0.38005102 0.39112595 0.3744296  0.37921572 0.3359248 ]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.5209215  0.528974   0.51181793 0.52526134 0.5254208 ]
 [0.35251886 0.35348657 0.34537253 0.35314247 0.35214144]
 [0.3029437  0.32032055 0.3168468  0.32479915 0.35223967]]


Training policy:  83%|████████▎ | 83/100 [54:40<11:00, 38.83s/it]

Step: 82
	Average reward: 0.004625315312296152
	Q-table update:
[[0.37688905 0.3716567  0.3744296  0.37845886 0.33958215]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.5209215  0.528974   0.51181793 0.52526134 0.5254208 ]
 [0.40070483 0.4025694  0.40194842 0.4096219  0.40019718]
 [0.3029437  0.3433986  0.33866683 0.34270224 0.3610471 ]]


Training policy:  84%|████████▍ | 84/100 [55:20<10:26, 39.17s/it]

Step: 83
	Average reward: -0.0015424201264977455
	Q-table update:
[[0.37688905 0.3716567  0.3744296  0.37845886 0.33958215]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.5143477  0.48615697 0.51372045 0.5154233  0.51450187]
 [0.41615948 0.4243296  0.4106158  0.42759848 0.40142316]
 [0.3029437  0.3433986  0.33866683 0.34270224 0.3610471 ]]


Training policy:  85%|████████▌ | 85/100 [55:58<09:44, 38.95s/it]

Step: 84
	Average reward: 0.006125676911324263
	Q-table update:
[[0.37688905 0.3716567  0.3744296  0.37845886 0.33958215]
 [0.37187883 0.36937767 0.36602813 0.37441012 0.3662716 ]
 [0.5143477  0.48615697 0.51372045 0.5154233  0.51450187]
 [0.4544087  0.467985   0.46376    0.47357845 0.44957024]
 [0.330633   0.36550874 0.36303183 0.37243602 0.3912459 ]]


Training policy:  86%|████████▌ | 86/100 [56:37<09:04, 38.89s/it]

Step: 85
	Average reward: 0.0018689049175009131
	Q-table update:
[[0.37688905 0.3716567  0.3744296  0.37845886 0.33958215]
 [0.38669604 0.38367063 0.38198304 0.3950279  0.377846  ]
 [0.49648285 0.50233585 0.5090139  0.4870038  0.51100355]
 [0.4544087  0.467985   0.46376    0.47357845 0.44957024]
 [0.330633   0.36550874 0.36303183 0.37243602 0.3912459 ]]


Training policy:  87%|████████▋ | 87/100 [57:16<08:24, 38.83s/it]

Step: 86
	Average reward: -0.007169043179601431
	Q-table update:
[[0.3658844  0.35919526 0.3600662  0.36062783 0.3595247 ]
 [0.37159473 0.37328473 0.3727797  0.34861115 0.3643028 ]
 [0.49648285 0.50233585 0.5090139  0.4870038  0.51100355]
 [0.4544087  0.467985   0.46376    0.47357845 0.44957024]
 [0.330633   0.36550874 0.36303183 0.37243602 0.3912459 ]]


Training policy:  88%|████████▊ | 88/100 [57:55<07:47, 38.96s/it]

Step: 87
	Average reward: 0.0032118242233991623
	Q-table update:
[[0.3658844  0.35919526 0.3600662  0.36062783 0.3595247 ]
 [0.40552056 0.43195656 0.4026612  0.40890846 0.41212466]
 [0.50010735 0.5031447  0.5034331  0.4958728  0.4829578 ]
 [0.4544087  0.467985   0.46376    0.47357845 0.44957024]
 [0.330633   0.36550874 0.36303183 0.37243602 0.3912459 ]]


Training policy:  89%|████████▉ | 89/100 [58:34<07:10, 39.14s/it]

Step: 88
	Average reward: 0.005414024461060762
	Q-table update:
[[0.42922637 0.37061018 0.37250227 0.36452094 0.3595247 ]
 [0.4526106  0.45926303 0.45411694 0.4571623  0.4493991 ]
 [0.4963257  0.4903331  0.49736938 0.496231   0.4889394 ]
 [0.4544087  0.467985   0.46376    0.47357845 0.44957024]
 [0.330633   0.36550874 0.36303183 0.37243602 0.3912459 ]]


Training policy:  90%|█████████ | 90/100 [59:15<06:35, 39.54s/it]

Step: 89
	Average reward: 0.0053351703099906445
	Q-table update:
[[0.42922637 0.37061018 0.37250227 0.36452094 0.3595247 ]
 [0.4526106  0.45926303 0.45411694 0.4571623  0.4493991 ]
 [0.4925118  0.48953137 0.49064088 0.49523827 0.4879442 ]
 [0.47201172 0.47336036 0.47666103 0.47242606 0.47728214]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  91%|█████████ | 91/100 [59:55<05:55, 39.55s/it]

Step: 90
	Average reward: 0.0023727109655737877
	Q-table update:
[[0.42922637 0.37061018 0.37250227 0.36452094 0.3595247 ]
 [0.45236558 0.4523109  0.45687526 0.45530197 0.45265672]
 [0.49523312 0.48793674 0.493811   0.48312482 0.4958569 ]
 [0.47201172 0.47336036 0.47666103 0.47242606 0.47728214]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  92%|█████████▏| 92/100 [1:00:34<05:15, 39.48s/it]

Step: 91
	Average reward: -0.0025000658351927996
	Q-table update:
[[0.42922637 0.37061018 0.37250227 0.36452094 0.3595247 ]
 [0.45236558 0.4523109  0.45687526 0.45530197 0.45265672]
 [0.47251827 0.4763542  0.47330037 0.47457612 0.46669334]
 [0.4729684  0.47753543 0.47567636 0.4748638  0.47042456]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  93%|█████████▎| 93/100 [1:01:14<04:36, 39.55s/it]

Step: 92
	Average reward: -0.001179343438707292
	Q-table update:
[[0.39644617 0.39375326 0.39568946 0.38845742 0.38896725]
 [0.44396445 0.4457621  0.43684295 0.4439439  0.44749317]
 [0.47251827 0.4763542  0.47330037 0.47457612 0.46669334]
 [0.4729684  0.47753543 0.47567636 0.4748638  0.47042456]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  94%|█████████▍| 94/100 [1:01:53<03:56, 39.39s/it]

Step: 93
	Average reward: -0.003873314941301942
	Q-table update:
[[0.39644617 0.39375326 0.39568946 0.38845742 0.38896725]
 [0.44396445 0.4457621  0.43684295 0.4439439  0.44749317]
 [0.47051397 0.4603091  0.47126442 0.47105718 0.46565756]
 [0.45525923 0.45250976 0.4571885  0.46066356 0.4578864 ]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  95%|█████████▌| 95/100 [1:02:31<03:16, 39.21s/it]

Step: 94
	Average reward: 0.0031652431935071945
	Q-table update:
[[0.45195076 0.44369274 0.44429627 0.42994344 0.4471741 ]
 [0.43553796 0.4420922  0.43791857 0.4348075  0.42983845]
 [0.47051397 0.4603091  0.47126442 0.47105718 0.46565756]
 [0.45525923 0.45250976 0.4571885  0.46066356 0.4578864 ]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  96%|█████████▌| 96/100 [1:03:12<02:38, 39.70s/it]

Step: 95
	Average reward: 0.0005812917370349169
	Q-table update:
[[0.45195076 0.44369274 0.44429627 0.42994344 0.4471741 ]
 [0.44040644 0.44925773 0.43728927 0.43742225 0.4376345 ]
 [0.46728057 0.46279496 0.45990553 0.46195138 0.4586084 ]
 [0.45525923 0.45250976 0.4571885  0.46066356 0.4578864 ]
 [0.41737363 0.41594353 0.42993617 0.41677275 0.45581502]]


Training policy:  97%|█████████▋| 97/100 [1:03:53<01:59, 39.88s/it]

Step: 96
	Average reward: 0.0062070623971521854
	Q-table update:
[[0.49323377 0.46120885 0.468885   0.46584553 0.47775644]
 [0.44040644 0.44925773 0.43728927 0.43742225 0.4376345 ]
 [0.46728057 0.46279496 0.45990553 0.46195138 0.4586084 ]
 [0.45525923 0.45250976 0.4571885  0.45848954 0.4578864 ]
 [0.47368735 0.4758983  0.4772976  0.4779776  0.4877939 ]]


Training policy:  98%|█████████▊| 98/100 [1:04:34<01:20, 40.38s/it]

Step: 97
	Average reward: -0.000334176846081391
	Q-table update:
[[0.49323377 0.46120885 0.468885   0.46584553 0.47775644]
 [0.44040644 0.44925773 0.43728927 0.43742225 0.4376345 ]
 [0.45080128 0.45646632 0.45291328 0.45829156 0.45737243]
 [0.45369908 0.45263687 0.45553637 0.44919503 0.45103842]
 [0.47368735 0.4758983  0.4772976  0.4779776  0.4877939 ]]


Training policy:  99%|█████████▉| 99/100 [1:05:13<00:39, 39.95s/it]

Step: 98
	Average reward: 0.004938620608299971
	Q-table update:
[[0.49702367 0.48946363 0.48671705 0.48609355 0.50109416]
 [0.4399662  0.43937615 0.43728927 0.43715373 0.43768367]
 [0.45080128 0.45646632 0.45291328 0.45829156 0.45737243]
 [0.45369908 0.45263687 0.45553637 0.44919503 0.45103842]
 [0.48910472 0.50179374 0.500435   0.49628878 0.50380254]]


Training policy: 100%|██████████| 100/100 [1:05:52<00:00, 39.52s/it]

Step: 99
	Average reward: 0.0015689806314185262
	Q-table update:
[[0.49524105 0.49048808 0.48732626 0.48682952 0.49110866]
 [0.43650088 0.44327947 0.44552    0.4405692  0.44126135]
 [0.45110252 0.45646632 0.45291328 0.45175323 0.45737243]
 [0.45369908 0.45263687 0.45553637 0.44919503 0.45103842]
 [0.48910472 0.50179374 0.500435   0.49628878 0.50380254]]





In [19]:
print(updated_policy_state.q_table - policy_state.q_table)

[[0.49524105 0.49048808 0.48732626 0.48682952 0.49110866]
 [0.43650088 0.44327947 0.44552    0.4405692  0.44126135]
 [0.45110252 0.45646632 0.45291328 0.45175323 0.45737243]
 [0.45369908 0.45263687 0.45553637 0.44919503 0.45103842]
 [0.48910472 0.50179374 0.500435   0.49628878 0.50380254]]


In [20]:
rng, episode_rng = jax.random.split(rng, 2)
visualize_episode(policy_state=updated_policy_state, rng=episode_rng)



0
This browser does not support the video tag.


### Exploit JAX: vectorize!

* Two approaches: train multiple agents OR train a single agent with faster data collection
    * Single agent -> width vs depth trade-off
* Todo: compare training times

In [72]:
print(updated_policy_state.q_table - policy_state.q_table)

[[-0.20057654 -0.20961711 -0.20375799 -0.20657158 -0.20181191]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]


## Exercise
* Reduce the epsilon over time
* Make the controller decentralized
    * You'll need to make the states and actions arm specific
    * E.g. like this:
        * Actions = [leading arm, left rower, right rower]
                * Or: [amplitude 0, amplitude 0.5, amplitude 1]
        * State = [arm is closest to target, arm is on left axis, arm is on right axis]
        * You'll have to modify the CPG as well!
* Experience replay (keep data buffer)