# Introduction to training on the Rubik's Cube 
<a href="https://githubtocolab.com/adsodemelk/umoja23/blob/main/Rubik's%20Cube%20walkthrough%20colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

This notebook will walk through some examples of how to run training with the model as set up currently, as well as how to make modifications and explore your own solutions.

In [1]:
import os 
#os.chdir("/content/drive/MyDrive/umojahack-africa-2023-advanced-challenge")

In [2]:
# !git clone https://github.com/adsodemelk/umoja23/

In [3]:
import os
import sys
#Download the necesary files from a public bucket
#!gsutil cp -r gs://umoja23/ .

In [4]:
# Istall requirements
#!pip install -Ur umoja23/requirements/requirements.txt

## Imports
Here we will be importing the moduls we are using. 


**Note** : If you face issues importing numpy. Try restarting the runtime (go to Runtime -> Restart runtime) 

In [5]:
# Add installed package to the path
import os, sys 
sys.path.append(os.path.abspath("./umoja23"))


# Environment
from rubiks_cube.env import RubiksCube, create_flattened_env
import gym
import matplotlib.pyplot as plt

# Training

from umoja23.training.example import main_training
from umoja23.training.configs import CUSTOM_MODEL_CONFIG

# Evaluation

from ray.rllib.algorithms import AlgorithmConfig
from ray.rllib.algorithms.ppo import PPO
from umoja23.training.configs import get_config, MEDIUM_ENV_CONFIG
from umoja23.training.registry import register, _ray

from umoja23.evaluation.seeds import PUBLIC_SEEDS
from umoja23.evaluation.generate_rollout import main_rollout
from evaluation.validate_rollout import main_validation

%matplotlib inline




In [None]:
import shutil
shutil.rmtree("/home/duamelo/ray_results/PPO")

## Introduction to the environment

The Rubik's Cube is likely a very familiar puzzle to everyone! In case you would like to read more about its history or some of the maths and mechanics behind it, the Wikipedia page is a good starting point: https://en.wikipedia.org/wiki/Rubik%27s_Cube

We present here an implementation in python (numpy) that allows you to explore how reinforcement learning can be applied to the problem.

First let's instantiate and walk through some of the basic mechanics of the environment. Feel free to consult rubiks_cube/env.py to see details of the implementation.

In [None]:
STEP_LIMIT, ITERS, NUM_SCRAMBLES_ON_RESET = 100, 100, 0

In [None]:
env = RubiksCube(step_limit=STEP_LIMIT, 
                 reward_function_type="sparse", 
                 num_scrambles_on_reset=NUM_SCRAMBLES_ON_RESET)

assert isinstance(env, gym.Env)

The observation space consists of

1: the cube, with ids indicating the colour each sticker. There are 6 faces, each containing 3x3 stickers;

2: the step_count, which starts at 0 and increments by 1 on every turn taken, until the environment step_limit is hit (at which point the episode must end).

The action space consists of a face to turn, and an amount. Faces can be 6 possible values, and amounts can be 3, corresponding to (in order) clockwise, anti-clockwise, and a half turn.

Convention:

0 = up face

1 = front face

2 = right face

3 = back face

4 = left face

5 = down face

All read in reading order when looking directly at face

To look directly at the faces:

UP: LEFT face on the left and BACK face pointing up

FRONT: LEFT face on the left and UP face pointing up

RIGHT: FRONT face on the left and UP face pointing up

BACK: RIGHT face on the left and UP face pointing up

LEFT: BACK face on the left and UP face pointing up

DOWN: LEFT face on the left and FRONT face pointing up

Turn amounts (eg clockwise) are when looking directly at the face.

In [None]:
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

In [None]:
# Render the environment

obs = env.reset()
env.render()

As mentioned, an action consists of specifying a face and an amount to turn it by. We can visualise the impact of an action by using the rendering above.

In [None]:
# Don't scramble the cube so it's easier to understand what's happening
env = RubiksCube(step_limit=STEP_LIMIT,
                 reward_function_type="sparse", 
                 num_scrambles_on_reset=0)
obs = env.reset()
env.render()

In [None]:
# Turn the UP face clockwise (by 90 degrees)
action = (0, 0)
obs, reward, done, info = env.step(action)
env.render()

By default, the reward function is +1 if the cube is solved, otherwise 0. This is specified by passing reward_function_type to the environment constructor. You can implement your own custom reward function if you think that doing so would better incentivise the agent to learn how to solve the cube! However, be aware that the final evaluation will be purely on how often the cube is solved.

In [None]:
print(f"Reward before solving: {reward}")

# Invert the above action
action = (0, 1)
obs, reward, done, info = env.step(action)
print(f"Reward after solving: {reward}")

The episode can finish for one of two reasons:

1: The cube is solved

2: The step limit is hit. In this case, all rewards throughout the episode will be 0

In [None]:
print(f"Above environment has finished? {done}")

env = RubiksCube(step_limit=STEP_LIMIT,
                 reward_function_type="sparse", 
                 num_scrambles_on_reset=NUM_SCRAMBLES_ON_RESET)
obs = env.reset()
done = False
while not done:
    # Select a random action
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    print(f"New environment has finished? {done}. Reward obtained={reward}")

For some models it might be more helpful to "flatten" the action space, meaning to have a single integer between 0 and 17 rather than a tuple of integers. For this purpose you can use the already implemented wrapper. To get this behaviour in models below, you will typically want to set FLATTEN_ACTIONS to True in training/configs.py

In [None]:
env = create_flattened_env(dict(step_limit=STEP_LIMIT,
                                reward_function_type="sparse", 
                                num_scrambles_on_reset=NUM_SCRAMBLES_ON_RESET))

print(env.action_space)

Finally, you can combine multiple frames together to create an animation of your Rubik's Cube!

In [None]:
env = RubiksCube(step_limit=STEP_LIMIT,
                 reward_function_type="sparse",
                 num_scrambles_on_reset=100)
obs = env.reset()
cubes = [obs["cube"].copy()]
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    cubes.append(obs["cube"].copy())

anim = env.animation(cubes)
anim.save("cube.gif")

## Training

The algorithm implemented here is Proximal Policy Optimization (PPO), which is an on-policy reinforcement learning algorithm developed by OpenAI. For more information please consult their page: https://openai.com/research/openai-baselines-ppo. Note that one could equally use other algorithms such as DQN and you encouraged to experiment with this, though it may involve changing other parts of the code such as evaluation/generate_rollout.py.

Fundamentally, a PPO agent should implement a policy network and a value network. The policy network maps observations to a parametrisation of the policy that is used to select the next action, whereas the value network maps observations to a single number which is supposed to estimate the expected discounted future reward, and is used to stabilise the training. The model implemented here by default shares most of the parameters between these two networks, though you are encouraged to experiment with different model architectures.

To understand the architecture of the model, examine the model class in training/PPO_models.py. By default it will use FactorisedPPOModel, which takes as input the cube and step count from the observation, and produces an encoding of the observation, which will be a fixed size vector of size given by the final entry of dense_layer_dims in the model_config (defined in training.configs.CUSTOM_MODEL_CONFIG). This encoding is projected to a single number for predicting the value. As for the policy, we take advantage of the factorised/conditional structure of the action space to produce first logits for the decision of which face to turn - the model's face_selection_model. Finally, we compute a conditional distribution by concatenating a sampled value of the selected face (encoded as a one-hot) to the encoding and projecting to give logits for the decision of how much to turn the face - the model's cube_movement_amount_selection_model. This model must be used in combination with the FactorisedActionDistribution.

Note that if one instead flattens the action space this conditional structure will no longer be present and so one should use a different model like FlatPPOModel, and will not need to specify a custom action distribution (since by default PPO is expecting the model to output logits for a flat action space).

### Training your first model

The main training script will write a checkpoint (learned model weights and associated metadata) that you will later be able to use to evaluate training on unseen instances.

This script can also be run from the command line; the following command will execute the same code as the python in the cell below:

python training/example.py --step_limit 10 --reward_function_type sparse --num_scrambles_on_reset 2 --agent_name PPO --num_iterations 2

In [None]:
main_training(
    step_limit=STEP_LIMIT,
    reward_function_type="sparse",
    num_scrambles_on_reset=NUM_SCRAMBLES_ON_RESET,
    model_config=CUSTOM_MODEL_CONFIG,
    agent_name="PPO",
    num_iterations=ITERS,
    restore_path=None,
)

As you can see there is quite a lot of output! The main highlights are the path to the checkpoint (here the directory is ~/ray_results/PPO/PPO_rubiks_cube_env_f42af_00000_0_2023-02-28_14-55-34), a summary of the model used (which you can override by modifying training/PPO_models.py) and the per iteration results. The latter is a series of metrics summarising what happened during the episode.

You might find it easier to view these in tensorboard:

In [None]:
%load_ext tensorboard

In [None]:
log_dir = sorted([x for x in os.listdir("/home/duamelo/ray_results/PPO/") if x.startswith("PPO")])[-1]
log_dir

In [None]:
chk_pt = sorted([x for x in os.listdir(f"/home/duamelo/ray_results/PPO/{log_dir}") if x.startswith("checkpoint")])[-1]
chk_pt

In [None]:
def get_dir(root=False):
  log_dir = sorted([x for x in os.listdir("/home/duamelo/ray_results/PPO/") if x.startswith("PPO")])[-1]
  chk_pt = sorted([x for x in os.listdir(f"/home/duamelo/ray_results/PPO/{log_dir}") if x.startswith("checkpoint")])[-1]
  if root:
    return f"/home/duamelo/ray_results/PPO/{log_dir}"    
  return f"/home/duamelo/ray_results/PPO/{log_dir}/{chk_pt}"

In [None]:
#%tensorboard --logdir ~/ray_results/PPO/PPO_rubiks_cube_env_f62df_00000_0_2023-03-18_23-16-50

To see some basic documentation of how to use the main training script, run in terminal:

python training/example.py --help

### Training your second model

The library allows you to restore from a checkpoint and continue training, without having to redo the first stages of training. This could be useful in various situations, for example if the training crashed, or if you want to start training on easier environments before progressing to more difficult examples as the agent gets better (curriculum learning). Note that this will only work if the model can stay the same throughout.

Let's restore the checkpoint generated above and continue training on a harder environment. We will increase the number of scrambles done to the cube on reset. Note that we must increase the number of training iterations as otherwise there is nothing to do. The results of this training will be written to a new directory.

This is equivalent to the following command:

python training/example.py --step_limit 10 --reward_function_type sparse --num_scrambles_on_reset 5 --agent_name PPO --num_iterations 4 --restore_path ~/ray_results/PPO/PPO_rubiks_cube_env_f42af_00000_0_2023-02-28_14-55-34/checkpoint_000002

In [None]:
os.listdir(f"{get_dir(True)}")

In [None]:
main_training(
    step_limit=STEP_LIMIT,
    reward_function_type="sparse",
    num_scrambles_on_reset=NUM_SCRAMBLES_ON_RESET,
    model_config=CUSTOM_MODEL_CONFIG,
    agent_name="PPO",
    num_iterations=ITERS,
    restore_path=f"{get_dir()}",
)

Note that only 2 additional iterations of training were performed.

There is a lot more customisablity available for rllib, feel free to consult documentation at https://docs.ray.io/en/latest/rllib/index.html

## Evaluation

The final stage to is to produce a set of rollouts - that is sequences of actions and new observations that one obtains by repeatedly inferring from a model and taking the corresponding action in the environment. Included below is a simple set of commands that are likely to be useful if you would like to investigate the decisions that your model is making in particular situations.

In [None]:
config = get_config(
    env_config=MEDIUM_ENV_CONFIG,
    model_config=CUSTOM_MODEL_CONFIG,
    agent_name="PPO",
)
register(agent_name="PPO")
with _ray():
    agent = PPO(AlgorithmConfig.from_dict(config))
    agent.restore(f"{get_dir()}")
    env = RubiksCube(**MEDIUM_ENV_CONFIG)
    obs = env.reset()
    # Exploration means to sample from the parametrised distribution. Setting it to false picks the modal action
    action = agent.compute_single_action(observation=obs, explore=False)

In [None]:
env.render()

In [None]:
action

In [None]:
print(f"Agent is choosing action {action}")
obs, reward, done, info = env.step(action)

In [None]:

env.render()

For a more systematic way of doing this, there is a script which will run this automatically on a collection of increasingly difficult environment configurations and write the result to a file. To use this script, run:

python evaluation/generate_rollout.py

while specifying command line arguments checkpoint_path (indicating the path to restore from if using a trained model), the results_path (where to write the results of the inference to), and the agent_name (can leave blank if using PPO).

In [None]:
# PUBLIC_SEEDS

In [None]:
# Example on a smaller set of seeds
/home/duamelo/home/duamelo
main_rollout(seeds=PUBLIC_SEEDS,
             checkpoint_path=f"{get_dir()}", 
             results_path=f"sample_results_steps_{STEP_LIMIT}_iter_{ITERS}_dl_dims_{'_'.join([str(x) for x in CUSTOM_MODEL_CONFIG['dense_layer_dims']])}.txt")

Finally, you can run the script python evaluation/validate_rollout.py to make sure that the results are given in a format the platform can understand. For more information on how to use this script, run python evaluation/validate_rollout.py --help

In [None]:
score = main_validation(results_path=f"sample_results_steps_{STEP_LIMIT}_iter_{ITERS}_dl_dims_{'_'.join([str(x) for x in CUSTOM_MODEL_CONFIG['dense_layer_dims']])}.txt", public_seeds=PUBLIC_SEEDS)
print(f"You scored {score}!")

In [None]:
# os.listdir()

In [None]:
#from google.colab import files

In [None]:
#files.download(f"sample_results_steps_{STEP_LIMIT}_iter_{ITERS}_dl_dims_{'_'.join([str(x) for x in CUSTOM_MODEL_CONFIG['dense_layer_dims']])}.txt")