# Finding short superpermutations for n=4

In [6]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv
from GymPermutationsEnv import GymPermutationEnv
# Instantiate the env
vec_env = make_vec_env(GymPermutationEnv, n_envs=16, env_kwargs=dict(alphabet_size=4), vec_env_cls=SubprocVecEnv)

In [7]:
# Train the agent
#model = A2C("MlpPolicy", vec_env, verbose=1,policy_kwargs=dict(optimizer_class=RMSpropTFLike, optimizer_kwargs=dict(eps=1e-5))).learn(total_timesteps=100000)
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

eval_env = make_vec_env(GymPermutationEnv, env_kwargs=dict(alphabet_size=4), vec_env_cls=SubprocVecEnv)
callback_on_best = StopTrainingOnRewardThreshold(reward_threshold=63, verbose=1) # for why 63 see notes below about calculating the max possible reward
eval_callback = EvalCallback(eval_env, callback_on_new_best=callback_on_best, verbose=1)
model = PPO('MlpPolicy', vec_env, verbose=1)
model.learn(10000000, callback=eval_callback)

Using cpu device
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 72.6     |
|    ep_rew_mean     | -240     |
| time/              |          |
|    fps             | 8447     |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 32768    |
---------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 56.9       |
|    ep_rew_mean          | -162       |
| time/                   |            |
|    fps                  | 2859       |
|    iterations           | 2          |
|    time_elapsed         | 22         |
|    total_timesteps      | 65536      |
| train/                  |            |
|    approx_kl            | 0.01294235 |
|    clip_fraction        | 0.118      |
|    clip_range           | 0.2        |
|    entropy_loss         | -3.17      |
|    explained_variance   | 0.00199    |
|    learning_rate        | 

<stable_baselines3.ppo.ppo.PPO at 0x2e3113654d0>

In [8]:
# Test the trained agent
vec_env = make_vec_env(GymPermutationEnv, n_envs=1, env_kwargs=dict(alphabet_size=4))
obs = vec_env.reset()
n_steps = 20
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=True)
    print(f"Step {step + 1}")
    print("Action: ", action)
    obs, reward, done, info = vec_env.step(action)
    print("obs=", obs, "reward=", reward, "done=", done)
    vec_env.render()
    if done:
        # Note that the VecEnv resets automatically
        # when a done signal is encountered
        print("Goal reached!", "reward=", reward)
        break

Step 1
Action:  [3]
obs= [[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0]] reward= [0.] done= [False]
[1, 3, 4, 2]
Step 2
Action:  [6]
obs= [[0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0]] reward= [9.] done= [False]
[1, 3, 4, 2, 1, 3, 4]
Step 3
Action:  [18]
obs= [[0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 0]] reward= [5.] done= [False]
[1, 3, 4, 2, 1, 3, 4, 1, 2, 3]
Step 4
Action:  [19]
obs= [[1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0]] reward= [8.] done= [False]
[1, 3, 4, 2, 1, 3, 4, 1, 2, 3, 4, 1, 3, 2]
Step 5
Action:  [10]
obs= [[1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0]] reward= [9.] done= [False]
[1, 3, 4, 2, 1, 3, 4, 1, 2, 3, 4, 1, 3, 2, 4, 1, 3]
Step 6
Action:  [13]
obs= [[1 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 1 1 1 

### The RL agent found the following superpermutation:
[1, 3, 4, 2, 1, 3, 4, 1, 2, 3, 4, 1, 3, 2, 4, 1, 3, 1, 4, 2, 3, 1, 4, 3, 2, 1, 4, 3, 1, 2, 4, 3, 1]

We can check if it's the same (up to relabelling) as the shortest superpermutation shown on Wikipedia:

In [9]:
import permutation_utils
found = [1, 3, 4, 2, 1, 3, 4, 1, 2, 3, 4, 1, 3, 2, 4, 1, 3, 1, 4, 2, 3, 1, 4, 3, 2, 1, 4, 3, 1, 2, 4, 3, 1]
shortest_superpermutation = [int(i) for i in "123412314231243121342132413214321"]
relabellings_list = permutation_utils.get_possible_relabellings(shortest_superpermutation, [1,2,3,4])
print(found in relabellings_list)

True


### Calculating the maximum possible reward for an episode

Assume that n is the size of the alphabet. Let us call the superpermutation formed by appending every possible permutation a "naive superpermutation".

Since there are n! possible permutations, each having length n and every one has to be included, the naive superpermutation has length n!*n.

The rewards in the environment are formed in such a way that the model gets 1 point for each character "saved" compared to this naive superpermutation. That is, if merging two permutations that have 2 overlapping characters, we get 2 points. If by merging we also added another permutation (other than the two we merged), we additionally reward the model with n points.

Thus the cumulative reward for an episode is (aside from penalties for picking already added permutations) equal to (n!*n)-(length of the superpermutation created by the model).

So the length of the superpermutation created by the model is at most (n!*n)-(sum of cumulative rewards for the episode)

For n=4, this is (4!*4)-(sum of cumulative rewards for the episode).

33 = 96-reward-->reward=63