# Imports
To begin with, we import all necessary tools from EvoTorch, and also PyTorch so that we can jit save the learned policy for submission.

In [None]:
from evotorch import Problem
from evotorch.algorithms import PGPE
from evotorch.neuroevolution import GymNE
from evotorch.logging import StdOutLogger, PandasLogger
import torch

# Configuration
Now we configure for the specific environment we want to solve. We've figured this configuration out for the weighted reward keys through some trial and error, but we imagine that they can be modified further. 

In [None]:
env_name = "myosuite:myoChallengeBaodingP1-v1"
policy_path = 'agent/policies/learned_policy_boading.pkl'
env_config = {
    'weighted_reward_keys' : {
        'pos_dist_1':1.0,
        'pos_dist_2':1.0,
        'solved': 2,
        'act_reg': 0.1,
    }
}

We also have the configuration of the optimizer. Here we are generally following the setup used in previous work with ClipUp:

In [None]:
CLIPUP_MAX_SPEED = 0.15
CLIPUP_ALPHA = CLIPUP_MAX_SPEED * 0.75
RADIUS = CLIPUP_MAX_SPEED * 15

STDEV_LR = 0.1
STDEV_MAX_CHANGE = 0.2

POPSIZE = 20000
POPSIZE_MAX = POPSIZE * 8
NUM_INTERACTIONS = int(POPSIZE * 200 * 0.75)
NUM_GENERATIONS = 2000

# Setup
Now we're ready to create the problem class. You should note that we're using the `Policy` class from the included `policy.py` file: our initial experiments suggest that a slightly more complex (and recurrent) policy is beneficial for these complex environments. Feel free to modify `Policy` as you try to improve upon this baseline.

In [None]:
from policy import Policy

problem = GymNE(
    env_name=env_name,
    env_config = env_config,
    network=Policy,
    network_args = {
        'hidden_dim': 64,
    },
    observation_normalization=True,
    num_actors='max',
)
print('Solution length is', problem.solution_length)

With `problem` instantiated, we can create the searcher:

In [None]:
searcher = PGPE(
    problem,
    center_learning_rate=CLIPUP_ALPHA,
    optimizer="clipup",
    optimizer_config={"max_speed": CLIPUP_MAX_SPEED},
    radius_init=RADIUS,
    stdev_learning_rate=STDEV_LR,
    stdev_max_change=STDEV_MAX_CHANGE,
    popsize=POPSIZE,
    popsize_max=POPSIZE_MAX,
    num_interactions=NUM_INTERACTIONS,
    distributed = True,
)
searcher

We add an additional `after_step_hook` which will save the policy after every generation. This means that you can asynchonously evaluate and submit agents, even as the evolutionary run continues!

In [None]:
import torch

def save_policy():
    global problem, searcher, policy_path
    policy = problem.to_policy(searcher.status['center'])
    scripted_module = torch.jit.script(policy)
    torch.jit.save(scripted_module, policy_path)

searcher.after_step_hook.append(save_policy)

For sanity, we're adding a StdOutLogger instance and a PandasLogger instance so that we can track what's going on:

In [None]:
_ = StdOutLogger(searcher)
pandas_logger = PandasLogger(searcher)

# Train
And that's all there is to it! We're ready to train. As we go, we'll be creating `learned_policy_boading.pkl` in the local directory. Check out `README.md` to see how you can submit these learned agents. 

In [None]:
searcher.run(NUM_GENERATIONS)

Finally, it's nice to see the mean evaluation of the training as time passes. You can interrupt the above cell at any time, generate the plot below, and restart the above cell to continue training!

In [None]:
pandas_logger.to_dataframe().mean_eval.plot()