## Day 5, Part B: Using Your Trained Policy

## Learning goals
- What to do with the trained policy

## Definitions
- **Simulation environment**: Notice that this is not the same as the python/conda environment.  The simulation environment is the simulated world where the reinforcement learning takes place.  It provides opportunities for an agent to learn and explore, and ideally provides challenges that aid in efficient learning.
- **Agent (aka actor or policy)**: An entity in the simulation environment that performs actions.  The agent could be a person, a robot, a car, a thermostat, etc.
- **State variable**: An observed variable in the simulation environment.  They can be coordinates of objects or entities, an amount of fuel in a tank, air temperature, wind speed, etc.
- **Action variable**: An action that the agent can perform.  Examples: step forward, increase velocity to 552.5 knots, push object left with force of 212.3 N, etc.
- **Reward**: A value given to the agent for doing something considered to be 'good'.  Reward is commonly assigned at each time step and cumulated during a learning episode.
- **Episode**: A learning event consisting of multiple steps in which the agent can explore.  It starts with the unmodified environment and continues until the goal is achieved or something prevents further progress, such as a robot getting stuck in a hole.  Multiple episodes are typically run in loops until the model is fully trained.
- **Model (aka policy or agent)**: An RL model is composed of the modeling architecture (e.g., neural network) and parameters or weights that define the unique behavior of the model.
- **Policy (aka model or agent)**: The parameters of a model that encode the best choices to make in an environment.  The choices are not necessarily good ones until the model undergoes training.  The policy (or model) is the "brain" of the agent.
- **Replay Buffer**: A place in memory to store state, action, reward and other variables describing environmental state transitions. It is effectively the agent's memory of past experiences.
- **On-policy**: The value of the next action is determined using the current actor policy.
- **Off-policy**: The value of the next action is determined by a function, such as a value function, instead of the current actor policy.
- **Value function**: Function (typically a neural network) used to estimate the value, or expected reward, of an action.

## I've trained. I'm happy with the results. Now what?

Training a RL policy can be very time consuming and expensive, so you want to make sure it's put to good use.  Before we try to make use of the trained model, let's be sure it's ready to use.  In previous notebooks, we have been saving the models (e.g., Day 1 Part C), but there are additional nuances that can be helpful during training.

- Consider auto-saving the highest reward policy during training
- Consider auto-saving periodically in case you need to pause a long training run or there is a power outage

A lot of the frameworks have these things included, but you want to verify or put those things in by hand (or add more for your own interests) should you need them. TD3 has some, and we'll mention those below.

You might also want to stand up a database to store a large number of policy snapshots or to keep the model buffer state-action transitions (Methods that learn from previously determined transitions are a new field in RL). 

>**There's a golden rule you may have learned from late nights writing school essays and the power goes out: "Save. And save often."**

From our original stable_baselines3 CartPole:

```python
model.save("ppo_cartpole")
```

In this case, saving produced a `ppo_cartpole.zip` file; others might produce a NumPy `.npy`, or simply an 'object'.  These always contain the model values, but some also include other aspects of training, like episode number, so you can restart training where you left off.  That depends on the library you use.  In any case, **the file is the artifact that you spent all your time and money producing - the stored values in the neural network**.

If you still have the ppo_cartpole.zip around, you can load it up and put it to use; otherwise, rerun Day1_PartC and create it.

Now, import the same boilerplate:

In [None]:
import os
import gym
from stable_baselines3 import PPO

Create our environment from `gym.make()` and load the zip back in to a variable using stable-baselines3's PPO load utilty.

In [None]:
env = gym.make("CartPole-v1")
model = PPO.load("ppo_cartpole")

We'll go ahead and do the same render we did before to 'see it in action', but lets take a look at what we have already:

In [None]:
obs = env.reset()
obs

This array is the environment state at reset.  Feel free to re-run the above cell a few times, you'll see different results for each run.  (Use control-enter to re-run without leaving the cell)

Passing that environment state to the policy in `model.predict(obs)` returns the policy action to take given the current environment.

In [None]:
action, _states = model.predict(obs)
action

Believe it or not, for CartPole, that's the *entire* ballgame. All that time and effort gives you a policy that delivers one thing: the action to take given the state of the environment.

>**Remember that you are building a tool.**

You *can* hook things up to run an entire episode and play things out like a simulation/game/etc., or you could just take that single one-off state->action converter and drop it into another piece of code.  Maybe you have a theory-based algorithm that solves your problem perfectly, except for that one blind point where your algorithm has a divide-by-zero (shrug), so in that exception catch you drop in your trained policy to do a bit more than just a simple 'default action.'

>**Try running the next two cells, again and again, to advance the environment forward; predict->step->predict->step**

In [None]:
obs, rewards, dones, info = env.step(action)
obs, rewards

In [None]:
action, _states = model.predict(obs)
action

We can, of course, just run the thing through the episode (or 1k steps in a loop below) given our policy.  But we just want to remind you, you don't have to do just that with the policy you've trained.  

>**Your policy is now a function that pops out 'best actions' when you ask it to**

Set up a multiplayer game, for example, and every time the computer gets a turn (or opportunity to move in some way... maybe on a set polling interval, or maybe 0.05 seconds, or whatever) you pass the observations to your policy, and your AI player can act.

In [None]:
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()
env.env.viewer.close()

For the case of TD3, there are save and load functions built in, and they look like this:

```python
    def save(self, filename):
        torch.save(self.critic.state_dict(), filename + "_critic")
        torch.save(self.critic_optimizer.state_dict(), filename + "_critic_optimizer")

        torch.save(self.actor.state_dict(), filename + "_actor")
        torch.save(self.actor_optimizer.state_dict(), filename + "_actor_optimizer")

    def load(self, filename):
        self.critic.load_state_dict(torch.load(filename + "_critic"))
        self.critic_optimizer.load_state_dict(torch.load(filename + "_critic_optimizer"))
        self.critic_target = copy.deepcopy(self.critic)

        self.actor.load_state_dict(torch.load(filename + "_actor"))
        self.actor_optimizer.load_state_dict(torch.load(filename + "_actor_optimizer"))
        self.actor_target = copy.deepcopy(self.actor)
```

It's simply using the PyTorch functions `torch.save()` and `torch.load()` to load the objects for actor and critic - then, when we request action and state updates, we're now asking a loaded (trained or partially trained) policy: `policy.select_action(np.array(state))`

## Load Ant and have it perform some actions

There's a lot of code in the next two cells, but it is rather simple in what it's doing: 
- import/load boilerplate
- register the environment
- load the policy

The `load_policy()` in this case is nearly identical to the first half of the `main()` function we were playing with in our Ant examples.  It just stops as soon as it has the TD3 load accomplished, with correct parameters.  We don't need most of them, but we're bringing them along for the ride, just in case.

In [None]:
import numpy as np
import torch
import gym
import pybullet_envs
import os
import time
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))
import utils
import TD3
from numpngw import write_apng
from gym.envs.registration import registry, make, spec

def register(id, *args, **kvargs):
    if id in registry.env_specs:
        return
    else:
        return gym.envs.registration.register(id, *args, **kvargs)

register(id='MyAntBulletEnv-v0',
         entry_point='override_ant_random:MyAntBulletEnv',
         max_episode_steps=3000,
         reward_threshold=2500.0)

In [None]:
def load_policy(env_name_var):
    args = {
            "policy" : "TD3",                  # Policy name (TD3, DDPG or OurDDPG)
            "env" : env_name_var,              # OpenAI gym environment name
            "seed" : 0,                        # Sets Gym, PyTorch and Numpy seeds
            "start_timesteps" : 25e3,          # Time steps initial random policy is used
            "eval_freq" : 5e3,                 # How often (time steps) we evaluate
            "max_timesteps" : 2e6,             # Max time steps to run environment
            "expl_noise" : 0.1,                # Std of Gaussian exploration noise
            "batch_size" : 256,                # Batch size for both actor and critic
            "discount" : 0.99,                 # Discount factor
            "tau" : 0.007,                     # Target network update rate
            "policy_noise" : 0.2,              # Noise added to target policy during critic update
            "noise_clip" : 0.5,                # Range to clip target policy noise
            "policy_freq" : 2,                 # Frequency of delayed policy updates
            "save_model" : "store_true",       # Save model and optimizer parameters
            "load_model" : "default",           # Model load file name, "" doesn't load, "default" uses file_name
           }

    file_name = f"{args['policy']}_{args['env']}_{args['seed']}_{args['tau']}"
    print("---------------------------------------")
    print(f"Policy: {args['policy']}, Env: {args['env']}, Seed: {args['seed']}")
    print("---------------------------------------")

    if not os.path.exists("./results"):
        os.makedirs("./results")

    if args['save_model'] and not os.path.exists("./models"):
        os.makedirs("./models")

    env = gym.make(args['env'])

    # Set seeds
    env.seed(args['seed'])
    env.action_space.seed(args['seed'])
    torch.manual_seed(args['seed'])
    np.random.seed(args['seed'])

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0] 
    max_action = float(env.action_space.high[0])

    kwargs = {
        "state_dim": state_dim,
        "action_dim": action_dim,
        "max_action": max_action,
        "discount": args['discount'],
        "tau": args['tau'],
    }

    # Initialize policy
    if args['policy'] == "TD3":
        # Target policy smoothing is scaled wrt the action scale
        kwargs["policy_noise"] = args['policy_noise'] * max_action
        kwargs["noise_clip"] = args['noise_clip'] * max_action
        kwargs["policy_freq"] = args['policy_freq']
        policy = TD3.TD3(**kwargs)

    if args['load_model'] != "":
        policy_file = file_name if args['load_model'] == "default" else args['load_model']
        policy.load(f"./models/{policy_file}")

    return policy

In [None]:
policy = load_policy("MyAntBulletEnv-v0")

In [None]:
env = gym.make("MyAntBulletEnv-v0", render=True)
env.seed(0)

In [None]:
state, done = env.reset(), False

In [None]:
state

Now that everything's set up, you can view the ant and **step through the simulation using the next cell** (we could have even just made that it's own function - call it 'advance' or something).  Control-enter will run the cell without advancing to the next cell.

At this point, it's fun to keep the simulation window and the notebook both visible (I shrink my notebook to see the window on the side).  The view can be adjusted with control-click and your mouse wheel.

**This cell can be used to advance the ant, one step at a time:**

In [None]:
action = policy.select_action(np.array(state))
state, reward, done, _ = env.step(action)

In [None]:
env.robot.body_xyz

Maybe mid-course we want to change the target the ant is walking to (which is then in the obs space):

In [None]:
env.robot.walk_target_x = -10
env.robot.walk_target_y = -10

Go back to the cell above and advance the ant. It should turn and go toward this new target.  There might be some momentum built up and some maneuvering to turn around, so advance it enough times to actually see it turn and start walking again.

You get the picture - the policy controls what actions are taken, it's an artifact that we save and load, but other than that it's up to us how it gets used.

## Walk to each coordinate
For fun - lets set it up so we can pass it a list of points the ant needs to go to, and we can pass in walking-path coordinates.  Maybe this list could be provided by another path-finding AI, or classical control scheme.

In [None]:
my_list = [(3,3),(0,3),(-6,-6),(0,-6),(9,9),(0,9),(0,0)]

In [None]:
for i in my_list:
    env.robot.walk_target_x = i[0]
    env.robot.walk_target_y = i[1]
    path_done = False
    counter_i = 0
    while not path_done:
        action = policy.select_action(np.array(state))
        state, reward, done, _ = env.step(action)
        time.sleep(1. / 100.) #comment out to run at max local system speed
        counter_i += 1
        if counter_i > 500:
            path_done = True
            
        dist = np.linalg.norm([i[0]-env.robot.body_xyz[0],i[1]-env.robot.body_xyz[1]])
        if dist < 0.2:
            path_done = True

Don't let all that power go to your head.. poor little ant.

Try changing the list up a few times and see the ant run different routes. 

>**Can you make a path that Ant will follow?**

This particular policy is from the 5 million time-step custom ant with no reward modification - it will mostly get the job done, but there will be a few instances where it just doesn't make the next point happen (this is why we give it the 500 time-step counter time-out per episode).  

>**For fun, try taking one of your modified ants with custom reward and send it through a similar challenge.**

- What happens if you (by-hand) manually tweak some of the robot internal state values as it's moving?  
- Is the ant robust to observation signal noise?  How might you modify the training course so it would handle real-world sensor noise/errors/corruptions that it might encounter (as if the policy were placed into a real robot)?
- What modifications might you want to make to the environment that the ant is trained in? 
- The observables it sees - would they be general enough to handle the 'real world'?

There are lots of things to consider and weigh when building out your RL environment, training, and how you use the policy, but hopefully by this point you can start to answer some of these questions and think about what you might do, yourself.  Best of luck!