### TODO
- Learning goals: Use lower level items as goals.  I.e. the method used to solve the problem.  Structure with subproblems and one story - subproblems become learning outcomes
- Consider places where learners write some code (but always give them the code in md so they can copy/paste if needed)
- see other todos inline

# Day 1, Part B: Faster Learning

## Learning goals
- 

**TODO: Add learning goals**

## Definitions
- **Simulation environment**: Notice that this is not the same as the python/conda environment.  The simulation environment is the simulated world where the reinforcement learning takes place.  It provides opportunities for an agent to learn and explore, and ideally provides challenges that aid in efficient learning.
- **Agent (aka actor or policy)**: An entity in the simulation environment that performs actions.  The agent could be a person, a robot, a car, a thermostat, etc.
- **State variable**: An observed variable in the simulation environment.  They can be coordinates of objects or entities, an amount of fuel in a tank, air temperature, wind speed, etc.
- **Action variable**: An action that the agent can perform.  Examples: step forward, increase velocity to 552.5 knots, push object left with force of 212.3 N, etc.
- **Reward**: A value given to the agent for doing something considered to be 'good'.  Reward is commonly assigned at each time step and cumulated during a learning episode.
- **Episode**: A learning event consisting of multiple steps in which the agent can explore.  It starts with the unmodified environment and continues until the goal is achieved or something prevents further progress, such as a robot getting stuck in a hole.  Multiple episodes are typically run in loops until the model is fully trained.
- **Model (aka policy or agent)**: An RL model is composed of the modeling architecture (e.g., neural network) and parameters or weights that define the unique behavior of the model.
- **Policy (aka model or agent)**: The parameters of a model that encode the best choices to make in an environment.  The choices are not necessarily good ones until the model undergoes training.  The policy (or model) is the "brain" of the agent.
- **Replay Buffer**: A place in memory to store state, action, reward and other variables describing environmental state transitions. It is effectively the agent's memory of past experiences.

![Reinforcement Learning Cycle](./images/Reinforcement-learning-diagram-01.png)

## Options for increasing learning rate
- Multiprocessing (multiple instances of cartpole).  Also give examples of algorithms designed for shared learning (MATD3?)
- Better reward functions (discussed in other notebooks)
- Other tricks (e.g., move agent farther from target as learning progresses)
- Add attention or other innovative algorithm enhancements

**TODO: write out the bullets above**

### Multiprocessing and Shared Learning

Because RL is focused on decision sequences in which an agent's actions influence future environment states, it is difficult to parallelize computations and speed up learning.  Multi-agent techniques, however, allow shared learning from identical agents.  OpenAI Gym provides a method for this by creating multiple identical environments, each with an identical agent.  By giving each environment a different random seed, they each begin learning in a different state.  These environments can run on seperate processes on a single computer at the same time, and the experiences of all of the agents will be shared.

In [None]:
import os
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.utils import set_random_seed
from tqdm import trange

This function helps us make multiple environments:

In [None]:
def make_env(env_id, rank, seed=0):
    """
    Utility function for multiprocessed env.

    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environments you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    """
    def _init():
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    
    set_random_seed(seed)
    
    return _init

In [None]:
env_id = "CartPole-v1"

Let's use multiple cpus, but avoid using all of them so that the computer remains stable

In [None]:
num_cpu = os.cpu_count()
if num_cpu > 2:
    num_cpu -= 1

Create the vectorized environment

In [None]:
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])

Start learning then display the result

In [None]:
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=25000)

This will show the results from all of the environments

In [None]:
obs = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()


**TODO: Show training loss plot and reward plot**


You can close the windows by restarting the kernel

**TODO: Figure out how to close the env windows without restarting the kernel**

**Maybe don't include parking environment**