# Training notebook
---

Notebook showing an example usage of the provided code to collect expert/prior data and replicate the experiments from the paper.

## Collect expert/prior data

In [None]:
from train_expert import train_expert
from collect_expert_data import collect_expert_data
from collect_prior_data import collect_prior_data

**Train an 'expert' agent and collect expert trajectories in the source environment:**

Tested source environments:

1) *Inverted Pendulum* realm:
- *InvertedPendulum-v2*
- *InvertedDoublePendulum-v2*

2) *Reacher* realm:
- *ReacherEasy-v2*
- *ThreeReacherEasy-v2*

3) *Hopper* realm:
- *Hopper-v2*

4) *Half-Cheetah* realm:
- *HalfCheetah-v2*

5) *7DOF-Pusher* realm:
- *PusherHumanSim-v2*

6) *7DOF-Striker* realm:
- *StrikerHumanSim-v2*

In [None]:
# Define source environment
env_name = 'InvertedPendulum-v2'

# Train expert agent
expert_agent = train_expert(env_name=env_name)

# Collect expert trajectories
collect_expert_data(agent=expert_agent,
                    env_name=env_name,
                    max_timesteps=10000,
                    expert_samples_location='expert_data')

**Collect prior trajectories for the realms:**

Tested realms:

1) *Inverted Pendulum* realm: *Inverted Pendulum/InvertedPendulum*

2) *Reacher* realm: *Reacher*

3) *Hopper* realm: *Hopper*

4) *Half-Cheetah* realm: *Half-Cheetah/HalfCheetah*

5) *7DOF-Pusher* realm: *7DOF-Pusher/Pusher*

6) *7DOF-Striker* realm: *7DOF-Striker/Striker*

In [None]:
# Define realm
env_realm = 'Inverted Pendulum'

# Collect prior data
collect_prior_data(realm_name=env_realm,
                   max_timesteps=10000,
                   prior_samples_location='prior_data')

## Perform *Observational* Imitation

### *DisentanGAIL* models

In [None]:
from run_experiment import run_experiment

**Replicating the experiments**

We provide a function that simply allows to collect data for different variations of the *DisentanGAIL* algorithm.

We divide the parameters of this function in three dictionaries, below we provide a brief description on the most relevant parameters to reproduce the experiments with *DisentanGAIL* by using the hyper-parameters given in the supplementary material (*Appendix B*). The parameters that are not specified below can be kept constant for the experiments in all *source/target* environments. 

Please, refer to our code (mainly *disentangail_models.py*) for further details to experiment with further options.

1) **exp_params**: defines the general parameters of the algorithm:

>Specify ***exp_name*** as the location where to log the results
>
>---

>Specify ***exp_num*** as the experiment number
>
>---

>Specify ***env_name*** as the *source* environment name:
>
>
>
>For the Inverted Pendulum realm:
>>*InvertedPendulum-v2*
>>
>>*InvertedDoublePendulum-v2*

> For the *Reacher* realm:
>>*ReacherEasy-v2*
>>
>>*ThreeReacherEasy-v2*

>For the *Half-Cheetah* realm:
>>*HalfCheetah-v2*

>For the *7DOF-Pusher* realm:
>>*PusherHumanSim-v2*

>---

>Specify ***env_type*** as the domain difference in the *target* environment:
>
>
>
>For the Inverted Pendulum realm:
>>*InvertedPendulum-v2*: *expert/colored/to_two/to_colored_two*
>>
>>*InvertedDoublePendulum-v2*: *expert/colored/to_one/to_colored_one*

> For the *Reacher* realm:
>>*ReacherEasy-v2*: *expert/tilted/to_three/to_tilted_three*
>>
>>*ThreeReacherEasy-v2*: *expert/tilted/to_two/to_tilted_two*

>For the *Hopper* realm:
>>*Hopper-v2*: *expert/flexible*

>For the *Half-Cheetah* realm:
>>*HalfCheetah-v2*: *expert/to_locked_feet*

>For the *7DOF-Pusher* realm:
>>*PusherHumanSim-v2*: *expert/to_robot*

>For the *7DOF-Striker* realm:
>>*StrikerHumanSim-v2*: *expert/to_robot*
>
>---

>Specify ***epochs*** as the maximum number of epochs to run the experiment.
>
>---

>Specify ***episode_limit*** as the task horizon.
>
>---

>Specify ***return_threshold*** as an optional early termination condition to stop learning after reaching the wanted performance.
>
>---

2) **learner_params**: defines the parameters of the 'observer' agent's policy:

>Specify ***l_buffer_size*** as the maximum dimension set of visual trajectories collected by the agent
>
>---

3) **discriminator_params**: defines the parameters of the discriminator:

>Specify ***d_mi_constant*** as the initial penalty coefficient for the expert demonstrations constraint (set to *0.0* to disable the expert demonstrations constraint) 
>
>---

>Specify ***d_prior_mi_constant*** as the initial penalty coefficient for the prior data constraint (set to *0.0* to disable the prior data constraint) 
>
>---

>Specify ***d_pre_filters*** as a list with the filters within each layer of the preprocessor (followed by a Tanh nonlinearity and 2x2 Max-Pooling)
>
>---

>Specify ***d_hidden_units*** as a list with the number of units within each hidden fully-connected layer of the invariant discriminator (followed by a ReLU/Tanh nonlinearity)
>
>---

>Specify ***d_mi_hidden_units*** as a list with the number of units within each hidden fully-connected layer of the statistics network (followed by a Tanh nonlinearity)
>
>---

>Specify ***n_expert_demos/n_expert_prior_demos/n_agent_prior_demos*** as the number of *expert demonstrations/prior expert data/prior agent data* to utilize for learning.
>
>---

In [None]:
# Define parameters

exp_params = {
    'exp_name': 'InvertedPendulum_to_colored/DisentanGAIL',
    'expert_samples_location': 'expert_data',
    'prior_samples_location': 'prior_data',
    'env_name': 'InvertedPendulum-v2',
    'env_type': 'colored',
    'exp_num': -1,
    'epochs': 20,
    'test_runs_per_epoch': 5,
    'steps_per_epoch': 1000,
    'init_random_samples': 5000,
    'training_starts': 512,
    'episode_limit': 50,
    'visualize_collected_observations': True,
}
learner_params = {
    'l_type': 'SAC',
    'l_buffer_size': 10000,
    'l_exploration_noise': 0.2,
    'l_learning_rate': 1e-3,
    'l_batch_size': 256,
    'l_updates_per_step': 1,
    'l_act_delay': 1,
    'l_gamma': 0.99,
    'l_polyak': 0.995,
    'l_entropy_coefficient': 0.2,
    'l_tune_entropy_coefficient': True,
    'l_target_entropy': None,
    'l_clip_actor_gradients': False,
}

discriminator_params = {
    'd_type': 'latent',
    'd_loss': 'ce',
    'd_rew': 'mixed',
    'd_rew_noise': False,
    'd_learning_rate': 1e-3,
    'd_mi_learning_rate': 1e-3,
    'd_updates_per_step': 1,
    'd_mi_updates_per_step': 1,
    'd_e_batch_size': 128,
    'd_l_batch_size': 128,
    'd_stability_constant':  1e-7,
    'd_sn_discriminator': True,
    'd_mi_constant': 0.5,
    'd_adaptive_mi': True,
    'd_double_mi': True,
    'd_use_min_double_mi': True,
    'd_max_mi': 0.99,
    'd_min_mi': 0.99/2,
    'd_use_dual_mi': False,
    'd_mi_lagrangian_lr': 1e-3,
    'd_max_mi_constant': 5.0,
    'd_min_mi_constant': 1e-4,
    'd_unbiased_mi': True,
    'd_unbiased_mi_decay': .99,
    'd_prior_mi_constant': 1.0,
    'd_negative_priors': True,
    'd_max_mi_prior': 0.001,
    'd_min_mi_prior_constant': 1e-3,
    'd_clip_mi_predictions': True,
    'd_pre_filters': [16, 16, 1],
    'd_hidden_units': [32, 32],
    'd_mi_hidden_units': [32, 32],
    'd_pre_scale_stddev': 0.5,
    'n_expert_demos': 10000,
    'n_expert_prior_demos': 10000,
    'n_agent_prior_demos': 10000,
}



In [None]:
# Run 5 repetitions of the experiment

for i in range(5):
    exp_params['exp_num'] = i
    gail, sampler = run_experiment(exp_params, learner_params, discriminator_params)
    
    # uncomment below to visualize the learnt behaviour at the end of each experiment
    
    # sampler.sample_test_trajectories(gail._agent, 0.0, 1, True)

### *Domain confusion loss* models

In [None]:
from run_experiment_dc_loss import run_experiment

**Replicating the *domain confusion loss* results**

We provide a function that simply allows to collect data with different algorithms making use of our implementation of the *domain confusion loss*.

The parameters for this function are similar to the *DisentanGAIL* parameters, described above. Please, refer to our code (mainly *disentangail_dc_loss_models.py*) for further details.


In [None]:
# Define parameters

exp_params = {
    'exp_name': 'InvertedPendulum_to_colored/DisentanGAIL_dc_loss',
    'expert_samples_location': 'expert_data',
    'prior_samples_location': 'prior_data',
    'env_name': 'InvertedPendulum-v2',
    'env_type': 'colored',
    'exp_num': -1,
    'epochs': 20,
    'test_runs_per_epoch': 5,
    'steps_per_epoch': 1000,
    'init_random_samples': 5000,
    'training_starts': 512,
    'episode_limit': 50,
    'visualize_collected_observations': True,
}
learner_params = {
    'l_type': 'SAC',
    'l_buffer_size': 10000,
    'l_learning_rate': 1e-3,
    'l_batch_size': 256,
    'l_updates_per_step': 1,
    'l_act_delay': 1,
    'l_gamma': 0.99,
    'l_polyak': 0.995,
    'l_entropy_coefficient': 0.2,
    'l_tune_entropy_coefficient': True,
    'l_target_entropy': None,
    'l_clip_actor_gradients': False,
}

discriminator_params = {
    'd_type': 'latent',
    'd_domain_constant': 0.25,
    'd_rew': 'mixed',
    'd_rew_noise': False,
    'd_learning_rate': 1e-3,
    'd_updates_per_step': 1,
    'd_stability_constant':  1e-7,
    'd_e_batch_size': 128,
    'd_l_batch_size': 128,
    'd_sn_discriminator': True,
    'd_use_prior_data': True,
    'd_pre_filters': [16, 16, 1],
    'd_hidden_units': [32, 32],
    'd_pre_scale_stddev': 0.5,
    'n_expert_demos': 10000,
    'n_expert_prior_demos': 10000,
    'n_agent_prior_demos': 10000,
}

In [None]:
# Run 5 repetitions of the experiment

for i in range(5):
    exp_params['exp_num'] = i
    gail, sampler = run_experiment(exp_params, learner_params, discriminator_params)
    
    # uncomment below to visualize the learnt behaviour at the end of each experiment
    
    # sampler.sample_test_trajectories(gail._agent, 0.0, 1, True)