# DLAD Exercise 2: Multitask Learning

### Working with code

- fork the original exercise template repository under your user account, make sure that the fork remains private
- enable [mirroring](https://about.gitlab.com/blog/2016/12/01/how-to-keep-your-fork-up-to-date-with-its-origin/) so that the changes to the template (upstream repository) propagate into your fork. This is useful in case we find a bug in the template and push a fix to it - the fix will not propagate to your fork unless mirroring is enabled
- add your fork to SageMaker repositories (see course presentation on AWS or check [here](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-notebooks-now-support-git-integration-for-increased-persistence-collaboration-and-reproducibility/))
- clone your fork to your PC, edit code, push changes back to your fork

### Running experiments

- The code is pulled from a linked repository _automatically_ only when you start the Notebook Instance. Subsequent pulls have to be done from the Terminal (see below)
- The only cell in this notebook launches a new Training Job, and trains a model with it until the end
- You can watch live TensorBoard every now and then to make sure training is doing fine. Just making sure the first training steps did not fail is enough in most cases
- If you regret the hyperparameter choice, you can stop the job by visiting [SageMaker Training Jobs Console](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs), selecting the job in progress, and pressing Stop button. The job will take a couple of minutes to transition into Stopped state; until then your job is counted against your limits. After that you can stop the notebook cell. *Stopping the notebook cell alone does not terminate the Training Job, but only disconnects the notebook from it*
- Each Training Job has a very simple lifecycle: it either runs until the end, thus resulting in a `submission.zip` file, or it fails/stops, in which case it cannot be resumed.

### Incremental changes

- (optional) With code changes
 - change code in your standalone environment
 - dry run code in your standalone environment to make sure it doesn't fail due to a syntax error, thus saving you time with Training Job initialization (order of 5 minutes)
 - push code to your private fork
 - open the Terminal within your Notebook Instance (Jupyter Logo in top-left corner -> New -> Terminal), navigate to your code folder
 - execute `git pull` to fetch the changes you pushed to your fork
- (optional) With hyperparameter changes
 - update the hyperparameters dictionary in the cell below. It contains the command line parameters, which you would normally pass via command line as `--key <value>` in the standalone environment. All command line parameters are described in `mtl/utils/config.py`.
- Assign a meaningful yet short name to your experiment. A good name allows you to quickly understand the meaning of the experiment when looking at a list with 20 other experiments. Example: `question1-lr-0.001-optimizer-adam`
- Run the cell

### Running multiple experiments simultaneously

Each account is likely to have a limit of ml.p2.xlarge instances larger than 1, meaning that training multiple experiments in parallel is possible. To do that, we recommend having multiple (equal to your limit) separate Notebook Instances, each linked with the same repository. This way each of the Notebook Instances corresponds to one Training Job at time.

## Start a Training Job

- "In [\*]" next to the cell means the Training Job is running
- "In [any digit]" means the Trainin Job has finished, or the cell was disconnected from the job (either manually or by restarting/reconnecting to the notebook)

In [None]:
hyperparameters={
    # enable tensorboard server to run alongside the training code (you need this to track progress)
    'tensorboard_daemon_start': True,
    # enable forwarding tensorboard to your ngrok.com account (you need this to access tensorboard)
    'ngrok_daemon_start': True,
    # TODO: use your personal ngrok.com account authtoken to access tensorboard
    'ngrok_auth_token': 'your token string here',
    
    # TODO: override configurable experiment parameters below this line
    # 'batch_size': <new value>,
    # 'num_epochs': ...,
    # ...
}

# TODO: give name to distinct experiments, e.g. 'batch-size-4--newideaX-on--someparam-off'
experiment_name = 'experiment-meaningful-description'

#
# no changes below this line
#
import datetime
import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    source_dir='./',
    entry_point='train_sagemaker.py',
    role=sagemaker.get_execution_role(),
    framework_version='1.4.0',
    train_instance_count=1,
    train_instance_type='ml.p2.xlarge',
    train_volume_size=30,
    train_use_spot_instances=True,
    train_max_run=86000,
    train_max_wait=86400,
    debugger_hook_config=False,
    hyperparameters=hyperparameters,
)

estimator.fit(
    {'training': 's3://dlad-miniscapes/miniscapes.zip'},
    job_name=experiment_name + datetime.datetime.now().strftime("-%Y-%m-%d-%H-%M-%S"),
)