# RECOVER Codebase Tutorial
This notebook serves as a tutorial for usage of the RECOVER codebase.  In this tutorial we will go over how to set up a new experiment and how to launch it.  We will detail the various configuration settings that can be manipulated, the different ways to use the datasets, models, and experimental setups.

### Logical Structure of Codebase
The RECOVER codebase makes heavy use of [Ray Tune](https://docs.ray.io/en/latest/tune/index.html).  The code implements a `tune.Trainable` object which is used for training.  By default a run using RECOVER will construct one of these `Trainable` classes (either a `BasicTrainer`, used for the simple supervised regression task, or the `ActiveTrainer` which implements active learning) and then Tune will run the experiment.  Given an object whose class is a subclass of `tune.Trainable` Tune will repeatedly call the `step` method on the object with the understanding that the `step` method will contain all logic to do optimization.  

This, in effect, is how the RECOVER code works.  Each experiment run using the RECOVER pipeline requires a configuration object, defined in a python file in the `config` directory and assigned to the variable `configuration`.  We will go deeper into how to properly setup a configuration later, but for now note that some example config files can be found in the `config` directory at your leisure.  We will now move on to how an experiment is run, from start to finish, using the RECOVER pipeline.  First, a `Trainable` subclass will be instantiated.  During its initialization, we will build a dataset object based on configuration parameters.  The dataset object will hold the [DrugComb](https://drugcomb.fimm.fi/) dataset altered in specific ways to enable different random splits based on your configuration.  If you are running an active learning trial using the `ActiveTrainer` it will initialize a seen and unseen set, splitting the training set into two parts -- one which is the initial seen set which is visible to the learner and a second unseen set which the learner does not have access to but may query and add to its seen set to use for further learning.

Next, Tune will take control of training and repeatedly call `step` on the `Trainable` object.  If using the `BasicTrainer` the `step` method will do optimization across the entire training set once and compute metrics on both the full training set and validation set.  If using the `ActiveTrainer` the `step` method will do optimization across the set of all drug combinations seen so far, doing early stopping on a held-out set of combinations.  Then, it will compute acquisition scores across the set of unseen drug combinations and choose the next batch of combinations to query and unmask.  Finally, the `ActiveTrainer.step` method will compute metrics across the seen set and also compute multiple metrics to track the quality of the active learning queries themselves.  Note that Tune will automatically log all the metrics returned by the `step` method to Tensorboard.

Since running with Tune requires Ray and can take slightly longer to setup, we also provide the option to run without Tune by setting `use_tune = False` in the configuration.  This will let things run slightly faster, but bars you from doing easy hyperparameter search as in Tune or benefiting from Tune's automatic logging.

Next we'll look at an example run of the RECOVER code, before finally going over what each configuration controls in more detail.

## An Example Run of RECOVER
Since Ray is a distributed processing library and since this tutorial is using an `ipynb` we're going to run without Ray Tune.  However, we generally recommend using Ray Tune as it provides many benefits when running on a cluster or otherwise.

Let's begin begin by importing some things

In [5]:
from recover.train import train
from recover.datasets.drugcomb_matrix_data import DrugCombMatrix
from recover.models.models import Baseline
from recover.models.predictors import BilinearFilmMLPPredictor, BilinearMLPPredictor
from recover.utils.utils import get_project_root
from recover.train import train_epoch, eval_epoch, BasicTrainer
import os
from ray import tune
from importlib import import_module

########################################################################################################################
# Configuration
########################################################################################################################


pipeline_config = {
    "use_tune": False,
    "num_epoch_without_tune": 50,  
    "seed": 0,
    # Optimizer config
    "lr": 1e-4,
    "weight_decay": 1e-2,
    "batch_size": 128,
    # Train epoch and eval_epoch to use
    "train_epoch": train_epoch,
    "eval_epoch": eval_epoch,
}

predictor_config = {
    "predictor": BilinearMLPPredictor,
    "predictor_layers":
        [
            2048,
            128,
            64,
            1,
        ],
    "merge_n_layers_before_the_end": 2,  # Computation on the sum of the two drug embeddings for the last n layers
    "allow_neg_eigval": True,
}

model_config = {
    "model": Baseline,
    "load_model_weights": False,
}

dataset_config = {
    "dataset": DrugCombMatrix,
    "study_name": 'ALMANAC',
    "in_house_data": 'without',
    "val_set_prop": 0.2,
    "test_set_prop": 0.1,
    "split_valid_train": "pair_level",
    "cell_line": 'MCF7',  
    "target": "bliss_max",
    "fp_bits": 1024,
    "fp_radius": 2
}

########################################################################################################################
# Configuration that will be loaded
########################################################################################################################

configuration = {
    "trainer": BasicTrainer, 
    "trainer_config": {
        **pipeline_config,
        **predictor_config,
        **model_config,
        **dataset_config,
    }
}

This configuration tells us to run 50 epochs of training on the ALMANAC dataset restricted to the MCF7 cell line.  We're going to split the dataset by drug pair so that no drug pair may be present in more than one of the training, validation, or test set.  The parameters `fp_bits` and `fp_radius` control parameters of the morgan fingerprint of the drugs in a combination -- we use this as part of the features for each drug pair.
We'll be running without using active learning (hence our usage of the `BasicTrainer`) and we'll be using a 4 layer multi-layer perceptron (MLP) as our predictor.  To allow for symmetry in our predictor's drug combination predictions (i.e., for drugs $A, B$ we require that $f(A,B) = f(B,A)$ for predictor $f$) we merge the drug representations into a combined representation, and the layer of the MLP at which this is done is denoted by the parameter `merge_n_layers_before_end`.  We are using a `BilinearMLPPredictor` which computes predictions via $h_1 W^T W h_2$ where $h_1, h_2$ are the representation of drugs $A$ and $B$ respectively.  Next we'll call `train` to begin training.  Note that we're just calling `train` here for demonstration purposes.  When usually running it is better to call `python train.py` from the command line with the config passed as an argument, as well as running with Tune enabled.  We're only disabling Tune here as we're working in a Jupyter notebook, where Ray Tune doesn't really make as much sense.

In [6]:
train(configuration)

Initializing regular training pipeline
Dataset loaded.
4463 drug comb experiments among 149 drugs
	 fingeprints with radius 2 and nbits 1024
	 drug features dimension 1173
	 1 cell-lines
model initialized randomly
Baseline(
  (criterion): MSELoss()
  (predictor): BilinearMLPPredictor(
    (before_merge_mlp): Sequential(
      (0): LinearModule(in_features=1173, out_features=2048, bias=True)
      (1): ReLUModule()
      (2): LinearModule(in_features=2048, out_features=128, bias=True)
      (3): ReLUModule()
    )
    (after_merge_mlp): Sequential(
      (0): LinearModule(in_features=128, out_features=64, bias=True)
      (1): ReLUModule()
      (2): LinearModule(in_features=64, out_features=1, bias=True)
    )
  )
)
Training {'loss_mean': 137.83404617309571, 'comb_r_squared': 0.008456869497901563}
Testing {'loss_mean': 96.26114654541016, 'comb_r_squared': 0.08663332884535772, 'spearman': 0.2269973471653043} 

Training {'loss_mean': 73.85206871032715, 'comb_r_squared': 0.109599766165787

We can see that by the end we arrive to an $R^2$ value of ~$0.497$ for the test set and a spearman rank correlation of $0.509$.  If we run without tune as we do here results will only be printed on the command line like so, but if we run with tune then one can access the results via tensorboard or even using Wandb by following the instructions [here](https://docs.wandb.ai/guides/integrations/other/ray-tune).  To do the same run as above but with tune we simply need to take the `configuration` we wrote and put it in some file and make a few changes to it.  A configuration like the above but usable with tune can be found in the cell below.

In [8]:
from recover.datasets.drugcomb_matrix_data import DrugCombMatrix
from recover.models.models import Baseline
from recover.models.predictors import BilinearFilmMLPPredictor, BilinearMLPPredictor
from recover.utils.utils import get_project_root
from recover.train import train_epoch, eval_epoch, BasicTrainer
import os
from ray import tune
from importlib import import_module

pipeline_config = {
    "use_tune": True,
    "seed": 0,
    # Optimizer config
    "lr": 1e-4,
    "weight_decay": 1e-2,
    "batch_size": 128,
    # Train epoch and eval_epoch to use
    "train_epoch": train_epoch,
    "eval_epoch": eval_epoch,
}

predictor_config = {
    "predictor": BilinearMLPPredictor,
    "predictor_layers":
        [
            2048,
            128,
            64,
            1,
        ],
    "merge_n_layers_before_the_end": 2,  # Computation on the sum of the two drug embeddings for the last n layers
    "allow_neg_eigval": True,
}

model_config = {
    "model": Baseline,
    "load_model_weights": False,
}

dataset_config = {
    "dataset": DrugCombMatrix,
    "study_name": 'ALMANAC',
    "in_house_data": 'without',
    "val_set_prop": 0.2,
    "test_set_prop": 0.1,
    "split_valid_train": "pair_level",
    "cell_line": 'MCF7', 
    "target": "bliss_max",
    "fp_bits": 1024,
    "fp_radius": 2
}

########################################################################################################################
# Configuration that will be loaded
########################################################################################################################

configuration = {
    "trainer": BasicTrainer,
    "trainer_config": {
        **pipeline_config,
        **predictor_config,
        **model_config,
        **dataset_config,
    },
    "summaries_dir": os.path.join(get_project_root(), "RayLogs"),
    "memory": 1800,
    "stop": {"training_iteration": 1000, 'patience': 10},
    "checkpoint_score_attr": 'eval/comb_r_squared',
    "keep_checkpoints_num": 1, # This means we'll only checkpoint the model with the best R^2 on the validation set
    "checkpoint_at_end": False,
    "checkpoint_freq": 1,
    "resources_per_trial": {"cpu": 8, "gpu": 0},
}

To run with tune, one could just copy the above into a file titled, say, `my_config.py` and then launch the experiment by running `python train.py -c my_config` from the command line.  The `train.py` script will invoke ray properly and do all setup of the experiment and, finally, launch the experiment.  The name and log directory for the experiment will be printed by ray and so to view the experiment results via tensorboard one can simply run 

```
$ cd <log_dir_printed_by_ray>
$ tensorboard --logdir=. --host localhost --port 8080
```