# Optuna Tutorial for 10x Dataset
This notebook shows how to use our code to train new models and do an automated hyperparameter optimization.

In [1]:
import scanpy as sc
import os
import optuna
import yaml

import sys
sys.path.append('../')
import tcr_embedding as tcr
from tcr_embedding.constants import DONOR_SPECIFIC_ANTIGENS
from tcr_embedding.utils_training import init_model
from tcr_embedding.evaluation.WrapperFunctions import get_model_prediction_function
from tcr_embedding.evaluation.Imputation import run_imputation_evaluation

In [2]:
%load_ext autoreload
%autoreload 2

## USER CHOICES

In [3]:
DONOR_NR = '1'  # options: ['1', '2']
MODEL_TYPE = 'concat'  # options: ['RNA', 'TCR', 'concat', 'PoE']

### Load Data
First we load the data, please first download the 10x dataset and preprocess it according to the notebook in "../preprocessing/10x_preprocessing.ipynb"

In [4]:
adata = sc.read_h5ad('../data/10x_CD8TC/v6_supervised.h5ad')

This example uses cells from donor 1

In [5]:
adata = adata[adata.obs['donor'] == 'donor_'+DONOR_NR]
adata = adata[adata.obs['binding_name'].isin(tcr.constants.DONOR_SPECIFIC_ANTIGENS[DONOR_NR])] # only use the most common and known antigen specificity
adata.obs['binding_name'] = adata.obs['binding_name'].astype(str)

Trying to set attribute `.obs` of view, copying.


## Manual hyperparameter configuration
This example shows how to use manually determined hyperparameters to initialize and train models. The hyperparameters can be specified in a config file that we load in.

In [6]:
params = yaml.safe_load(open('../config/transformer.yaml', 'r'))
params

{'seq_model_arch': 'Transformer',
 'seq_model_hyperparams': {'embedding_size': 32,
  'num_heads': 4,
  'forward_expansion': 2,
  'encoding_layers': 2,
  'decoding_layers': 2,
  'dropout': 0.1},
 'scRNA_model_arch': 'MLP',
 'scRNA_model_hyperparams': {'gene_hidden': [1000],
  'activation': 'leakyrelu',
  'output_activation': 'relu',
  'dropout': 0.2,
  'batch_norm': True},
 'hdim': 800,
 'activation': 'leakyrelu',
 'dropout': 0.2,
 'batch_norm': True,
 'shared_hidden': [200],
 'zdim': 100,
 'lr': 0.0001,
 'batch_size': 256,
 'losses': ['MSE', 'CE'],
 'loss_weights': [0.1, 1.0, 5e-05]}

In [7]:
model = init_model(params, model_type=MODEL_TYPE, adata=adata, dataset_name='10x')

**Training parameters**:
- experiment_name (str): A name for the experiment, used to name the saved weights
- n_epochs (int): Number of maximum epochs to train
- batch_size (int): Batch size
- lr (float): Learning Rate
- loss_weights (list of len(loss_weights) == 3): weighting of losses, loss_weights[0] := RNA loss, loss_weights[1] := TCR loss, loss_weights[2] := KL loss
- early_stop (int): Number of epochs without improvement on val loss before early stopping. Note, training won't stop before KL annealing is finished, i.e. before 30% of n_epochs. Aka patience
- balanced_sampling (str or None): If not None: adata.obs key to select attribute to balance data on. Used to balance clonotypes, don't use for RNA-only model!
- save_path (str): path to save the model weights
- num_workers (int): Number of workers for PyTorch dataloader

In [8]:
# Please increase N_EPOCHS for real training purposes. Setting it to a low value is only for demonstration purposes.
N_EPOCHS = 100

In [9]:
model.train(
    experiment_name=f'{MODEL_TYPE}_donor_{DONOR_NR}_manual',
    n_epochs=N_EPOCHS,
    batch_size=params['batch_size'],
    lr=params['lr'],
    loss_weights=params['loss_weights'],
    early_stop=N_EPOCHS // 10,
    balanced_sampling='clonotype',
    save_path='../saved_models',
    num_workers=0
)

Create Dataloader
Dataloader created


Epoch:  86%|████████████████████████████████▋     | 87/101 [10:28<01:41,  7.23s/it, best_epoch=11, best_f1_score=0.815]

Early stopped





The weights with the best kNN score and best reconstruction loss are saved and can be loaded for downstream tasks.

In [10]:
file_path = os.path.join('../saved_models', f'{MODEL_TYPE}_donor_{DONOR_NR}_manual_best_knn_model.pt')
model.load(file_path)
test_embedding_func = get_model_prediction_function(model, batch_size=1024)  # helper function for evaluation functions

summary = run_imputation_evaluation(adata, test_embedding_func, query_source='test', use_non_binder=False, use_reduced_binders=True, num_neighbors=5)
print(f"kNN weighted f1-score: {summary['knn']['weighted avg']['f1-score']}")

kNN weighted f1-score: 0.5849673701390998


## Optuna - automated hyperparameter search
Instead of relying on hyperparameter values from experience and gut feeling, the hyperparameters can be tuned in an automated fashion. In this tutorial we use Optuna (https://optuna.org/).

#### Objective function for Optuna
To perform a hyperparameter search, an objective function needs to be defined. The objective takes in the Optuna trial Object, which suggests parameters. Then the whole training and evaluation loop runs through and the objective function returns the metric back to Optuna, so Optuna knows what performance is reached with the current hyperparameters.

In [11]:
def objective(trial):
    """
    Objective function for Optuna
    :param trial: Optuna Trial Object
    """
    # suggest_params is a global method, defined later on
    params = suggest_params(trial)

    # create directory to save model weights for each trial
    save_path = f'../optuna/{name}/trial_{trial.number}'
    if not os.path.exists(save_path):
        os.makedirs(save_path)

    model = init_model(params, model_type=MODEL_TYPE, adata=adata, dataset_name='10x')
    
    # Train Model
    model.train(
        experiment_name=name,
        n_epochs=N_EPOCHS,
        batch_size=params['batch_size'],
        lr=params['lr'],
        loss_weights=params['loss_weights'], # [] or list of floats storing weighting of loss in order [scRNA, TCR, KLD]
        early_stop=N_EPOCHS // 10,
        balanced_sampling='clonotype',
        save_path=save_path,
        num_workers=0,
    )

    # in case kNN failed from the beginning
    if model.best_knn_metric == -1:
        return None

    return model.best_knn_metric

In [12]:
name = f'{MODEL_TYPE}_donor_{DONOR_NR}'

import sys
import importlib
sys.path.append('../config_optuna')

# Loads suggest_params method from the MODEL_TYPE dependent file
suggest_params = importlib.import_module(name.lower()).suggest_params

In [13]:
# For demonstration purposes set to a low value
N_TRIALS = 3

If the Optuna database file already exists (if training started previously), Optuna will throw an error. Delete database file manually or uncomment the corresponding lines

In [14]:
if not os.path.exists('../optuna/'):
    os.makedirs('../optuna/')

## Uncomment to remove previous database file to restart hyperparameter optimization, be careful! Will delete previous state
# if os.path.exists(f'../optuna/{name}.db'):
#     os.remove(f'../optuna/{name}.db')

# Create study object
study = optuna.create_study(study_name=name, storage=f'sqlite:///../optuna/{name}.db', direction='maximize')

# Starts hyperparameter optimization
study.optimize(objective, n_trials=N_TRIALS)

[32m[I 2021-06-02 10:28:07,226][0m A new study created in RDB with name: concat_donor_1[0m


Create Dataloader


Epoch:   0%|                                                                                   | 0/101 [00:00<?, ?it/s]

Dataloader created


Epoch: 100%|██████████████████████████████████████| 101/101 [10:12<00:00,  6.06s/it, best_epoch=0, best_f1_score=0.347]
[32m[I 2021-06-02 10:38:20,699][0m Trial 0 finished with value: 0.3471632371949116 and parameters: {'dropout': 0.25, 'activation': 'linear', 'rna_hidden': 2000, 'hdim': 100, 'shared_hidden': 100, 'rna_num_layers': 3, 'tfmr_encoding_layers': 2, 'loss_weights_seq': 2.460285025312969, 'loss_weights_kl': 0.2657602125175934, 'batch_size': 256, 'lr': 4.753226911948138e-05, 'tfmr_embedding_size': 16, 'tfmr_num_heads': 2, 'tfmr_forward_expansion': 4, 'tfmr_dropout': 0.05, 'zdim': 50}. Best is trial 0 with value: 0.3471632371949116.[0m


Create Dataloader


Epoch:   0%|                                                                                   | 0/101 [00:00<?, ?it/s]

Dataloader created


Epoch: 100%|█████████████████████████████████████| 101/101 [16:45<00:00,  9.96s/it, best_epoch=57, best_f1_score=0.541]
[32m[I 2021-06-02 10:55:07,881][0m Trial 1 finished with value: 0.5412424072270924 and parameters: {'dropout': 0.25, 'activation': 'leakyrelu', 'rna_hidden': 2000, 'hdim': 200, 'shared_hidden': 100, 'num_layers': 2, 'rna_num_layers': 1, 'tfmr_encoding_layers': 3, 'loss_weights_seq': 0.11974500375153953, 'loss_weights_kl': 1.8403759623483707e-06, 'batch_size': 256, 'lr': 0.0005345086266269178, 'tfmr_embedding_size': 32, 'tfmr_num_heads': 4, 'tfmr_forward_expansion': 4, 'tfmr_dropout': 0.2, 'zdim': 20}. Best is trial 1 with value: 0.5412424072270924.[0m


Create Dataloader


Epoch:   0%|                                                                                   | 0/101 [00:00<?, ?it/s]

Dataloader created


Epoch:  42%|███████████████▊                      | 42/101 [02:55<04:06,  4.18s/it, best_epoch=21, best_f1_score=0.834]
[32m[I 2021-06-02 10:58:04,817][0m Trial 2 finished with value: 0.8339651561361385 and parameters: {'dropout': 0.2, 'activation': 'linear', 'rna_hidden': 1750, 'hdim': 500, 'shared_hidden': 500, 'rna_num_layers': 1, 'tfmr_encoding_layers': 2, 'loss_weights_seq': 5.48978826979483, 'loss_weights_kl': 0.0018328574596167392, 'batch_size': 512, 'lr': 0.0006459063331390998, 'tfmr_embedding_size': 16, 'tfmr_num_heads': 2, 'tfmr_forward_expansion': 2, 'tfmr_dropout': 0.0, 'zdim': 60}. Best is trial 2 with value: 0.8339651561361385.[0m


Early stopped


In [15]:
print(f'{len(study.trials)} Trials has been finished\n')

best_trial = study.best_trial
print(f'Best Trial: {best_trial.number}')
print(f'Best weighted f1-score on the val set: {best_trial.value}')
print(f'Using the following parameters:')
for key, value in best_trial.params.items():
    print(f'    {key}: {value}')
print(f'\nYou can find the saved weights here: "../optuna/{name}/trial_{best_trial.number}/{name}_best_knn_model.pt"')

3 Trials has been finished

Best Trial: 2
Best weighted f1-score on the val set: 0.8339651561361385
Using the following parameters:
    activation: linear
    batch_size: 512
    dropout: 0.2
    hdim: 500
    loss_weights_kl: 0.0018328574596167392
    loss_weights_seq: 5.48978826979483
    lr: 0.0006459063331390998
    rna_hidden: 1750
    rna_num_layers: 1
    shared_hidden: 500
    tfmr_dropout: 0.0
    tfmr_embedding_size: 16
    tfmr_encoding_layers: 2
    tfmr_forward_expansion: 2
    tfmr_num_heads: 2
    zdim: 60

You can find the saved weights here: "../optuna/concat_donor_1/trial_2/concat_donor_1_best_knn_model.pt"
