# Kopp et al 2021 Training 

**Authorship:**
Adam Klie, *08/07/2022*
***
**Description:**
This notebook is a use case in EUGENe paper. It is used to showcase Janggu integrated functionality to classify junD binding sites.
***

In [1]:
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload 
%autoreload 2

import os
import logging
import torch
import numpy as np
import pandas as pd
import eugene as eu

Global seed set to 13


GPU is available: True
Number of GPUs: 1
Current GPU: 0
GPUs: Quadro RTX 5000


  min_coords = np.vstack(data.min(0) for data in polygons_data).min(0)
  max_coords = np.vstack(data.max(0) for data in polygons_data).max(0)


In [2]:
# Configure EUGENe 
eu.settings.dataset_dir = "/cellar/users/aklie/data/eugene/kopp21"
eu.settings.output_dir = "/cellar/users/aklie/projects/EUGENe/EUGENe_paper/output/kopp21"
eu.settings.logging_dir = "/cellar/users/aklie/projects/EUGENe/EUGENe_paper/logs/kopp21"
eu.settings.config_dir = "/cellar/users/aklie/projects/EUGENe/EUGENe_paper/configs/kopp21"
eu.settings.verbosity = logging.ERROR

# Load in the `SeqData`

In [3]:
sdata = eu.dl.read_h5sd(
    filename=os.path.join(eu.settings.dataset_dir, "jund_train_processed.h5sd"),
)
sdata

SeqData object with = 948771 seqs
seqs = (948771,)
names = (948771,)
rev_seqs = (948771,)
ohe_seqs = (948771, 500, 4)
ohe_rev_seqs = (948771, 500, 4)
seqs_annot: 'chr', 'end', 'start', 'target', 'train_test', 'train_val'
pos_annot: None
seqsm: None
uns: None

# Model initialization 

In [4]:
from pytorch_lightning import seed_everything
def prep_new_model(
    seed,
    arch,
    config
):
    # Instantiate the model
    model = eu.models.load_config(
        arch=arch,
        model_config=config
    )

    # Set a seed
    seed_everything(seed)
    
    # Initialize the model prior to conv filter initialization
    eu.models.base.init_weights(model)

    # Return the model
    return model 

In [5]:
# Just make sure the model is taking in the proper data
model_types = ["FCN", "CNN", "RNN", "Hybrid", "Kopp21CNN"]
model_names = ["dsFCN", "dsCNN", "dsRNN", "dsHybrid", "Kopp21CNN"]
for model_name, model_type in zip(model_names, model_types):
    print(model_name, model_type)
    model = prep_new_model(0, model_type, os.path.join(eu.settings.config_dir, f"{model_name}.yaml"))
    if model_type == "RNN":
        sdataloader = sdata.to_dataset(transform_kwargs={"transpose": False}).to_dataloader() 
    else:
        sdataloader = sdata.to_dataset(transform_kwargs={"transpose": True}).to_dataloader()
    test_seqs = next(iter(sdataloader))
    print(model(test_seqs[1], test_seqs[2]).size())
    print()

dsFCN FCN


Global seed set to 0


No transforms given, assuming just need to tensorize).
torch.Size([128, 1])

dsCNN CNN


Global seed set to 0


No transforms given, assuming just need to tensorize).
torch.Size([128, 1])

dsRNN RNN


Global seed set to 0


No transforms given, assuming just need to tensorize).
torch.Size([128, 1])

dsHybrid Hybrid


Global seed set to 0


No transforms given, assuming just need to tensorize).
torch.Size([128, 1])

Kopp21CNN Kopp21CNN


Global seed set to 0


No transforms given, assuming just need to tensorize).
torch.Size([128, 1])



In [None]:
# Train 5 models with 5 different random initializations
model_types = ["FCN", "CNN", "RNN", "Hybrid", "Kopp21CNN"]
model_names = ["dsFCN", "dsCNN", "dsRNN", "dsHybrid", "Kopp21CNN"]
trials = 5
for model_name, model_type in zip(model_names, model_types):
    for trial in range(1, trials+1):
        print(f"{model_name} trial {trial}")

        # Initialize the model
        model = prep_new_model(
            arch=model_type, 
            config=os.path.join(eu.settings.config_dir, f"{model_name}.yaml"),
            seed=trial
        )

        if model_type == "RNN":
            t_kwargs = transform_kwargs={"transpose": False}
        else:
            t_kwargs = transform_kwargs={"transpose": True}

        # Train the model
        eu.train.fit(
            model=model, 
            sdata=sdata, 
            gpus=1, 
            target="target",
            train_key="train_val",
            epochs=30,
            early_stopping_metric="val_loss",
            early_stopping_patience=5,
            transform_kwargs=t_kwargs,
            batch_size=64,
            num_workers=4,
            name=model_name,
            seed=trial,
            version=f"trial_{trial}",
            verbosity=logging.ERROR
        )
        # Get predictions on the training data
        eu.settings.dl_num_workers = 0
        eu.predict.train_val_predictions(
            model,
            sdata=sdata, 
            target="target",
            train_key="train_val",
            transform_kwargs=t_kwargs,
            name=model_name,
            version=f"trial_{trial}",
            prefix=f"{model_name}_trial_{trial}_"
        )
        del model 
sdata.write_h5sd(os.path.join(eu.settings.output_dir, "train_predictions.h5sd"))

dsFCN trial 1


Global seed set to 1
Global seed set to 1


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name      | Type                      | Params
--------------------------------------------------------
0 | hp_metric | AUROC                     | 0     
1 | fcn       | BasicFullyConnectedModule | 1.1 M 
--------------------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.232     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 1


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved. New best score: 0.120


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Monitored metric val_loss did not improve in the last 5 records. Best score: 0.120. Signaling Trainer to stop.


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

SeqData object modified:
    seqs_annot:
        + dsFCN_trial_1_target_predictions
dsFCN trial 2


Global seed set to 2
Global seed set to 2


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name      | Type                      | Params
--------------------------------------------------------
0 | hp_metric | AUROC                     | 0     
1 | fcn       | BasicFullyConnectedModule | 1.1 M 
--------------------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.232     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved. New best score: 0.118


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Monitored metric val_loss did not improve in the last 5 records. Best score: 0.118. Signaling Trainer to stop.


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

SeqData object modified:
    seqs_annot:
        + dsFCN_trial_2_target_predictions
dsFCN trial 3


Global seed set to 3
Global seed set to 3


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name      | Type                      | Params
--------------------------------------------------------
0 | hp_metric | AUROC                     | 0     
1 | fcn       | BasicFullyConnectedModule | 1.1 M 
--------------------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.232     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 3


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved. New best score: 0.119


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Monitored metric val_loss did not improve in the last 5 records. Best score: 0.119. Signaling Trainer to stop.


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: 0it [00:00, ?it/s]

SeqData object modified:
    seqs_annot:
        + dsFCN_trial_3_target_predictions
dsFCN trial 4


Global seed set to 4
Global seed set to 4


No transforms given, assuming just need to tensorize).
No transforms given, assuming just need to tensorize).


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name      | Type                      | Params
--------------------------------------------------------
0 | hp_metric | AUROC                     | 0     
1 | fcn       | BasicFullyConnectedModule | 1.1 M 
--------------------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.232     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 4


Training: 0it [00:00, ?it/s]

---

## Scratch