# Guide to instantiating models with EUGENe
Adam Klie (last updated: *09/20/2023*)
***
**Description:**
This notebook is meant to serve as a guide to instantiating models with EUGENe. It will cover the following topics:
1. EUGENe's `models` library
2. Using EUGENe's built-in models
3. Using configuration files to instantiate models
4. Using custom architectures
5. Using custom LightningModules

# Set-up

Make sure to have [installed EUGENe](https://eugene-tools.readthedocs.io/en/latest/installation.html) in your environment before you start!
> **Warning**:
> Before you start! Running this notebook without a GPU on this data is feasible but will be very slow. We'd recommend using [Google Colab](https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d) if you don't have access to your own GPU. If you choose this option, run the following before you begin the tutorial:
> ```
> !pip install eugene-tools
> !pip install torchmetrics==0.1a1.4
> ```

In [None]:
# General imports
import torch

In [None]:
# Test out basic torch
x_linear = torch.randn(10, 4)
x = torch.randn(10, 4, 100)

To run a few cells near the end of the notebook that illustrate plugging in EUGENe architectures into different software frameworks, you will need to install the following packages:

In [None]:
# Get EvoAug set-up
#!pip install evoaug
#!pip install git+https://github.com/p-koo/evoaug_analysis.git

We recommend cloning the entire tutorials repository so that you have all the necessary intermediate files you need, but when applicable, we also provide links to download the files directly.

In [None]:
# Change this to where you would like to save all your results, likely your tutorials download directory
import os
os.chdir("/cellar/users/aklie/projects/ML4GLand/tutorials")  # TODO: change this to your own directory
cwd = os.getcwd()
cwd

## EUGENe's `models` library -- the basics

### Starting with PyTorch

I like to think of architecting neural networks as playing with adult legos. Legos specifically designed for geeks who like to learn from data. 

Designing and training neural networks for regulatory genomics requires a comprehensive library of architecture lego blocks. Fortunately for us, PyTorch provides an extensive library that we can use right out of the box. 

In [None]:
import torch.nn as nn

In [None]:
# We define a simple linear layer with 4 inputs and 5 outputs.
layer = nn.Linear(4, 5)
layer_out = layer(x_linear)
layer, layer_out.shape

### EUGENe's layers build on PyTorch

EUGENe builds on PyTorch by adding several useful layers such as inception and residual layers. These are define in the `eugene.models.base._layers` module. 

In [None]:
# Load EUGENe's layers 
from eugene.models.base import _layers as layers

Layers in EUGENe can be broken up into several categories: **Activations, Convolutional, Pooling, Recurrent, Attention, Normalizer, Wrappers, Gluers, Sampling, Noise, Misc**

In [None]:
# Take exponential activations as an example that have been shown to improve the interpretability of models
layer = layers.Exponential(inplace=False)
layer_out = layer(x)
layer, layer_out.shape

In [None]:
# For a more complex example, we can use the InceptionConv1D layer that uses multiple sizes of convolutions in a single layer
layer = layers.InceptionConv1D(in_channels=4, out_channels=16)
layer_out = layer(x)
layer, layer_out.shape

In [None]:
# We can also use the MultiHeadAttention layer that is used in the Transformer architecture
layer = layers.MultiHeadAttention(
    input_dim=4,
    head_dim=10,
    num_heads=2
)
layer_out = layer(x.transpose(1, 2), mask=None)
layer, layer_out.shape

Remember you can always bring up the signature and docstring for any layer by using the `?` operator in Jupyter notebooks. For example, to bring up the docstring for the `Inception` layer, you can run the following code:

```python
from eugene.models.base._layers import InceptionConv1d
InceptionConv1d?
```

In [None]:
# One more common example of what we call a wrapper layer. Residual layers are used to add the input to the output of a layer.
conv_layer = torch.nn.Conv1d(4, 4, 5, padding="same")
layer = layers.Residual(conv_layer)
layer_out = layer(x)
layer, layer_out.shape

### From layers to blocks

Additionally, EUGENe introduces flexible functions for establishing common “blocks” that are composed of heterogeneous sets of layers arranged in a predefined or adaptable order. Blocks are available in the `eugene.models.base._blocks` module.

In [None]:
from eugene.models.base import _blocks as blocks

A convolutional block (Conv1DBlock in EUGENe) often comprises convolutional, normalization, activation, and dropout layers in different orderings depending on the model and task.

In [None]:
conv1d_block = blocks.Conv1DBlock(
    input_len=100,
    input_channels=4,
    output_channels=32,
    conv_kernel=23,
    dropout_rate=0.1,
)
block_out = conv1d_block(x)
conv1d_block, block_out.shape

**Note**: Order matters! From model to model, the order of the layers can be employed differently. For example, a DeepSEA conv block uses the ordering “conv-act-pool-dropout”, while a DeepSTARR conv block uses “conv-norm-act-pool”. We can change the ordering of the layers in the conv block using the order argument.

In [None]:
# A la DeepSEA, note that we can also omit one of the layers by simply not including it in the order argument. We've also added other arguments that are flexible in EUGENe
conv1d_block = blocks.Conv1DBlock(
    input_len=100,
    input_channels=4,
    output_channels=32,
    conv_kernel=23,
    conv_type="conv1d",
    conv_padding="same",
    pool_type="max",
    norm_type="batchnorm",
    dropout_rate=0.5,
    order="conv-act-pool-dropout"
)
block_out = conv1d_block(x)
conv1d_block, block_out.shape

In [None]:
# A la DeepSTARR
conv1d_block = blocks.Conv1DBlock(
    input_len=100,
    input_channels=4,
    output_channels=32,
    conv_kernel=23,
    conv_type="conv1d",
    conv_padding="same",
    pool_type="max",
    norm_type="batchnorm",
    dropout_rate=0.5,
    order="conv-norm-act-pool"
)
block_out = conv1d_block(x)
conv1d_block, block_out.shape

We also offer 2 other blocks: `DenseBlocks`, `RecurrentBlocks`. We leave it as an exercise to the reader to explore these blocks or to create their own! (Transformer blocks coming soon!)

### From blocks to towers

We can also stack multiple blocks on top of one another so that we can build very deep architectures without having to write a lot of code. We call these stacks of blocks "towers". Towers are available in the `eugene.models.base._towers` module.

In [None]:
from eugene.models.base import _towers as towers

We built a single flexible `Tower` class that can handle any block. The class takes in a set of `static_block_args` that you want to repeat across blocks and a set of `dynamic_block_args` that you want to change for each block. You can also pass in a set of `mults` that you want to use to scale an argument with in each block (e.g. if you want to do exponentially increasing dilations like in Basenji). We show an example of using the `Tower` class with a Conv1DBlock (probably the most common in genomics) below:

In [None]:
from eugene.models.base import _blocks as blocks

In [None]:
# Here we toss a Conv1DBlock with some different arguments than before. If you like dogs, then this tower is for you.
tower = towers.Tower(
    input_size=(4, 100),
    block=blocks.Conv1DBlock,
    repeats=3,
    static_block_args={'input_len': 100, 'conv_kernel': 3, 'conv_padding': 'same', 'conv_type': 'conv1d', 'activation': 'gelu', 'order': 'conv-norm-act-dropout-pool'},
    dynamic_block_args={'input_channels': [4, 10, 20], 'output_channels': [10, 20, 30]},
    mults={"conv_dilation": 2}
)
tower_out = tower(x)
tower, tower.input_size, tower.output_size, tower_out.shape

The other cool thing about this is it is not EUGENe specific! That is, it isn't constrained to working on EUGENe layers or blocks. You can use it with any PyTorch layer or block!

In [None]:
# Here we just repeat a linear layer 3 times, decreasing the number of units in each layer
tower = towers.Tower(
    input_size=400,
    block=torch.nn.Linear,
    repeats=3,
    dynamic_block_args={'in_features': [400, 200, 100], 'out_features': [200, 100, 10]},
)
tower_out = tower(x.reshape(10, -1))
tower, tower.input_size, tower.output_size, tower_out.shape

Using standarized objects such as blocks and towers has several benefits:
1. It makes it easier to write code, and for others to read your code.
2. It makes it easier to interpret models. You can pull layers out of the model based on the standard names and see what is going on.
3. It makes it easier to debug models and identify where things are going wrong.

## Using EUGENe's built-in models

### Architectures as simple nn.Modules

We refer to architectures in EUGENe as any class that inherits from `torch.nn.Module` and includes a `forward` method. This is the standard way of defining architectures in PyTorch. This means that any of the layers, blocks, or towers we discussed are technically architectures on their own. However, architectures that are actually used in practice are usually composed of multiple layers, blocks, and towers in combination.

In [None]:
import torch.nn as nn

In [None]:
class EverythingEverywhereAllAtOnce(torch.nn.Module):
    def __init__(
        self
    ):
        super(EverythingEverywhereAllAtOnce, self).__init__()
        self.layer = nn.Linear(4, 5)
        self.layer2 = layers.Exponential(inplace=False)
        self.layer3 = layers.InceptionConv1D(in_channels=4, out_channels=16)
        self.layer4 = layers.MultiHeadAttention(
            input_dim=4,
            head_dim=10,
            num_heads=2
        )
        self.layer5 = layers.Residual(conv_layer)
        self.layer6 = blocks.Conv1DBlock(
            input_len=100,
            input_channels=4,
            output_channels=32,
            conv_kernel=23,
            dropout_rate=0.1,
        )
        self.layer7 = towers.Tower(
            input_size=(4, 100),
            block=blocks.Conv1DBlock,
            repeats=3,
            static_block_args={'input_len': 100, 'conv_kernel': 3, 'conv_padding': 'same', 'conv_type': 'conv1d', 'activation': 'gelu', 'order': 'conv-norm-act-dropout-pool'},
            dynamic_block_args={'input_channels': [4, 10, 20], 'output_channels': [10, 20, 30]},
            mults={"conv_dilation": 2}
        )
        self.layer8 = towers.Tower(
            input_size=400,
            block=torch.nn.Linear,
            repeats=3,
            dynamic_block_args={'in_features': [400, 200, 100], 'out_features': [200, 100, 10]},
        )

    def forward(self, x):
        x = self.layer(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.layer5(x)
        x = self.layer6(x)
        x = self.layer7(x)
        x = self.layer8(x)
        return x

In [None]:
# Note, the forward for this architecture will not actually work, I just wanted to stack everything we've looked at so far into one model
EverythingEverywhereAllAtOnce()

### Built-in architectures

We could just leave you on your own to play with your new lego sets, but for those who want some fully assembled models to play with, we have you covered. We've built a model zoo (cliche I know) of models that are available in the `eugene.models.zoo` module.

In [None]:
from eugene.models import zoo

Like we mentioned before, every architecture in the zoo is a class that inherits from `torch.nn.Module` and includes a `forward` method. This means that you can use any of the architectures in the zoo just like you would any other PyTorch model.

### Types of built-in architectures

We are ever the [splitters at EUGENe](https://en.wikipedia.org/wiki/Lumpers_and_splitters), so we split our models into many categories based on the types of tasks they were designed for. The model zoo has the following sections:

1. Basic models
2. Transcription factor binding prediction models
3. Regulatory classification models
4. Cis-regulatory element (CRE) activity prediction models
5. Profile predictors
6. Single cell predictors

#### Basic architectures

Let's start by checking out the basic models

In [None]:
from eugene.models.zoo._basic_models import FCN, CNN, RNN, Hybrid

We provide customizable fully connected (FCN), convolutional (CNN), recurrent (RNN) and hybrid (a combination of the three) architectures that can all be instantiated from single function calls! Did I say we were splitters?

In [None]:
# FCNs are just a bunch of linear layers stacked on top of each other
model = FCN(
    input_len=100,
    output_dim=10,
    dense_kwargs={
        "hidden_dims": [50, 25],
        "activations": "relu",
        "batchnorm": True,
    }
)
model_out = model(x)
model, model_out.shape

In [None]:
# Here is just another example of an FCN, this time with a different activation function and dropout, as well as other customizable parameters
model = FCN(
    input_len=100,
    output_dim=1,
    dense_kwargs=dict(
        hidden_dims=[100, 50, 25], 
        activations=["relu", None, None], 
        dropout_rates=[0.1, 0.5], 
        batchnorm=True, 
        batchnorm_first=True, 
        biases=False
    ),
)
model_out = model(x)
model, model_out.shape

In [None]:
# Now we can move to CNNs, which are just a bunch of convolutions stacked on top of each other, followed by a dense block
model = CNN(
    input_len=100,
    output_dim=10,
    conv_kwargs={
        "input_channels": 4,
        "conv_channels": [10, 10],
        "conv_kernels": [5, 3],
        "activations": [],
        "pool_types": []
    }
)
model_out = model(x)
model, model_out.shape

Note that the number of incoming features to the dense block is determined automagically! Thanks [torchinfo](https://github.com/TylerYep/torchinfo)!

In [None]:
# Let's skip RNNs since those aren't used a whole lot and jump to Hybrids, which are just convolutions followed by recurrent layers and then a dense block
model = Hybrid(
    input_len=100,
    output_dim=10,
    conv_kwargs={
        "input_channels": 4,
        "conv_channels": [10, 10],
        "conv_kernels": [5, 3],
        "activations": "relu",
        "pool_types": "max"
    },
    recurrent_kwargs={
        "hidden_dim": 10,
        "num_layers": 10,
        "bidirectional": True
    }
)
model_out = model(x)
model, model_out.shape


We encourage the user to play with different arguments to these architectures, and not just because we selfishly want them to help us find any bugs that have missed our testing ;)

#### Published architectures

Sometimes, even having to specifiy the arguments to the layers you want is too much. We get it. We've been there. That's why we've built in set of models that are designed for specific tasks. We have also constructed several published architectures that often represent specific configurations of these basic architectures and made them accessible to users through single function calls. One example is the transcription factor binding classifiers. These models are available in the `eugene.models.zoo._tf_binding_predictors` module.

In [None]:
from eugene.models.zoo._tf_binding_predictors import DeepBind, Kopp21CNN

At a minimum, each architecture expects an input sequence length (`input_len`) and an output dimension (`output_dim`). Other required and optional arguments are model specific, but each model follows the same contract as before. That is, each model is a class that inherits from `torch.nn.Module` and includes a `forward` method.

In [None]:
model = DeepBind(
    input_len=100,
    output_dim=1,
    mode="rbp"
)
model_out = model(x)
model, model_out.shape

We leave it as an exercise to the reader to explore the other published architectures in the model zoo.

### LightningModules for training architectures

One thing you may be thinking to yourself if you've made it this far is, "Wait, how do I make sure I train the DeepBind architecture as the creator intended" (e.g. for binary classification). It is true, there is nothing in the way we've written any of the built-in architectures requiring you to train them in any particular way (again, maybe we are lumpers). Instead, we've written that contract into something called a LightningModule from PyTorch Lightning! Don't worry, you don't have to know anything about PyTorch Lightning to use these modules, just on how they are used in EUGENe. As a starting place, we can grab LightningModules directly from the `eugene.models` module.

In [None]:
from eugene import models

Each LightningModules is meant to do the following:

1. define the types of architectures it can train
2. standardize the way a user interacts with those architectures in EUGENe
3. reduce boilerplate PyTorch and PyTorch Lighting code. 

For example, we implemented SequenceModule to expect an architecture that ingests a single tensor (usually one-hot encoded DNA sequences) as input and outputs a single tensor. The SequenceModule defines how this class of models should be trained (including the loss function and optimizer), what metrics should be reported, and how inference should be handled. As a result, any PyTorch model that follows this contract can be trained using SequenceModule. 

In [None]:
# We first need to create a model architecture that we can pass to the module. We kept it simple with DeepBind here
arch = DeepBind(
    input_len=100,
    output_dim=1
)
arch

In [None]:
# Next we pass the architecture to the module, along with several arguments that we want to use to train the model
module = models.SequenceModule(
    arch=arch,
    task="binary_classification",
    loss_fxn="bce",
    optimizer="adam",
    metric="auroc",
    metric_kwargs={"task": "binary", "num_classes": 1},
)
module

And voila! We have a architecture wrapped in a SequenceModule that is now ready for training! Let's break down the arguments in the code above.
1. `task` - the task we are training the model for. This is used to determine the loss function and metrics to use when we don't pass them in explicitly.
2. `loss_fxn` - the loss function to use for training. If not provided, the loss function is determined based on the task.
3. `optimizer` - the optimizer to use for training.
4. `metric` - the metric to use for training. If not provided, the metric is determined based on the task.

Remember, the `SequenceModule` is expecting an architecture that follows the contract we described above. DeepBind, as well as most of the other built-in architectures, follow this contract. If we pass in an architecture that doesn't follow this contract, however, we will likely get an error when we try to train the model.

We have also implemented a ProfileModule that handles BPNet style training, where the model has multiple output tensors (“heads”), can take in optional control inputs, and uses multiple loss functions. We only have one architecture that follows this contract as of right now (BPNet), but who knows maybe other architectures will follow this contract in the future.                                                                                                                         

In [None]:
from eugene.models.zoo import BPNet

In [None]:
arch = BPNet(
    input_len=2114,
    output_dim=1000,
    n_outputs=2,
    n_control_tracks=2, 
    trimming=(2114 - 1000) // 2,
    name="BPNet"
)

In [None]:
from eugene.models import ProfileModule

In [None]:
module = ProfileModule(arch=arch)
module

## Using config files

By now, hopefully you have a pretty good feel how architectures work in EUGENe and we can move into some nuances.

It can be quite cumbersome to drag around the arguments you need to build a model to every place you need to instantiate it. It can also be quite cumbersome to have to remember all the arguments you need to build a model. That's why we've built in a config file system that allows you to save and load model configurations. Let's start with an arleady generated config for a CNN model.

In [None]:
import importlib
import os
import torch
import yaml
from eugene import settings

In [None]:
# TODO: Uncomment and run the following to get the config. You will not need to do this if you are working in your cloned tutorials repo
#!mkdir -p $cwd/configs
#!wget https://github.com/ML4GLand/tutorials/blob/main/configs/simple_cnn.yaml -O $cwd/configs/simple_cnn.yaml

In [None]:
# We can make use of a EUGENe helper for loading from a config
models.load_config("configs/simple_cnn.yaml")

Let's take a look at the config directly

In [None]:
!cat $cwd/configs/simple_cnn.yaml

We can see that it looks a lot like the arguments we passed into the CNN architecture. It's really the same process either way, the config file just has a little more persistence.

Configs work similarly for the published architectures. Let's load in a config for the DeepBind architecture.

In [None]:
# TODO: Uncomment and run the following to get the config, you will not need to do this if you are working in your cloned tutorials repo
#!mkdir -p $cwd/configs
#!wget https://github.com/ML4GLand/tutorials/blob/main/configs/deepbind.yaml -O $cwd/configs/deepbind.yaml

In [None]:
models.load_config("configs/deepbind.yaml")

In [None]:
!cat $cwd/configs/deepbind.yaml

Note, that when you build a SequenceModule with a config, you must use a built-in architecture.

Finally, you don't have to add the SequenceModule specific arguments to the config if you don't want to. You can specify just the architecture

In [None]:
# TODO: Uncomment and run the following to get the config, you will not need to do this if you are working in your cloned tutorials repo
#!mkdir -p $cwd/configs
#!wget https://github.com/ML4GLand/tutorials/blob/main/configs/deepbind_arch.yaml -O $cwd/configs/deepbind_arch.yaml

In [None]:
models.load_config("configs/deepbind_arch.yaml")

In [None]:
!cat $cwd/configs/deepbind_arch.yaml

## Using custom architectures

If I've said it twice, all say it three times, architectures in EUGENe are just classes that inherit from `torch.nn.Module` and include a `forward` method. This means that you can grab PyTorch models from anywhere and use them in EUGENe. Let's build an new architecture from scratch and plug it into a SequenceModule.

In [None]:
# You could imagine building this a ton of different ways, but here is a simple example
import torch.nn.functional as F
class SmallCNN(nn.Module):
    def __init__(self):
        super(SmallCNN, self).__init__()

        # Set the attributes
        self.input_len = 100
        self.output_dim = 1

        # Create the blocks
        self.conv1 = nn.Conv1d(4, 30, 21)
        self.relu  = nn.ReLU()
        
        self.dense = nn.Linear(30, 1)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = F.max_pool1d(x, x.size()[-1]).flatten(1, -1)
        x = self.dense(x)
        return x

In [None]:
# Now we can pass this to the module
models.SequenceModule(
    arch=SmallCNN(),
    task="binary_classification",
    loss_fxn="bce",
    optimizer="adam",
    metric="auroc",
    metric_kwargs={"task": "binary"}
)

And away we go!

## Using custom LigthningModules

We can also do the reverse process by plugging in a built-in architecture to a custom LightningModule. Let's use the EvoAug RobustModel LightningModule, and plug in the DeepBind architecture.

In [None]:
import torch
from evoaug import evoaug
from evoaug_analysis import utils

In [None]:
# Load the deepbind model that was trained in the evo_aug paper
deepbind = DeepBind(249, 2)
loss = torch.nn.MSELoss()
optimizer_dict = utils.configure_optimizer(deepbind, lr=0.001, weight_decay=1e-6, decay_factor=0.1, patience=5, monitor='val_loss')
model = evoaug.RobustModel(
    deepbind, 
    criterion=loss, 
    optimizer=optimizer_dict, 
    augment_list=[]
)
model

Once again, away we go!