# Adding a model to EUGENe 

**Authorship:**
Adam Klie, *10/05/2022*<br>
**Last Updated**: 10/08/2022
***
**Description:**
EUGENe offers several customizable architectures as built-in, including flexible fully connected, convolutional, recurrent, hybrid architectures and seminal DeepBind and DeepSEA architectures. We also provide implementations of models introduced in Jores et al and Kopp et al. However, this set of provided modules may not be sufficient for a users training task and many users may need to add custom architectures to the library. This can be achieved in a few straightforward steps outlined below. We also provide a walkthrough on how to add a model on the EUGENe readthedocs page and available on GitHub.

This tutorial is intended to show how to add a model to EUGENe. It's a pretty simple process and allows the model to be utiilized throughout the EUGENe pipeline.
***

In [18]:
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload 
%autoreload 2

import os
import numpy as np
import pandas as pd
import eugene as eu

eu.settings.dataset_dir = "./tutorial_datasets"
eu.settings.logging_dir = "./tutorial_logs"
eu.settings.output_dir = "./tutorial_output"

# 1. Review the `BaseModel` class and check out some examples in the `_custom_models.py` file
In order to fully integrate models into the EUGENe pipeline, it is recommended that you make your model a subclass of the [`BaseModel` class](https://eugene-tools.readthedocs.io/en/latest/usage-principles.html#basemodel-a-pytorch-lightning-template-for-deep-models). Though many of EUGENe's functions work under the assumption that the model is a subclass of a `torch.nn.Module`, many other functions assume a structure dictated by the `BaseModel` class. For most of the rest of this tutorial, we assume that you are inheriting from `BaseModel`.

Before you begin implementing anything it is recommended that you take a look at the (`BaseModel` class attributes)[https://github.com/adamklie/EUGENe/blob/main/eugene/models/base/_base_model.py]. These are the attributes that you will need to instantiate for any EUGENe model. I also find that it helps to see a few examples, which you can find in the [`_custom_models.py` file](https://github.com/adamklie/EUGENe/blob/main/eugene/models/_custom_models.py)

# 2. Create a model class
* This should be a Python class that at the very least inherits from `torch.nn.Module` (but ideally should inherit from the `BaseModel` class).

* For naming your model class, use the last name of the first author followed by the year of publication (`NameYY`) if your model is associated with a publication. It can also be useful to add the type of model you are implementing. For example, if the model is a CNN that was published in 2021 by the author "Jane Doe", the function should be named `Doe21CNN`. If your model is not associated with a publication, feel free to come up with your own name, but use the camel case convention (`ModelName`).
* At the minimum, this class should contain two functions, an  `__init__()` and a `forward()`.
* The `__init__()` function will set-up the way the model architecture is initialized. To use BaseModel functionality, a  user must first make a call to `super.init()` in the first line. The BaseModel class expects the user to include:
    
    - **input_len**: Expected input length
        - In most cases, this should be the length of the longest input sequence. See the `preprocess` module for more details on how different length inputs are handled.
    
    <br>

    - **output_dim**: The expected output dimension
        - The number of output neurons. One for single task regression and binary classification, multiple for multi-task regression, and the number of classes for multi-class classification.
    
    <br>

    - **strand**: The input type broken into three categories (described below)
        - *ss*: or single stranded models only take in one direction of the double stranded DNA (usually the 5’—>3’ direction)
        - *ds*: or double stranded models ingest both the forward and reverse strand (3’—>5’ reverse complement of forward) through the same set of layers. They aggregate the representations from these inputs according to the `aggr` argument and the error is backpropogated through this shared architecture
        - *ts*: or twin stranded models ingest both the forward and reverse strand (3’—>5’ reverse complement of forward) through a two sets of identically shaped layers. That is, two separate twin models handle each input and the representation learned from these different architectures is aggregated according to `aggr`.

    <br>

    - **task**: The type of task we are trying to model
        - We currently support single task and multitask regression. Passing in "regression" into this argument with different output_dim’s handles these cases.
        - We currently support binary and multiclass classification. Binary can be run with "binary_classification" and multiclass can be run with "multiclass_classification"

    <br>

    - **aggr**: The way to aggregate information from multiple stranded inputs (*ds* and *ss* models)
        - "avg": take the average value of each output neuron across the strands
        - "max" : take the max value for each output neuron across the strands
        - "concat" : concat the representation learned prior to the output. For networks that have multiple modules (e.g. `Hybrid` models, you can separate the different possible concatenations by adding a suffix (e.g. "concat_cnn" means concatenate the representation learned after the CNN module of a `Hybrid` model) 

    <br>

    - **loss_fxn** : The loss function to use. We currently support: 
        - "mse": mean squared error
        - "poisson": poisson negative log likelihood loss
        - "bce": binary cross entropy loss
        - "cross_entropy": cross entropy loss

    <br>

* Current models in EUGENe assume a single stranded (ss), regression model (regression) that is trained to optimize mean squared error (mse) by default.

* `forward`
    - The requirement of the forward function are that it can handle at least a single strand as input of length `input_len` and that it outputs vector of values of dimension equivalent to `output_dim`. 
    - To be compatible with EUGENe’’s baseline training functionality, the forward function should take in both the forward (x) and reverse strand (x_rev) as arguments. Note that the model needs to take in `x` and `x_rev_comp` as arguments with `x_rev_comp` defaulting to `None`. Even if your model takes in only the forward strand (i.e. does not use "ds" or "ts" modes), this needs to be defined.

In [19]:
import torch.nn as nn
import torch.nn.functional as F
from eugene.models.base import BaseModel

In [20]:
BaseModel?

[0;31mInit signature:[0m
[0mBaseModel[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0minput_len[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_dim[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrand[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'ss'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtask[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'regression'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maggr[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mloss_fxn[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'mse'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moptimizer[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'adam'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlr[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m0.001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mscheduler[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'lr_scheduler'[0m[0;34m,[0m[0;34m[0m
[0;3

In [22]:
class TutorialCNN(BaseModel):
    def __init__(
        self,
        input_len: int,
        output_dim: int,
        strand: str = "ss",
        task: str = "regression",
        aggr: str = "avg",
        loss_fxn: str = "mse",
        **kwargs
    ):
        # Don't worry that we don't pass in the class name to the super call (as is standard for creating new
        # nn.Module subclasses). This is handled by inherting BaseModel
        super().__init__(
            input_len, 
            output_dim, 
            strand=strand, 
            task=task, 
            aggr=aggr, 
            loss_fxn=loss_fxn,
            **kwargs
        )
        # Define the layers of the model
        self.conv1 = nn.Conv1d(4, 30, 21)
        self.dense = nn.Linear(30, 1)
        self.sigmoid = nn.Sigmoid()        
            
            
    # Define the forward pass of the model/
    # Note how you need to use the x_rev_comp argument if you want to use the reverse complement of the sequence, 
    # but this can be ignored if the model is only meant to take in a single strand as input
    def forward(self, x, x_rev_comp=None):
        x = F.relu(self.conv1(x))
        x = F.max_pool1d(x, x.size()[-1]).flatten(1, -1)
        x = self.dense(x)
        x = self.sigmoid(x)
        if self.strand == "ds":
            x_rev_comp = F.relu(self.conv1(x_rev_comp))
            x_rev_comp = F.max_pool1d(x_rev_comp, x_rev_comp.size()[-1]).flatten(1, -1)
            x_rev_comp = self.dense(x_rev_comp)
            x_rev_comp = self.sigmoid(x_rev_comp)
            x = (x + x_rev_comp / 2)
        return x

# 3. Test the forward pass
Its often helpful to run a simple forward pass of the model with some dummy data to make sure all your matrix multiplication and other operations are working.

In [23]:
import torch

In [24]:
# Length of strand
x_len = 66

# Generate some random input
x = torch.randn(10, 4, x_len)
x_rev = torch.randn(10, 4, x_len)

In [25]:
# Instantiate your model
model = TutorialCNN(input_len=x_len, output_dim=1, strand="ds")

In [26]:
model(x, x_rev)

tensor([[0.5254],
        [0.4977],
        [0.5097],
        [0.5046],
        [0.5201],
        [0.4770],
        [0.5340],
        [0.5367],
        [0.3727],
        [0.3681]], grad_fn=<AddBackward0>)

# 4. Test a PL trainer
If your model is a BaseModel instance and you are working with SeqData objects, this is as simple as a call to the `fit` function within the `train` model.

If you want to work directly with PL trainers, this is a little more complicated but still not too bad! You just need to create an appropriate DataLoader for your implementation (which can be converted form a SeqData object) and pass the model and the dataloader to a PL trainer)

## Using `train.fit`

In [27]:
sdata = eu.datasets.random1000()
eu.pp.ohe_seqs_sdata(sdata)
eu.pp.reverse_complement_seqs_sdata(sdata)
eu.pp.train_test_split_sdata(sdata)
eu.train.fit(model, sdata, target_keys="activity_0", epochs=1, name="test_fit", version="add_model_tutorial")

One-hot encoding sequences:   0%|          | 0/1000 [00:00<?, ?it/s]

Global seed set to 13
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name      | Type    | Params
--------------------------------------
0 | hp_metric | R2Score | 0     
1 | conv1     | Conv1d  | 2.6 K 
2 | dense     | Linear  | 31    
3 | sigmoid   | Sigmoid | 0     
--------------------------------------
2.6 K     Trainable params
0         Non-trainable params
2.6 K     Total params
0.010     Total estimated model params size (MB)


SeqData object modified:
	ohe_seqs: None -> 1000 ohe_seqs added
SeqData object modified:
	ohe_rev_seqs: None -> 1000 ohe_rev_seqs added
SeqData object modified:
    seqs_annot:
        + train_val
Dropping 0 sequences with NaN targets.
No transforms given, assuming just need to tensorize.
No transforms given, assuming just need to tensorize.


  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


Validation sanity check: 0it [00:00, ?it/s]

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 13
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

## Directly using PL trainer

In [28]:
# Direct access to PL trainer
from pytorch_lightning import Trainer

In [33]:
# Separate train and val
sdata_train = sdata[sdata["train_val"]]
sdata_val = sdata[~sdata["train_val"]]

In [47]:
# Make some dataloaders
sdataloader_train = sdata_train.to_dataset(target_keys="activiy_0").to_dataloader(batch_size=32)
sdataloader_val = sdata_val.to_dataset(target_keys="activity_0").to_dataloader(batch_size=32)

No transforms given, assuming just need to tensorize.
No transforms given, assuming just need to tensorize.


In [48]:
# Define a trainer by hand
trainer = Trainer(max_epochs=1)

GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  "GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`."


In [None]:
# Fit the model
trainer.fit(model, train_dataloaders=sdataloader_train, val_dataloaders=sdataloader_val)

# 5. Adding you model to EUGENe
Once you are happy with how your model seems to be working, you can add it to the appropriate `.py` file within EUGENe. 
- `_base_models.py`: This is meant for implementations of flexible architectures that are at the core of deep learning across fields. This might be something like a vanilla autoencoder, where you can change the number of hidden layers and units in the encoder or decoder.

- `_sota_models.py`: These are often specific instances of the the above Base Models. Often these models have architectures that don't quite fit within the mold of the Base Models (e.g. DeepBind models that concatenate global and average pooling layers), but can also just be calls to Base Models with a specific configuration of hyperparameters (an example of the latter might be the DeepSEA architecture, which could be created with a specific call to a CNN). There also must be some basis for calling this a SOTA architecture. I realize this is somewhat arbitrary and I could probably have endless debates with people about what this means, but I typically use the rule that I know a SOTA architecture when I see one (e.g. if you are reading this you probably know what DeepSEA is).

- `_custom_models.py`: These are custom architectures that don't really fall under the Base Models or SOTA Models. These might be published models that were successful on a particular dataset, or your own custom architecture you just want to be able to use and test within EUGENe. A note for the latter. In order for the a custom model to make it into a future release of EUGENe, there should be some basis for its inclusion. That is, you should be able to demonstrate the utility of the architecure on some real world data.

I've already went ahead and added the `TutorialCNN` to the Custom Models.

# 6. Create a unit test for your model
* In order for your model to make it into the next EUGENe release, it needs to have a unit test within the [`test_models.py` file](https://github.com/adamklie/EUGENe/blob/main/tests/test_models.py). At a minimum this unit test should test the instantiation of your model and the training procedure of your choice on some dummy data. Check out the unit tests already there for more examples.

As is the general rule for testing, the more "units" you can test the better. Feel free to add other tests as well. One other area that might be a little tricky is making sure the convolutional filters of your model are seen by the `generate_pfms` function in the `interpret` module. 

Don't forget to actually run your tests as well! This can be done with the following command

```bash
pytest tests/test_models.py -k "test_TutorialCNN"
```

# 7. Document your function

## Docstring

Once your happy with your model, you've tested it and it's working as expected, you can add documentation to the function. This is done by adding a docstring to the function. The docstring should be formatted in [numpydoc](https://numpydoc.readthedocs.io/en/latest/format.html) format.

```python
"""Tutorial CNN model

    This is a very simple one layer convolutional model for testing purposes. It is featured in testing and tutorial
    notebooks.

    Parameters
    ----------
    input_len : int
        Length of the input sequence.
    output_dim : int
        Dimension of the output.
    strand : str, optional
        Strand of the input. Only ss is supported for this model
    task : str, optional
        Task of the model. Either "regression" or "classification".
    aggr : str, optional
        Aggregation method. This model only supports "avg"
    loss_fxn : str, optional
        Loss function.
    **kwargs
        Keyword arguments to pass to the BaseModel class.
    """
```

## (Optional) Add information to the EUGENe model Notion database
If you want to help me in my never ending quest/addiction to organize things, please consider adding the details of your new model to [this](https://www.notion.so/44cca45b45cd41c2b06b74b9ca6242da?v=235befda27d54f9eaa85dafeaad1be3b) Notion database. Check out the examples already there for how to format your entry.

# 8. (Optional) Submit a pull request
You only need to do this if you want to share your model with the world (which is strongly encouraged)!

Once you've completed all of the above steps, you can submit a pull request to the EUGENe repository. We will review your pull request and merge it into the main branch if everything looks good. If there are any issues, we will let you know and you can make the necessary changes. Once your pull request is merged, your model will be available in the next release of EUGENe!

# 9. More advanced training techniques
The beauty of using PyTorch Lightning under the hood is that the framework has allowed us to create an abstraction from the basic training details that become boilerplate with most models while giving us the flexibility to make changes in a simple modular manner. You can find a slightly more involved description of advanced training in the documentation for EUGENe and we will be adding functionality and examples in the future.


# Wrapping up
Hopefully this guide was helpful in getting you started with adding your own model to EUGENe. If you have any questions, feel free to open a GitHub issue.

---