# Adding a model to EUGENe 

**Authorship:**
Adam Klie, *10/05/2022*
***
**Description:**
This tutorial is intended to show how to add a model to EUGENe. It's a pretty simple process, but it's important to follow the steps in order to ensure that the model can be properly utiilized properly throughout the EUGENe pipeline.
***

In [5]:
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload 
%autoreload 2

import os
import numpy as np
import pandas as pd
import eugene as eu

eu.settings.dataset_dir = "./tutorial_datasets"
eu.settings.logging_dir = "./tutorial_logs"
eu.settings.output_dir = "./tutorial_output"

# 1. Review the `BaseModel` class and check out some examples in the `_custom_models.py` file
In order to fully integrate models into the EUGENe pipeline, it is recommended that you make your model a subclass of the [`BaseModel` class]() (**TODO**). Though many of EUGENe's functions work under the assumption that the model is a subclass of a `torch.nn.Module`, many other functions assume a structure dictated by the `BaseModel` class. For the rest of this tutorial, we assume that we are inheriting from `BaseModel`.

Before you begin implementing anything it is recommended that you take a look at the `BaseModel` class attributes. These are the attributes that you will need to instantiate for any EUGENe model. I also find that it helps to see a few examples, which you can find in the `_custom_models.py` file

# 2. Create a model class
* This should be a Python class that at the very least inherets from torch.nn.Module, but ideally should inherit from the `BaseModel` class.
    - If your model is associated with a publication, use the last name of the first author followed by the year of publication (NameYY). It can also be useful to add the type of model you are implementing. For example, if the model is a CNN that was published in 2021 by the author "Jane Doe", the function should be named `Doe21CNN`. 
* At the minimum, this class should have two functions an  `__init__` and a `forward`
    - `__init__`
        * Needs to call:
        ```python
        super().__init__(
            input_len, 
            output_dim, 
            strand=strand, 
            task=task, 
            aggr=aggr, 
            loss_fxn=loss_fxn,
            **kwargs
        )
        ```
    - `forward`
        * Needs to take in x and x_rev as arguments with x_rev defaulting to None. Even if your model takes in only the forward strand (i.e. does not use "ds" or "ts" modes), this needs to be defined.

In [None]:
from eugene.models.base import BaseModel

: 

In [None]:
class NewModel(BaseModel):

def __init__(
        self,
        input_len: int,
        output_dim: int,
        strand: str = "ss",
        task: str = "regression",
        aggr: str = None,
        loss_fxn: str = "mse",
        **kwargs
    ):
        # Don't worry that we don't pass in the class name to the super call (as is standard for creating new
        # nn.Module subclasses). This is handled by inherting BaseModel
        super().__init__(
            input_len, 
            output_dim, 
            strand=strand, 
            task=task, 
            aggr=aggr, 
            loss_fxn=loss_fxn,
            **kwargs
        )

def forward(self, x, x_rev_comp=None):
    return x

# 3. Test the forward pass
Its often helpful to run a simple forward pass of the model with some dummy data to make sure all your matrix multiplication and other operations are working.

In [None]:
# Length of strand
x_len = 66

# Generate some random input
x = torch.randn(10, 4, x_len)
x_rev = torch.randn(10, 4, x_len)
y = torch.randn(10, 2)

In [None]:
# Instantiate your model
model = 

In [None]:
model(x, x_rev)

# 4. Test a PL trainer
If your model is a BaseModel instance and you are working with SeqData objects, this is as simple as a call to the `fit` function within the `train` model.

If you are working directly with PL trainers, this is a little more complicated but still not too bad! You just need to create an appropriate DataLoader for your implementation (which can be converted form a SeqData object) and pass the model and the dataloader to a PL trainer)

## Using `train.fit`

In [None]:
sdata = eu.datasets.random1000()
eu.train.fit(sdata, model)

## Directly using PL trainer

In [None]:
from pytorch_lightning import Trainer
from torch.utils.data import DataLoader

In [None]:
sdataloader = DataLoader(sdata.to_dataset(target_keys="activity_0"))

In [None]:
trainer = Trainer()

In [None]:
trainer.fit(model, sdataloader)

# 5. Adding you model to EUGENe
Once you are happy with how your model seems to be working, you can add it to the appropriate `.py` file within EUGENe. 
- `_base_models.py`: This is meant for implementations of flexible architectures that are at the core of deep learning across fields. This might be something like a vanilla autoencoder, where you can change the number of hidden layers and units in the encoder or decoder.
- `_sota_models.py`: These are often times instances of the the above Base Models. Often these models have architectures that don't quite fit within the mold of the Base Models (e.g. DeepBind models that concatenate global and average pooling layers), but can also just be calls to Base Models with a specific configuration of hyperparameters (an example of the latter might be the DeepSEA architecture, which could be created with a specific call to a CNN). There also must be some basis for calling this a SOTA architecture. I realize this is somewhat arbitrary and I could probably have endless debates with people about what this means, but I typically use the rule that I know a SOTA architecture when I see one (e.g. if you are reading this you probably know what DeepSEA is).
- `_custom_models.py`: These are custom architectures that don't really fall under the Base Models or SOTA Models. These might be published models that were successful on a particular dataset, or your own custom architecture you just want to be able to use within EUGENe. A note for the latter. In order for the a custom model to make it into a future release of EUGENe, there should be some basis for its inclusion. That is you should be able to demonstrate the utility of the architecure on some real world data.

# 6. Create a unit test for your model
* In order for your model to make it into the next EUGENe release, it needs to have a unit test within the [`test_models.py` file](TODO). At a minimum this unit test should test the instantiation of your model and the training procedure of your choice on some dummy data. Check out the unit tests already there for more examples.

As is the general rule for testing, the more "units" you can test the better. Feel free to add other tests as well. One other area that might be a little tricky is making sure the convolutional filters of your model are seen by the `generate_pfms` function in the `interpret` module. 

Don't forget to actually run your tests as well! This can be done with the following command

```bash
pytest tests/test_models.py -k "test_NewModelCNN21.py"
```

# 7. Document your function

## Docstring

Once your happy with your model, you've tested it and it's working as expected, you can add documentation to the function. This is done by adding a docstring to the function. The docstring should be formatted in numpydoc format and should contain a parameters and a returns section. You can see examples of this in the any of the model's scripts.

```python
"""
Reads in the farley15 dataset.

Parameters
----------
return_sdata : bool, optional
    If True, return SeqData object for the farley15 dataset. The default is True.
    If False, return the paths to any downloaded files.
**kwargs : kwargs, dict
    Keyword arguments to pass to read_csv.

Returns
-------
sdata : SeqData
    SeqData object for the farley15 dataset.
""" 
```

## (Optional) Add information to the EUGENe model Notion database
If you want to help me in my never ending quest/addiction to organize things, please consider adding the details of your new model to [this](TODO) Notion database. Check out the example already there for how to format your entry

# 8. (Optional) Submit a pull request
You only need to do this if you want to share your model with the world (which is strongly encouraged)!

Once you've completed all of the above steps, you can submit a pull request to the EUGENe repository. We will review your pull request and merge it into the main branch if everything looks good. If there are any issues, we will let you know and you can make the necessary changes. Once your pull request is merged, your model will be available in the next release of EUGENe!

# X. More advanced training techniques
- Use supplementary note stuff
- Note that tutorials for this are coming!!!

# Wrapping up
Hopefully this guide was helpful in getting you started with adding your own model to EUGENe. If you have any questions, feel free to open a GitHub issue.

---