# Template for a EUGENe workflow on a new dataset

**Authorship:**
Adam Klie, *08/07/2022*
***
**Description:**
Template notebook for creating a EUGENe workflow on a new dataset. To see a full list of functionality, check out the [API](https://eugenegroup.github.io/EUGENE/api/). You can always use `?` to see the available parameters for each method
***

In [None]:
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload 
%autoreload 2

import numpy as np
import pandas as pd
import eugene as eu

# Configure EUGENe 
print(eu.__version__)
eu.settings.dataset_dir = #TODO: path to dataset directory for saving download links
eu.settings.logging_dir = #TODO: path to logging directory for model training logs
eu.settings.output_dir = #TODO: path to output directory for outputs of model training
eu.settings.dl_num_workers = #TODO: number of workers for data loading
eu.settings.batch_size = #TODO: batch size for data loading

# Dataloading
We first need to load our data into memory. If the dataset is a "EUGENe benchmarking dataset", it can be loaded in through the `dataset` module:
    
```python
eu.datasets.random1000()
```

If the requested dataset requires a download, it will be downloaded and loaded in automatically. Use `get_dataset_info()` to get information about the datasets available as "EUGENe benchmarking datasets".

---

You can also read from standard file formats into `SeqData` objects using `read_` functions from the `dataloading` module:

```python
eu.dl.read_csv('datasets/random1000/random1000_seqs.tsv')
```

In [1]:
# TODO: load your data

# Data Visualization
Data visualization is a key part of the EUGENe workflow. We can use the `plotting` module to visualize aspects of our dataset like target value distributions and sequence length:

```python
sdata["SEQ_LEN"] = [len(seq) for seq in sdata.seqs]
eu.pl.histplot(
    sdata, 
    keys="SEQ_LEN", 
    orient="h"
)
eu.pl.violinplot(
    sdata,
    keys="target",
    groupby="group",
    hue="subgroup",
    xlab="Group",
    ylab="Target",
    title="Target distribution"
)
```
The above commands are examples that show the distribution of sequence lengths and the distribution of target values as a histogram and a violin plot respectively.

In [1]:
# TODO: Write some code to visualize what your data looks like. eu.pl.histplot, eu.pl.boxplot, eu.pl.scatterplot, eu.pl.violinplot are useful functions here

# Preprocessing
We can preprocess our data using the `preprocessing` module. This includes:

- reverse complementing sequences: ```eu.pp.reverse_complement_data(sdata)```
- one hot encoding of the target values: ```eu.pp.one_hot_encode_data(sdata)```
- training/validation/test split: ```eu.pp.train_test_split(sdata, split=0.8, rand_state=42)```
- scaling the target values: ```eu.pp.scale_data(sdata)```
- and more!

Users are encouraged to take a look at the API for more functions you can use. Most users, however, can use the `eu.pp.prepare_data(sdata)` function to get there data ready for training.


In [None]:
# TODO: Preprocess your sequences and targets

In [3]:
# TODO: Visualize after preprocessing to sanity check

# Training
Now that we have our data ready, it's time to train our model. This starts with instantiating and initializing our model. We can use the `models` module to do this:

```python
model = eu.models.DeepBind(
    input_len=100,
    output_dim=1,
    scheduler = "reduce_lr_on_plateau",
    scheduler_patience=2,
    lr=0.001
)
model.summary()
eu.models.init_weights(model)
```
We offer several options for instantiating a model architecture. Take a look at the API for more options and details.
- The `Base Model`s contain the 4 common base architectures: FCN, CNN, RNN and Hybrid. 
- The `SOTA Model`s contain 2 SOTA architectures: DeepBind and DeepSEA.
- The `Custom Models` are models that you can add to. We have  a single custom model currently implemented to serve as a template (`Jores21CNN`). Who knows? Maybe your custom model will become SOTA!

In [None]:
# TODO: Instantiate your model

With a model intantiated and initialized, we are set up to train our model. We can do this through the `train` module:

```python
eu.train.fit(
    model=model, 
    sdata=sdata, 
    gpus=1, 
    target="target",
    train_key="train",
    epochs=50,
    version=f"v1"
)
```

We can see how well our models trained by plotting a training summary:

```python
eu.train.pl_training_summary(model_leaf, version=f"v1")
```

In [None]:
# TODO: Initialize your models parameters

# Evaluation
After the model's been trained, we can evaluate our performance on our training data and our held-out test data. This is done through the `plotting` module.
It is often best to use the model that achieved the lowest loss on the validation data for evaluation. We can load this model in from the log directory:
```python
best_model = eu.models.DeepBind.load_from_checkpoint("...")
```
We can then use this model to make predictions on our training and validation data and to visualize the performance:
```python
eu.predict.train_val_predictions(
    best_model, 
    sdata=sdata, 
    target="target",
    train_key="train",
    version=f"v1",
)
train_idx = np.where(sdata_leaf_train["train"] == True)[0]
eu.pl.performance_scatter(
    sdata, 
    seq_idx=train_idx, 
    target="target", 
    prediction="target_predictions",
    title="Training Set Performance",
    alpha=0.5,
)
```

In [None]:
# TODO: See how you performed and the training and validation sets

It is important to understand how the model is performing on a held-out (and ideally independent) test set. You should either load this separately above, here or have split your data up in preprocessing (see `jores21_analysis.ipynb` for an example).

```python
eu.predict.predictions(
    best_model, 
    sdata=sdata, 
    target="target",
    version=f"v1",
    file_label="test"
)
eu.predict.predictions(
    best_model, 
    sdata=sdata, 
    target="target",
    version=f"v1",
    file_label="test"
)

In [None]:
# TODO: If you have a test set, see how you did on that

# Interpretation
Potentially the most important step in the EUGENe workflow is the interpretation of the model's predictions. This is done through the `interpret` module. All the functions in this module act on either `SeqData` and Models or just Models. Results from these calls can be visualized using the `plotting` module.
---
There are many options for interpreting the model's predictions, and we will again point users to the API for all the options and their arguments. We list examples for a few common ones below.


```python
eu.interpret.generate_pfms(
    best_model_leaf, 
    sdata_leaf_test
)
```

## Feature attribution
We can calculate the contribution of each nucleotide to the model's predictions for a sequence by using the `interpret` module's `feature_attribution` function. We currently implement several different methods for this, includeing `DeepLift, ISM, InputXGradient and DeepSHAP`.
```python
eu.interpret.feature_attribution(
    best_model,
    sdata_test,
    saliency_method="DeepLift",
    device= "cuda" if eu.settings.gpus > 0 else "cpu"
)
```

In [6]:
# TODO: Run feature attribution on your model

## Filter Visualization 
We can get an idea for what each filter of first convoulional layer of the model is seeing by using the `interpret` module's `generate_pfms` function. This creates a position frequency matrix for each filter in the model using sequences that highly activate that filter (can be defined in multiple ways). We often times pass the the test sequences through the model, but you can theoretically pass any sequences you want.
```python
eu.interpret.generate_pfms(
    best_model, 
    sdata_test
)
```

In [7]:
# TODO: Run filter visualization on your model

## Other intepretation methods
We currently implement a few other methods for interpreting the model's predictions. These include:
- Dimensionality Reduction on your importance scores: e.g. `eu.interpret.pca`
- ...
We are looking to add more! If you are interested in contributing...

In [None]:
# TODO: Perform other intepretation methods on your trained model

# Wrapping up
EUGENe is very much meant to be a community project. It represents a collection of data, models, and techniques meant for analyzing sequence data with deep learning. We are looking for contributions in almost every aspect of EUGENe. We are particularly interested in:

- New model additions through the `models` module
- New dataset additions through the `datasets` module
- New preprocessing techniques through the `preprocessing` module
- New visualization techniques through the `plotting` module
- New interpretation techniques through the `interpret` module
- New methods for training models in the `train` module

Please do not hesitate to contact us if you have any questions or suggestions.

---