# Getting Started with Selene

This tutorial explores the core components of Selene, and should teach you everything you need to know to train a simple model on biological sequence data.
Before starting this tutorial, you need to install Selene.
Instructions for installation are available [here](https://selene.flatironinstitute.org/installation.html).
Lastly, if you are not familiar with neural networks, we recommend reading through this [introductory PyTorch tutorial on neural networks](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html).
In the simplest case, we train a neural network as follows:

1. Construct our neural network, which should be a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) object
2. Load the training data and divide it into training and validation sets
3. Iterate over the training set
4. Compute and backpropagate the training loss after each iteration
5. Save the model weights at specified intervals
6. Compute and report the loss on the validation set at specified intervals
7. Compute and report the loss on the validation set after training is complete

In this case, much of our work is already done for us.
In fact, we do not actually need to write any code besides our model and a configuration file.

## Download the data

First, we need to download the data

TODO

## Command line arguments

Selene uses a limited number (two to be precise) of command line arguments.
The first of these is the positional parameter for the configuration file, which we will discuss in more detail in the following section.
The second argument is the optional named argument for the learning rate, specified with `--lr`.
The learning rate only needs to be specified when Selene is training a model, and is ignored in all other circumstances.

## Configuration file syntax

The configuration file is a [YAML file](https://en.wikipedia.org/wiki/YAML) that specifies the majority of the runtime parameters for Selene.
In general, a YAML file with keys `key1` and `key2` taking values `val1` and `val2` would look like such:

```YAML
---
key1: val1
key2: val2
...
```

For training a new network, there are a few keys that we must include in this YAML file, which we will discuss later.

The following sections explain each of these parameters in some detail.
However, we first need to discuss the syntax for our configuration file.
We discuss each of the argument types for configuration files below.

### Literal arguments

The simplest configuration arguments are literals.
To specify a learning rate of `0.01`, and that we would like to evaluate test metric performance and not save the datasets, we would include the following lines in our configuration file:
```YAML
lr: 0.01
evaluate_on_test: True
save_datasets: True
```

### List arguments

After literals, lists arguments like `ops` are the next simplest type of configuration parameter.
Syntactically, list arguments are very similar to the python lists that they represent.
For instance, to specify `ops` as the Python list below:
```python
ops = ["train"]
```
we would write the following line in our configuration file:
```YAML
ops: [train]
```

### Dictionary arguments

The next type of argument we need is a dictionary.
Like lists, dictionaries in the configuration file are very similar to their Python equivalents.
For instance, if the `model` configuration were written as a dictionary in Python, it might look something like the following:
```python
model = {"file": "/path/to/deepsea.py",
         "class": "DeepSEA",
         "sequence_length": 1000
         "n_classes_to_predict": 919,
         "non_strand_specific": {
             "use_module": True,
             "mode": "mean"
             }
         }
```
Now, to write this in the configuration file, we simply include the following lines:
```YAML
model: {
    file: /path/to/deepsea.py,
    class: DeepSEA,
    sequence_length: 1000,
    n_classes_to_predict: 919,
    non_strand_specific: {
        use_module: True,
        mode: mean
    }
}
```

### Function arguments 

In addition to the types we've just discussed, Selene's configuration accept python function calls.
For instance, let's say we want to specify the value of the `features` argument for `train_model`, which takes a list of strings and specifies the names of the values we are predicting with our model.
One option would be to write the list of strings into the configuration file, but this might take a long time if this list is very long.
If we were using Python, we would just read the list of feature names the following:
```python
import selene
features = selene.utils.load_features_list(input_path="features_list.txt")
```
Fortunately, we can use function the function call arguments to include this in our configuration file.
Specifically, we would write the following in our configuration file:
```YAML
features: !obj.selene.utils.load_features_list {
    input_path: "features_list.txt"
}
```

## Training a model and analyzing sequences with it

To train or analyze sequences with a model, we first specify the configuration file and then we execute `selene` from the command line.
The first section provides an overview of all the requirements for training a model with Selene.
The second section covers the arguments used to evaluate sequences with a trained model.
We recommend opening the included `train.yaml` and `analyze.yaml` configuration files and following along in them while reading through these sections.

### Configuration file arguments for training
Before running Selene from the command line, we need to specify its runtime parameters in a configuration file.
Specifically, we need to include the following:

| key              | definition |
|------------------|-----------------------------------------------------------------------------------------------------|
| ops              | list of operations to execute with Selene |
| model            | dict containing the configuration parameters for the model we intend to train.
| sampler          | a subclass of selene.samplers.Sampler |
| train_model      | a subclass of selene.TrainModel |
| lr               | a floating point value for the learning rate, if we do not want to specify it in the command line arguments |
| evaluate_on_test | a boolean specifying whether we should calculate performance metrics on held out test data |
| save_datasets    | a boolean specifying if we would like to write the training/validation/test data to file                   |

#### ops

Selene currently supports the `train` and `analyze` operations, and allows chaining of operations by simply adding them to the `ops` list in the configuration file.
For instance, to train a model and then use it to analyze some data, you would include the following line in the configuration file:
```YAML
ops: [train, analyze]
```
To only train a model, we would just write the following:
```YAML
ops: [train]
```

#### model

In this tutorial, we will use the neural network from [DeepSEA](http://deepsea.princeton.edu), which models chromatin properties of sequences in the non-coding genome.
The class for this model, `DeepSEA`, is specified in the `deepsea.py` file from earlier.
The model should follow all of the [normal rules for specifying a `torch.nn.Module`](https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html), with two exceptions.
First, the file with the model class should include a method called `criterion` that returns the object to use for the [PyTorch loss function](https://pytorch.org/docs/master/nn.html#loss-functions).
In `deepsea.py`, this is defined as follows:
```python
def criterion():
    return torch.nn.BCELoss()
```
Second, we must define a method called `get_optimizer` that takes a learning rate, and returns the [optimization function](https://pytorch.org/docs/master/optim.html) and its parameters.
The return value should be a 2-tuple, where the first element is the optimizer class, and the second element is a `dict` containing the keyword arguments to use when constructing the optimizer.
In `deepsea.py`, this is specified as follows:
```python
def get_optimizer(lr):
    return (torch.optim.SGD, {"lr": lr, "weight_decay": 1e-6, "momentum": 0.9})
```
Note that, to allow specifying the learning rate at the command line, you should include the passed `lr` argument in the `dict` of keyword arguments.

#### sampler

The `sampler` argument specifies how Selene will sample its training data.
The value for `sampler` should be a function-type argument, and the function needed to construct an object that is a subclass of `selene.samplers.Sampler`. 
The specific arguments for the sampler's construction will vary by class, so it is important to check the class definitions and documentation when specifying them.
For the example, we will use the following configuration for the `sampler`:
```YAML
sampler: !obj:selene.samplers.IntervalsSampler {
    reference_sequence: !obj:selene.sequences.Genome {
            input_path: hg19.fasta
            },
    features: !obj:selene.utils.load_features_list {
        input_path: distinct_features.txt
    },
    target_path: target_data.bed.gz
    intervals_path: intervals_only.txt,
    sequence_length: 1000,
    center_bin_to_predict: 200,
    test_holdout: [chr8, chr9],
    validation_holdout: [chr6, chr7],
    feature_thresholds: 0.5,
    seed: 127,
    mode: "train",
    save_datasets: []
}
```

#### train_model
The `train_model` argument is responsible to specifying many of the parameters for `selene.TrainModel`.
The following parameters for `train_model` are automatically generated, and should not be specified in the configuration file:

|                |
|----------------|
| model          |
|data_sampler    |
|loss_criterion  |
|optimizer_class |
|optimizer_kwargs|

With this in mind, we write the following in our configuration file:
```YAML
train_model: !obj:selene.TrainModel {
    batch_size: 64,
    max_steps: 500000,
    report_stats_every_n_steps: 16000,
    n_validation_samples: 32000,
    cpu_n_threads: 32,
    use_cuda: False,
    data_parallel: False,
    logging_verbosity: 2,
    output_dir: ./
}
```

#### other arguments

There are three additional optional arguments we need when training models: `lr`, `evaluate_on_test`, and `save_datasets`.
If you do not want to specify the learning rate in the command line arguments, you can specify it in the configuration file.
However, note that Selene will throw an exception and crash if `lr` is not included in the configuration file or specified in the command line arguments.
If we want to specify it in the configuration file, we can include the following lines:
```YAML
lr: 0.01
```
If you want your model to evaluate its performance on the test data after it has completed its training, you can use the `evaluate_on_test`.
If this argument is not included in the configuration file and set as `True`, the model performance will not be evaluated on the testing data.
To set the `evaluate_on_test` argument to `True`, you would include the following lines in your configuration file:
```YAML
evaluate_on_test: True
```
Lastly, there is the `save_datasets` argument, which will let us save our data to file if we are using a sampler that generates 
Like the `evaluate_on_test` argument, `save_datasets` will be set to `False` if it is not included in the configuration file.
To specify that we would like to save the datasets, we would include the following in our configuration file:
```YAML
save_datasets: True
```

### Configuration file arguments for analyzing sequences

TODO

#### TODO

### Running it
TODO