# Regression Models in Selene

Selene is a flexible framework, and can be used for tasks beyond simple classification.
This tutorial demonstrates the simple process of training regression models with Selene.
For this example, we will predict mean ribosomal load (MRL) from 50 base pair 5' UTR sequences using models and data from [*Human 5′ UTR design and variant effect prediction from a massively parallel translation assay*](https://doi.org/10.1101/310375) by Sample et al.
This data was generated from a massively parallel reporter assay (MPRA), which you can read more about it in the preprint on [*bioRxiv*](https://doi.org/10.1101/310375).

## Setup

**Architecture:** The model is defined in [utr_model.py](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/utr_model.py), and only superficially differs from the model in [the paper](https://doi.org/10.1101/310375).
Since this is a real-valued regression problem, it is appropriate that the `criterion` method in `utr_model.py` uses the mean squared error.


**Data:** The data from Sample et al is available on the [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114002).
However, we have included [the `download_data.py` script](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py), to download the data and preprocess it.
It should produce three files, `train.mat`, `validate.mat`, and `test.mat`.
They include the data for training, validation, and testing respectively.

**Configuration file:** The configuration file [`regression_train.yml`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_train.yml) is slightly different than the configuration files in the classification tutorials.
Specifically, `metrics` in `train_model` includes the coefficient of determination (`r2`), since the default metrics (`roc_auc` and `average_precision`) are not appropriate for regression.
Further, `report_gt_feature_n_positives` in `train_model` has been set to zero to prevent spurious filtering based on target values.

## Download the data

To download the data, just run the [`download_data.py`](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/download_data.py) script from the command line:
```sh
python download_data.py
```

## Train the model

In [None]:
from selene_sdk.utils import load_path
from selene_sdk.utils import parse_configs_and_run

Before running `load_path` on `regression_train.yml`, please edit the YAML file to include the absolute path of the model file.

Currently, the model is set to train on GPU.
If you do not have CUDA on your machine, please set `use_cuda` to `False` in the configuration file. Note that using the CPU instead of GPU will slow down training considerably.

In [6]:
configs = load_path("./regression_train.yml", instantiate=False)

In [7]:
parse_configs_and_run(configs, lr=0.001)

Outputs and logs saved to ./2018-12-08-22-08-14
[VALIDATE] average r2: 0.8641705948994154
[VALIDATE] average r2: 0.8767916124114791
[VALIDATE] average r2: 0.8817297326343803
[TEST] average r2: 0.9232683662644537
