# Likelihood Ratio Outlier Detection on Genomes

## Method

## Dataset

## Todo
- show in vs. out-of-distribution plot of Fig 1 (incl. log-likelihood vs. GC content)
- show in vs. out-of-distribution plot for LLR (incl. GC content plot) + ROC-AUC curve

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, LSTM

from alibi_detect.od import LLR
from alibi_detect.datasets import fetch_genome
from alibi_detect.utils.saving import save_detector, load_detector

ERROR:fbprophet:Importing plotly failed. Interactive plots will not work.


### Load genome data

*X* represents the genomic sequences and *y* whether they are outliers ($1$) or not ($0$).

In [2]:
(X_train, y_train), (X_val, y_val), (X_test, y_test) = \
        fetch_genome(return_X_y=True, return_labels=False)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)

(1000000, 250) (1000000,) (6999774, 250) (6999774,) (7000000, 250) (7000000,)


There are no outliers in the training set and a minority of outliers in the validation and test sets:

In [5]:
print('Fraction of outliers in train, val and test sets: '
      '{:.2f}, {:.2f} and {:.2f}'.format(y_train.mean(), y_val.mean(), y_test.mean()))

Fraction of outliers in train, val and test sets: 0.00, 0.86 and 0.86


### Define model

We need to define a generative model which models the genomic sequences. We follow the paper and opt for a simple LSTM. Note that we don't actually need to define the model or `log_prob` function below if we simply load the pretrained detector later on:

In [6]:
class LlrLSTM(tf.keras.Model):
    def __init__(self, hidden_dim: int, input_dim: int, dropout: float = 0.):
        super(LlrLSTM, self).__init__()
        self.input_dim = input_dim
        self.lstm = LSTM(hidden_dim, dropout=dropout, return_sequences=True)
        self.logits = Dense(input_dim, activation=None)

    def call(self, x: tf.Tensor) -> tf.Tensor:
        x = tf.one_hot(tf.cast(x, tf.int32), self.input_dim)
        x = self.lstm(x)
        x = self.logits(x)
        return x

In [7]:
hidden_dim = 2000
input_dim = 4  # ACGT nucleobases
model = LlrLSTM(hidden_dim, input_dim)

We also need to define our loss function which we can utilize to evaluate our log-likelihood at inference time since the model outputs logits:

In [9]:
def loss_fn(y, x):
    y = tf.one_hot(tf.cast(y, tf.int32), 4)  # ACGT on-hot encoding
    return tf.nn.softmax_cross_entropy_with_logits(y, x, axis=-1)

### Load or train the outlier detector

We can again either fetch the pretrained detector from a [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/seldon-models/alibi-detect/od/llr/genome) or train one from scratch:

In [8]:
load_pretrained = False

In [None]:
filepath = 'my_path'  # change to (absolute) directory where model is downloaded
if load_outlier_detector:  # load pretrained outlier detector
    detector_type = 'outlier'
    dataset = 'genome'
    detector_name = 'LikelihoodRatio'
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
    filepath = os.path.join(filepath, detector_name)
else:  # define model, initialize, train and save outlier detector
    
    od = LLR(threshold=None, model=model, log_prob=loss_fn, sequential=True)
    
    # train
    od.fit()
    
    # save the trained outlier detector
    save_detector(od, filepath)