# Likelihood Ratio Outlier Detection on Genomes

## Method

## Dataset

## Todo
- show in vs. out-of-distribution plot of Fig 1 (incl. log-likelihood vs. GC content)
- show in vs. out-of-distribution plot for LLR (incl. GC content plot) + ROC-AUC curve

## Issues
- save/load fn: needs to work for any TF model, so can't simply save/load generically.
  1. PixelCNN: ok
  2. functional API: ok
  3. class-modular API: only ok from TF 2.2

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, LSTM

from alibi_detect.od import LLR
from alibi_detect.datasets import fetch_genome
from alibi_detect.utils.saving import save_detector, load_detector

ERROR:fbprophet:Importing plotly failed. Interactive plots will not work.


### Load genome data

*X* represents the genome sequences and *y* whether they are outliers ($1$) or not ($0$).

In [2]:
(X_train, y_train), (X_val, y_val), (X_test, y_test) = \
        fetch_genome(return_X_y=True, return_labels=False)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)

(1000000, 250) (1000000,) (6999774, 250) (6999774,) (7000000, 250) (7000000,)


There are no outliers in the training set and a minority of outliers in the validation and test sets:

In [3]:
print('Fraction of outliers in train, val and test sets: '
      '{:.2f}, {:.2f} and {:.2f}'.format(y_train.mean(), y_val.mean(), y_test.mean()))

Fraction of outliers in train, val and test sets: 0.00, 0.86 and 0.86


### Define model

We need to define a generative model which models the genome sequences. We follow the paper and opt for a simple LSTM. Note that we don't actually need to define the model or `log_prob` function below if we simply load the pretrained detector later on:

In [None]:
class LlrLSTM(tf.keras.Model):
    def __init__(self, hidden_dim: int, input_dim: int, dropout: float = 0.):
        super(LlrLSTM, self).__init__()
        self.input_dim = input_dim
        self.lstm = LSTM(hidden_dim, dropout=dropout, return_sequences=True)
        self.logits = Dense(input_dim, activation=None)

    def call(self, x: tf.Tensor) -> tf.Tensor:
        x = tf.one_hot(tf.cast(x, tf.int32), self.input_dim)
        x = self.lstm(x)
        x = self.logits(x)
        return x
    
hidden_dim = 2000
input_dim = 4  # ACGT nucleobases
model = LlrLSTM(hidden_dim, input_dim)

#inputs = np.random.randint(0, high=3, size=250).reshape(1, 250)
#model._set_inputs(inputs)
#model.save('model', save_format='tf')

In [9]:
genome_dim = 250
input_dim = 4  # ACGT nucleobases
hidden_dim = 2000

inputs = Input(shape=(250,), dtype=tf.int8)
x = tf.one_hot(tf.cast(inputs, tf.int32), input_dim)
x = LSTM(hidden_dim, return_sequences=True)(x)
logits = Dense(input_dim, activation=None)(x)
model = tf.keras.Model(inputs=inputs, outputs=logits, name='LlrLSTM')

In [14]:
model.save('whatever', save_format='tf')

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: whatever/assets


INFO:tensorflow:Assets written to: whatever/assets


We also need to define our loss function which we can utilize to evaluate the log-likelihood at inference time since the model outputs logits:

In [12]:
def loss_fn(y, x):
    y = tf.one_hot(tf.cast(y, tf.int32), 4)  # ACGT on-hot encoding
    return tf.nn.softmax_cross_entropy_with_logits(y, x, axis=-1)

### Load or train the outlier detector

We can again either fetch the pretrained detector from a [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/seldon-models/alibi-detect/od/llr/genome) or train one from scratch:

In [13]:
load_pretrained = False

In [None]:
filepath = 'my_path'  # change to (absolute) directory where model is downloaded
if load_pretrained:  # load pretrained outlier detector
    detector_type = 'outlier'
    dataset = 'genome'
    detector_name = 'LLR'
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
    filepath = os.path.join(filepath, detector_name)
else:  # define model, initialize, train and save outlier detector
    
    od = LLR(threshold=None, model=model, log_prob=loss_fn, sequential=True)
    
    # train
    od.fit(
        X_train,
        mutate_fn_kwargs=dict(rate=.1, feature_range=(0,3)),
        mutate_batch_size=1000,
        loss_fn=loss_fn,
        epochs=50,
        batch_size=100
    )
    
    # save the trained outlier detector
    save_detector(od, filepath)