Baseline Needs Even More Love: A Simple Word-Embedding-Based Model for Genetic Engineering Attribution
This repository contains my submission for the Genetic Engineering Attribution Challenge. The goal of the challenge was to create an algorithm that identified the most likely lab-of-origin for genetically engineered DNA. The challenge is described in more detail here, and the competition dataset is analysed here.
Approached the challenge using natural language processing. Byte-pair encoded genetic sequences. Implemented a variant of SWEM-max in TensorFlow to classify the encoded sequences. SWEM-max was particularly well suited to the task given its strong performance on small datasets and long texts. The approach proposed in my paper was judged as "particularly promising" and "quite distinctive from the other submissions", and ranked 5th in the Innovation Track (judged on the real-world merits of the model). Also ranked 21st in the Prediction Track (judged on model top-10 accuracy).
(Shen et al., 2018) first demonstrated that Simple Word-Embedding-Based Models (SWEMs) outperform convolution neural networks (CNNs) and recurrent neural networks (RNNs) in many natural language processing (NLP) tasks. We apply SWEMs to the task of genetic engineering attribution. We encode genetic sequences using BPE as proposed by (Alley et al., 2020), which separates the sequence into motifs (distinct sequences of DNA). Our model uses a max-pooling SWEM to extract a feature vector from the organism’s motifs, and a simple neural network to extract a feature vector from the organism’s phenotypes (observed characteristics). These two feature vectors are concatenated and then used to predict the lab of origin. Our model achieves 90.24% top-10 accuracy on the private test set, outperforming RNNs (Alley et al., 2020) and CNNs (Nielsen & Voigt, 2018). The simplicity of our model makes it interpretable, and we discuss how domain experts may approach interpreting the model.
If you would like to cite this report, please cite it as:
@article{GeneticSWEM,
author = {Kieran Litschel},
title = {Baseline Needs Even More Love: A Simple Word-Embedding-Based Model for Genetic Engineering Attribution},
URL = {https://github.com/KieranLitschel/GeneticSWEM},
year = {2020},
}
During the competition I made a submission every time I made a significant improvement to my model. I saved the notebooks corresponding to each submission and have included them in this repository in the development directory. If you are curious about how the development of my model progressed through the competition take a look.
The Train, Infer, and Build vectors and metadata for TensorFlow notebooks were my final submission for the Innovation Track. The Train notebook loads and pre-processes the training data, and uses it to train the model. The Infer notebook loads and pre-processes the test data, and makes predictions for each sample using the model trained in the Train notebook. The Build vectors and metadata for TensorFlow notebook is used to extract the word embeddings from the trained model in the format projector.tensorflow.org accepts as input.
Note that the execution output is not saved for any of the submitted notebooks. If you want to see the execution output, take a look at development notebook GE_8_36, as this notebook trains and evaluates the same model.
These notebooks were run on Ubuntu 18.04 using Python 3.6.9. A requirements.txt of the versions of packages we used is included. You will likely have success running these notebooks with a different operating system and different version of Python and the required packages, but we have not tested this.
I am currently working on an open source implementation of SWEM-max (max-pooling SWEM) in TensorFlow, with an emphasis on the interpretability techniques discussed in the paper. I have had some ideas of how to make the model even more interpretable, which I plan to include in the implementation. Development is at an early stage and currently the implementation is not flexible enough to completely replicate the model proposed here, but I plan to improve flexibility in the future to support this. You can check it out here.
I have had a few ideas on how to improve my approach since the competition. I plan to explore these ideas, and then develop the report into a full pre-print paper.