Once we've trained a model, we might want to better understand what sequence motifs the first convolutional layer has discovered and how it's using them. Basset offers two methods to help users explore these filters.

You'll want to double check that you have the Tomtom motif comparison tool for the MEME suite installed. Tomtom provides rigorous methods to compare the filters to a database of motifs. You can download it from here: http://meme-suite.org/doc/download.html

To run this tutorial, you'll need to either download the pre-trained model from http://www and preprocess the consortium data, or just substitute your own files here:

In [5]:
model_file = '../data/models/pretrained_model.th'
seqs_file = '../data/encode_roadmap.h5'

First, we'll run basset_motifs.py, which will extract a bunch of basic information about the first layer filters. The script takes an HDF file (such as any preprocessed with preprocess_features.py) and samples sequences from the test set. It sends those sequences through the neural network and examines its hidden unit values to describe what they're doing.

By default, the script will search the CIS-BP Homo sapiens database. You can change that with -m.

-s specifies the number of sequences to sample. 1000 is fast and sufficient.

-t asks the script to trim uninformative positions off the filter ends.

In [11]:
import subprocess
subprocess.call('basset_motifs.py -s 1000 -t -o motifs_out %s %s' % (model_file, seqs_file), shell=True)

0

Now there's plenty of information output in motifs_out. My favorite way to get started is to open the HTML file output by Tomtom's comparison of the motifs to a database. It displays all of the motifs and their database matches in a neat table.

Before we take a look though, let me describe where these position weight matrices came from. Inside the neural network, the filters are reprsented by real-valued matrices. Here's one:

In [21]:
from IPython.display import HTML
HTML('<iframe src=motifs_out/filter9_heat.pdf width=1000 height=250></iframe>')

Although it's matrix of values, this doesn't quite match up with the conventional notion of a position weight matrix that we typically use to represent sequences motifs in genome biology. To make that, basset_motifs.py pulls out the underlying sequences that activate the filters in the test sequences and passes that to weblogo.

In [43]:
HTML('<iframe src=motifs_out/filter9_logo.eps width=1000 height=250></iframe>')

0

We can also visualize the distribution of activations for that filter.

In [33]:
#fig = Image(filename=('motifs_out/filter9_dens.pdf'))
#fig

HTML('<iframe src=motifs_out/filter9_dens.pdf width=850 height=550></iframe>')

The other primary tool that we have to understand the filters is to remove the filter from the model and assess the impact. Rather than truly remove it and re-train, we can just nullify it within the model by setting all output from the filter to its mean. This way the model around it isn't drastically affected, but there's no information flowing through.

This analysis requires considerably more compute time, so I separated it into a different script. To give it a sufficient number of sequences to obtain a good estimate influence, I typically run it overnight.

To get really useful output, the script needs a few additional pieces of information.

-m specifies the table created by basset_motifs.py above.

-t specifies a table where the second column is the target labels.

In [None]:
!basset_motif_infl.py 