Peptide cleavage site prediction
Given a sequence of aminoacids, we want to predict the location of its cleavage site. This can be framed as a binary classification problem: given a location in the sequence, we use the surrounding aminoacids to predict whether or not the cleavage site is located there.
For example, as in the image below, we may try to predict whether or not the cleavage site is located after the fifth aminoacid in the sequence, by looking at the
q=4 respectively preceding and following this position.
We are therefore looking for methods to score a window of
p+q aminoacids around any candidate cleavage site. We implemented 2 such methods:
- Using the training data to create a Position Specific Scoring Matrix (PSSM). This is an ad hoc method, and we explain how the PSSM is built below.
Consider a window containing aminoacid a in position i, with i between -p and q-1 (cf image above).
- if a appears in position i especially frequently in windows surrounding cleavage sites (compared with other windows), then this should increase the score of the window
- if a appears in position i especially seldom in windows surrounding cleavage sites (compared with other windows), then this should decrease the score of our window
Let f(a,i) be the frequency of aminoacid a at the relative position i in the training set, and g(a) the general frequency of a in the training set. The PSSM matrix W(a,i) is defined by the equation: W(a,i) = log[f(a,i)] - log[g(a)]. Summing the W(a,i) for every aminoacid a in position i in the window defines the PSSM score of the window.
- Using Support Vector Machines (SVM) and the C++ library
- modify the first line of the function
src/main.cppto chose the desired dataset.
make clean && makeinside the
Structure of our code
In this section, we describe how the raw data is transformed in order to be exploitable by our learning models:
train_test_splittakes the original dataset (throughout the project, we worked with
EUKSIG_13.red) and builds two files
test.red, in respective proportions 4/5 and 1/5 using a random seed.
Sequenceencodes a sequence of aminoacids. Each instance has three attributes: the integers
cleavage_sitecontain the number of aminoacids in the sequence, and the position of the cleavage site.
aa_sequenceis a string containing the sequence of aminoacids.
test.red) into arrays of instances of
Sequence. As we shall see below, these arrays will be used directly for training in the case of the PSSM method, or converted into
libsvm-compatible files in the case of SVM.
The abstract class
Predictorcontains the functions used to predict the location of the cleavage site. It is extended by the class
PSSM, which implements these functions in the case of the PSSM approach.
traintakes as argument a training set (an array of instances of
Sequence) and the number of instances it contains, and performs training (in the
PSSMimplementation, it simply builds the Position Specific Scoring Matrix). Once training has been completed, the following functions can be used:
PSSMscore of a given window in a sequence.
FindCleavageMaxreturns the position in a sequence such that the window surrounding it maximizes the value of
FindCleavagesThresholdreturns all the positions in a sequence such that the score of the surrounding windows exceeds some specified threshold.
Given an array of instances of
Sequence, the function
SequenceAccuracyreturns the fraction of sequences such that
FindCleavageMaxreturns the true position of the cleavage site of the sequence.
Support Vector Machines
We used the library
libsvm to perform SVM on the training examples. This entailed preprocessing the data so as to convert it in a format accessible by
libsvm. Note that at this point, the format of the data is an array of sequences, whereas the actual dataset on which we wish to train Support Vector Machines is the set of windows (rather than the set of sequences).
Hence the following functions, written directly inside
main.cpp, which convert the data from an array of sequences into a vector of windows (and their associated labels):
SequenceToWindowsconverts the array of instances of
Sequenceinto a vector of strings, such that each string corresponds to a window of size
p + q.
SequenceToLabelsconverts the same array into a vector of strings, such that each string is the label of the corresponding window ("+1" or "-1" depending on whether the window's position is that of the sequence's cleavage site).
The two following functions use the vectors created above to create
libsvm-compatible training and testing files:
SequenceSVMTranslatetakes as argument the vectors built by the two functions above and uses them to write into a
.txtfile (the name of which is also an argument) in the
<label> <index1>:<value1> <index2>:<value2> ...format
SequenceSVMTranslateSimilarityMatrixalso writes into a file (the name of which is also an argument) in a
libsvm-compatible data format. However, in this case the file should allow
libsvmto exploit a precomputed kernel, specified in the argument
similarity_matrix: the format is
<label> 0:i 1:K(xi,x1) ... L:K(xi,xL). We noticed that the resulting files can quickly become very heavy (> 500 MB with 267 training instances), since the size grows quadratically with the number of instances.
We wrote a bash script
src/main.sh in order to facilitate training and prediction using
libsvm, making it easier to choose kernels and hyperparameters.
The project comprises three directories:
src is the source directory: it contains all the code we wrote, as detailed above.
data contains all our data. To be more precise:
data/datasetscontains all the
data/matricescontains the Identity and BLOSUM62 matrices in
data/modelscontains the models created by
data/outputscontains the output files created when running
svm-predict, both predictions for each instance and the associated scores (f-score, precision, recall, accuracy).
data/testcontains the test sets, in a format compatible with
data/traincontains the training sets, in a format compatible with
libsvm-3.23 is where the
libsvm functions are stored. We modified it slightly, exactly as advised here, in order to validate our model based on better evaluation functions than the accuracy.