A deep learning framework to predict the lysine acetylation sites in protein
- Python>=3.6
- Matlab2016a
- Tensorflow =1.6.0
- There are seven sub-folders in the "Deep Learning" folder. The folders named by the six coding schemes are python code, and the predictors are obtained by performing 4-fold cross-validation on the feature vectors obtained via different encoding methods.
- There are six different encoding schemes for MATLAB code in the folder named "Encoding schemes" which are AAindex, BLOSUM62, CKSAAP (Composition of K-space amino acid pairs), IG (Information gain) One-hot and PSSM (Position-specific scoring matrix). These programs can encode protein fragments into feature vectors of different dimensions.
- The folder named "Protein capture" is a protein interception program which is capable of interpreting proteins as lysine-centered fragments with equal length. (Note: Put the FASTA file and the protein ID file in this folder when running this program)
- The folder named "Feature Combination" contains the optimal model obtained by combining six coding methods with F-score. (Note: put the coded test set into the folder when running this program and all files in this folder should be in the same path)
The amino acids within a small range around the acetylation site are primary sequence features and have proven to be useful information for lysine acetylation sites prediction in previous studies.These features can be used to represent protein sequences [1].
BLOSUM matrices have belonged to the most common substitution matrix series for protein homology search and sequence alignments since their publication in 1992. Essential characters of protein evolution can be learned from analysis of aligned protein sequences[2] [3].
The CKSAAP encoding scheme reflects the information of amino acid pairs in small range within the peptides.
Shannon Entropy was defined as a unique function that represents the average amount of information for a set of objects according to their probabilities. It can be used to measure the conservation of amino acids in fragments.
AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids [4]. There are 566 entries in Amino Acid Index Database.
To get information about the sequential evolution, we can exploit the data of the position-specific scoring matrix [5].
we combined a series of feature extraction methods with deep learning framework to predict lysine acetylation sites and got better results. Two ways were adopted. One way was training the model by different coding schemes respectively. Another was combining six types of encoding schemes with F-score to train the model. The flow as shown below:
We constructed a feedforward neural network of six layers (including input and output layers).
- Comparisons of fragments information between lysine acetylation and non-acetylation sites. (A) The percentage of amino acids in the lysine acetylation and the non-acetylation fragments. (B) A pLogo of compositional bias around the lysine acetylation and non-acetylation sites
- Performance measures of different features. (A)the Accuracy, Specificity, Sensitivity, AUC values of different features. (B)ROC curves and their AUC values of different features.
- The distribution of the number of each type of features and their corresponding F-score sums in the optimized feature set. The distribution of F-score sums of each type of features.
- Performance measures of the optimized selected predictors. (A) the Accuracy, Specificity, Sensitivity, AUC values in 4-,6-,8-,10-fold cross-validation. (B)ROC curves in and their AUC values in 4-,6-,8-,10-fold cross-validation.
- The ROC curves for the independent test set.