Skip to content
This is the package of GuanLab's winning algorithm in the NCI-CPTAC DREAM Proteogenomics Challenge
R Shell Python Perl
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Machine Learning Empowers Phosphoproteome Prediction in Cancers

This is the package of our winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge.

background: Proteogenomics Challenge

see also: Hongyang Li and Yuanfang Guan's 1st Place Solution

Please contact ( or if you have any questions or suggestions.



Git clone a copy of code:

git clone

Required dependencies

  • R (3.4.3)
  • python (3.6.5)
  • numpy (1.13.3). It comes pre-packaged in Anaconda.
  • scikit-learn (0.19.0) A popular machine learning package. It can be installed by:
pip install -U scikit-learn


All the omic data are 2D matrices, where columns are cancer samples and rows are genes/proteins/phosphorylation sites. The proteomic and phosphoproteomic data originally came from CPTAC-breast and CPTAC-ovary. The genomic data originally came from TCGA-breast and TCGA-ovary.

During the challenge, we directly downloaded these data from the challenge website. Unfortunatelly, this download link s no longer available for unregistered users. We therefore provided examples of dummy data in the directory data/raw/.

  • retrospective_breast_CNA_sort_common_gene_16884.txt
  • retrospective_breast_phospho_sort_common_gene_31981.txt
  • retrospective_breast_proteome_sort_common_gene_10005.txt
  • retrospective_breast_RNA_sort_common_gene_15107.txt
  • retrospective_ova_CNA_sort_common_gene_11859.txt
  • retrospective_ova_JHU_proteome_sort_common_gene_7061.txt
  • retrospective_ova_phospho_filtered.txt
  • retrospective_ova_phospho_sort_common_gene_10057.txt
  • retrospective_ova_PNNL_proteome_sort_common_gene_7061.txt
  • retrospective_ova_rna_seq_sort_common_gene_15121.txt

Then preprocess the data and generate 5-fold cross validatation using code in

  • data/trimmed_set
  • data/normalization
  • data/cv_set

Four models

We have two sets of code in parallel

  • prediction/breast
  • prediction/ova

1. "proteome" model

This model directly approximates the phosphorylation level based on the corresponding parent protein level.


2. "site-specific" model

This model considers the protein-protein interactions in regulating phosphorylation, in which all protein levels were used as features to make predictions. The base learner is random forest with maximum depth of 3 and 100 trees.


3. "cross-tissue" model

Similar to the "site-specific" model, this model uses combined samples from breast and ovarian cancer samples.


4. "multi-site" model

This model considers the associations between phosphorylation sites of the same parent protein.


5. ensemble model

Our final results are the ensemble of the 1-4 models mentioned above.


6. result analysis and figure preparation

You can’t perform that action at this time.