Machine Learning Empowers Phosphoproteome Prediction in Cancers
This is the package of our winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge.
background: Proteogenomics Challenge
Git clone a copy of code:
git clone https://github.com/GuanLab/phosphoproteome_prediction.git
- R (3.4.3)
- python (3.6.5)
- numpy (1.13.3). It comes pre-packaged in Anaconda.
- scikit-learn (0.19.0) A popular machine learning package. It can be installed by:
pip install -U scikit-learn
All the omic data are 2D matrices, where columns are cancer samples and rows are genes/proteins/phosphorylation sites. The proteomic and phosphoproteomic data originally came from CPTAC-breast and CPTAC-ovary. The genomic data originally came from TCGA-breast and TCGA-ovary.
During the challenge, we directly downloaded these data from the challenge website. Unfortunatelly, this download link s no longer available for unregistered users. We therefore provided examples of dummy data in the directory data/raw/.
Then preprocess the data and generate 5-fold cross validatation using code in
We have two sets of code in parallel
1. "proteome" model
This model directly approximates the phosphorylation level based on the corresponding parent protein level.
2. "site-specific" model
This model considers the protein-protein interactions in regulating phosphorylation, in which all protein levels were used as features to make predictions. The base learner is random forest with maximum depth of 3 and 100 trees.
3. "cross-tissue" model
Similar to the "site-specific" model, this model uses combined samples from breast and ovarian cancer samples.
4. "multi-site" model
This model considers the associations between phosphorylation sites of the same parent protein.
5. ensemble model
Our final results are the ensemble of the 1-4 models mentioned above.
6. result analysis and figure preparation