This repo hosts the scripts used in the study of Signatures of Discriminative Copy Number Aberrations in 31 Cancer Subtypes.
The open-access data from Progenetix and TCGA, and restricted data from PCAWG were used in the study.
- The complete lists of the samples used in each data repository are provided in
/data
. - The open-access data used in the study is available at Progenetix.
- In accordance with the data access policies of the ICGC, researchers need to apply to the ICGC Data Access Compliance Office for PCWAG data access. Instruction on accessing restricted data from the ICGC/PCAWG is available at https://docs.icgc.org/pcawg/data/.
alt-pipeline/ Scripts to generate .pkl files from the provided sample data at Progenetix.
classification/ Scripts of the classification experiments.
data/ The external & generated data used during the study.
integration/ Scripts to process the original data from Progenetix, TCGA, and PCAWG.
plots/ Scritps of all figures.
signatures/ Scripts for feature & signature generation using Autoencoder and LRP
Copy number data from Progenetix, TCGA and PCAWG were preprocessed with the following steps, respectively:
- probe or segment data were lifted to hg38, if the original data was not in hg38.
- transformed to a uniform data structure and stored in mongodb.
- normalized using mecan4CNA.
All data were combined in a single collection in mongodb (db:Rebased, collection:mecaned).
{
"source" : "TCGA",
"project" : "TCGA-BRCA",
"sample_id" : "ae96c429-b221-4894-a45a-6aa4e8d32c71",
"morphology" : "8500/3",
"topography" : "Breast, NOS",
"segments" : [
{
"chro" : "1",
"start" : 3301765,
"end" : 53333626,
"probes" : 26594,
"value" : -0.2022
}
],
"base" : 1.85,
"level_distance" : 0.35,
"normalized" : [
{
"chro" : "1",
"start" : 3301765,
"end" : 53333626,
"probes" : 26594,
"value" : -0.2504
}
],
"cytobands" : [
{
"start" : 0,
"end" : 2300000,
"name" : "p36.33",
"note" : "gneg",
"total_dup" : 0,
"total_del" : 0,
"dup_length" : 0,
"del_length" : 0,
"dup_count" : 0,
"del_count" : 0,
"chro" : "1",
"ave_dup" : 0,
"ave_del" : 0
}
]
}
Here, segments
stores the original data, normalized
stores the normalized segments, and cytobands
is used in the feature extraction procedure to store the summary of each cytoband. base
and level_distance
are parameters computed by mecan4CNA and are used for the normalization.
To facilitate the downstream pipelines, the following pickle files were created from the mongodb.
all_bands_meta.pkl
: the metadata of all samples.all_bands.pkl
: the band features (weighted CNV average) of each sample.all_bands_label.pkl
: the morphology, topography and organ labels of each sample.all_bands_disease_label.pkl
: the morphology label of each sample.all_bands_source_label.pkl
: the source of each sample.
When using the download data from Progenetix, please use the alt-pipeline
instead of integration
to preprocess data. Because of the difference in data, the alt-pipeline
is not identical to the original pipeline.
Please run scripts in the following order:
- load_data
- calibration
- combine_data
- normalization
- cytoband_data
- gen_pickles
The procedure:
- Build an autoencoder model using cytoband features
- Extract high-weighting cytoband features
- Build an autoencoder model using gene features (generated from high-weighting cytoband features)
- Extract high-weight gene features
- Generate signatures for cancer subtypes
The procedure:
- Filter data (with subtype signatures, enough samples)
- Upsampling & downsampling during cross-validations
- Multi-class classification of cancer subtypes
- Extend classification results to organs of origin