Genomic Copy Number Signatures Based Classifiers for Subtype Identification in Cancer

This repo hosts the scripts used in the study of Signatures of Discriminative Copy Number Aberrations in 31 Cancer Subtypes.

Data

The open-access data from Progenetix and TCGA, and restricted data from PCAWG were used in the study.

The complete lists of the samples used in each data repository are provided in /data.
The open-access data used in the study is available at Progenetix.
In accordance with the data access policies of the ICGC, researchers need to apply to the ICGC Data Access Compliance Office for PCWAG data access. Instruction on accessing restricted data from the ICGC/PCAWG is available at https://docs.icgc.org/pcawg/data/.

File structure

alt-pipeline/			Scripts to generate .pkl files from the provided sample data at Progenetix.
classification/			Scripts of the classification experiments.
data/				The external & generated data used during the study.
integration/			Scripts to process the original data from Progenetix, TCGA, and PCAWG.
plots/				Scritps of all figures.
signatures/			Scripts for feature & signature generation using Autoencoder and LRP

Workflow

Data integration

Copy number data from Progenetix, TCGA and PCAWG were preprocessed with the following steps, respectively:

probe or segment data were lifted to hg38, if the original data was not in hg38.
transformed to a uniform data structure and stored in mongodb.
normalized using mecan4CNA.

All data were combined in a single collection in mongodb (db:Rebased, collection:mecaned).

An example of the mongodb data structure

{
    "source" : "TCGA",
    "project" : "TCGA-BRCA",
    "sample_id" : "ae96c429-b221-4894-a45a-6aa4e8d32c71",
    "morphology" : "8500/3",
    "topography" : "Breast, NOS",
    "segments" : [
            {
            "chro" : "1",
            "start" : 3301765,
            "end" : 53333626,
            "probes" : 26594,
            "value" : -0.2022
        }
    ],
    "base" : 1.85,
    "level_distance" : 0.35,
    "normalized" : [
            {
            "chro" : "1",
            "start" : 3301765,
            "end" : 53333626,
            "probes" : 26594,
            "value" : -0.2504
        }
    ],
    "cytobands" : [
            {
            "start" : 0,
            "end" : 2300000,
            "name" : "p36.33",
            "note" : "gneg",
            "total_dup" : 0,
            "total_del" : 0,
            "dup_length" : 0,
            "del_length" : 0,
            "dup_count" : 0,
            "del_count" : 0,
            "chro" : "1",
            "ave_dup" : 0,
            "ave_del" : 0
        }
    ]
}

Here, segments stores the original data, normalized stores the normalized segments, and cytobands is used in the feature extraction procedure to store the summary of each cytoband. base and level_distance are parameters computed by mecan4CNA and are used for the normalization.

Intermediate files

To facilitate the downstream pipelines, the following pickle files were created from the mongodb.

all_bands_meta.pkl: the metadata of all samples.
all_bands.pkl: the band features (weighted CNV average) of each sample.
all_bands_label.pkl: the morphology, topography and organ labels of each sample.
all_bands_disease_label.pkl: the morphology label of each sample.
all_bands_source_label.pkl: the source of each sample.

The alt-pipeline

When using the download data from Progenetix, please use the alt-pipeline instead of integration to preprocess data. Because of the difference in data, the alt-pipeline is not identical to the original pipeline.

Please run scripts in the following order:

load_data
calibration
combine_data
normalization
cytoband_data
gen_pickles

Feature & signature generation

The procedure:

Build an autoencoder model using cytoband features
Extract high-weighting cytoband features
Build an autoencoder model using gene features (generated from high-weighting cytoband features)
Extract high-weight gene features
Generate signatures for cancer subtypes

Classification

The procedure:

Filter data (with subtype signatures, enough samples)
Upsampling & downsampling during cross-validations
Multi-class classification of cancer subtypes
Extend classification results to organs of origin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alt-pipeline

alt-pipeline

classification

classification

data

data

integration

integration

plots

plots

signatures

signatures

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Genomic Copy Number Signatures Based Classifiers for Subtype Identification in Cancer

Data

File structure

Workflow

Data integration

An example of the mongodb data structure

Intermediate files

The alt-pipeline

Feature & signature generation

Classification

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
alt-pipeline		alt-pipeline
classification		classification
data		data
integration		integration
plots		plots
signatures		signatures
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

baudisgroup/cancer-signatures

Folders and files

Latest commit

History

Repository files navigation

Genomic Copy Number Signatures Based Classifiers for Subtype Identification in Cancer

Data

File structure

Workflow

Data integration

An example of the mongodb data structure

Intermediate files

The alt-pipeline

Feature & signature generation

Classification

About

Resources

Stars

Watchers

Forks

Languages