Welcome to CNVRobot v4.2!
CNVRobot is an integrated pipeline designed to detect rare germline and somatic copy-number variants (CNVs) as well as loss of heterozygosity (LOH) regions in the human genome using short-read DNA sequencing from any NGS platform. This includes targeted sequencing (TS), whole-exome sequencing (WES), and whole-genome sequencing (WGS). It integrates sequencing depth with SNP zygosity across the genome, incorporates advanced data denoising, segmentation and variant prioritization options, and provides detailed visualization for manual inspection of detected CNVs.
The code, databases, and example data are available for download HERE. The latest version is CNVRobot v4.2. Please note that the databases are not available here on GitHub.
- GATK v4.2+ (Apache License 2.0)
- PICARD (MIT License)
- R (GPL-2 | GPL-3)
- Samtools, HTSlib (bgzip and tabix), BCFtools (MIT/Expat License)
- bedtools (GNU General Public License v2.0)
- bedGraphToBigWig (UCSC Genome Browser License)
- dos2unix (BSD License)
- java (Oracle Binary Code License)
- gzip/gunzip (GNU General Public License)
- R packages:
- tidyverse (MIT License)
- expss (GPL-2 | GPL-3)
- splitstackshape (GPL-3)
- karyoploteR (Artistic-2.0)
- IRanges (Artistic-2.0)
- rhdf5 (Artistic-2.0)
- only for CNVkit analysis: see CNVkit github for instalation via Conda and setting up a new Python environment
- only for ASCAT analysis: R packages ASCAT, GenomicRanges (Artistic-2.0) and plyranges (Artistic-2.0)
- BAM files
- FASTA file - the reference file used for alignment (possible assemblies: hg19, GRCh37, hg38, GRCh38, or CHM13v2.0)
- capture BED file (for WES or TS)
- first three columns required - contig, start, and end; no header
- requires the same genome assembly as the reference file (Contigs not found in the reference will be automatically excluded.)
- master files - spreadsheets of the project, controls, samples, etc.; see below for details
- Fill
./CNVRobot_vX.X/Masters/setting_base.sh
to specify paths for software. - Fill
./CNVRobot_vX.X/Masters/master_projects
to specify conditions for your project. - Fill
./CNVRobot_vX.X/Masters/master_samples.txt
and./CNVRobot_vX.X/Masters/master_controls.txt
to specify samples and control sets for the project. - Run
./CNVRobot_vX.X/Scripts/run.sh
.
How CNVRobot regulates what project(s), sample(s) and control(s) are executed?
-
The
INCLUDE
column in each master file serves as a switch to determine which projects to execute, which samples to analyze, and which controls to employ. This way, it's possible to maintain multiple projects within the project master file, designatingno
in theINCLUDE
column for the projects you do not wish to activate. -
The same principle applies to the sample master file; by marking
INCLUDE
asyes
for a particular sample, you're indicating your intention to analyze it. Conversely, if you want to exclude certain samples from the current analysis, simply assignno
to theINCLUDE
field. -
For the control master file, you might have multiple controls listed. If you need to exclude some controls, perhaps due to poor quality control, assign
no
to their respectiveINCLUDE
fields. CNVRobot will register this modification and execute the analysis excluding the specified controls. -
CNVRobot will link samples and controls for each project by matching
PROJECT_ID
,CAPTURE_ID
, andGENOME_VERSION
columns. -
Details about each master file are provided below.
-
Please avoid using space characters in any identifier.
To run this test data, please fill ./CNVRobot_vX.X/Masters/setting_base.sh
and run ./CNVRobot_vX.X/Scripts/run.sh
.
Simulated example data contains one sample, two controls, and a reference file.
Platform: targeted sequencing
Chromosome locus: 21q21
Genomic position: chr21:15.0-16.0Mb
Genome assembly: hg38
CNVs in the sample: loss of chr21:15.2-15.3Mb, gain of chr21:15.7-15.8Mb
You can browse /Test_example/Test_example_output
to see the output from this example data, including figures and IGV files.
GATK recommends that the control set consists of at least 40 samples without pathogenic CNVs. These samples should be prepared, sequenced, and analyzed using the same laboratory protocols and bioinformatic processes. However, even if you only have 2–5 controls processed simultaneously, it is still better than having no control at all. Even a small number of controls can already be highly effective in reducing noise from technical artifacts. The control set can overlap with the sample set, as long as no pathological CNVs are expected. For instance, the control set can comprise unaffected parents in a pedigree dataset or germline samples in a cancer dataset. By default, any related samples are automatically excluded from the controls during analysis. This is done by excluding all controls that match the main identifier MAIN_ID
provided in both the sample and control spreadsheet.
CNVRobot offers an option to denoise a tumor sample using its corresponding germline sample. The resulting profile might be noisier (as only one control is technically used), but it can provide a better understanding of what is somatic and what is germline. This option can be configured under SETTING_MODE
in the /Masters/master_projects.txt
. See the Specific project setting options
section for further details.
For the accurate analysis of CNV on chrX and chrY, denoising using controls and segmentation are carefully processed. To detect gonosomal CNVs correctly, controls of the same sex are required for GATK data denoising. This means that SMPL_PON_SEX_SELECT
within /Masters/master_projects.txt
should be M
, F
, both_separated
, matched_main
, or matched_each
. The CNVRobot segmentation algorithm will always recognize which control sex was used against which sample sex and adjust log2 ratio cut-offs for abnormalities accordingly. For chrY analysis in males, the Panel of Normals should consist of male controls. If the Panel of Normals consists of female controls, chrY data will be excluded from the analysis by GATK algorithms. If SMPL_PON_SEX_SELECT
is mixed, the analysis of chrX will still be correct (due to coverage doubling in males, see below), but chrY cannot be reported accurately as it will be influenced by the ratio of males and females in the GATK Panel of Normals.
Note that CNVRobot has a module for denoising data using CNVkit, which can be beneficial for the analysis of chrY if a small control set is available and/or includes female controls only. See the Specific project setting options
section for further details.
Before including a control in the Panel of Normals, its sex is verified or predicted if unknown. Sex tests and predictions do not run for samples, as gonosomal aneuploidies can be present.
Each control is tested to ensure the sex is correctly determined by the CTRL_SEX
value in /Masters/master_controls.txt
. This test uses global coverage (normalized for genomic size) of gonosomes compared to autosomes. If there are no gonosomal sequencing targets, the test will be skipped. If there are no autosomal sequencing targets, but both gonosomes have sequencing targets, the test will use the ratio of chrY to chrX. If the predicted sex differs from what was provided by the user, the pipeline will halt and report this error.
If CTRL_SEX
is defined as unk
in /Masters/master_controls.txt
, the same test will be run to predict the sex, and the control will be further processed using this predicted sex.
By default, the collected coverage of chrX and chrY is now doubled in male samples and controls. This step was introduced in response to observations of GATK "over-denoising" gonosomes relative to autosomes when the Panel of Normals is composed of males (haploid chrX and chrY). The normal (diploid) values for log2 ratios are as follows:
- F sample denoised by F controls: chrX = 0, chrY = no data
- M sample denoised by F controls: chrX = 0, chrY = no data
- F sample denoised by M controls: chrX = 0, chrY = deep minus
- M sample denoised by M controls: chrX = 0, chrY = 0
The segmentation algorithm automatically adjusts log2 ratio cut-offs for gonosomes based on the sample/control sex. Therefore, chrY will not be reported as deleted in a female sample if the Panel of Normals consists of male samples.
If, for any reason, a user prefers the original analysis without doubling the male gonosomal coverage, it is possible to specify this for your project run under SETTING_MODE
in /Masters/master_projects.txt
. Refer to the Specific project setting options
section for details. Note that the segmentation algorithm will detect this and adjust log2 ratio cut-offs for gonosomes accordingly, avoiding reports of changes due to sex-flips. In this setting, the normal (diploid) values for log2 ratios are as follows:
- F sample denoised by F controls: chrX = 0, chrY = no data
- M sample denoised by F controls: chrX = -1, chrY = no data
- F sample denoised by M controls: chrX = 1, chrY = deep minus
- M sample denoised by M controls: chrX = 0, chrY = 0
The segmentation of denoised coverage and SNP zygosity data is an important step to recognize abnormal segments (losses, gains, and LOH).
The pipeline offers a default segmentation setting. However, results can be reviewed and segmentation conditions adapted by the user based on data type, quality, and expectation. There are two options for customizing segmentation conditions using the project's master file SEGMENTATION_ID
column:
a) smart segmentation: The identifier pattern smart-XX
is provided within SEGMENTATION_ID
(in the project's master file), where XX
represents a segmentation coefficient, a number between 0.5 (for high sensitivity) and 1 (for high specificity). If smart segmentation is used, segmentation conditions are not loaded from the segmentation master file, but are instead calculated automatically by the given segmentation coefficient and detected sample quality. For subclonal analysis, the pattern is smart-XX-sub-YY
, with YY
representing a number between 0.1 and 1.0 (for example, 0.5 indicates a subclonal analysis to detect monoallelic CNV in 50% of cells).
b) custom segmentation: A short identifier (for example, my_segm, somatic_segmentation, etc.) is provided within SEGMENTATION_ID
(in the project's master file). This identifier matches a unique column name within the segmentation master file, where all segmentation conditions are manually defined by the user.
CNVRobot offers several run options beyond the default settings that can be useful for specific situations. These settings are found under SETTING_MODE
in /Masters/master_projects.txt
. Due to their complexity, they are explained in this specific section rather than in the /Masters/master_projects.txt
description. The SETTING_MODE
can simply be set to default
, in which case CNVRobot will operate in its standard setting.
However, SETTING_MODE
consists of seven positions, each of which can switch the run to a specific project setting.
Position 1 - Low mapping quality read count collection: F
/-
(default) or D
:
- Introduced with the T2T-CHM13v2.0 genome assembly to enable the collection of low mapping quality reads.
F
/-
(default) - This is the GATK-recommended setting when all read filters are enabled. WhileF
and-
are equivalent, if Microsoft Excel is used to fill the master files,F
should be used since the-
symbol can cause issues when placed at the beginning.D
- Enables the collection of reads with the "-DF "MappingQualityReadFilter" argument. This collects low-quality reads, allowing us to visualize regions like centromeres using GATK algorithms.
Position 2 - Matching germline denoising of tumor sample: -
(default) or G
-
(default) - The Panel of Normals is created based on the list of controls found within/Masters/master_controls.txt
and by the defined sex (SMPL_PON_SEX_SELECT
in/Masters/master_projects.txt
). The matching germline is excluded from the denoising. If listed asSAMPLE2
in/Masters/master_samples.txt
, it will be processed alongside the tumor with the same Panel of Normals.G
- The tumor sample will be denoised using only the corresponding germline sample. This germline must be listed in/Masters/master_controls.txt
with the sameMAIN_ID
identifier as the tumor sample in/Masters/master_samples.txt
. If the germline is also listed asSAMPLE2
in/Masters/master_samples.txt
, it will be processed next to the tumor and denoised by the Panel of Normals as in the default setting.
Position 3 - Not-doubling gonosomal coverage: -
(default) or A
(refer to Gonosomal analysis
section for context)
-
(default) - The collected coverage of chrX and chrY is doubled in male samples and controls.A
- The collected coverage of chrX and chrY IS NOT doubled in male samples and controls (retains the original GATK setting).
Position 4 - Speeding up CNVRobot by skipping certain steps: -
(default) or Q
:
-
(default) - No steps are skipped.Q
- Skips the following: segmentation and QC metrics for controls, as well as annotation and reports for samples.
Position 5 - Speeding up CNVRobot plotting by omitting detailed plots: -
(default) or Q
-
(default) - A figure is generated for each abnormal segment (excluding sub-clonal findings).Q
- No figure is generated for abnormal segments. Genome and chromosome figures are still generated.
Position 6 - Running CNVkit (tested with v0.9.10): -
(default) or Y
-
(default) - CNVkit is silent.Y
- CNVkit runs alongside GATK. Note that it needs to be correctly installed in a specific Python environment and requires paths toCONDA
andCNVKIT_ENV
(environment name) specified in/Masters/setting_base.txt
. CNVkit is used for coverage collection and denoising and may offer advantages for gonosomal analysis. Data is segmented using the CNVRobot algorithm. Outputs are similar to those produced by GATK, including plots and IGV files.
The normal (diploid) values for log2 ratios processed by CNVkit are as follows:
- F sample denoised by F controls: chrX = 0, chrY = deep minus
- M sample denoised by F controls: chrX = 0, chrY = 0
- F sample denoised by M controls: chrX = 0, chrY = deep minus
- M sample denoised by M controls: chrX = 0, chrY = 0
Position 7 - Running ASCAT (tested with v2.5.2): -
(default) or Y
-
(default) - ASCAT is silent.Y
- ASCAT is running. This allows output from CNVRobot to be directly processed by ASCAT to determine ploidy and analyze absolute copy-number data in a tumor sample. Please note that this mode is still a work in progress and requires both a tumor and a matching germline sample. The germline sample must be specified asSAMPLE2
in/Masters/master_samples.txt
to be included in the analysis. Outputs are stored in a separate folder within each sample and include regular ASCAT outputs as well as IGV files with absolute CN for segments (/IGV/*_absCN*
) from CNVRobot, and ASCAT CN segmentation for major and minor alleles (/IGV/*_allele.bw
).
Examples of SETTING_MODE
:
default
: All defaults apply - only reads with high mappability are collected by GATK, Panel of Normals is used for denoising, gonosomal coverage is doubled in males, no steps are skipped, and neither CNVkit nor ASCAT are run.D
(equivalent toD------
): Reads with low mapping quality will be collected by GATK. Other settings will remain as default.-G
(equivalent to-G-----
/FG-----
): The tumor will be denoised by its matching germline. Other settings will remain as default.---QQ
(equivalent to---QQ--
/F--QQ--
): CNVRobot will skip certain steps during processing and will not generate a plot for each abnormality. Other settings will remain as default.D-A
(equivalent toD-A----
): Reads with low mapping quality will be collected by GATK, and gonosomal coverage will not be doubled in males. Other settings will remain as default.-----Y
(equivalent to-----Y-
/F----Y-
): CNVRobot runs CNVkit in addition to default settings.------Y
(equivalent toF-----Y
): CNVRobot runs ASCAT in addition to default settings.
Important: If Microsoft Excel is used to fill the master files, the pattern starting F
(not -
) should be used. Symbol -
at the beggining can cause issues.
Please note that the SETTING_MODE
component of CNVRobot is still under development. If any aspect is unclear, do not hesitate to contact me for further details.
/Masters/setting_base.sh
Paths to software, main output folders, and basic parameters for the identification of noisy and rare SNPs in the gnomAD database
/Masters/master_projects.txt
Main information for each project
(!) = requires explicit values
(!-) = requires explicit values in specific situations
(D) = accepting default
Columns:
INCLUDE
(!) -yes
,no
- helps to regulate what project(s) will be executedyes
- project will be executedno
- project will not be executed- It is recommended to run one project per time.
PROJECT_ID
- short and unique identifier for the projectPROJECT_TYPE
(!) -germline
,tumor
,other
- important value for variant origin prediction in the final report; also used to define abnormal segments smoothing during segmentationgermline
- final report is generated for a pedigree project as heredity prediction; sample 1 is a proband and samples 2 and 3 are parents (works also if only one parent is available)tumor
- final report is generated for a cancer project as germline/somatic prediction (paired germline has to be available)other
- no report is generated; anything else
SEQ_TYPE
(!) -WGS
,WES
,TS
- type of the sequencing captureWGS
- whole genome sequencingWES
- whole exome sequencingTS
- targeted sequencing
CAPTURE_ID
- short identifier for the capture (examples: nextera, twist, sureselect_V4, wgs, wgs_10X etc.)CAPTURE_FILE
- path to the capture filepath/to/file
- required for any sequencing with targets (TS or WES)- for WGS, this column is not used and can be left blank or contains anything like not_used, na etc.
GENOME_VERSION
(!) -GRCh37-hg19
,GRCh38-hg38
,CHM13v2.0
GRCh37-hg19
- for any version of human genome assembly such as GRCh37, hg19 and b37GRCh38-hg38
- for any version of human genome assembly such as GRCh38 and hg38CHM13v2.0
- for T2T-CHM13v2.0 genome assembly
REF
- path to fasta fileSETTING_MODE
(D) (!) = seeSpecific project setting options section
GENE_ANNOTATION
(D) - path to gene database that will be used for annotation and plotsdefault
- refseq database will be used, provided in/Databases/GENE_ANNOTATION/
/path/to/file
- custom file can be prepared, seeCustom gene annotation
section below
BIN
(D) - numeric value - length (bp) of intervals for processing, used in WGS or when capture targets are too long; if no bins are required, value is0
default
- automatic use of1000
PADDING
(D) - numeric value - extension (bp) of each interval for depth collectiondefault
- automatic use of0
for WGS and read length for WES and TS (read length is calculated from the first control sample)
WAY_TO_BAM
(D) (!) -absolute
,find_in_dir
default
=absolute
absolute
- BAM file for each control/sample is provided as an absolute path. The final absolute path consists ofCTRL_BAM_DIR
(projects master file) +CTRL_PATH_TO_BAM
(controls master file) for controls andSMPL_BAM_DIR
(projects master file) +SAMPLE_PATH_TO_BAM
(samples master file) for samples.find_in_dir
- BAM file for each control/sample is found recursively in folderCTRL_BAM_DIR
/SMPL_BAM_DIR
using control/sample identifier (controls/samples master file) and shared patternCTRL_BAM_PATTERN
/SMPL_BAM_PATTERN
(projects master file). Importantly, this pattern has to fit only one BAM file in the provided folder. This setting is not primarily recommended, but it can be very useful when BAM files are in folders with many subfolders and absolute path would be very long and not very enjoyable to specify for each control/sample. The final file is found asCTRL_ID*CTRL_BAM_PATTERN
/SAMPLE1_ID*SMPL_BAM_PATTERN
. Patterns can be, for example,*.bam
,*_final.bam
, etc.
CTRL
(!) -yes
,no
yes
- controls will be included in the analysisno
- controls will not be included in the analysis; coverage profile will be denoised only by GC correction
CTRL_BAM_DIR
- path to the folder with BAM file of controlsCTRL_BAM_PATTERN
(D) - pattern (file suffix) for the BAM files of controls (examples:*.bam
,*_final.bam
, etc.); used only whenWAY_TO_BAM
isfind_in_dir
default
=na
(which means it is not used asWAY_TO_BAM
default
isabsolute
)
CTRL_SEX_TEST
(D) (!) -custom
,no
- see Sex test for more informationdefault
- sex test of controls will be performed using the default setting (SRY and GAPDH genes coverage)custom
- sex test of controls will be performed using custom genomic locations defined within the sex master file underPROJECT_ID
no
- sex test of controls will not be performed (for example because chromosome Y has no targets, sex of controls is unknown or user just does not require this test to be done)
CTRL_PON_SEX_SELECT
(D) (!) -M
,F
,matched
,mixed
- During pre-processing, controls are denoised by each other and segmented. This allows us to look at controls qc and the number of segments with a given segmentation setting, and possibly exclude controls with extreme values if wanted. Denoised data is also further used to determine CNVs in control in annotation and plots. Controls itself and controls with the sameMAIN_ID
are excluded from denoising. Settingmatched
is recommended, others can be used in case of a limited number of controls and/or controls with each sex.default
=matched
M
- only male controls are used for denoising of controlsF
- only female controls are used for denoising of controlsmatched
- male controls are used for denoising of male controls and female controls are used for denoising of female controlsmixed
- male and female controls are mixed together for denoising of samples; NOT RECOMMENDED setting for two reasons: 1) gonozomes cannot be analyzed, and 2) mixing sex was found to increase false positivity rate for autosomes during pipeline testing (one single-sex control provided better sensitivity/specificity than six mixed controls). This setting is recommended to be used only if the sex is unknown and cannot be determined from Y capture.
SMPL_BAM_DIR
- path to the folder with BAM file of samplesSMPL_BAM_PATTERN
(D) - pattern (file suffix) for the BAM files of samples (examples:*.bam
,*_final.bam
, etc.); used only whenWAY_TO_BAM
isfind_in_dir
default
=na
(which means it is not used asWAY_TO_BAM
default
isabsolute
)
SMPL_PON_SEX_SELECT
(D) (!) -M
,F
,both_separated
,matched_main
,matched_each
,mixed
,all_options
. This value allows specifying what controls are used for denoising of samples. Controls with the sameMAIN_ID
as the sample are automatically excluded from denoising. Selection depends on user's requirements and the number of controls with each sex available.default
=matched_each
M
- only male controls are used for denoising of samplesF
- only female controls are used for denoising of samplesboth_separated
-M
andF
will be performed separatelymatched_main
- male or female controls are selected based on the sex of the main sample (proband or tumor sample) and used for all related samples (family members or germline sample of the tumor)matched_each
- male controls are used for denoising of male samples and female controls are used for denoising of female samplesmixed
- male and female controls are mixed together for denoising of samples; NOT RECOMMENDED setting for two reasons: 1) gonozomes cannot be analyzed, and 2) mixing sex was found to increase false positivity rate for autosomes during pipeline testing (one single-sex control provided better sensitivity/specificity than six mixed controls). This setting is recommended to be used only if the sex is unknown and cannot be determined from Y capture.all_option
-M
,F
,mixed
,matched_main
, andmatched_each
will be performed separately
CN_FREQ_IN_CTRLS
(!) -yes
,no
yes
- CNV analysis in controls will be included in the annotation and plotsno
- CNV analysis in controls will not be included in the annotation and plots, for example because of no or a very small number of controls.
GNOMAD_SELECTION
(D) (!) -capture_filter
,capture_filter_and_ctrl_denois
,af_filter
,af_filter_and_ctrl_denois
- the original gnomAD database of SNP that is provided has to be filtered for efficient performing; usuallycapture_filter
/capture_filter_and_ctrl_denois
for TS or WES andaf_filter
/af_filter_and_ctrl_denois
for WGS. Option_and_ctrl_denois
requires a reasonable number of controls.default
- automatically selectedaf_filter_and_ctrl_denois
for WGS andcapture_filter_and_ctrl_denois
for WES and TScapture_filter
- SNPs are filtered for those being in the covered regions (for TS and WES)capture_filter_and_ctrl_denois
-capture_filter
+ noisy SNPs (recognized in controls) are excluded; noise in controls is found by default parameters in the base master fileaf_filter
- SNPs with very rare alternative alleles are excluded; allelic frequency is defined by default parameter in the base master fileaf_filter_and_ctrl_denois
-af_filter
+ noisy SNPs (recognized in controls) are excluded; noise in controls is found by default parameters in the base master file
SEGMENTATION_ID
(D) (!-) (seeData segmentation
section above for more information)default
=smart-0.65
for single-sex denoising andsmart-0.85
for mixed-sex denoising- smart segmentation -
smart-XX
(without sub-clonal analysis) orsmart-XX-sub-YY
(with subclonal analysis); XX = number between 0.5 (high sensitivity) and 1.0 (high specificity), YY = number between 0.1 and 1.0 (for example, 0.5 indicates that subclonal analysis can detect monoallelic CNV in 50% of cells) - custom segmentation - short identifier (for example,
my_segm
,somatic_segmentation
, etc.) to select required segmentation conditions from the segmentation master file
SEGMENTATION_ID_USE
(D) (!) -segm_id
,full
- This controls what identifier of segmentation will be used in filenames.default
=segm_id
segm_id
- RECOMMENDED; segmentation condition will be used in filenames by its short identifier (full version is still kept inside of the segmentation table)full
- all segmentation parameters in numeric version will be used in filenames; not primarily recommended due to creating long file names that can be difficult in some systems- If smart segmentation is used,
segm_id
is selected automatically.
NOTE
- any additional information that the user wants to keep with the project
/Masters/master_controls.txt
Controls spreadsheet
(!) = requires explicit values
Columns:
INCLUDE
(!) -yes
,no
- helps to regulate which controls will be included and at the same time to keep excluded controls with excluding reason in a note columnyes
- control will be includedno
- control will not be included
PROJECT_ID
- short and unique identifier for the project, matchingPROJECT_ID
in the other master filesCAPTURE_ID
- short identifier for the capture, matchingCAPTURE_ID
in the other master filesGENOME_VERSION
(!) -GRCh37-hg19
,GRCh38-hg38
, matchingGENOME_VERSION
in the other master filesGRCh37-hg19
- for any version of human genome assembly such as GRCh37, hg19, and b37GRCh38-hg38
- for any version of human genome assembly such as GRCh38 and hg38
MAIN_ID
- controls/samples group identifier; This identifier is the one that decides which controls to exclude for denoising because of relatedness. Therefore, controls/samples that are related have to share the same identifier in the controls and the samples master file.CTRL_ID
- control identifier; IfWAY_TO_BAM
isfind_in_dir
, this identifier has to be part of the BAM file name asCTRL_ID*CTRL_BAM_PATTERN
(pattern determined in the projects master file)CTRL_SEX
(!) -M
,F
,unk
- control sexM
- control is maleF
- control is femaleunk
- sex of control is unknown
CTRL_PATH_TO_BAM
- control BAM file path whenWAY_TO_BAM
isabsolute
, the final path for each BAM file is a combination ofCTRL_BAM_DIR
(projects master file) andCTRL_PATH_TO_BAM
. IfWAY_TO_BAM
isfind_in_dir
, this column is not used.NOTE
- any additional information that the user wants to keep with the control
/Masters/master_samples.txt
Samples spreadsheet
(!) = requires explicit values
Columns:
INCLUDE
(!) -yes
,no
- helps to regulate what samples are analyzed and skipped during runPROJECT_ID
- short and unique identifier for the project, matchingPROJECT_ID
in the other master filesCAPTURE_ID
- short identifier for the capture, matchingCAPTURE_ID
in the other master filesGENOME_VERSION
(!) -GRCh37-hg19
,GRCh38-hg38
, matchingGENOME_VERSION
in the other master filesGRCh37-hg19
- for any version of human genome assembly such as GRCh37, hg19, and b37GRCh38-hg38
- for any version of human genome assembly such as GRCh38 and hg38
MAIN_ID
- controls/samples group identifier; This identifier is the one that decides what controls to exclude from the denoising because of relatedness. Therefore, controls/samples that are related have to share the same identifier in the controls and the samples master files.SAMPLE1_ID
- sample_1 (main sample) identifier; IfWAY_TO_BAM
isfind_in_dir
, this identifier has to be part of BAM file name asSAMPLE1_ID*SMPL_BAM_PATTERN
(pattern determined in the projects master file)SAMPLE1_TYPE
- sample_1 (main sample) type; for example proband, child-affected, child-unaffected, tumor, etc.SAMPLE1_SEX
- (!) -M
,F
,unk
- sample_1 (main sample) sexM
- sample is maleF
- sample is femaleunk
- sex of the sample is unknown
SAMPLE1_PATH_TO_BAM
- sample_1 (main sample) BAM file path whenWAY_TO_BAM
isabsolute
, the final path for each BAM file is a combination ofSMPL_BAM_DIR
(projects master file) andSAMPLE1_PATH_TO_BAM
. IfWAY_TO_BAM
isfind_in_dir
, this column is not used.SAMPLE2_ID
- sample_2 identifier; IfWAY_TO_BAM
isfind_in_dir
, this identifier has to be part of BAM file name asSAMPLE2_ID*SMPL_BAM_PATTERN
(pattern determined in the projects master file)SAMPLE2_TYPE
- sample_2 (main sample) type; for example father, mother, germline, etc.SAMPLE2_SEX
- (!) -M
,F
,unk
- sample_2 sexM
- sample is maleF
- sample is femaleunk
- sex of the sample is unknown
SAMPLE2_PATH_TO_BAM
- sample_2 BAM file path whenWAY_TO_BAM
isabsolute
, the final path for each BAM file is a combination ofSMPL_BAM_DIR
(projects master file) andSAMPLE2_PATH_TO_BAM
. IfWAY_TO_BAM
isfind_in_dir
, this column is not used.SAMPLE3_ID
- sample_3 identifier; IfWAY_TO_BAM
isfind_in_dir
, this identifier has to be part of BAM file name asSAMPLE3_ID*SMPL_BAM_PATTERN
(pattern determined in the projects master file)SAMPLE3_TYPE
- sample_3 (main sample) type; for example father, mother, etc.SAMPLE3_SEX
- (!) -M
,F
,unk
- sample_3 sexM
- sample is maleF
- sample is femaleunk
- sex of the sample is unknown
SAMPLE3_PATH_TO_BAM
- sample_3 BAM file path whenWAY_TO_BAM
isabsolute
, the final path for each BAM file is a combination ofSMPL_BAM_DIR
(projects master file) andSAMPLE3_PATH_TO_BAM
. IfWAY_TO_BAM
isfind_in_dir
, this column is not used.NOTE
- any additional information that user wants to keep with the sample
/Masters/master_reg_of_interest_to_plot.txt
Spreadsheet that allows to plot required region for required sample; if PROJECT_ID
is recognized to be running and at least one of the lines of this master file has INCLUDE
= yes
, a separate detailed plot for the required sample within /detail_regions_of_interest/
folder will be produced. It can be useful when a user wants to print a suspicious region that was not recognized as abnormal and therefore not printed as a detailed plot.
(!) = requires explicit values
Columns:
- First 17 columns are the same as columns in the samples master file
CONTIG
- chromosome of the region to plotSTART
- starting position of the region to plotEND
- ending position of the region to plotZOOM
- numeric value, zoom for the defined region to be printed to the plot, examples:0
:CONTIG
:START
-END
will be printed to the plot1
:CONTIG
: [START
-1✗(END
-START
)] - [END
+1✗(END
-START
)] will be printed to the plot2
:CONTIG
: [START
-2✗(END
-START
)] - [END
+2✗(END
-START
)] will be printed to the plot
/Masters/setting_segmentation.txt
Parameters for CNVs and LOH segmentation, used when custom segmentation is selected (see column SEGMENTATION_ID
in the projects master) and paired by a short identifier.
Columns:
FEATURE
- Segmentation parameterSEGM-EXAMPLE
- Example of segmentation setting...
- Additional columns can be added by the user; a short and unique ID of segmentation is required
The segmentation identifier defined in the column header (e.g., my_segm, somatic_segmentation, etc.) is used to pair the segmentation setting and project as SEGMENTATION_ID
within the projects master file. This identifier is also used in filenames if SEGMENTATION_ID_USE
= segm_id
.
List of parameters:
SEGM_DIFFERENCE
SEGM_MINSIZE
SEGM_MINSIZE_SUB
SEGM_MINPROBE
SEGM_MINPROBE_SUB
SEGM_MINKEEP
SEGM_MINLOSS
SEGM_MINLOSS_SUB
SEGM_MINLOSS_BIAL
SEGM_MINGAIN
SEGM_MINGAIN_SUB
SEGM_GAP
SEGM_SMOOTHPERC
SEGM_AFDIF
SEGM_AFSIZE
SEGM_AFPROBE
Subclonal analysis is defined by parameters ending with _SUB
. If no subclonal analysis is required, these variables are set to none
. If SEGM_GAP
is set to none
, segments continue independently regardless of their distance from each other in the capture.
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/data/
...allelicCounts...rds
- RDS file with SNP zygosity collection...counts...hdf
- HDF5 file with coverage collection...denoisedCR...tsv
- TSV file with denoised coverage data...SEGMENTS...segmentation...tsv
- TSV file with segmentation data before annotation, contains all normal and abnormal segments...SEGMENTS...tsv
- TSV file with segmentation and annotation, contains all normal and abnormal segments...SEGMENTS...REPORT...tsv
- TSV file with segmentation, annotation, and variant origin prediction, contains ONLY abnormal segments
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/plot_SETTING/plot/
- genome - Figure showing all chromosomes (contigs with centromere) extracted from the reference (usually chr1-chr22, chrX, and chrY)
- chromosome - Figure is generated for each chromosome
- /detail/ - Folder with figures generated for each abnormal segment (does not include sub-clonal findings)
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/plot_SETTING/IGV/
-
..._abnormal_CN_segments.bed
- BED file with segments for DNA losses and gains -
..._abnormal_LOH_segments.bed
- BED file with segments for LOH (any regions with deviation from heterozygosity, which can be due to deletion, cnnLOH or gain); generated only if LOH region(s) were detected -
..._segments.bw
- BigWig with all segments (normal and abnormal),..._CN.bw
- BigWig with denoised coverage data- IGV setting:
- Type of Graph: Points
- Windowing Function: None
- Set Data Range: min -2.5, mid 0.0, max 2.5
- Note: log2 ratio values smaller than -2.25 are shifted to -2.25 and log2 ratio bigger than 2.25 are shifted to 2.25 in IGV output (as well as in the PNG plots)
- Tip: when these two BigWig files in the same setting, mark them and by right click select "Overlay Tracks"
- IGV setting:
-
.._SNP.bw
- BigWig with SNP zygosity- IGV setting:
- Type of Graph: Points
- Windowing Function: None
- Set Data Range: min 0.0, mid 0.5, max 1.0
- IGV setting:
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/data/
...(anti)targetcoverate.cnn
- CNN file with coverage collection...cnr
- denoised coverage data...SEGMENTS...CNVkit.tsv
- TSV files with segmentation, same as above, just generated using CNVkit data
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/plot_SETTING/cnvkit_plot/
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/plot_SETTING/cnvkit_IGV/
- same as above, just generated using CNVkit data
/CNVRobot/Processing/Results/PROJECT_ID/PROJECT_ID_SETTING/ascat.../
- contains data outputs from ASCAT (
/data/
) and IGV files (/IGV/
) - note, this is still a work in progress
How to prepare gene annotation other than default RefSeq provided:
- Download the required gene annotation from UCSC website.
- Columns have to be named as follows (order does not need to be the same):
gene_id
: Identifier that will be displayed at plots and used for annotation (usually gene symbol)chrom
: Chromosomestrand
: + or - for strandtxStart
: Transcription start positiontxEnd
: Transcription end positioncdsStart
: Coding region startcdsEnd
: Coding region endexonStarts
: Exon start positionsexonEnds
: Exon end positions Most of the columns will match automatically, onlygene_id
needs to be picked and renamed by the user, for example, by using the commandsed -i 's/name2/gene_id/g' file.txt
. The table should never be edited in EXCEL due to renaming some gene symbols to dates. Contig IDs have to match IDs in the reference file.
This is an advanced option that allows the production of personalized detail plots for additional regions with personalized annotation.
A customized R script is required, provided in the example. This script has to be placed in the /CNVRobot_vX.X/Scripts/R/
folder and named as plot_CN_PROJECT_ID.R
. It is then recognized as a source in plot_CN.R
and produces plots within the /project_specific_regions/
folder.
- Processing of the databases from the original source is available upon request.
- Original sources of databases:
- gnomAD SNP:
- hg19/GRCh37 and hg38/GRCh38: gnomAD SNP v2.1 (exome and genome data was downloaded, combined, and processed into a simple VCF file in hg19/GRCh37 version; hg38/GRCh38 version prepared by liftover)
- CHM13v2.0: T2T aws -
chm13v2.0_dbSNPv155.vcf.gz
(filtered for gnomAD SNP only and processed into a simple VCF file)
- Database of Genomic Variants (DGV, TCAG) v2020-02-25 (hg19/GRCh37 and hg38/GRCh38 prepared using files from the website, CHM13v2.0 prepared by liftover from hg38/GRCh38 version, unlifted regions excluded)
- Chromosome bands:
- hg19/GRCh37 and hg38/GRCh38: UCSC chromosome bands and centromeres
- CHM13v2.0: T2T aws -
chm13v2.0_cytobands_allchrs.bed
- Chromosome centromeres: Border position between p arm and q arm of each chromosome, generated from chromosome bands
- Gene annotation:
- hg19/GRCh37 and hg38/GRCh38: UCSC Table Browser -
RefSeq All (ncbiRefSeq)
(under Genes and Gene Predictions, NCBI RefSeq) - CHM13v2.0: T2T UCSC -
catLiftOffGenesV1.bb
(processed to UCSC-looking format)
- hg19/GRCh37 and hg38/GRCh38: UCSC Table Browser -
- Mappability:
- GRCh37/hg19: ENCODE from UCSC -
wgEncodeCrgMapabilityAlign100mer.bigWig
- GRCh38/hg38: Hoffman Lab from UCSC
k100.Umap.MultiTrackMappability.bw
- CHM13v2.0: T2T aws -
chm13v2.0_100mer_mappability.bb
(note this data is binary and not quantitative as the others)
- GRCh37/hg19: ENCODE from UCSC -
- ENCODE black list:
- GRCh37/hg19: UCSC Table Browser -
ENCODE Blacklist (encBlackList)
(under Mapping and Sequencing, Problematic Regions) - GRCh38/hg38 and CHM13v2.0: liftover from GRCh37/hg19, unlifted regions excluded
- GRCh37/hg19: UCSC Table Browser -
- gnomAD SNP:
The CNVRobot pipeline is not designed for detecting common population CNVs and may produce errors in regions with low mappability. Users should be aware that CNVRobot performs a depth of coverage analysis within specified bins; therefore, the exact starting and ending points of abnormalities are not precisely determined. The tool is only capable of detecting unbalanced changes. Much like DNA microarrays, coverage-based analysis can pose challenges when there's a change in the sample's ploidy, given that it relies on median centering normalization.