Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 4.03 KB

aux_data.md

File metadata and controls

60 lines (46 loc) · 4.03 KB

Auxiliary data generation

Details on how HiFiCNV auxiliary data files were generated.

Excluded Regions

Excluded regions can optionally be specified with a bed file.

Pre-computed excluded regions files

Several useful exclusions tracks are provided for commonly used genomes, this can be used directly for those genomes, or as examples to develop exclusion files for other genomes.

Two pre-computed exclusion files are provided for hg38/GRCh38:

  • cnv.excluded_regions.hg38.bed.gz - Contains regions that are known to cause artifacts during data processing (e.g. centromeres). Script to generate this file can be found here.
  • cnv.excluded_regions.common_50.hg38.bed.gz - Contains all the regions in the above file, plus regions that were frequently called as a duplication or deletion in a population. The additional regions were generated by running HiFiCNV on our population (N=97), and then storing any bin where >50% of the population had a duplication or deletion overlapping that bin. This is the recommended excluded regions track for human sample analysis.

More limited exclusion files are also provided for hg19 and hs37d5:

  • cnv.excluded_regions.hg19.bed.gz - Contains regions that are known to cause artifacts during data processing (e.g. centromeres). This file was generated with the following script, modified with a 'ref' value of 'hg19'.
  • cnv.excluded_regions.hs37d5.bed.gz - Contains regions that are known to cause artifacts during data processing (e.g. centromeres). This file was generated by the following script which converts chromosome names from the hg19 exclusion file, removing hg19 non-canonical contigs and marking those from hs37d5 as excluded.

Note that the common deletion and duplication calls in the population provided for GRCh38 are not available for the other reference genomes. To improve CNV precision, it is recommended to either use GRCh38 or create a similar track of common population calls for other reference genomes.

How excluded regions influence copy number calling

All depth bins intersecting an excluded region are removed from the depth bins track. All minor allele frequency evidence intersecting an excluded region are removed from the MAF track.

Segmentation will treat any depth bins intersecting an excluded region as having a small bias in favor of a special unknown copy-number state -- the probability of all other copy number states are equal, but lower than the unknown state. This means that a copy number change can span through a short excluded region if there is sufficient evidence on the left or right flank, but longer excluded regions should be segmented into an unknown state.

Expected Copy Number

By default, HiFiCNV expects each chromosome to have two full copies (e.g. a diploid organism). When reporting variants to the output VCF file, it will only report deviations from this expectation. However, this expectation is undesirable for some chromosomes (e.g. sex chromosomes) or non-diploid organisms. The expectation can be overridden by providing a BED file with expected copy number values. Examples corresponding to XX/XY karyotypes are provided for human GRCh38/hg38, hg19 and hs37d5 references: