ccdg_qc

Scripts to run QC on CCDG (Centers for Common Disease Genomics) WGS (Whole Genome Sequence) data.

Data: ~60K whole genomes sequenced from 5 centers

Broad Institute
Baylor
New York Genome Center (NYGC)
Univ. of Michigan
Washington Univ. of St. Louis

2017 Analyst: Robert Maier

Links to sequence data QC resources:

gnomAD QC

WES QC workflow

Sequence QC blog for Covid HGI

Sequence QC github for Covid HGI

CCDG WGS & WES QC Pipeline

WGS Data

Hail: INFO: wrote matrix table with 2907865897 rows and 136959 columns in 121159 partitions to gs://ccdg/vds/wgs_136k_recombine.vds/reference_data
    Total size: 86.05 TiB
    * Rows/entries: 86.05 TiB
    * Columns: 633.99 KiB
    * Globals: 11.00 B
    * Smallest partition: 4 rows (116.47 KiB)
    * Largest partition:  9161 rows (841.07 MiB)
Hail: INFO: wrote matrix table with 1105368358 rows and 136959 columns in 121159 partitions to gs://ccdg/vds/wgs_136k_recombine.vds/variant_data
    Total size: 22.68 TiB
    * Rows/entries: 22.68 TiB
    * Columns: 633.99 KiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (20.00 B)
    * Largest partition:  18048 rows (6.11 GiB)

WES Data

Hail: INFO: wrote matrix table with 268099170 rows and 203664 columns in 5439 partitions to gs://ccdg/vds/split_200k_ccdg_exome.vds/reference_data
    Total size: 1.96 TiB
    * Rows/entries: 1.96 TiB
    * Columns: 920.55 KiB
    * Globals: 11.00 B
    * Smallest partition: 1 rows (153.00 B)
    * Largest partition:  53478 rows (1.01 GiB)
Hail: INFO: wrote matrix table with 111590474 rows and 203664 columns in 5439 partitions to gs://ccdg/vds/split_200k_ccdg_exome.vds/variant_data
    Total size: 1.40 TiB
    * Rows/entries: 1.40 TiB
    * Columns: 920.55 KiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (20.00 B)
    * Largest partition:  36797 rows (11.03 GiB)

Reference data have the following schema:

vds.reference_data.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'ref_allele': str
----------------------------------------
Entry fields:
    'END': int32
    'DP': int32
    'GQ': int32
----------------------------------------
Column key: ['s']
Row key: ['locus']
----------------------------------------

Variant data have the following schema:

vds.variant_data.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
----------------------------------------
Entry fields:
    'LA': array<int32>
    'LGT': call
    'LAD': array<int32>
    'LPGT': call
    'LPL': array<int32>
    'RGQ': int32
    'gvcf_info': struct {
          ...
    }
    'DP': int32
    'GQ': int32
    'MIN_DP': int32
    'PID': str
    'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Variants Hard Filtering

GOAL:Select reliable variants for PCA
Overall: Filter variants in CCDG + gnomAD genomes to variants with high callrate and in high quality intervals from both CCDG & UKBB exomes
- Filter CCDG genomes and exomes variants to:
  - Present in both exomes and genomes
  - Variants in autosomes
  - (bi-allelic/all ?) single nucleotide variants (SNVs) only
  - Variants in good capture platforms (high quality intervals in UKBB 455K exomes & CCDG exomes)
  - [Optional] Variants with a precomputed AC > 10 in gnomAD v3 genomes
- Densify variants in CCDG genomes and exomes
- Compute combined MAF and Callrate for CCDG and gnomAD v3 genomes, and Callrate for CCDG and UKBB exomes
  - Combined MAF > 0.1% (or lower?)
  - Combined Callrate > 99%
  - High Callrate in CCDG and UKBB exomes (cutoff TBD)
- LD Pruning
  - Select CCDG or gnomAD genomes to be the LD pruning dataset
  - LD pruning with a cutoff of r2 = 0.1

Sample QC Metric Computation

GOAL: generate metrics for downstream sample QC
- hl.vds.sample_qc()

Sex Imputation(Before/After Interval QC for WES Data)
Interval QC (WES Data Only)
Samples Hard Filtering
- Low coverage
- High/low n_snp
- High n_singleton
- High r_het_hom_var
- Ambiguous sex
- Sex aneuploidy
Platform PCA

GOAL: Determine which platform samples are from

Relatedness Inference

GOAL: Remove related samples
- Approach 1:
  - hl.pc_relate()
  - hl.maximal_independent_set()
  - maximal independent set of samples
  - hl.hwe_normalized_pca()
  - Project the related samples to the unrelated samples
- Approach 2:
  - Project all samples to gnomAD
  - Compute relatedness

Ancestry Inference

GOAL: Determine general ancestry of cohort
- Approach 1: using labels within data
- Approach 2: Joint call with 1KG and HGDP
- Approach 3: Project onto gnomAD PCA loadings.

Outlier Sample Detection

GOAL: Remove outlier samples based on sample qc metrics stratified by population group
- n_snp, ti/tv, in/del, het/hom (DEFAULT > 4 MAD)
  - Ancestry: EUR/EAS/AFR/SAS, filter out within cohort outliers.
  - PCs: PC1 - PC10, filter out outliers on each PC.
  - Platform: check % filtered for each platform.

Variant QC

GOAL: Remove low quality/somatic variants

Genotype QC
- GQ >= 20
- DP >= 10
- 0.2 < AB <0.8

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
resources		resources
scripts		scripts
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

scripts

scripts

README.md

README.md

init.py

init.py

Repository files navigation

ccdg_qc

Links to sequence data QC resources:

CCDG WGS & WES QC Pipeline

WGS Data

WES Data

About

Releases

Packages

Languages

Nealelab/ccdg_qc

Folders and files

Latest commit

History

Repository files navigation

ccdg_qc

Links to sequence data QC resources:

CCDG WGS & WES QC Pipeline

WGS Data

WES Data

About

Resources

Stars

Watchers

Forks

Languages