diff --git a/.gitignore b/.gitignore index 22197ad..a64d92f 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,7 @@ # Created by .ignore support plugin (hsz.mobi) # PyCharm +.DS_Store .idea/ Polaris.wiki Polaris.archive.*.tar.bz2 -ena_submissions/* \ No newline at end of file +ena_submissions/* diff --git a/README.md b/README.md index 355d336..b329690 100644 --- a/README.md +++ b/README.md @@ -2,19 +2,19 @@

- +

- + HiSeqX data

-HiSeqX data +HiSeqX data

@@ -24,10 +24,8 @@ # Table of Contents - [Summary](#summary) -- [Latest variant call release VC1.0](#latest-variant-call-release--vc10) +- [Latest release](#latest-release) - [Download the VCF](#download-the-vcf) - - [AWS CLI](#aws-cli) - - [wget](#wget) - [Sequencing resources](#sequencing-resources) - [HiSeqX PCR-Free Data (Polaris 1)](#hiseqx-pcr-free-data-polaris-1) - [Associated resources](#associated-resources) @@ -44,68 +42,27 @@ ## Summary The Polaris project provides -* Population sequencing resources on high throughput Illumina sequencing - platforms -* Variant calls validated using population genetic and Mendelian methods - -The variant calls currently provided in Polaris are breakpoint-resolved deletion -and insertion structural variants (SVs). +* Population sequencing resources on high throughput Illumina sequencing platforms +* Variant calls from both multiple technologies, validated by population genetics and Mendelian methods Further details of the sequencing resources, input data sources, genotyping methods and validation methods can be found in the [project wiki][1.1]. -## Latest variant call release — VC1.0 - -Our latest variant call release set is [VC1.0][2.1]. This call set contains -**70,706** from a candidate set of **184,988** validated breakpoint-resolved SV calls. - -Candidates were identified from 4 sources: -* [Previously characterized Manta calls][2.2.1] -* [Platinum Genomes pedigree consistent events with unique population breakpoints][2.2.2] -* [Parliament insertions][2.2.3][2](#English2015) -* [Icelandic insertions identified with PopIns][2.2.4][3](#Kehr2017) +## Latest release -All candidates were jointly re-called using our -[breakpoint joint caller suite, `paragraph`][2.3]. +Our latest variant call release set is v1.5, containing breakpoint-resolved structural variants (SVs) from: -Validation consisted of: +- Manta deletion and insertion calls from NovaSeq sequenced NA12877 & NA12878 +- Refined Sniffles (v1.0.8) deletion and insertion calls from PacBio and Oxford Nanopore sequenced NA12878 (Sedlazeck et al. 2018) +- copy number variants and large deletions curated from HiSeqX and NovsSeq sequenced population and Platinum Genome pedigree -* [Polaris 1 Diversity Panel][3.1.1.1] / [Polaris 1 PGx Panel][3.1.2.1] HWE - assessment -* [Platinum Genomes pedigree][2.4.1][1](#Eberle2017) pedigree - consistency check +In total, there are 13,120 PASS and 27,093 FAIL entries on **hg38**. hg19 calls were generated by liftover. -PASS calls were either -* Pedigree consistent -* Homozygous in pedigree and MAF > 0.05 + HWE p-value > 0.05 in Polaris panels - -Complete release notes for VC1.0 can be found [here][2.1]. +Complete release notes for v1.5 can be found at [release-notes/v1.5](release_notes/v1.5) ### Download the VCF -The VC1.0 VCF is available can be downloaded either using `AWS CLI` or `wget` -and can also be viewed in this [S3 bucket display][2.5.1]. Using `wget` is -currently the easier of the two command line options. - -#### `AWS CLI` - -Polaris datasets are stored in an AWS S3 bucket called `illumina-polaris`, and -can de downloaded using the [AWS CLI][2.5.2]: - -```bash -$: aws cp s3://illumina-polaris/vc1_0.vcf.gz -$: aws cp s3://illumina-polaris/vc1_0.vcf.gz.tbi -``` - -#### `wget` - -If you don't have AWS credentials, you can use `wget` or a similar tool to -download VC1.0: - -```bash -$: wget https://s3.amazonaws.com/illumina-polaris/vc1_0.vcf.gz -$: wget https://s3.amazonaws.com/illumina-polaris/vc1_0.vcf.gz.tbi -``` +VCFs for hg38 and hg19 are available at [release-data/v1.5](release-data/v1.5) ## Sequencing resources diff --git a/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz b/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz new file mode 100644 index 0000000..2917fa3 Binary files /dev/null and b/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz differ diff --git a/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz.tbi b/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz.tbi new file mode 100644 index 0000000..b996447 Binary files /dev/null and b/release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz.tbi differ diff --git a/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz b/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz new file mode 100644 index 0000000..4ac1175 Binary files /dev/null and b/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz differ diff --git a/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz.tbi b/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz.tbi new file mode 100644 index 0000000..5cd0902 Binary files /dev/null and b/release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz.tbi differ diff --git a/release-notes/images/vc1_0/hwe_pvalues.png b/release-notes/v1.0/images/vc1_0/hwe_pvalues.png similarity index 100% rename from release-notes/images/vc1_0/hwe_pvalues.png rename to release-notes/v1.0/images/vc1_0/hwe_pvalues.png diff --git a/release-notes/images/vc1_0/hwe_ternary_plot.png b/release-notes/v1.0/images/vc1_0/hwe_ternary_plot.png similarity index 100% rename from release-notes/images/vc1_0/hwe_ternary_plot.png rename to release-notes/v1.0/images/vc1_0/hwe_ternary_plot.png diff --git a/release-notes/images/vc1_0/minor_allele_frequency.png b/release-notes/v1.0/images/vc1_0/minor_allele_frequency.png similarity index 100% rename from release-notes/images/vc1_0/minor_allele_frequency.png rename to release-notes/v1.0/images/vc1_0/minor_allele_frequency.png diff --git a/release-notes/images/vc1_0/pedigree_hamming_distance.png b/release-notes/v1.0/images/vc1_0/pedigree_hamming_distance.png similarity index 100% rename from release-notes/images/vc1_0/pedigree_hamming_distance.png rename to release-notes/v1.0/images/vc1_0/pedigree_hamming_distance.png diff --git a/release-notes/images/vc1_0/pedigree_hamming_distance_by_type.png b/release-notes/v1.0/images/vc1_0/pedigree_hamming_distance_by_type.png similarity index 100% rename from release-notes/images/vc1_0/pedigree_hamming_distance_by_type.png rename to release-notes/v1.0/images/vc1_0/pedigree_hamming_distance_by_type.png diff --git a/release-notes/vc1_0.md b/release-notes/v1.0/vc1_0.md similarity index 100% rename from release-notes/vc1_0.md rename to release-notes/v1.0/vc1_0.md diff --git a/release-notes/v1.5/README.md b/release-notes/v1.5/README.md new file mode 100644 index 0000000..ff490dc --- /dev/null +++ b/release-notes/v1.5/README.md @@ -0,0 +1,98 @@ +# Cascadia v0.6 Release Notes (June 2018) + +# Table of Contents +- [Overview](#Overview) +- [Dataset Summary](#v06-truthset) +- [Validation Scheme](#validation-scheme) +- [Merging and Refining Scheme](#merging-and-refining-scheme) + +## Overview + +A **hg38** truth set of simple deletion and insertions built from: + +- Manta deletion and insertion calls from NovaSeq NA12877 & NA12878 (NSV4 pipeline on hg38) + - This part of is the same as in v0.5 (except for minor VCF format fix) + +- Refined Sniffles (v1.0.8) deletion and insertion calls from PacBio NA12878 on hg38 + - Insertion sequence was assembled and refined from PacBio + ONT reads + +- copy number variants and large deletions curated from population and Platinum Genome pedigree on *hg19*, coming from deletion calls of Manta/Canvas and Sniffles. + - Please see Xiao's CNV truth set repository for details: + - https://git.illumina.com/xchen2/CNVTruthSet + +Hg19 truth set is generated by lift-over on hg38 truth set. + +12,374 (98%) passed entries and 24,405 (88%) failed entries were successfully converted to hg19. + +The release vcf contains genotypes of NA12877 and NA12878 (re-genotyped with our targeted graph genotyper *Paragraph*). + +Illumina cluster users can find VCF containing full Platinum Genome pedigree genotypes on Illumina cluster. + +### Unvalidated variants + +In addition, VCFs named with "all_merge.include_unvalidated" contain unvalidated variants from Sniffles v1.0.8 calls, they are: + +- All calls on chrX, chrY and mitochondria + +- Inversions, duplications and translocations + +These unvalidated variants are labled as *UNVALIDATED* in their filter fields. + +For now we haven't established a robust pipeine for validating these variants, but finally they will be validated in future release. + +### Data format + +Variants that pass our pedigree and population check were labeled as *PASS* in their filter fields. + +Variants that fail any filter were labled with the specific filter name(s) in their filter fields. + +*SOURCE* key in *INFO* field indicates where the variant originally comes from. All unvalidated variants do not have *SOURCE* key because they all come from Sniffles. + +## Dataset Summary + +### Merged variants partitioned by SVLEN and type + +| SV Type | INS | INS | DEL | DEL | CNV | CNV | +|:----------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:| +| Filter | PASS | FAIL | PASS | FAIL | PASS | FAIL | +| _L_ \< 50 | 905 | 1,934 | 1,178 | 1,390 | 0 | 0 | +| 50 \< _L_ \< 100 bp | 1,503 | 3,571 | 1,939 | 3,553 | 0 | 0 | +| 100 \< _L_ \< 1kb | 2,240 | 4,785 | 2,389 | 4,417 | 0 | 0 | +| 1kb \< _L_ \< 10kb | 18 | 155 | 1,978 | 2,930 | 105 | 139 | +| _L_ > 10kb | 0 | 0 | 337 | 1,065 | 89 | 219 | +| __Total__ | __4,666__ | __10,445__| __7,821__ | __13,355__| __194__ | __358__ | + +### **Non-reference** calls in NA12878 partitioned by SVLEN and type + +| SV Type | INS | DEL | CNV | +|:----------------------|:---------:|:---------:|:---------:| +| _L_ \< 50 | 878 | 1,169 | 0 | +| 50 \< _L_ \< 100 bp | 1,274 | 1,521 | 0 | +| 100 \< _L_ \< 1kb | 1,958 | 1,926 | 0 | +| 1kb \< _L_ \< 10kb | 17 | 626 | 19 | +| _L_ > 10kb | 0 | 140 | 11 | +| __Total__ | __4,127__ | __5,382__ | __30__ | + + + +### Merged variants partitioned by SVLEN, type and source + +| SV Type | INS | INS | DEL | DEL | +|:----------------------|:--------------|:--------------|:--------------|:--------------| +| Source | Manta PASS | Sniffles PASS | Manta PASS | Sniffles PASS | +| _L_ \< 50 | 6 (24%) | 901 (32%) | 7 (100%) | 1,178 (46%) | +| 50 \< _L_ \< 100 bp | 1,151 (47%) | 620 (22%) | 1,598 (44%) | 726 (32%) | +| 100 \< _L_ \< 1kb | 1,451 (55%) | 1,189 (25%) | 2,256 (45%) | 1,492 (47%) | +| 1kb \< _L_ \< 10kb | 1 (100%) | 18 (10%) | 736 (29%) | 267 (66%) | +| _L_ > 10kb | 0 (n/a) | 0 (n/a) | 156 (17%) | 0 (n/a) | +| __Total__ |__2,606 (51%)__|__2,728 (26%)__|__4,753 (39%)__|__3,663 (43%)__| + +Using refined Sniffles calls as input, v0.6 has a significant improved validate rate for Sniffles calls, with 2,604 validated insertions and a validation rate of 26%, compared to v0.5 (589 validated insertions and a validation rate of 5%). + +## Merging Scheme + +For simple SVs validated by Paragraph, we followed the same merging scheme for deletions as [v0.5](../v0.5/README.md). + +For CNVs, we do not try to merge them with simple SVs. + +For large deletions labeled as PASS, we prioritize Paragraph validated ones over Xiao's CNV truth set. \ No newline at end of file