Skip to content

Commit

Permalink
CASC-264 add v1.5 release data
Browse files Browse the repository at this point in the history
  • Loading branch information
Sai Chen committed Aug 22, 2018
1 parent d664550 commit 6d40c6a
Show file tree
Hide file tree
Showing 13 changed files with 114 additions and 58 deletions.
3 changes: 2 additions & 1 deletion .gitignore
@@ -1,6 +1,7 @@
# Created by .ignore support plugin (hsz.mobi)
# PyCharm
.DS_Store
.idea/
Polaris.wiki
Polaris.archive.*.tar.bz2
ena_submissions/*
ena_submissions/*
71 changes: 14 additions & 57 deletions README.md
Expand Up @@ -2,19 +2,19 @@

<p align="center">
<a href="../../wiki/Sample-Information">
<img src="https://img.shields.io/badge/Total%20samples%20sequenced-220-6d73f3.svg" height="30">
<img src="https://img.shields.io/badge/Total%20samples%20sequenced-150-6d73f3.svg" height="30">
</a>
</p>

<p align="center">
<a href="../../wiki/Sample-Information#hiseqx-data-polaris-1">
<a href="../../wiki/Sample-Information#hiseqx-data-polaris">
<img src="https://img.shields.io/badge/HiSeqX%20data-Polaris%201-ed9d2d.svg" alt="HiSeqX data">
</a>
</p>

<p align="center">
<a href="release-notes/vc1_0.md#get-the-data">
<img src="https://img.shields.io/badge/Latest%20variant%20calls-VC1.0-8a6183.svg" alt="HiSeqX data">
<img src="https://img.shields.io/badge/Latest%20variant%20calls-v1.5-8a6183.svg" alt="HiSeqX data">
</a>
</p>

Expand All @@ -24,10 +24,8 @@

# Table of Contents
- [Summary](#summary)
- [Latest variant call release VC1.0](#latest-variant-call-release--vc10)
- [Latest release](#latest-release)
- [Download the VCF](#download-the-vcf)
- [AWS CLI](#aws-cli)
- [wget](#wget)
- [Sequencing resources](#sequencing-resources)
- [HiSeqX PCR-Free Data (Polaris 1)](#hiseqx-pcr-free-data-polaris-1)
- [Associated resources](#associated-resources)
Expand All @@ -44,68 +42,27 @@
## Summary

The Polaris project provides
* Population sequencing resources on high throughput Illumina sequencing
platforms
* Variant calls validated using population genetic and Mendelian methods

The variant calls currently provided in Polaris are breakpoint-resolved deletion
and insertion structural variants (SVs).
* Population sequencing resources on high throughput Illumina sequencing platforms
* Variant calls from both multiple technologies, validated by population genetics and Mendelian methods

Further details of the sequencing resources, input data sources, genotyping
methods and validation methods can be found in the [project wiki][1.1].

## Latest variant call release &mdash; VC1.0

Our latest variant call release set is [VC1.0][2.1]. This call set contains
**70,706** from a candidate set of **184,988** validated breakpoint-resolved SV calls.

Candidates were identified from 4 sources:
* [Previously characterized Manta calls][2.2.1]
* [Platinum Genomes pedigree consistent events with unique population breakpoints][2.2.2]
* [Parliament insertions][2.2.3]<sup>[2](#English2015)</sup>
* [Icelandic insertions identified with PopIns][2.2.4]<sup>[3](#Kehr2017)</sup>
## Latest release

All candidates were jointly re-called using our
[breakpoint joint caller suite, `paragraph`][2.3].
Our latest variant call release set is v1.5, containing breakpoint-resolved structural variants (SVs) from:

Validation consisted of:
- Manta deletion and insertion calls from NovaSeq sequenced NA12877 & NA12878
- Refined Sniffles (v1.0.8) deletion and insertion calls from PacBio and Oxford Nanopore sequenced NA12878 (Sedlazeck et al. 2018)
- copy number variants and large deletions curated from HiSeqX and NovsSeq sequenced population and Platinum Genome pedigree

* [Polaris 1 Diversity Panel][3.1.1.1] / [Polaris 1 PGx Panel][3.1.2.1] HWE
assessment
* [Platinum Genomes pedigree][2.4.1]<sup>[1](#Eberle2017)</sup> pedigree
consistency check
In total, there are 13,120 PASS and 27,093 FAIL entries on **hg38**. hg19 calls were generated by liftover.

PASS calls were either
* Pedigree consistent
* Homozygous in pedigree and MAF > 0.05 + HWE p-value > 0.05 in Polaris panels

Complete release notes for VC1.0 can be found [here][2.1].
Complete release notes for v1.5 can be found at [release-notes/v1.5](release_notes/v1.5)

### Download the VCF

The VC1.0 VCF is available can be downloaded either using `AWS CLI` or `wget`
and can also be viewed in this [S3 bucket display][2.5.1]. Using `wget` is
currently the easier of the two command line options.

#### `AWS CLI`

Polaris datasets are stored in an AWS S3 bucket called `illumina-polaris`, and
can de downloaded using the [AWS CLI][2.5.2]:

```bash
$: aws cp s3://illumina-polaris/vc1_0.vcf.gz
$: aws cp s3://illumina-polaris/vc1_0.vcf.gz.tbi
```

#### `wget`

If you don't have AWS credentials, you can use `wget` or a similar tool to
download VC1.0:

```bash
$: wget https://s3.amazonaws.com/illumina-polaris/vc1_0.vcf.gz
$: wget https://s3.amazonaws.com/illumina-polaris/vc1_0.vcf.gz.tbi
```
VCFs for hg38 and hg19 are available at [release-data/v1.5](release-data/v1.5)

## Sequencing resources

Expand Down
Binary file added release-data/v1.5/hg19/polaris.v1.5.hg19.vcf.gz
Binary file not shown.
Binary file not shown.
Binary file added release-data/v1.5/hg38/polaris.v1.5.hg38.vcf.gz
Binary file not shown.
Binary file not shown.
File renamed without changes
File renamed without changes.
98 changes: 98 additions & 0 deletions release-notes/v1.5/README.md
@@ -0,0 +1,98 @@
# Cascadia v0.6 Release Notes (June 2018)

# Table of Contents
- [Overview](#Overview)
- [Dataset Summary](#v06-truthset)
- [Validation Scheme](#validation-scheme)
- [Merging and Refining Scheme](#merging-and-refining-scheme)

## Overview

A **hg38** truth set of simple deletion and insertions built from:

- Manta deletion and insertion calls from NovaSeq NA12877 & NA12878 (NSV4 pipeline on hg38)
- This part of is the same as in v0.5 (except for minor VCF format fix)

- Refined Sniffles (v1.0.8) deletion and insertion calls from PacBio NA12878 on hg38
- Insertion sequence was assembled and refined from PacBio + ONT reads

- copy number variants and large deletions curated from population and Platinum Genome pedigree on *hg19*, coming from deletion calls of Manta/Canvas and Sniffles.
- Please see Xiao's CNV truth set repository for details:
- https://git.illumina.com/xchen2/CNVTruthSet

Hg19 truth set is generated by lift-over on hg38 truth set.

12,374 (98%) passed entries and 24,405 (88%) failed entries were successfully converted to hg19.

The release vcf contains genotypes of NA12877 and NA12878 (re-genotyped with our targeted graph genotyper *Paragraph*).

Illumina cluster users can find VCF containing full Platinum Genome pedigree genotypes on Illumina cluster.

### Unvalidated variants

In addition, VCFs named with "all_merge.include_unvalidated" contain unvalidated variants from Sniffles v1.0.8 calls, they are:

- All calls on chrX, chrY and mitochondria

- Inversions, duplications and translocations

These unvalidated variants are labled as *UNVALIDATED* in their filter fields.

For now we haven't established a robust pipeine for validating these variants, but finally they will be validated in future release.

### Data format

Variants that pass our pedigree and population check were labeled as *PASS* in their filter fields.

Variants that fail any filter were labled with the specific filter name(s) in their filter fields.

*SOURCE* key in *INFO* field indicates where the variant originally comes from. All unvalidated variants do not have *SOURCE* key because they all come from Sniffles.

## Dataset Summary

### Merged variants partitioned by SVLEN and type

| SV Type | INS | INS | DEL | DEL | CNV | CNV |
|:----------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|
| Filter | PASS | FAIL | PASS | FAIL | PASS | FAIL |
| _L_ \< 50 | 905 | 1,934 | 1,178 | 1,390 | 0 | 0 |
| 50 \< _L_ \< 100 bp | 1,503 | 3,571 | 1,939 | 3,553 | 0 | 0 |
| 100 \< _L_ \< 1kb | 2,240 | 4,785 | 2,389 | 4,417 | 0 | 0 |
| 1kb \< _L_ \< 10kb | 18 | 155 | 1,978 | 2,930 | 105 | 139 |
| _L_ > 10kb | 0 | 0 | 337 | 1,065 | 89 | 219 |
| __Total__ | __4,666__ | __10,445__| __7,821__ | __13,355__| __194__ | __358__ |

### **Non-reference** calls in NA12878 partitioned by SVLEN and type

| SV Type | INS | DEL | CNV |
|:----------------------|:---------:|:---------:|:---------:|
| _L_ \< 50 | 878 | 1,169 | 0 |
| 50 \< _L_ \< 100 bp | 1,274 | 1,521 | 0 |
| 100 \< _L_ \< 1kb | 1,958 | 1,926 | 0 |
| 1kb \< _L_ \< 10kb | 17 | 626 | 19 |
| _L_ > 10kb | 0 | 140 | 11 |
| __Total__ | __4,127__ | __5,382__ | __30__ |



### Merged variants partitioned by SVLEN, type and source

| SV Type | INS | INS | DEL | DEL |
|:----------------------|:--------------|:--------------|:--------------|:--------------|
| Source | Manta PASS | Sniffles PASS | Manta PASS | Sniffles PASS |
| _L_ \< 50 | 6 (24%) | 901 (32%) | 7 (100%) | 1,178 (46%) |
| 50 \< _L_ \< 100 bp | 1,151 (47%) | 620 (22%) | 1,598 (44%) | 726 (32%) |
| 100 \< _L_ \< 1kb | 1,451 (55%) | 1,189 (25%) | 2,256 (45%) | 1,492 (47%) |
| 1kb \< _L_ \< 10kb | 1 (100%) | 18 (10%) | 736 (29%) | 267 (66%) |
| _L_ > 10kb | 0 (n/a) | 0 (n/a) | 156 (17%) | 0 (n/a) |
| __Total__ |__2,606 (51%)__|__2,728 (26%)__|__4,753 (39%)__|__3,663 (43%)__|

Using refined Sniffles calls as input, v0.6 has a significant improved validate rate for Sniffles calls, with 2,604 validated insertions and a validation rate of 26%, compared to v0.5 (589 validated insertions and a validation rate of 5%).

## Merging Scheme

For simple SVs validated by Paragraph, we followed the same merging scheme for deletions as [v0.5](../v0.5/README.md).

For CNVs, we do not try to merge them with simple SVs.

For large deletions labeled as PASS, we prioritize Paragraph validated ones over Xiao's CNV truth set.

0 comments on commit 6d40c6a

Please sign in to comment.