"None" genotypes with GATK4 best practices BAM/CRAMs #4

mgonzalezporta · 2020-10-28T04:06:07Z

We intend to run Cryrius on a cohort of 10K Asian genomes which have been analysed with the GATK4 best practices pipeline. Whilst running some tests, I've noticed that I consistently get a genotype of None in the outputs.

In parallel, I've also tried to launch Cyrius on one of the replicates of NA12878 available in BaseSpace, analysed with Dragen 3.2.8, and observe the same behaviour.

In both cases, alignment has been done against GRCh38 with alt contigs (using the fasta from the GATK4 bundle and hg38-altaware in Dragen, respectively). Seeing this previous issue (#1), I wonder if it could be the alt-aware alignment what's leading to no-call. Would you have any thoughts?

I'm attaching the json output for the two test samples mentioned above for additional info.
output.zip

Many thanks

The text was updated successfully, but these errors were encountered:

xiao-chen-xc · 2020-10-28T19:29:31Z

Hi Mar, I looked at your files and each sample has a different problem.

The NA12878 sample has only 18X coverage, so that's not enough for getting accurate results. We have run Cyrius before in a NA12878 BAM from NYGC (as part of 1000G) realigned with DRAGEN (available here: http://1000genomes-dragen.s3.amazonaws.com/) and it worked fine in that BAM. Maybe the BAM file you used had been downsampled?

The first sample has sufficient genome-wide coverage, but the CYP2D6/CYP2D7 region does not. If you view the BAM in IGV you will likely see that this region gets a lot fewer reads than the rest of the genome. I'm guessing that this might have something to do with your aligner or processing steps after alignment. Could you share a bit about how you got your BAM? Is it similar to the NYGC 1000G pipeline described here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/20190405_NYGC_b38_pipeline_description.pdf ? Are there any other steps that may remove (low mapping quality) reads from the BAM? Are any of the samples you are using publicly available? If yes could you point me to one or share one with me (just the CYP2D6/D7 region would be fine) and it would be easier for me to see what is the problem with the BAM.
Alt-aware alignment should not cause problem to Cyrius, as long as alignments to the primary assembly take precedence over alternative contigs. We have run Cyrius on BWA and DRAGEN bams with alt handling before and they all work fine. Also note that this problem is different from the #1 issue you mentioned because in that case though most of the reads are assigned to mapQ zero by the tool postalt, they are still kept there in the BAM so Cyrius is able to call the copy number of the CYP2D6/CYP2D7 region. And in your case, a good portion of the reads are missing from this region in your BAM so even the copy number calling step fails.

mgonzalezporta · 2020-10-29T14:41:48Z

Thanks for your quick response Xiao.

Regarding NA12878, indeed it was a downsampled analysis. I've re-tried with a 30X replicate and I do get a PASS genotype. Perhaps it may be worth adding a warning or fail flag for cases like this, indicating that the reason for a no-call is low coverage?

As for WHB4204, I cannot see anything obvious with respect to the alignment workflow. Similar to the NYGC 1000G pipeline, it covers alignment with bwa-mem (per read group), bam merging, duplicate marking with picard, BQSR recalibration and conversion to CRAM. I'm attaching the specific commands for reference - see alignment_workflow.md.

I've also tried a few follow up analyses:

Extending the test set to include only 30X data (most of our cohort is done at 15X and that could explain a subset of results). I've also included a mix of PCR-Free and PCR-based builds. The genotype call is None in all the 100 samples considered.
Running Cyrius on a publicly available HG002 replicate (SRR8861483, 40X), that's been processed with our analysis pipeline. The genotype call here is also None. You can access the cram file and index here:
https://cyrius-troubleshooting.s3-ap-southeast-1.amazonaws.com/SRR8861483/SRR8861483.bqsr.cram
https://cyrius-troubleshooting.s3-ap-southeast-1.amazonaws.com/SRR8861483/SRR8861483.bqsr.cram.crai
Inspecting the CYP2D6/CYP2D7 region in IGV for 3 samples: (i) our original sample (WHB4204, 30X), (ii) the HG002 replicate analysed with our pipeline (SRR8861483, 40X), and (iii) the NA12878 replicate analysed with Dragen (NA12878, 30X). The first two show clear aggregation of reads with mapq0 (with quite uneven coverage for HG002) - see igv_snapshot.png.

Altogether, these point to our analysis workflow as the reason for the no-call genotypes, even though we do retain mapq0 reads. Would you have any further thoughts?

attachment.zip

xiao-chen-xc · 2020-10-30T04:49:35Z

Hi Mar, the screenshots suggest that reads are missing from the CYP2D6/CYP2D7 region and I agree that it has something to do with your analysis workflow.

I downloaded the original bam submitted to SRA for SRR8861483 and in that bam the coverage of the D6/D7 region looks normal. I extracted mapped reads to D6/D7 region from that bam and compared with mapped reads to D6/D7 region in your cram. The attached fastq files contain ~5000 reads that belong to D6/D7 region but are missing in your cram. I aligned these reads using bwa v0.7.17-r1188 and they all align correctly.

Perhaps you could try processing this small read set through your pipeline and keep intermediate bams from each step. Hopefully you will be able to see during which step those reads are tossed. Or if you have intermediate bams available for some other samples, you could check the coverage of the D6/D7 region in each bam and see if it starts to lose reads at a particular step.

hg2_cram_missing_read1.fq.gz
hg2_cram_missing_read2.fq.gz

xiao-chen-xc · 2020-10-30T05:09:35Z

I tried to look for those missing reads elsewhere in your cram. It took me forever go through the entire cram but it seems that they are not there at all, not even marked as unaligned. This is a bit confusing to me. Shouldn't the cram contain all reads from a sequencing run? Is it possible that they are somehow tossed even before alignment?

mgonzalezporta · 2020-10-30T11:37:37Z

Much appreciated Xiao. Allow me some time to follow up internally and get the big picture. Will keep you posted.

…

On Fri, 30 Oct 2020, 13:09 Xiao Chen, ***@***.***> wrote: I tried to look for those missing reads elsewhere in your cram. It took me forever go through the entire cram but it seems that they are not there at all, not even marked as unaligned. This is a bit confusing to me. Shouldn't the cram contain all reads from a sequencing run? Is it possible that they are somehow tossed even before alignment? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKWEK5LHAQPYP3SW3SX3TTSNJDBZANCNFSM4TBYRQ6Q> .

xiao-chen-xc · 2020-10-30T16:17:37Z

Sounds good. I want to make a correction to my reply yesterday. Today I was able to find those missing reads in the cram (not sure what happened yesterday...so please ignore my last comment). They align to alternative contigs like chr22_KI270928v1_alt. But when I aligned them using BWA v0.7.17-r1188 they aligned to chr22. It'd be interesting to find out during which step in your pipeline those reads are moved from chr22 to alternative contigs.

mgonzalezporta · 2020-11-03T13:56:10Z

Hi Xiao,

I've run our alignment workflow step by step and tried a few combinations, and it seems that the choice of reference genome is what's driving the differences. Taking your fastq files as an example, here a summary of the relevant tests:

# commands used (bwa v0.7.17-r1188)
$ seqtk mergepe $fq1 $fq2 > merged.fq && gzip merged.fq
$ bwa mem -R "$readgroup" -p $REF merged.fq.gz \
    | samtools fixmate - - \
    | samtools sort -o merged.bam \
    && samtools index merged.bam

# TEST 1: REF = GRCh38 from GATK resource bundle (Homo_sapiens_assembly38.fasta)
# downloaded from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22   50818468        1964    2
chr22_GL383582v2_alt    162811  1308    3
chr22_KB663609v1_alt    74013   2872    3
chr22_KI270928v1_alt    176103  4075    5

# TEST 2: REF = GRCh38 from 1000G (GRCh38_full_analysis_set_plus_decoy_hla.fa)
# downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22   50818468        10200   13
chr22_GL383582v2_alt    162811  33      0
chr22_KB663609v1_alt    74013   1148    0
chr22_KI270928v1_alt    176103  2231    0

# TEST 3: REF = hg19 from GATK resource bundle (human_g1k_v37_decoy.fasta)
# downloaded a while ago from FTP site (GATK bundle 2.8 > b37)
# now available from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^22'
22      51304566        10242   0

Note the difference in the number reads that map to chr22 / alt contigs for test 1 vs. test 2. Not sure what may be happening here, as I've checked the MD5 checksum of these contigs and they are the same (based on dict files). Could you have a look at the GRCh38 reference from the GATK resource bundle? Happy to investigate other suggestions.

xiao-chen-xc · 2020-11-03T17:57:56Z

Hi Mar, I think BWA needs a idxbase.alt file to perform alt-aware alignment. In the GATK resource bundle link, there is only a Homo_sapiens_assembly38.fasta.64.alt. If you rename it to Homo_sapiens_assembly38.fasta.alt, BWA will be able to recognize it. I tried it and all reads aligned to chr22.

mgonzalezporta · 2020-11-04T09:15:21Z

Thanks Xiao, this is rather unfortunate...

I've confirmed that renaming the file does rescue reads to chr22 in the subsetted fastqs. I'm also re-aligning the HG002 replicates and a subset of our 30X genomes using the updated bundle, and will re-run Cyrius when ready. I'll keep you posted on the outcome (it may take a while as the analysis will run on a rather busy SGE cluster).

mgonzalezporta · 2020-11-06T10:15:19Z

Hi Xiao, confirmed that Cyrius is able to return genotype calls on our data after re-running alignment as discussed. Many thanks for your help in troubleshooting this!

xiao-chen-xc · 2020-11-07T02:43:17Z

That's great! No problem.

mgonzalezporta closed this as completed Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"None" genotypes with GATK4 best practices BAM/CRAMs #4

"None" genotypes with GATK4 best practices BAM/CRAMs #4

mgonzalezporta commented Oct 28, 2020

xiao-chen-xc commented Oct 28, 2020 •

edited

Loading

mgonzalezporta commented Oct 29, 2020

xiao-chen-xc commented Oct 30, 2020 •

edited

Loading

xiao-chen-xc commented Oct 30, 2020

mgonzalezporta commented Oct 30, 2020 via email

xiao-chen-xc commented Oct 30, 2020

mgonzalezporta commented Nov 3, 2020

xiao-chen-xc commented Nov 3, 2020

mgonzalezporta commented Nov 4, 2020

mgonzalezporta commented Nov 6, 2020

xiao-chen-xc commented Nov 7, 2020

"None" genotypes with GATK4 best practices BAM/CRAMs #4

"None" genotypes with GATK4 best practices BAM/CRAMs #4

Comments

mgonzalezporta commented Oct 28, 2020

xiao-chen-xc commented Oct 28, 2020 • edited Loading

mgonzalezporta commented Oct 29, 2020

xiao-chen-xc commented Oct 30, 2020 • edited Loading

xiao-chen-xc commented Oct 30, 2020

mgonzalezporta commented Oct 30, 2020 via email

xiao-chen-xc commented Oct 30, 2020

mgonzalezporta commented Nov 3, 2020

xiao-chen-xc commented Nov 3, 2020

mgonzalezporta commented Nov 4, 2020

mgonzalezporta commented Nov 6, 2020

xiao-chen-xc commented Nov 7, 2020

xiao-chen-xc commented Oct 28, 2020 •

edited

Loading

xiao-chen-xc commented Oct 30, 2020 •

edited

Loading