Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"None" genotypes with GATK4 best practices BAM/CRAMs #4

Closed
mgonzalezporta opened this issue Oct 28, 2020 · 11 comments
Closed

"None" genotypes with GATK4 best practices BAM/CRAMs #4

mgonzalezporta opened this issue Oct 28, 2020 · 11 comments

Comments

@mgonzalezporta
Copy link

We intend to run Cryrius on a cohort of 10K Asian genomes which have been analysed with the GATK4 best practices pipeline. Whilst running some tests, I've noticed that I consistently get a genotype of None in the outputs.

In parallel, I've also tried to launch Cyrius on one of the replicates of NA12878 available in BaseSpace, analysed with Dragen 3.2.8, and observe the same behaviour.

In both cases, alignment has been done against GRCh38 with alt contigs (using the fasta from the GATK4 bundle and hg38-altaware in Dragen, respectively). Seeing this previous issue (#1), I wonder if it could be the alt-aware alignment what's leading to no-call. Would you have any thoughts?

I'm attaching the json output for the two test samples mentioned above for additional info.
output.zip

Many thanks

@xiao-chen-xc
Copy link
Contributor

xiao-chen-xc commented Oct 28, 2020

Hi Mar, I looked at your files and each sample has a different problem.

The NA12878 sample has only 18X coverage, so that's not enough for getting accurate results. We have run Cyrius before in a NA12878 BAM from NYGC (as part of 1000G) realigned with DRAGEN (available here: http://1000genomes-dragen.s3.amazonaws.com/) and it worked fine in that BAM. Maybe the BAM file you used had been downsampled?

The first sample has sufficient genome-wide coverage, but the CYP2D6/CYP2D7 region does not. If you view the BAM in IGV you will likely see that this region gets a lot fewer reads than the rest of the genome. I'm guessing that this might have something to do with your aligner or processing steps after alignment. Could you share a bit about how you got your BAM? Is it similar to the NYGC 1000G pipeline described here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/20190405_NYGC_b38_pipeline_description.pdf ? Are there any other steps that may remove (low mapping quality) reads from the BAM? Are any of the samples you are using publicly available? If yes could you point me to one or share one with me (just the CYP2D6/D7 region would be fine) and it would be easier for me to see what is the problem with the BAM. 
Alt-aware alignment should not cause problem to Cyrius, as long as alignments to the primary assembly take precedence over alternative contigs. We have run Cyrius on BWA and DRAGEN bams with alt handling before and they all work fine. Also note that this problem is different from the #1 issue you mentioned because in that case though most of the reads are assigned to mapQ zero by the tool postalt, they are still kept there in the BAM so Cyrius is able to call the copy number of the CYP2D6/CYP2D7 region. And in your case, a good portion of the reads are missing from this region in your BAM so even the copy number calling step fails.

@mgonzalezporta
Copy link
Author

Thanks for your quick response Xiao.

Regarding NA12878, indeed it was a downsampled analysis. I've re-tried with a 30X replicate and I do get a PASS genotype. Perhaps it may be worth adding a warning or fail flag for cases like this, indicating that the reason for a no-call is low coverage?

As for WHB4204, I cannot see anything obvious with respect to the alignment workflow. Similar to the NYGC 1000G pipeline, it covers alignment with bwa-mem (per read group), bam merging, duplicate marking with picard, BQSR recalibration and conversion to CRAM. I'm attaching the specific commands for reference - see alignment_workflow.md.

I've also tried a few follow up analyses:

  1. Extending the test set to include only 30X data (most of our cohort is done at 15X and that could explain a subset of results). I've also included a mix of PCR-Free and PCR-based builds. The genotype call is None in all the 100 samples considered.

  2. Running Cyrius on a publicly available HG002 replicate (SRR8861483, 40X), that's been processed with our analysis pipeline. The genotype call here is also None. You can access the cram file and index here:
    https://cyrius-troubleshooting.s3-ap-southeast-1.amazonaws.com/SRR8861483/SRR8861483.bqsr.cram
    https://cyrius-troubleshooting.s3-ap-southeast-1.amazonaws.com/SRR8861483/SRR8861483.bqsr.cram.crai

  3. Inspecting the CYP2D6/CYP2D7 region in IGV for 3 samples: (i) our original sample (WHB4204, 30X), (ii) the HG002 replicate analysed with our pipeline (SRR8861483, 40X), and (iii) the NA12878 replicate analysed with Dragen (NA12878, 30X). The first two show clear aggregation of reads with mapq0 (with quite uneven coverage for HG002) - see igv_snapshot.png.

Altogether, these point to our analysis workflow as the reason for the no-call genotypes, even though we do retain mapq0 reads. Would you have any further thoughts?

attachment.zip

@xiao-chen-xc
Copy link
Contributor

xiao-chen-xc commented Oct 30, 2020

Hi Mar, the screenshots suggest that reads are missing from the CYP2D6/CYP2D7 region and I agree that it has something to do with your analysis workflow.

I downloaded the original bam submitted to SRA for SRR8861483 and in that bam the coverage of the D6/D7 region looks normal. I extracted mapped reads to D6/D7 region from that bam and compared with mapped reads to D6/D7 region in your cram. The attached fastq files contain ~5000 reads that belong to D6/D7 region but are missing in your cram. I aligned these reads using bwa v0.7.17-r1188 and they all align correctly. 

Perhaps you could try processing this small read set through your pipeline and keep intermediate bams from each step. Hopefully you will be able to see during which step those reads are tossed. Or if you have intermediate bams available for some other samples, you could check the coverage of the D6/D7 region in each bam and see if it starts to lose reads at a particular step.

hg2_cram_missing_read1.fq.gz
hg2_cram_missing_read2.fq.gz

@xiao-chen-xc
Copy link
Contributor

I tried to look for those missing reads elsewhere in your cram. It took me forever go through the entire cram but it seems that they are not there at all, not even marked as unaligned. This is a bit confusing to me. Shouldn't the cram contain all reads from a sequencing run? Is it possible that they are somehow tossed even before alignment?

@mgonzalezporta
Copy link
Author

mgonzalezporta commented Oct 30, 2020 via email

@xiao-chen-xc
Copy link
Contributor

Sounds good. I want to make a correction to my reply yesterday. Today I was able to find those missing reads in the cram (not sure what happened yesterday...so please ignore my last comment). They align to alternative contigs like chr22_KI270928v1_alt. But when I aligned them using BWA v0.7.17-r1188 they aligned to chr22. It'd be interesting to find out during which step in your pipeline those reads are moved from chr22 to alternative contigs.

@mgonzalezporta
Copy link
Author

Hi Xiao,

I've run our alignment workflow step by step and tried a few combinations, and it seems that the choice of reference genome is what's driving the differences. Taking your fastq files as an example, here a summary of the relevant tests:

# commands used (bwa v0.7.17-r1188)
$ seqtk mergepe $fq1 $fq2 > merged.fq && gzip merged.fq
$ bwa mem -R "$readgroup" -p $REF merged.fq.gz \
    | samtools fixmate - - \
    | samtools sort -o merged.bam \
    && samtools index merged.bam

# TEST 1: REF = GRCh38 from GATK resource bundle (Homo_sapiens_assembly38.fasta)
# downloaded from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22   50818468        1964    2
chr22_GL383582v2_alt    162811  1308    3
chr22_KB663609v1_alt    74013   2872    3
chr22_KI270928v1_alt    176103  4075    5

# TEST 2: REF = GRCh38 from 1000G (GRCh38_full_analysis_set_plus_decoy_hla.fa)
# downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22   50818468        10200   13
chr22_GL383582v2_alt    162811  33      0
chr22_KB663609v1_alt    74013   1148    0
chr22_KI270928v1_alt    176103  2231    0

# TEST 3: REF = hg19 from GATK resource bundle (human_g1k_v37_decoy.fasta)
# downloaded a while ago from FTP site (GATK bundle 2.8 > b37)
# now available from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^22'
22      51304566        10242   0

Note the difference in the number reads that map to chr22 / alt contigs for test 1 vs. test 2. Not sure what may be happening here, as I've checked the MD5 checksum of these contigs and they are the same (based on dict files). Could you have a look at the GRCh38 reference from the GATK resource bundle? Happy to investigate other suggestions.

@xiao-chen-xc
Copy link
Contributor

Hi Mar, I think BWA needs a idxbase.alt file to perform alt-aware alignment. In the GATK resource bundle link, there is only a Homo_sapiens_assembly38.fasta.64.alt. If you rename it to Homo_sapiens_assembly38.fasta.alt, BWA will be able to recognize it. I tried it and all reads aligned to chr22.

@mgonzalezporta
Copy link
Author

Thanks Xiao, this is rather unfortunate...

I've confirmed that renaming the file does rescue reads to chr22 in the subsetted fastqs. I'm also re-aligning the HG002 replicates and a subset of our 30X genomes using the updated bundle, and will re-run Cyrius when ready. I'll keep you posted on the outcome (it may take a while as the analysis will run on a rather busy SGE cluster).

@mgonzalezporta
Copy link
Author

Hi Xiao, confirmed that Cyrius is able to return genotype calls on our data after re-running alignment as discussed. Many thanks for your help in troubleshooting this!

@xiao-chen-xc
Copy link
Contributor

That's great! No problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants