-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"None" genotypes with GATK4 best practices BAM/CRAMs #4
Comments
Hi Mar, I looked at your files and each sample has a different problem. The NA12878 sample has only 18X coverage, so that's not enough for getting accurate results. We have run Cyrius before in a NA12878 BAM from NYGC (as part of 1000G) realigned with DRAGEN (available here: http://1000genomes-dragen.s3.amazonaws.com/) and it worked fine in that BAM. Maybe the BAM file you used had been downsampled? The first sample has sufficient genome-wide coverage, but the CYP2D6/CYP2D7 region does not. If you view the BAM in IGV you will likely see that this region gets a lot fewer reads than the rest of the genome. I'm guessing that this might have something to do with your aligner or processing steps after alignment. Could you share a bit about how you got your BAM? Is it similar to the NYGC 1000G pipeline described here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/20190405_NYGC_b38_pipeline_description.pdf ? Are there any other steps that may remove (low mapping quality) reads from the BAM? Are any of the samples you are using publicly available? If yes could you point me to one or share one with me (just the CYP2D6/D7 region would be fine) and it would be easier for me to see what is the problem with the BAM. |
Thanks for your quick response Xiao. Regarding NA12878, indeed it was a downsampled analysis. I've re-tried with a 30X replicate and I do get a PASS genotype. Perhaps it may be worth adding a warning or fail flag for cases like this, indicating that the reason for a no-call is low coverage? As for WHB4204, I cannot see anything obvious with respect to the alignment workflow. Similar to the NYGC 1000G pipeline, it covers alignment with bwa-mem (per read group), bam merging, duplicate marking with picard, BQSR recalibration and conversion to CRAM. I'm attaching the specific commands for reference - see I've also tried a few follow up analyses:
Altogether, these point to our analysis workflow as the reason for the no-call genotypes, even though we do retain mapq0 reads. Would you have any further thoughts? |
Hi Mar, the screenshots suggest that reads are missing from the CYP2D6/CYP2D7 region and I agree that it has something to do with your analysis workflow. I downloaded the original bam submitted to SRA for SRR8861483 and in that bam the coverage of the D6/D7 region looks normal. I extracted mapped reads to D6/D7 region from that bam and compared with mapped reads to D6/D7 region in your cram. The attached fastq files contain ~5000 reads that belong to D6/D7 region but are missing in your cram. I aligned these reads using bwa v0.7.17-r1188 and they all align correctly. Perhaps you could try processing this small read set through your pipeline and keep intermediate bams from each step. Hopefully you will be able to see during which step those reads are tossed. Or if you have intermediate bams available for some other samples, you could check the coverage of the D6/D7 region in each bam and see if it starts to lose reads at a particular step. |
I tried to look for those missing reads elsewhere in your cram. It took me forever go through the entire cram but it seems that they are not there at all, not even marked as unaligned. This is a bit confusing to me. Shouldn't the cram contain all reads from a sequencing run? Is it possible that they are somehow tossed even before alignment? |
Much appreciated Xiao. Allow me some time to follow up internally and get
the big picture. Will keep you posted.
…On Fri, 30 Oct 2020, 13:09 Xiao Chen, ***@***.***> wrote:
I tried to look for those missing reads elsewhere in your cram. It took me
forever go through the entire cram but it seems that they are not there at
all, not even marked as unaligned. This is a bit confusing to me. Shouldn't
the cram contain all reads from a sequencing run? Is it possible that they
are somehow tossed even before alignment?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWEK5LHAQPYP3SW3SX3TTSNJDBZANCNFSM4TBYRQ6Q>
.
|
Sounds good. I want to make a correction to my reply yesterday. Today I was able to find those missing reads in the cram (not sure what happened yesterday...so please ignore my last comment). They align to alternative contigs like chr22_KI270928v1_alt. But when I aligned them using BWA v0.7.17-r1188 they aligned to chr22. It'd be interesting to find out during which step in your pipeline those reads are moved from chr22 to alternative contigs. |
Hi Xiao, I've run our alignment workflow step by step and tried a few combinations, and it seems that the choice of reference genome is what's driving the differences. Taking your fastq files as an example, here a summary of the relevant tests: # commands used (bwa v0.7.17-r1188)
$ seqtk mergepe $fq1 $fq2 > merged.fq && gzip merged.fq
$ bwa mem -R "$readgroup" -p $REF merged.fq.gz \
| samtools fixmate - - \
| samtools sort -o merged.bam \
&& samtools index merged.bam
# TEST 1: REF = GRCh38 from GATK resource bundle (Homo_sapiens_assembly38.fasta)
# downloaded from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22 50818468 1964 2
chr22_GL383582v2_alt 162811 1308 3
chr22_KB663609v1_alt 74013 2872 3
chr22_KI270928v1_alt 176103 4075 5
# TEST 2: REF = GRCh38 from 1000G (GRCh38_full_analysis_set_plus_decoy_hla.fa)
# downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^chr22'
chr22 50818468 10200 13
chr22_GL383582v2_alt 162811 33 0
chr22_KB663609v1_alt 74013 1148 0
chr22_KI270928v1_alt 176103 2231 0
# TEST 3: REF = hg19 from GATK resource bundle (human_g1k_v37_decoy.fasta)
# downloaded a while ago from FTP site (GATK bundle 2.8 > b37)
# now available from https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0
$ samtools idxstats merged.bam | awk '$3!=0' | grep '^22'
22 51304566 10242 0 Note the difference in the number reads that map to chr22 / alt contigs for test 1 vs. test 2. Not sure what may be happening here, as I've checked the MD5 checksum of these contigs and they are the same (based on dict files). Could you have a look at the GRCh38 reference from the GATK resource bundle? Happy to investigate other suggestions. |
Hi Mar, I think BWA needs a |
Thanks Xiao, this is rather unfortunate... I've confirmed that renaming the file does rescue reads to chr22 in the subsetted fastqs. I'm also re-aligning the HG002 replicates and a subset of our 30X genomes using the updated bundle, and will re-run Cyrius when ready. I'll keep you posted on the outcome (it may take a while as the analysis will run on a rather busy SGE cluster). |
Hi Xiao, confirmed that Cyrius is able to return genotype calls on our data after re-running alignment as discussed. Many thanks for your help in troubleshooting this! |
That's great! No problem. |
We intend to run Cryrius on a cohort of 10K Asian genomes which have been analysed with the GATK4 best practices pipeline. Whilst running some tests, I've noticed that I consistently get a genotype of
None
in the outputs.In parallel, I've also tried to launch Cyrius on one of the replicates of NA12878 available in BaseSpace, analysed with Dragen 3.2.8, and observe the same behaviour.
In both cases, alignment has been done against GRCh38 with alt contigs (using the fasta from the GATK4 bundle and hg38-altaware in Dragen, respectively). Seeing this previous issue (#1), I wonder if it could be the alt-aware alignment what's leading to no-call. Would you have any thoughts?
I'm attaching the json output for the two test samples mentioned above for additional info.
output.zip
Many thanks
The text was updated successfully, but these errors were encountered: