Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canu + minimap for correction on Guppy v3.6 reads: uncorrected has better error profile #1715

Open
gringer opened this issue May 15, 2020 · 6 comments

Comments

@gringer
Copy link

gringer commented May 15, 2020

A pileup of uncorrected Guppy v3.6 reads produces a better guess at the true sequence compared to corrected reads. The errors are in deletion variants and only in specific locations; for the most part, reads are more accurately called after correction (especially at SNPs). I acknowledge that this difference may be purely in the mapper (I used mmap, rather than the default).

I've started re-assembling my January 2017 reads from Nippostrongylus brasiliensis, recalled using Guppy v3.6, using Canu v2.0. We've got a pretty accurate mitochondrial genome that's been previously assembled from Illumina-corrected nanopore reads (confirmed by our own Illumina cDNA reads, and by the Sanger Institute's own Illumina reads), so I've been using that as a yard stick to measure the accuracy of basecalled unmethylated reads. I use LAST with a trained alignment matrix for this mapping, because it seems to have better mapping error profiles in comparison to minimap2.

Here is a combined coverage / variant plot showing Canu-corrected reads:
stompPlot_corrected_Nb_mtDNA

And here is a combined coverage / variant plot showing the uncorrected version of those same reads:
stompPlot_Nb_uncorrected_mtDNA

The identified variants with frequency >40% on both the forward and reverse strand are indicated on the outer portion of the plot. Only deletion variants exist in both plots. Note that there are more identified variants for the corrected reads (11) than for the uncorrected reads (1). Based on Illumina reads that I've looked at, I believe that uncorrected variant to be a true reflection of polymorphic mitochondrial sequence, rather than an error.

This is not an issue, as such. More of an observation, just in case it's helpful for improving and/or speeding up Canu.

In case anyone wants to have a look at these reads, the uncorrected Guppy v3.6-called reads that map to the mitochondrial genome can be found here. Coverage for the mitochondrial genome is about 650X.


In order to map reads to the mitochondrial genome I use LAST. Here's an example command sequence:

lastal -P 10 -p ~/db/last/bc.mat ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta uncorrectedReads_vs_Nb_mtDNA.fa.gz | last-split | maf-convert sam | samtools view -h --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta | samtools sort > mtDNA_called_uncorrected_vs_nippo_mtDNA.bam

I have a semi-automated visualisation script that shows me coverage and variant frequencies, and determines any observed variants that are consistent on forward-mapped and reverse-mapped reads. I use this consistency check to exclude methylation-related variant signals (because it's been my experience that typically only one strand is methylated in the same place):

samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -F 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_fwd.csv
samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -f 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_rev.csv
~/scripts/stomp_plotter.r -f stompedCounts_fwd_4T1_ρ0SC_vs_chrM_9457.csv -r stompedCounts_rev_4T1_ρ0SC_vs_chrM_9457.csv -circular 16299 -adj 9457 -log -type png -scale -max 60000 -nodels -mindepth 2

command:
~/install/canu/canu-2.0/Linux-amd64/bin/canu -nanopore-raw called_Nb_CFED_65bptrim_guppy_3.6.0.fq.gz -p Nb_ONTCFED
_guppy360_65bpTrim -d Nb_ONTCFED_guppy360_65bpTrim genomeSize=400M corOverlapper=minimap

version: Canu 2.0

system: Debian Linux desktop
Linux elegans 5.6.0-1-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux

@skoren
Copy link
Member

skoren commented May 15, 2020

Yes, we've seen some similar effects. I think with the increase in quality of 3.6.0 data, it makes sense to assemble them uncorrected.

We're exploring similar strategies to HiFi data (compress, correct isolated errors, mask systematic errors) as well. You can try it by using the options:

-trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi <your nanopore fastq>

@harish0201
Copy link

@gringer Any such plots or outcomes from Bonito basecalled datasets?

@gringer
Copy link
Author

gringer commented Apr 28, 2022

Sorry, missed this question. I've done a super-accuracy recall (with guppy v5.1.15 / 6.1) with my mouse cDNA reads, but not yet the nippo mtDNA reads. I'll put that on my mental list to do shortly.

@gringer
Copy link
Author

gringer commented May 2, 2022

I've created an updated dataset, with a slightly more curated dataset that filters on the following conditions:

  • length >= 3kb
  • target (mitochondrial) match length >= 50% OR
  • query (read) match length >= 90%

Hopefully this will exclude most of the nuclear mitochondrial copies, if they exist. Files can be found here:

@gringer
Copy link
Author

gringer commented May 2, 2022

Now for the accuracy comparisons. I'm checking the assembled genome using LAST (with default mapping parameters, because I expect it to be very similar to the corrected assembly), and measuring the gap-compressed identity of the longest alignment to the longest assembled contig (likely less than 100% of the genome).

# default parameters, all reads
$ canu-2.2/bin/canu -nanopore guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_allReads_default -d canu_allReads_default genomeSize=13k
# GCI: 99.87% (17 errors)
# default parameters, 100 most accurate reads
$ canu-2.2/bin/canu -nanopore 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_default -d canu_bestReads_default genomeSize=13k
# GCI: 99.91% (12 errors)
# @skoren's recommended options, all reads
$ canu-2.2/bin/canu -p canu_allReads_skOpts -d canu_allReads_skOpts genomeSize=13k  -trim-assemble batOptions="-eg 0.12 -sb 0.01 -dg 12 -db 12 -dr 6 -ca 1000 -cp 10" "correctedErrorRate=0.12" -pacbio-hifi guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz
# GCI: 99.84% (21 errors) [... overlapErrorAdjustment took 270 s / 920 s, much longer than others]
# trim & assemble only, all reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_allReads_trimAssemble -d canu_allReads_trimAssemble genomeSize=13k
# GCI: 99.85% (20 errors) [overlapErrorAdjustment took 35s]
# trim & assemble only, 100 most accurate reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_trimAssemble -d canu_bestReads_trimAssemble genomeSize=13k
# GCI: 99.83% (21 errors)
# minimap2, trim & assemble only, 100 most accurate reads
$ canu-2.2/bin/canu -trim-assemble -nanopore-corrected 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_mm2TrimAssemble -d canu_bestReads_mm2TrimAssemble genomeSize=13k corOverlapper=minimap
# GCI: 99.83% (21 errors)
# minimap2 [with correction], 100 most accurate reads
$ canu-2.2/bin/canu -nanopore 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz -p canu_bestReads_mm2 -d canu_bestReads_mm2 genomeSize=13k corOverlapper=minimap
# GCI: 99.91% (12 errors)

So, at least for these tests, the longest-running canu assemblies didn't produce the best results, and there didn't seem to be any difference between using minimap2 or mhap for the overlapper. The best outcome was reached by doing pre-filtering of reads by mean accuracy, then assembling with correction.

This seems inconsistent with my first result, but bear in mind that these are a subset of reads (trying to exclude as much haplotypic variation as possible), and canu may perform better with its own idea of "corrected" reads, even if they are less accurate relative to the reference, because the developers have optimised assembly for those reads.

As something of an encore, I tried to see if I could get medaka + LAST to improve on the bestReads_mm2 assembly:

$ lastdb -uNEAR -R01 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta
$ last-train -P 10 -Q 1 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz > trained.mat
$ lastal -p trained.mat -P 10 -j 7 canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta 100best_guppy_5.1.15_sup_reads_long_mtDNA.fastq.gz | maf-convert sam | samtools view -b -T canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta | samtools sort > calls_to_draft.bam
$ samtools index calls_to_draft.bam
$ mkdir -p results
$ medaka consensus calls_to_draft.bam results/contig1.hdf --model r941_min_sup_g507
$ medaka stitch results/contig1.hdf canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta results/polished.assembly.fasta
$ lastal Nb_mtDNA_MRSR_corrected.fasta results/polished.assembly.fasta | last-map-probs | ~/scripts/maf2csv.pl | cut -d ',' -f 14,15
# GCI: 99.93 (10 errors)

According to diffseq, the errors are all deletions. This is a known/expected issue for nanopore reads due to the increased chance of skipped signal, and it makes some sense to be a bit consistent because certain sequences may make DNA unwinding a little bit harder or easier:

# [I doubled the genome to make sure I could get a contiguous section of matching sequence. Duplicated matches have been removed]
########################################
# Program: diffseq
# Rundate: Tue  3 May 2022 01:11:48
# Commandline: diffseq
#    [-asequence] dbl_Nb_mtDNA_MRSR_corrected.fasta
#    [-bsequence] results/polished.assembly.fasta
# Report_format: diffseq
# Report_file: nb_mtdna.diffseq
# Additional_files: 2
# 1: Nb_mtDNA.diffgff (Feature file for first sequence)
# 2: tig00000001.diffgff (Feature file for second sequence)
########################################

#=======================================
#
# Sequence: Nb_mtDNA     from: 1   to: 26710
# HitCount: 14
#
# Compare: tig00000001     from: 1   to: 25293
# 
# Nb_mtDNA overlap starts at 1
# tig00000001 overlap starts at 7909
# 
#
#=======================================
Nb_mtDNA 730-730 Length: 1
Sequence: T
Sequence: 
tig00000001 8637 Length: 0

Nb_mtDNA 2842-2842 Length: 1
Sequence: T
Sequence: 
tig00000001 10748 Length: 0

Nb_mtDNA 3562-3563 Length: 2
Sequence: TT
Sequence: 
tig00000001 11467 Length: 0

Nb_mtDNA 4193-4193 Length: 1
Sequence: T
Sequence: 
tig00000001 12096 Length: 0

Nb_mtDNA 4919-4919 Length: 1
Sequence: T
Sequence: 
tig00000001 12821 Length: 0

Nb_mtDNA 5201-5201 Length: 1
Sequence: T
Sequence: 
tig00000001 13102 Length: 0

Nb_mtDNA 8253-8253 Length: 1
Sequence: T
Sequence: 
tig00000001 16153 Length: 0

Nb_mtDNA 8881-8882 Length: 2
Sequence: TT
Sequence: 
tig00000001 16780 Length: 0

Nb_mtDNA 9530-9530 Length: 1
Sequence: T
Sequence: 
tig00000001 17427 Length: 0

Nb_mtDNA 12316-12316 Length: 1
Sequence: T
Sequence: 
tig00000001 20212 Length: 0

In other words, yes, medaka was able to very slightly improve the accuracy of the assembly (bearing in mind that medaka is designed to work best with Flye and minimap2, rather than canu and LAST).

I have now tried this with minimap2 instead of LAST, and unlike LAST, it couldn't make the assembly any more accurate:

$ minimap2 -a Nb_mtDNA_MRSR_corrected.fasta results/polished.assembly.fasta | samtools sort > mm2_calls_to_draft.bam
$ samtools index mm2_calls_to_draft.bam
$ medaka consensus mm2_calls_to_draft.bam results/mm2_contig1.hdf --model r941_min_sup_g507
$ medaka stitch results/mm2_contig1.hdf canu_bestReads_mm2/canu_bestReads_mm2.contigs.fasta results/mm2_polished.assembly.fasta
...
[Copying contig 'tig00000001' verbatim from input]

@gringer
Copy link
Author

gringer commented May 2, 2022

Revisiting those plots... canu does a really good job at read correction, essentially eliminating the random SNP errors, and most INDELs. The problem (as previously observed) is that it leaves behind some consistent errors:

Best Raw

Best Corrected

All Raw

All Corrected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants