Some questions about the article #1

shelkmike · 2022-01-01T12:13:24Z

Could you please answer some questions about the article (https://arxiv.org/pdf/2112.08687.pdf):

For HiFi reads you used Minimap2 with the option --ava-pb that is intended for PacBio CLR reads and not PacBio HiFi reads (Table S1). Why didn't you try Minimap2 with some other parameters? For example you could have increased the window size and the minimizer size. I suppose this will make Minimap2 faster and decrease its RAM consumption, thus reducing the difference between BLEND and Minimap2 on HiFi reads.
Why did you use N50 and not NGA50 (Table 2)? N50 may be inflated due to misassemblies that result in improper sequence junctions.
Why did you measure k-mer completeness and average identity using unpolished assemblies (Table 3)? Miniasm assemblies require polishing, because the accuracy of its contigs is the same as the accuracy of the reads used for the assembly. The higher accuracy of BLEND in Table 3 means that contigs made with BLEND are composed of slightly more accurate reads than contigs made with Minimap2, but the difference in accuracy may disappear after polishing.
Taking into account that you used only one non-HiFi long read dataset and BLEND performed on it worse than Minimap2 (N50 in Table 2), is it correct to say that BLEND is probably fit only for HiFi long reads, and not PacBio CLR or Nanopore reads?

With best wishes,
Mikhail Schelkunov

canfirtina · 2022-01-03T14:18:57Z

Thanks for your interest and questions.

There are several reasons for this. First, we use default parameter settings as provided by each tool. Currently, there is no default parameter setting that Minimap2 suggests for finding overlapping reads specifically when using PacBio HiFi reads. Thus, we use the only available default parameter setting for finding overlapping reads using PacBio reads: --ava-pb.

Second, it is challenging to speculate on the custom best settings for Minimap2 because there are also many other parameters that could affect accuracy, performance, and memory usage. For example, map-pb uses -w10 while map-hifi uses -w19 along with many other options that are set differently than how map-pb sets. Also overlapping seems to use the half of the window length that is used for read mapping (e.g., ava-pb -w5 and map-pb uses -w10). We also tried to find a good answer to the following question when we were designing our experiments: Is it better to use the parameter settings as suggested in map-hifi while making it suitable for finding overlapping reads (i.e., also using the options -X -e0 -m100)? We tried using half of the original map-hifi window length (-w10) and also the original window length as suggested by map-hifi (-w19). What we observe was the following: when we use the map-hifi settings along with -X -e0 -m100 for finding overlapping reads, we observe Minimap2 performs 1.2x - 4x faster than using ava-pb (still much slower than BLEND) with the cost of loss of information in the PAF file and reduced accuracy in the assembly.

Third, we use window length to the level as much as -w500 (and potentially even higher) with the ability of combining many neighbor k-mers (e.g., 100 neighbor k-mers as in -x map-hifi --genome human) not to lose from the accuracy. It is not implementation-wise possible to increase the window length more than 256 in the original implementation of Minimap2.

Perhaps, we could contact Heng Li and have his opinion on the suggested parameter settings for finding overlapping HiFi reads with Minimap2. We will update our experiments accordingly if we can receive a suggestion from him. I would also appreciate any pointer to a similar discussion where Heng Li provides some suggestions for overlapping HiFi reads.

I would also like to clarify that we use the default settings for HiFi reads when there is available (i.e., we use map-hifi for mapping HiFi reads with Minimap2).

Thank you for suggesting NGA50 and I agree with your point. There is no strong reason for choosing N50 over NGA50. I believe we chose to go with N50 as it is --probably-- a more commonly reported statistic than NGA50. We also have the NGA50 numbers. The NGA50 results 1) are mostly inline with the trend we observe with the N50 results and 2) do not change our observation we make regarding the contiguity of the assemblies generated using BLEND and Minimap2 overlaps. We will include the NGA50 results in the revised version of the paper, too.
We want to assess the quality of the overlapping reads by measuring the accuracy of assemblies under the same conditions without the effect of other additional tools (e.g., polishing). Otherwise it makes it more challenging to differentiate the direct effect of the overlapping algorithms from the polishing tools on the quality of assemblies. We have the following statement in our paper to clarify this point:

"We use miniasm because it does not perform error correction when generating de novo assemblies, which allows us to directly assess the quality of overlaps without using additional approaches for improving the accuracy of assemblies."

We definitely agree that miniasm needs assembly polishing to generate higher quality assemblies. It may potentially be true that the the final polished assembly may have a high accuracy such that the accuracy of the initial draft assembly does not matter at all. However, we also note that this depends on the coverage of the read set, assembly polishing tool, read mapper used for generating the input for most assembly polishing tools, and probably several more other reasons. Then, the question potentially may become: what is the coverage that BLEND and Minimap2 requires to achieve 99.9% accuracy, if they both end up generating such a good accuracy after assembly polishing? An answer for such a question may again be implied from the initial draft assemblies without any error correction.

For these reasons, we currently do not consider including assembly polishing in our experiments.

This is a good question and I believe it is still open for discussion. I partially agree with your point. Unfortunately, the performance benefits are not high when using PacBio CLR reads. BLEND approximates the hash value of a seed by using seed's k-mers. Such an approximation works well with HiFi reads for obvious reasons (errors are less so the chances that we will include an erroneous k-mer in our BLEND calculation is less likely). I am still working on to make BLEND better with PacBio CLR reads and ONT reads but this is still an ongoing process. I am not sure if we will be able to get to the point than we what we already have in the current version of our implementation. We will definitely announce a new release in the same GitHub page if we can achieve better performance and memory usage with PacBio CLR reads.

We have not thoroughly tested BLEND with ONT reads. We believe the current parameter settings should be good enough for ONT reads but it is still not confirmed that they will work better than Minimap2.

In short, we believe BLEND is best fit for PacBio HiFi reads based on the results we show in our paper.

Best,

Can Firtina

shelkmike closed this as completed Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the article #1

Some questions about the article #1

shelkmike commented Jan 1, 2022

canfirtina commented Jan 3, 2022 •

edited

Some questions about the article #1

Some questions about the article #1

Comments

shelkmike commented Jan 1, 2022

canfirtina commented Jan 3, 2022 • edited

canfirtina commented Jan 3, 2022 •

edited