How to run "generate_sequence_features_single" with UNSORTED bam #169

GuoYang-qd · 2024-07-31T02:33:43Z

Thank you for developing such an excellent tool as semibin2, which performs exceptionally well and can generate a large number of high-quality MAGs.

Therefore, we are interested in applying semibin2 to the analysis of our large datasets. Considering that the analysis of large datasets is usually very time-consuming, we hope to streamline the pipline as much as possible.

Sorting Bam files often consumes a significant amount of computational and storage resources (e.g., temporary files when sorting are usually hundreds of Gbs per bam in our case). However, it seems that Semibin2 does not support unsorted bam as input, as an error occurs when running the "generate_sequence_features_single" module:

Input error: Chromosome k127_4971567 found in non-sequential lines. This suggests that the input file is not sorted correctly.

I would like to ask if there are any alternative tools or ways to generate the "data.csv" and "data.split.csv" based on unsorted bam files? Or, is it possible to make simple modifications on the "generate_sequence_features_single" module to adapt it to unsorted bam?

luispedro · 2024-08-01T00:01:34Z

Unfortunately, it's not trivial to use non-sorted files. It's conceptually possible (we do so in NGLess), but not in a way that fits semibin

GuoYang-qd · 2024-08-04T03:15:25Z

Thanks for the reply. Currently, I can generate tetramer frequencies in "data.csv". The abundance calculated by NGLess seems to be similar to the trend of abundance generated by Bedtools in semibin. So, can the abundance calculated by NGLess replace the abundance calculated by Bedtools?

Additionally, I noticed that "data_split.csv" appears to sample the contig from "data.csv", and then split its abundance and tetramer frequencies into two numbers (it seems the average of this two values is the number in "data.csv"). How is this process achieved? Could you briefly introduce the logic behind it?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run "generate_sequence_features_single" with UNSORTED bam #169

How to run "generate_sequence_features_single" with UNSORTED bam #169

GuoYang-qd commented Jul 31, 2024

luispedro commented Aug 1, 2024

GuoYang-qd commented Aug 4, 2024

How to run "generate_sequence_features_single" with UNSORTED bam #169

How to run "generate_sequence_features_single" with UNSORTED bam #169

Comments

GuoYang-qd commented Jul 31, 2024

luispedro commented Aug 1, 2024

GuoYang-qd commented Aug 4, 2024