Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: ubam support? #757

Open
eboyden opened this issue Jan 25, 2024 · 4 comments
Open

feature request: ubam support? #757

eboyden opened this issue Jan 25, 2024 · 4 comments

Comments

@eboyden
Copy link

eboyden commented Jan 25, 2024

Pretty self-explanatory. We're trying to eliminate the need to process data in fastq format, so it would be terrific if cutadapt could accept ubam input and write ubam output. (We probably don't need the ability to convert fastq to ubam or vice-versa, so mandating ubam in = ubam out would probably be fine, but we wouldn't say no to additional flexibility either.) And if this were implemented, it would be nice to be able to place trimmed sequences or other metadata into sam tags, similar to the current ability to modify fastq read names and/or comments.) Just a suggestion, thanks!

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Jan 26, 2024

The latest version of dnaio (the library that cutadapt uses for sequence IO) has uBAM read support. So cutadapt can already read uBAM files.

Most aligners only support FASTQ format though, so there has been no need yet to add uBAM write support.

After implementing the uBAM reader, I have not grown very fond of the BAM format for unaligned pure sequencing data for these reasons:

  1. Overhead. Bam starts with a 36-byte field with all sorts of metadata stored in integers. Some of this data (sequence length) is useful, most of this metadata is related to alignment and therefore unused.
  2. The BAM tag format is extremely clunky to parse. You need to support all the different typecodes. Unlike parsing fields in a FASTQ header. this is a tremendous amount of work.
  3. While it stores the sequence in 4-bit format (for 16 IUPAC codes) and supposedly this leads to smaller data, in practice it does not. FASTQ data and BAM data are both compressed using DEFLATE (which is what is used in gzip) so the differences in filesizes are minimal. BAM is smaller in memory, but not smaller on disk. If you convert the IUPAC to a sequence anyway, the in-memory difference does not matter anymore.

So for pure sequencing data, uBAM adds a lot of parsing difficulties without offering any advantages. While TAGS may seem like a nice format for adding metadata, only very few Tag names that apply to pure sequencing data are anchored in the specification. So most sequencers will have custom named tags anyway. Well in that case a FASTQ header is much, much easier to parse.

Pretty self-explanatory. We're trying to eliminate the need to process data in fastq format ...

Well could you explain yourself a bit more on why you are doing this? What downstream programs that work pre-alignment do require uBAM or would function better with the functionality it offers compared to FASTQ?

@marcelm
Copy link
Owner

marcelm commented Jan 26, 2024

The latest version of dnaio (the library that cutadapt uses for sequence IO) has uBAM read support. So cutadapt can already read uBAM files.

To clarify: Only single-end reads can be read. This is supported from dnaio 1.1.0 onwards. Cutadapt itself does not need to be updated. Since uBAM can only be read, the output will currently be FASTQ with any BAM-specific metadata lost.

I am very much in favor of BAM-only workflows and would like to see Cutadapt support it.
The minor inconveniences at parsing time (only relevant for us developers) don’t negate that uBAM is the clearly superior format (for users). Example: Cutadapt could add a @PG header to the output documenting the command that was run for improved reproducibility.

@rhpvorderman
Copy link
Collaborator

Ah yes, I did not consider the BAM header and the provenance argument. I am convinced.

Maybe we can create an issue at dnaio and discuss the technical implementation? The alternative is to defer to pysam. As someone who has tried to make an alternative to pysam I can say that full support of SAM/BAM/CRAM is a major undertaking. However, limited BAM support in context of unaligned data only seems feasible.

@eboyden
Copy link
Author

eboyden commented Jan 26, 2024

Thanks to both of you for your quick and thoughtful responses. I didn't know that cutadapt can read ubam at all, so that's good to know, although it's unfortunate that currently the tag metadata are lost and only SE reads are supported. While I better appreciate the cons of ubams from a developer perspective now, as I'm sure you're aware fastqs have their own parsing issues for users - they use 4 lines to encode one "unit" of information, and storing metadata by concatenating them in the read comment is clunky to say the least (assuming you have a convenient means of putting them there in the first place - cutadapt is a rare software that has excellent functionality in that respect); and then retrieving those data typically requires an aligner that can move fastq comments into sam tags, of which there are also only a few (bwa, mm2, bowtie2, SNAP; incidentally the latter two can read ubams). And indeed, the provenance argument. Many other software tools are increasingly embracing and even preferring ubam for unaligned reads (e.g. fgbio). So I'd still like to see ubam write suuport someday, but I understand that it's easy to make suggestions, less so to implement them, so I appreciate your even taking it under consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants