Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level file processing #76

Open
TransGirlCodes opened this issue Jun 22, 2022 · 8 comments
Open

High level file processing #76

TransGirlCodes opened this issue Jun 22, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jun 22, 2022

I think we would do the usability of BioJulia a lot of favours if we made it easier to use the file format packages to do the processing of HTS files a la seqtk, samtools etc.

It's currently not as convenient to do, for example - comparing to seqtk:

using FASTX
transcribe(open(FASTQ.Reader, "x.fq"), open(FASTA.Reader, "x.fa"))

As it is to just do seqtk seq -a in.fq.gz > out.fa

or

using FASTX

names = Set(...names names names...)

open(FASTQ.Reader, "x.fq") do x
    open(FASTQ.Writer, "y.fq") do y
        r = FASTQ.Record()
        while !eof(x)
        read!(x, r)
        if FASTQ.identifier(r)  names
            write(y, FASTA.Record(r))
        end
    end   
end

As it is to do seqtk subseq in.fq name.lst > out.fq

One of julias strengths - if you're not using the part of BioJulia that provides efficient core data structures and algorithms, is the aspects that make it a pretty great glue language. So the ability to very tersely express the processing of HTS files at a high level is important.

There are a few proposals and options I'm considering to improve this area and wanted to get feedback or other ideas.

  1. File handles / string literals
    The open(FASTQ/A.Reader pattern gets irritating after a while.
    I'd like to propose a filehandle kind of string literal that, when used in BioJulia ecosystem functions, does that for you. e.g.
using FASTX
transcribe(rdr"x.fq", wtr"x.fa")

It could even detect FASTA or FASTQ at the metaprogramming stage by inspecting the string. These could generate the code to open a stream, or just create a filehandle type that causes dispatch to use a transcribe function that does the appropriate opens.

  1. A way of tersely expressing stream processing / common HTS file processing

A lot of file processing involves opening a source, doing some stuff and writing to a sink. I wonder if we should implement that as a function(s), and then express e.g. transcribe, using that instead.

With 1, this could look something like:

process(rdr"x.fq", wtr"y.fq") do record
    ...some stuff
end

We could also have a few of these methods e.g. a filter, transform, etc that make various assumptions about the functions.
Or we could just have a more general function that, for e.g. if the provided function returns a record, write it, if it returns nothing, don't. That way a user can make a decision about dropping records in the same function as it may transform records.

Another alternative to this would be to do similar to what I did with the DSL in Pseudoseq: https://github.com/bioinfologics/Pseudoseq.jl/blob/master/src/sequencing/DSL.jl

And provide a series of Iterators like types, that wrap either a reader, or another iterator, that can be plugged into a writer.

So more like:

# Example of if `write` was extended to have methods that consumed a reader or an iterator...
# To achieve transcribe
write(wtr"out.fa", rdr"in.fq")

# To achieve a filter
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))

# We could overload e.g. `|>` to make the above a bit nicer.
# Or even
rdr"in.fa" |> Filter( x -> rand(Bool) ) |> wtr"out.fa"

# Which overloading certain argument combinations for |> could be the equivalent to:
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))

Basically angling for a kinda API that approaches a DSL on top of the readers and writers.

One such iterator could be one that modifies the iteration behaviour of an unwrapped reader, to do in-place record reading
over the default that allocates a new copy every time. e.g.

Inplace(rdr"in.fa") |> Filter( x -> rand(Bool) ) |> wtr"out.fa"
# or perhaps
rdr"in.fa" |> Inplace |> Filter( x -> rand(Bool) ) |> wtr"out.fa"

Or an alternative to this - we just try to provide as many individual functions like filter, trim, transcode etc. as possible and hope we cover everyone's common needs.

pinging @jakobnissen @CiaranOMara @kescobo

@TransGirlCodes TransGirlCodes added the enhancement New feature or request label Jun 22, 2022
@kescobo
Copy link
Member

kescobo commented Jun 23, 2022

This all seems great to me. I'd prefer slightly more verbose string macros for readability (eg writer"thing.fq"), but we can save that for future bikeshedding. I agree in principle with this idea, and definitely agree that a better API would be very welcome.

@CiaranOMara
Copy link
Member

I think our file format packages are more like libraries that provide composable elements/structures and algorithms to use in the Julia ecosystem. A command-line interface is probably not a like comparison to our file format packages, but the comparison certainly demonstrates a desired ease of use.

I prefer the series of iterators-like approaches where you can pipe from source to sink. And like @kescobo, I prefer more verbose names.

FileIO sets precedence for loading and saving files. I think part of the solution, whatever approach, is registering our file format packages with FileIO and implementing their streaming IO interface -- I have always wanted to register the BioJulia file format packages with FileIO.

In terms of designing interfaces and composable data structures, maybe the BioJulia tutorials are a good place to do that. The tutorials would lay out the common tasks and goals, and then the process of writing the tutorial will identify gaps/missing features and sore points. And where there are gaps or sore points, I suspect writing those sections in the way we wish to use the packages will show us or help us design the interfaces. What do you reckon?

@kescobo
Copy link
Member

kescobo commented Jun 24, 2022

 terms of designing interfaces and composable data structures, maybe the BioJulia tutorials are a good place to do that.

💯💯💯

I've just recently started the process of reigniting that repo - it hadn't really been touched since Julia 0.7. I definitely think that's (one of) the right place (s) to solidify the expected interface.

@jakobnissen
Copy link
Member

So I'm a bit torn here. On one hand, I completely agree that we should play to Julia's strengths of enabling convenient high-level syntax suitable for interactive use while also enabling more strict syntax that is more error resistant, and more optimised, for libraries. And I, too, am a bit tired of the pattern:

open(path) do io
    io = last(splitext(path)) == ".gz" ? GzipDecompressorStream : io
    for record in FASTA.Reader(io)
        ...
    end
    close(reader)
end

...which certainly feels like it could be better.

That was my original intent for making BadBio.jl (which I've stopped using for no real reason) - code up some convenient shorthands, and then possibly migrate them to the proper repo if useful. Of course BadBio is not a serious way to go about this problem.

On the other hand, however, I am a bit wary of conflating low and high level packages. The two pitfalls I see are:

  • Library authors have to load and compile a big bunch of unneeded code pertaining to interactive use. Maybe that's unaviodable and therefore not a real issue. After all, they already have to load code for all the myriad functions they don't use in the library.
  • We might accidentally expose convenient-but-error-prone syntax, like an open function that autodetects the format. People definitely should not use that in library code. But maybe we can avoid that pitfall by mentally separating methods into convenience vs basic ones.

Okay, after writing this I'm pretty sure I'm on board with this, @SabrinaJaye

For your concrete proposals: Yes, I think it's a good idea!I think we should just use the existing Base.Iterators framework as much as possible. We really don't want to get bogged down in trying re-invent Tranducers.jl - that'd be a project approximately the size of the entire BioJulia project so far.

Perhaps a good idea would be to "just" have readers be iterators of records (any headers can be accessed on instantiation like hdr, itr = reader(foo.vcf)), the writers be sinks, and then have some kind of clever mechanism that automatically closes readers when it reaches EOF just like Base.EachLine, and writers when they are garbage collected, or something like that.

Lots of bikeshedding potential there. I don't think the details matter too much. An improvement is an improvement.

W.r.t tutorials, yes, that would be a nice improvement. But that's a separate issue I think - no matter the outcome of the proposal here, updating the tutorials would be a good idea.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 6, 2022

Ok cool. It's funny you mention Transducers as this issue got me onto reading those - but I think, give that library works with iterators as well, that if we do this right, that Transducers, and Iterate methods should ideally just work - 'cause julia is awesome like that.

I'll make some smaller PRs that move towards some of these improvements.

@jonathanBieler
Copy link

jonathanBieler commented Dec 10, 2022

I think at minima it would be great to have FileIO.load & FileIO.save to easily read & write file. I assume having FileIO as a dependency isn't an issue ? This seems like an independent problem to the whole processing pipeline question.

On that I also wrote something along these lines : https://github.com/jonathanBieler/BioRecordsProcessing.jl

I think my examples look pretty good (there's minimal boilerplate) but you can't specify source and sink, it's only disk-to-disk.

One concern I have is that having sinks, sources with processing in between isn't specific to biological records, and when you think about adding parallel processing it starts to look at lot like Dagger.jl or other generic frameworks. That is to say this doesn't really concern FASTX, BED, ... packages (or only in making sure they are compatible with those tools).

I think it becomes a bit more relevant is you include domain specific knowledge and handle common use cases as easy as possible (I have some examples in my readme, but we could also look at common samtools/vcftools/... usages). What I have in BioRecordsProcessing is 1) processing paired files, 2) processing whole folders in parallel 3) grouping records before processing. Another one would be to read only selected regions for indexed files.

@CiaranOMara
Copy link
Member

CiaranOMara commented Dec 11, 2022

I think at minima it would be great to have FileIO.load & FileIO.save to easily read & write file. I assume having FileIO as a dependency isn't an issue ? This seems like an independent problem to the whole processing pipeline question.

In the queryverse ecosystem, it seems that the convention is to have a separate package that implements the FileIO interface. Following that convention, we'd have BEDFiles.jl and FASTXFiles.jl, .etc. I'd prefer to follow this convention when implementing the FileIO interfaces.

@CiaranOMara
Copy link
Member

Looking at the initial examples with Filter.

Inplace(rdr"in.fa") |> Filter( x -> rand(Bool) ) |> wtr"out.fa"

I think Filter only needs to be a curried variant of Iterators.filter.

Iterators.filter(fn::Function) = (itr) -> Iterators.filter(fn, itr)

I wonder why this is not already in the standard julia libraries. I s'pose it's difficult to decide whether eager or lazy variants should be distinguished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants